https://arxiv.org/pdf/2004.06305.pdf
In a summary, it can be:
1). Using larger datasets (involving generated data.)
2). Using bigger-size input (like 384*384 or 512*512)
3). Modern backbones (like swin-net trained with 22k)
4). Using more losses like contrastive loss + circle loss in https://github.com/layumi/Person_reID_baseline_pytorch
5). Design a strong attention module.
6). Like XVLM to do a two-stage test.
7). Use More hard-sampling strategy to minimize the impact of noise samples.
etc.