2
•We also propose consistent sample mining (CSM) to
discard the samples whose pseudo labels are inconsistent
during each training epoch. The discarded inconsistent
samples are potential noisy samples that may hinder
network training.
•Extensive experiments on three large-scale datasets
(Market-1501 [7], MSMT17 [8], and PersonX [9])
demonstrate that our method outperforms the fully un-
supervised state-of-the-art methods by a large margin,
even surpasses most UDA methods and methods utilizing
camera information.
II. RELATED WORKS
A. Unsupervised Person Re-ID
Unsupervised person Re-ID methods are mainly divided into
unsupervised domain adaptive (UDA) methods and unsuper-
vised learning (USL) methods.
1) UDA Person Re-ID: UDA methods generally pre-train a
model using labeled data on the source domain and transfer the
learned knowledge from the source domain to the unlabeled
target domain. Recent studies in UDA method for person Re-
ID can mainly group into clustering-based adaptation [21]–
[24] and cross-domain translation [8], [25]–[29].
The clustering-based adaptation method aims to leverage
clustering to generate pseudo labels for unlabeled data on
the target domain. Fan et al. [21] utilize the pseudo labels
generated by k-means [30] to fine-tune the model. Song
et al. [22] adopt DBSCAN [31] to generate pseudo labels,
and the number of clusters is determined by the density
of features. The AD-cluster [23] leverages iterative density-
based clustering to generate pseudo labels. It learns an image
generator to augment the training samples to enforce the
discrimination ability of Re-ID models. To avoid overfitting
to noisy pseudo labels, AdaDC [24] adaptively and alternately
utilizes different clustering methods. Although the clustering-
based method has been proven effective and achieves state-
of-the-art performance, due to the existence of some indistin-
guishable persons with similar appearance, the pseudo labels
assigned by the clustering method will be inevitably noisy,
which will seriously hinder the training of the network.
The cross-domain translation is another approach that learns
domain-invariant features from source-domain images. Gener-
ative Adversarial Network (GAN) is one of the main repre-
sentatives of this type of method. PTGAN [8] and SPGAN
[26] utilize the images of the source domain to generate
the transferred images that have the same style as the tar-
get domain images. However, the quality of the generated
images restricts the performance of such methods. DAAL
[25] separate the feature map into the domain-shared feature
map and the domain-specific feature map simultaneously. The
former is transferred from the source domain to the target
domain to facilitate the Re-ID task. ECN [27] and ECN++ [28]
adopt a feature memory to learn exemplar-invariance, camera-
invariance, and neighborhood-invariance. HCN [32] proposes
a heterogeneous convolutional network, which leverages CNN
and GCN to learn the appearance and correlation information
of person images. TAL-MIRN [29] leverages triple adversarial
learning and multi-view imaginative reasoning to improve the
generalization ability of the Re-ID model from the source
domain to the target domain. Although these UDA methods
perform well under the cross-domain scenario, the requirement
of tremendous manually annotation largely limits their usage
in practice. In addition, UDA methods rely on the transferable
knowledge learned from the source domain, but the discrim-
inative information of the target domain may not be fully
explored.
2) USL Person Re-ID: USL methods do not require any
labeled data. In recent years, clustering-based methods [10],
[11], [33] have become the mainstream of USL methods.
BUC [10] presents bottom-up clustering to generate pseudo
labels, and a diversity regularization is employed to control the
number of samples in each cluster. However, only one bottom-
up clustering is performed in the entire training process, and
incorrectly merged samples in the previous merging steps
will always affect the subsequent training process. HCT [11]
adopts hierarchical clustering to generate pseudo labels and
employs batch hard triplet loss [34] to facilitate training. TSSL
[35] designs a unified formulation to consider tracklet frame
coherence, tracklet neighbourhood compactness, and tracklet
cluster structure. In order to improve the generation quality
of pseudo labels, IICS [33] decomposes the sample similarity
computation into two stages: intra-camera and inter-camera
computation. PPLR [13] exploits the complementary rela-
tionship between global and local features to reduce pseudo
label noise. To reduce “sub and mixed” clustering errors, ISE
[14] generates support samples around cluster boundaries to
associate the same identity samples.
Some studies address unsupervised person Re-ID without
using clustering. SSL [36] explores the similarity between
unlabeled images via softened similarity learning. And a
cross-camera encouragement term is proposed to boost soft-
ened similarity learning. MMCL [37] employs the multi-label
classification method to tackle unsupervised person Re-ID
and proposes a memory-based multi-label classification loss
to promote training. Although these methods have achieved
satisfactory performance, there is still a gap between them
and clustering-based methods.
In the latest researches, some contrastive learning based
methods have achieved remarkable performances. SpCL [15]
stores the features of all instances in hybrid memory and
optimizes the encoder with a unified contrastive loss. Cluster-
Contrast [38] stores features and computes contrastive loss
at the cluster level. CAP [39] designs both intra-camera and
inter-camera contrastive learning to boost training. ICE [12]
employs inter-instance pairwise similarity scores to promote
contrastive learning. However, the inevitable pseudo label
noise limits the performance of these methods.
B. Learning with Noisy Labels
In recent years, training networks on noisy or unlabeled
data has been widely studied, which can be classified into
four categories: estimating the noise transition matrix [40],
[41], designing the robust loss function [42], [43], correcting
the noisy labels [44], [45] and utilizing peer-teaching strategy
[16], [17], [20].