1 Dual Clustering Co-teaching with Consistent Sample Mining for Unsupervised Person

2025-05-01 0 0 701.47KB 11 页 10玖币
侵权投诉
1
Dual Clustering Co-teaching with Consistent
Sample Mining for Unsupervised Person
Re-Identification
Zeqi Chen1Zhichao Cui2Chi Zhang1Jiahuan Zhou3Yuehu Liu1
1Xi’an Jiaotong University 2Chang’an University 3Peking University
Abstract—In unsupervised person Re-ID, peer-teaching strat-
egy leveraging two networks to facilitate training has been proven
to be an effective method to deal with the pseudo label noise.
However, training two networks with a set of noisy pseudo
labels reduces the complementarity of the two networks and
results in label noise accumulation. To handle this issue, this
paper proposes a novel Dual Clustering Co-teaching (DCCT)
approach. DCCT mainly exploits the features extracted by two
networks to generate two sets of pseudo labels separately by
clustering with different parameters. Each network is trained
with the pseudo labels generated by its peer network, which can
increase the complementarity of the two networks to reduce the
impact of noises. Furthermore, we propose dual clustering with
dynamic parameters (DCDP) to make the network adaptive and
robust to dynamically changing clustering parameters. Moreover,
Consistent Sample Mining (CSM) is proposed to find the samples
with unchanged pseudo labels during training for potential noisy
sample removal. Extensive experiments demonstrate the effective-
ness of the proposed method, which outperforms the state-of-the-
art unsupervised person Re-ID methods by a considerable margin
and surpasses most methods utilizing camera information.
Index Terms—Unsupervised person re-identification, peer-
teaching strategy, sample mining.
I. INTRODUCTION
PERSON re-identification (Re-ID) aims to retrieve the im-
ages of the same person captured by different cameras [1].
Although supervised person Re-ID [2]–[6] has achieved ex-
cellent accuracy on publicly available datasets [7]–[9], the
requirement of tremendous manual annotation limits their
practicality in the real world. To tackle this issue, unsupervised
person Re-ID methods [10]–[14] without any labeled data have
been extensively studied.
The mainstream unsupervised methods are clustering-based
[10], [11], [13]–[15], which are mainly divided into two stages:
(1) generating pseudo labels by clustering; (2) training the
network with pseudo labels. Although those methods achieve
excellent performance, their generated pseudo labels are in-
evitably noisy. On the one hand, person images with different
identities may have similar appearance, viewpoint, pose, and
illumination. Due to subtle differences, they may be clustered
into a cluster by clustering algorithms. On the other hand, the
images of a person may have occlusion, different resolution,
and motion blur. They may be clustered into different clusters
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may
no longer be accessible.
due to their distinct differences. Training with noisy pseudo
labels hinders the model’s performance.
To mitigate the influence of such noisy pseudo labels, a
peer-teaching strategy [16]–[20] is employed, which leverages
the difference and complementarity of the two networks to
filter different noises through cooperative training of the two
networks. ACT [19] trains two networks in an asymmetric
manner to enhance the complementarity of the two networks.
One network is trained with pure samples, while the other is
trained with diverse samples. To enhance the output indepen-
dence of the two networks, MMT [20] utilizes the outputs of
the network’s temporally average model [18] as soft pseudo
labels to train its peer network. However, both ACT and MMT
adopt only one set of noisy pseudo labels to train the two
networks, resulting in the accumulation and propagation of
pseudo label noise during training.
To overcome the aforementioned shortcomings, we propose
a novel Dual Clustering Co-teaching (DCCT) framework to
train two networks using two sets of pseudo labels. Train-
ing with pseudo labels obtained by different clusterings can
increase the differences and complementarity of the two net-
works, thereby reducing the effect of noises and improving
the final performance. Specifically, we propose dual clustering
with dynamic parameters (DCDP) to obtain different clustering
parameters at each epoch. Then, the features extracted by the
temporally average models (M ean Nets) of two networks
are clustered to generate two sets of pseudo labels. And
two memory banks are initialized according to the clustering
results, as shown in Fig. 1(a). Then we adopt the pseudo labels
generated by one network to train its peer network, as shown
in Fig. 1(b). In addition, we propose consistent sample mining
(CSM) in each mini-batch to discard potential noisy samples
with incorrect pseudo labels, which improves the network’s
performance.
The main contributions of this paper can be summarized as
threefold:
We design a novel peer-teaching framework called Dual
Clustering Co-teaching (DCCT), which employs dual
clustering with dynamic parameters (DCDP) to generate
two sets of pseudo labels. Training with different pseudo
labels can enhance the differences and complementarity
of the two networks and improve their final performance.
The proposed DCDP is so flexible to be effective on
multiple clustering algorithms.
arXiv:2210.03339v1 [cs.CV] 7 Oct 2022
2
We also propose consistent sample mining (CSM) to
discard the samples whose pseudo labels are inconsistent
during each training epoch. The discarded inconsistent
samples are potential noisy samples that may hinder
network training.
Extensive experiments on three large-scale datasets
(Market-1501 [7], MSMT17 [8], and PersonX [9])
demonstrate that our method outperforms the fully un-
supervised state-of-the-art methods by a large margin,
even surpasses most UDA methods and methods utilizing
camera information.
II. RELATED WORKS
A. Unsupervised Person Re-ID
Unsupervised person Re-ID methods are mainly divided into
unsupervised domain adaptive (UDA) methods and unsuper-
vised learning (USL) methods.
1) UDA Person Re-ID: UDA methods generally pre-train a
model using labeled data on the source domain and transfer the
learned knowledge from the source domain to the unlabeled
target domain. Recent studies in UDA method for person Re-
ID can mainly group into clustering-based adaptation [21]–
[24] and cross-domain translation [8], [25]–[29].
The clustering-based adaptation method aims to leverage
clustering to generate pseudo labels for unlabeled data on
the target domain. Fan et al. [21] utilize the pseudo labels
generated by k-means [30] to fine-tune the model. Song
et al. [22] adopt DBSCAN [31] to generate pseudo labels,
and the number of clusters is determined by the density
of features. The AD-cluster [23] leverages iterative density-
based clustering to generate pseudo labels. It learns an image
generator to augment the training samples to enforce the
discrimination ability of Re-ID models. To avoid overfitting
to noisy pseudo labels, AdaDC [24] adaptively and alternately
utilizes different clustering methods. Although the clustering-
based method has been proven effective and achieves state-
of-the-art performance, due to the existence of some indistin-
guishable persons with similar appearance, the pseudo labels
assigned by the clustering method will be inevitably noisy,
which will seriously hinder the training of the network.
The cross-domain translation is another approach that learns
domain-invariant features from source-domain images. Gener-
ative Adversarial Network (GAN) is one of the main repre-
sentatives of this type of method. PTGAN [8] and SPGAN
[26] utilize the images of the source domain to generate
the transferred images that have the same style as the tar-
get domain images. However, the quality of the generated
images restricts the performance of such methods. DAAL
[25] separate the feature map into the domain-shared feature
map and the domain-specific feature map simultaneously. The
former is transferred from the source domain to the target
domain to facilitate the Re-ID task. ECN [27] and ECN++ [28]
adopt a feature memory to learn exemplar-invariance, camera-
invariance, and neighborhood-invariance. HCN [32] proposes
a heterogeneous convolutional network, which leverages CNN
and GCN to learn the appearance and correlation information
of person images. TAL-MIRN [29] leverages triple adversarial
learning and multi-view imaginative reasoning to improve the
generalization ability of the Re-ID model from the source
domain to the target domain. Although these UDA methods
perform well under the cross-domain scenario, the requirement
of tremendous manually annotation largely limits their usage
in practice. In addition, UDA methods rely on the transferable
knowledge learned from the source domain, but the discrim-
inative information of the target domain may not be fully
explored.
2) USL Person Re-ID: USL methods do not require any
labeled data. In recent years, clustering-based methods [10],
[11], [33] have become the mainstream of USL methods.
BUC [10] presents bottom-up clustering to generate pseudo
labels, and a diversity regularization is employed to control the
number of samples in each cluster. However, only one bottom-
up clustering is performed in the entire training process, and
incorrectly merged samples in the previous merging steps
will always affect the subsequent training process. HCT [11]
adopts hierarchical clustering to generate pseudo labels and
employs batch hard triplet loss [34] to facilitate training. TSSL
[35] designs a unified formulation to consider tracklet frame
coherence, tracklet neighbourhood compactness, and tracklet
cluster structure. In order to improve the generation quality
of pseudo labels, IICS [33] decomposes the sample similarity
computation into two stages: intra-camera and inter-camera
computation. PPLR [13] exploits the complementary rela-
tionship between global and local features to reduce pseudo
label noise. To reduce “sub and mixed” clustering errors, ISE
[14] generates support samples around cluster boundaries to
associate the same identity samples.
Some studies address unsupervised person Re-ID without
using clustering. SSL [36] explores the similarity between
unlabeled images via softened similarity learning. And a
cross-camera encouragement term is proposed to boost soft-
ened similarity learning. MMCL [37] employs the multi-label
classification method to tackle unsupervised person Re-ID
and proposes a memory-based multi-label classification loss
to promote training. Although these methods have achieved
satisfactory performance, there is still a gap between them
and clustering-based methods.
In the latest researches, some contrastive learning based
methods have achieved remarkable performances. SpCL [15]
stores the features of all instances in hybrid memory and
optimizes the encoder with a unified contrastive loss. Cluster-
Contrast [38] stores features and computes contrastive loss
at the cluster level. CAP [39] designs both intra-camera and
inter-camera contrastive learning to boost training. ICE [12]
employs inter-instance pairwise similarity scores to promote
contrastive learning. However, the inevitable pseudo label
noise limits the performance of these methods.
B. Learning with Noisy Labels
In recent years, training networks on noisy or unlabeled
data has been widely studied, which can be classified into
four categories: estimating the noise transition matrix [40],
[41], designing the robust loss function [42], [43], correcting
the noisy labels [44], [45] and utilizing peer-teaching strategy
[16], [17], [20].
3
...
1
1
c
1
k
c
...
1
1
c
1
k
c
Mean
Net1
Net2
Net1
Mean
Net2
Mean
Net2
Mean
Net1
clustering X1
Mean
Net Retrieval
X
X2
Forward
Process Backward
Process Momentum
Updating
Momentum
α
Updating
Momentum
β
Updating
Momentum
α
Updating
Consistent
Sample Set
Training
Dataset
Memory2
Consistent
Sample Set
*
2
X*
2
X
Momentum
β
Updating
Init
Training
Dataset
Unlabeled
Dataset
X
CSM
*
1
X*
1
X
CSM
Memory1Memory1
1
f
1
f
2
f
1
f
2
f
X1 ,Y1
Training Dataset
with Pseudo Labels
Clustering1
Clustering2
Clustering2Init
Memory2
X2 ,Y2
Training Dataset
with Pseudo Labels
...
1
1
c
1
k
c
...
1
1
c
1
k
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
with ε1
clustering
with ε2
(DCDP)
(DCDP)
Lq
Lq
Lq
Lq
(a) Dual clustering with dynamic parameters (DCDP) for pseudo label
generation and memory initialization.
...
1
1
c
1
k
c
...
1
1
c
1
k
c
Mean
Net1
Net2
Net1
Mean
Net2
Mean
Net2
Mean
Net1
clustering X1
Mean
Net Retrieval
X
X2
Forward
Process Backward
Process Momentum
Updating
Momentum
α
Updating
Momentum
β
Updating
Momentum
α
Updating
Consistent
Sample Set
Training
Dataset
Memory2
Consistent
Sample Set
*
2
X*
2
X
Momentum
β
Updating
Init
Training
Dataset
Unlabeled
Dataset
X
CSM
*
1
X*
1
X
CSM
Memory1Memory1
1
f
1
f
2
f
1
f
2
f
X1 ,Y1
Training Dataset
with Pseudo Labels
Clustering1
Clustering2
Clustering2Init
Memory2
X2 ,Y2
Training Dataset
with Pseudo Labels
...
1
1
c
1
k
c
...
1
1
c
1
k
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
with ε1
clustering
with ε2
(DCDP)
(DCDP)
Lq
Lq
Lq
Lq
(c) One of the Mean Nets is employed for inference.
...
1
1
c
1
k
c
...
1
1
c
1
k
c
Mean
Net1
Net2
Net1
Mean
Net2
Mean
Net2
Mean
Net1
clustering X1
Mean
Net Retrieval
X
X2
Forward
Process Backward
Process Momentum
Updating
Momentum
α
Updating
Momentum
β
Updating
Momentum
α
Updating
Consistent
Sample Set
Training
Dataset
Memory2
Consistent
Sample Set
*
2
X*
2
X
Momentum
β
Updating
Init
Training
Dataset
Unlabeled
Dataset
X
CSM
*
1
X*
1
X
CSM
Memory1Memory1
1
f
1
f
2
f
2
f
1
f
2
f
X1 ,Y1
Training Dataset
with Pseudo Labels
Clustering1
Clustering2
Clustering2Init
Memory2
X2 ,Y2
Training Dataset
with Pseudo Labels
...
1
1
c
1
k
c
...
1
1
c
1
k
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
...
2
k
c
1
2
c
with ε1
clustering
with ε2
(DCDP)
(DCDP)
Lq
Lq
Lq
Lq
(b) Consistent sample mining (CSM) and co-teaching.
Fig. 1. The framework of proposed Dual Clustering Co-teaching (DCCT) approach. In order to show the co-teaching process of the two networks more
clearly, Net1and its results are shown in blue, while Net2and its results are shown in red. (a) The features extracted by Mean Net1and M ean N et2
are clustered with different parameters ε1and ε2at each epoch to generate two sets of pseudo labels and initialize two memory banks. ε1and ε2change
dynamically during training, so we call it dual clustering with dynamic parameters (DCDP). (b) Consistent sample mining (CSM) is performed in each
iteration. Specifically, Mean Net1and Memory1are employed to mine consistent samples X
1from training dataset X1. Then X
1is adopted to train
Net2. (The training of Net1is similar.) The contrastive loss shown in Eq. 4 is used for training. (c) Since the performance of Mean Net is better than
that of Net, one of the Mean Net with better performance is employed for inference. More details are narrated in Algorithm 1.
This paper focuses on leveraging the peer-teaching strategy
method to alleviate label noise. Co-teaching [16] trains two
networks, and each network selects the samples with small
losses to train its peer network. Inspired by Co-teaching, Co-
mining [17] trains two networks for face recognition tasks, and
the clean samples in each mini-batch are re-weighted. Mean
teachers [18] average model weights to deal with large datasets
and achieve better performance than averaging label predic-
tions. Drawing inspiration from Co-teaching, ACT [19] trains
two networks in an asymmetric way to tackle unsupervised
person Re-ID. However, one of the networks is only trained
with clean samples, which limits its generalization capacity.
MMT [20] employs the peer-teaching strategy on unsupervised
person Re-ID, and proposes to utilize the temporally average
model [18] to generate pseudo labels and soft pseudo labels to
avoid training error amplification. However, it leverages noisy
pseudo labels to train two networks simultaneously, which
results in noise accumulation and affects the performance of
the model.
III. DUAL CLUSTERING CO-TEACHING (DCCT)
Inspired by previous peer-teaching strategy methods [19],
[20], we develop a novel Dual Clustering Co-teaching (DCCT)
framework to train two networks using two sets of pseudo
labels. To increase the difference and independence of the
two networks, we follow MMT [20] to employ the temporally
average models [18] for our method.
A. The Framework of DCCT
In order to better illustrate the workflow of our method,
the two networks are called Net1and Net2for short, and
their temporally average models are called M ean Net1and
Mean N et2. As shown in Fig. 1, our method mainly contains
two stages: (a) pseudo label generation and memory initializa-
tion; (b) co-teaching of the two networks.
(a) In the stage of pseudo label generation and memory
initialization, we adopt Mean N et1and Mean N et2to
extract features from the unlabeled dataset X. Then, different
clustering parameters are calculated according to the proposed
dual clustering with dynamic parameters (DCDP). After that,
the features extracted by Mean Net1and Mean Net2are
clustered with different parameters to generate two sets of
pseudo labels. And some outliers in Xmay be discarded
according to the clustering results. Then, we get two relatively
clean datasets and their pseudo labels: training dataset X1with
pseudo labels Y1and training dataset X2with pseudo labels
Y2. At the same time, the two clustering results are exploited
to initialize the memory bank Memory1and Memory2, as
shown in Fig. 1(a).
(b) In the stage of co-teaching, we perform consistent
sample mining (CSM) in each iteration. Concretely, we lever-
age Mean N et1and Memory1to mine consistent sample set
X
1from the training dataset X1, and X
1is employed to train
Net2. Similarly, the consistent sample set X
2is employed
to train Net1, as shown in Fig. 1(b). Net1and Net2are
trained by the contrastive loss shown in Eq. 4. Memory1
摘要:

1DualClusteringCo-teachingwithConsistentSampleMiningforUnsupervisedPersonRe-IdenticationZeqiChen1ZhichaoCui2ChiZhang1JiahuanZhou3YuehuLiu11Xi'anJiaotongUniversity2Chang'anUniversity3PekingUniversityAbstract—InunsupervisedpersonRe-ID,peer-teachingstrat-egyleveragingtwonetworkstofacilitatetraininghas...

展开>> 收起<<
1 Dual Clustering Co-teaching with Consistent Sample Mining for Unsupervised Person.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:701.47KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注