1 Dual Clustering Co-teaching with Consistent Sample Mining for Unsupervised Person

2025-05-01 1 0 701.47KB 11 页 10玖币

侵权投诉

Dual Clustering Co-teaching with Consistent

Sample Mining for Unsupervised Person

Re-Identiﬁcation

Zeqi Chen1Zhichao Cui2Chi Zhang1Jiahuan Zhou3Yuehu Liu1

1Xi’an Jiaotong University 2Chang’an University 3Peking University

Abstract—In unsupervised person Re-ID, peer-teaching strat-

egy leveraging two networks to facilitate training has been proven

to be an effective method to deal with the pseudo label noise.

However, training two networks with a set of noisy pseudo

labels reduces the complementarity of the two networks and

results in label noise accumulation. To handle this issue, this

paper proposes a novel Dual Clustering Co-teaching (DCCT)

approach. DCCT mainly exploits the features extracted by two

networks to generate two sets of pseudo labels separately by

clustering with different parameters. Each network is trained

with the pseudo labels generated by its peer network, which can

increase the complementarity of the two networks to reduce the

impact of noises. Furthermore, we propose dual clustering with

dynamic parameters (DCDP) to make the network adaptive and

robust to dynamically changing clustering parameters. Moreover,

Consistent Sample Mining (CSM) is proposed to ﬁnd the samples

with unchanged pseudo labels during training for potential noisy

sample removal. Extensive experiments demonstrate the effective-

ness of the proposed method, which outperforms the state-of-the-

art unsupervised person Re-ID methods by a considerable margin

and surpasses most methods utilizing camera information.

Index Terms—Unsupervised person re-identiﬁcation, peer-

teaching strategy, sample mining.

I. INTRODUCTION

PERSON re-identiﬁcation (Re-ID) aims to retrieve the im-

ages of the same person captured by different cameras [1].

Although supervised person Re-ID [2]–[6] has achieved ex-

cellent accuracy on publicly available datasets [7]–[9], the

requirement of tremendous manual annotation limits their

practicality in the real world. To tackle this issue, unsupervised

person Re-ID methods [10]–[14] without any labeled data have

been extensively studied.

The mainstream unsupervised methods are clustering-based

[10], [11], [13]–[15], which are mainly divided into two stages:

(1) generating pseudo labels by clustering; (2) training the

network with pseudo labels. Although those methods achieve

excellent performance, their generated pseudo labels are in-

evitably noisy. On the one hand, person images with different

identities may have similar appearance, viewpoint, pose, and

illumination. Due to subtle differences, they may be clustered

into a cluster by clustering algorithms. On the other hand, the

images of a person may have occlusion, different resolution,

and motion blur. They may be clustered into different clusters

This work has been submitted to the IEEE for possible publication.

no longer be accessible.

due to their distinct differences. Training with noisy pseudo

labels hinders the model’s performance.

To mitigate the inﬂuence of such noisy pseudo labels, a

peer-teaching strategy [16]–[20] is employed, which leverages

the difference and complementarity of the two networks to

ﬁlter different noises through cooperative training of the two

networks. ACT [19] trains two networks in an asymmetric

manner to enhance the complementarity of the two networks.

One network is trained with pure samples, while the other is

trained with diverse samples. To enhance the output indepen-

dence of the two networks, MMT [20] utilizes the outputs of

the network’s temporally average model [18] as soft pseudo

labels to train its peer network. However, both ACT and MMT

adopt only one set of noisy pseudo labels to train the two

networks, resulting in the accumulation and propagation of

pseudo label noise during training.

To overcome the aforementioned shortcomings, we propose

a novel Dual Clustering Co-teaching (DCCT) framework to

train two networks using two sets of pseudo labels. Train-

ing with pseudo labels obtained by different clusterings can

increase the differences and complementarity of the two net-

works, thereby reducing the effect of noises and improving

the ﬁnal performance. Speciﬁcally, we propose dual clustering

with dynamic parameters (DCDP) to obtain different clustering

parameters at each epoch. Then, the features extracted by the

temporally average models (M ean Nets) of two networks

are clustered to generate two sets of pseudo labels. And

two memory banks are initialized according to the clustering

results, as shown in Fig. 1(a). Then we adopt the pseudo labels

generated by one network to train its peer network, as shown

in Fig. 1(b). In addition, we propose consistent sample mining

(CSM) in each mini-batch to discard potential noisy samples

with incorrect pseudo labels, which improves the network’s

performance.

The main contributions of this paper can be summarized as

threefold:

•We design a novel peer-teaching framework called Dual

Clustering Co-teaching (DCCT), which employs dual

clustering with dynamic parameters (DCDP) to generate

two sets of pseudo labels. Training with different pseudo

labels can enhance the differences and complementarity

of the two networks and improve their ﬁnal performance.

The proposed DCDP is so ﬂexible to be effective on

multiple clustering algorithms.

arXiv:2210.03339v1 [cs.CV] 7 Oct 2022

•We also propose consistent sample mining (CSM) to

discard the samples whose pseudo labels are inconsistent

during each training epoch. The discarded inconsistent

samples are potential noisy samples that may hinder

network training.

•Extensive experiments on three large-scale datasets

(Market-1501 [7], MSMT17 [8], and PersonX [9])

demonstrate that our method outperforms the fully un-

supervised state-of-the-art methods by a large margin,

even surpasses most UDA methods and methods utilizing

camera information.

II. RELATED WORKS

A. Unsupervised Person Re-ID

Unsupervised person Re-ID methods are mainly divided into

unsupervised domain adaptive (UDA) methods and unsuper-

vised learning (USL) methods.

1) UDA Person Re-ID: UDA methods generally pre-train a

model using labeled data on the source domain and transfer the

learned knowledge from the source domain to the unlabeled

target domain. Recent studies in UDA method for person Re-

ID can mainly group into clustering-based adaptation [21]–

[24] and cross-domain translation [8], [25]–[29].

The clustering-based adaptation method aims to leverage

clustering to generate pseudo labels for unlabeled data on

the target domain. Fan et al. [21] utilize the pseudo labels

generated by k-means [30] to ﬁne-tune the model. Song

et al. [22] adopt DBSCAN [31] to generate pseudo labels,

and the number of clusters is determined by the density

of features. The AD-cluster [23] leverages iterative density-

based clustering to generate pseudo labels. It learns an image

generator to augment the training samples to enforce the

discrimination ability of Re-ID models. To avoid overﬁtting

to noisy pseudo labels, AdaDC [24] adaptively and alternately

utilizes different clustering methods. Although the clustering-

based method has been proven effective and achieves state-

of-the-art performance, due to the existence of some indistin-

guishable persons with similar appearance, the pseudo labels

assigned by the clustering method will be inevitably noisy,

which will seriously hinder the training of the network.

The cross-domain translation is another approach that learns

domain-invariant features from source-domain images. Gener-

ative Adversarial Network (GAN) is one of the main repre-

sentatives of this type of method. PTGAN [8] and SPGAN

[26] utilize the images of the source domain to generate

the transferred images that have the same style as the tar-

get domain images. However, the quality of the generated

images restricts the performance of such methods. DAAL

[25] separate the feature map into the domain-shared feature

map and the domain-speciﬁc feature map simultaneously. The

former is transferred from the source domain to the target

domain to facilitate the Re-ID task. ECN [27] and ECN++ [28]

adopt a feature memory to learn exemplar-invariance, camera-

invariance, and neighborhood-invariance. HCN [32] proposes

a heterogeneous convolutional network, which leverages CNN

and GCN to learn the appearance and correlation information

of person images. TAL-MIRN [29] leverages triple adversarial

learning and multi-view imaginative reasoning to improve the

generalization ability of the Re-ID model from the source

domain to the target domain. Although these UDA methods

perform well under the cross-domain scenario, the requirement

of tremendous manually annotation largely limits their usage

in practice. In addition, UDA methods rely on the transferable

knowledge learned from the source domain, but the discrim-

inative information of the target domain may not be fully

explored.

2) USL Person Re-ID: USL methods do not require any

labeled data. In recent years, clustering-based methods [10],

[11], [33] have become the mainstream of USL methods.

BUC [10] presents bottom-up clustering to generate pseudo

labels, and a diversity regularization is employed to control the

number of samples in each cluster. However, only one bottom-

up clustering is performed in the entire training process, and

incorrectly merged samples in the previous merging steps

will always affect the subsequent training process. HCT [11]

adopts hierarchical clustering to generate pseudo labels and

employs batch hard triplet loss [34] to facilitate training. TSSL

[35] designs a uniﬁed formulation to consider tracklet frame

coherence, tracklet neighbourhood compactness, and tracklet

cluster structure. In order to improve the generation quality

of pseudo labels, IICS [33] decomposes the sample similarity

computation into two stages: intra-camera and inter-camera

computation. PPLR [13] exploits the complementary rela-

tionship between global and local features to reduce pseudo

label noise. To reduce “sub and mixed” clustering errors, ISE

[14] generates support samples around cluster boundaries to

associate the same identity samples.

Some studies address unsupervised person Re-ID without

using clustering. SSL [36] explores the similarity between

unlabeled images via softened similarity learning. And a

cross-camera encouragement term is proposed to boost soft-

ened similarity learning. MMCL [37] employs the multi-label

classiﬁcation method to tackle unsupervised person Re-ID

and proposes a memory-based multi-label classiﬁcation loss

to promote training. Although these methods have achieved

satisfactory performance, there is still a gap between them

and clustering-based methods.

In the latest researches, some contrastive learning based

methods have achieved remarkable performances. SpCL [15]

stores the features of all instances in hybrid memory and

optimizes the encoder with a uniﬁed contrastive loss. Cluster-

Contrast [38] stores features and computes contrastive loss

at the cluster level. CAP [39] designs both intra-camera and

inter-camera contrastive learning to boost training. ICE [12]

employs inter-instance pairwise similarity scores to promote

contrastive learning. However, the inevitable pseudo label

noise limits the performance of these methods.

B. Learning with Noisy Labels

In recent years, training networks on noisy or unlabeled

data has been widely studied, which can be classiﬁed into

four categories: estimating the noise transition matrix [40],

[41], designing the robust loss function [42], [43], correcting

the noisy labels [44], [45] and utilizing peer-teaching strategy

[16], [17], [20].

...

Mean

Net1

Net2

Net1

Mean

Net2

Mean

Net2

Mean

Net1

clustering X1

Mean

Net Retrieval

Forward

Process Backward

Process Momentum

Updating

Momentum

Updating

Momentum

Updating

Momentum

Updating

Consistent

Sample Set

Training

Dataset

Memory2

Consistent

Sample Set

Momentum

Updating

Init

Training

Dataset

Unlabeled

Dataset

CSM

Memory1Memory1



X1 ,Y1

Training Dataset

with Pseudo Labels

Clustering1

Clustering2

Clustering2Init

Memory2

X2 ,Y2

Training Dataset

with Pseudo Labels

...

with ε1

clustering

with ε2

(DCDP)

(a) Dual clustering with dynamic parameters (DCDP) for pseudo label

generation and memory initialization.

...

Mean

Net1

Net2

Net1

Mean

Net2

Mean

Net2

Mean

Net1

clustering X1

Mean

Net Retrieval

Forward

Process Backward

Process Momentum

Updating

Momentum

Updating

Momentum

Updating

Momentum

Updating

Consistent

Sample Set

Training

Dataset

Memory2

Consistent

Sample Set

Momentum

Updating

Init

Training

Dataset

Unlabeled

Dataset

CSM

Memory1Memory1



X1 ,Y1

Training Dataset

with Pseudo Labels

Clustering1

Clustering2

Clustering2Init

Memory2

X2 ,Y2

Training Dataset

with Pseudo Labels

...

with ε1

clustering

with ε2

(DCDP)

...

Mean

Net1

Net2

Net1

Mean

Net2

Mean

Net2

Mean

Net1

clustering X1

Mean

Net Retrieval

Forward

Process Backward

Process Momentum

Updating

Momentum

Updating

Momentum

Updating

Momentum

Updating

Consistent

Sample Set

Training

Dataset

Memory2

Consistent

Sample Set

Momentum

Updating

Init

Training

Dataset

Unlabeled

Dataset

CSM

Memory1Memory1



X1 ,Y1

Training Dataset

with Pseudo Labels

Clustering1

Clustering2

Clustering2Init

Memory2

X2 ,Y2

Training Dataset

with Pseudo Labels

...

with ε1

clustering

with ε2

(DCDP)

(b) Consistent sample mining (CSM) and co-teaching.

Fig. 1. The framework of proposed Dual Clustering Co-teaching (DCCT) approach. In order to show the co-teaching process of the two networks more

clearly, Net1and its results are shown in blue, while Net2and its results are shown in red. (a) The features extracted by Mean Net1and M ean N et2

are clustered with different parameters ε1and ε2at each epoch to generate two sets of pseudo labels and initialize two memory banks. ε1and ε2change

dynamically during training, so we call it dual clustering with dynamic parameters (DCDP). (b) Consistent sample mining (CSM) is performed in each

iteration. Speciﬁcally, Mean Net1and Memory1are employed to mine consistent samples X∗

1from training dataset X1. Then X∗

1is adopted to train

Net2. (The training of Net1is similar.) The contrastive loss shown in Eq. 4 is used for training. (c) Since the performance of Mean Net is better than

that of Net, one of the Mean Net with better performance is employed for inference. More details are narrated in Algorithm 1.

This paper focuses on leveraging the peer-teaching strategy

method to alleviate label noise. Co-teaching [16] trains two

networks, and each network selects the samples with small

losses to train its peer network. Inspired by Co-teaching, Co-

mining [17] trains two networks for face recognition tasks, and

the clean samples in each mini-batch are re-weighted. Mean

teachers [18] average model weights to deal with large datasets

and achieve better performance than averaging label predic-

tions. Drawing inspiration from Co-teaching, ACT [19] trains

two networks in an asymmetric way to tackle unsupervised

person Re-ID. However, one of the networks is only trained

with clean samples, which limits its generalization capacity.

MMT [20] employs the peer-teaching strategy on unsupervised

person Re-ID, and proposes to utilize the temporally average

model [18] to generate pseudo labels and soft pseudo labels to

avoid training error ampliﬁcation. However, it leverages noisy

pseudo labels to train two networks simultaneously, which

results in noise accumulation and affects the performance of

the model.

III. DUAL CLUSTERING CO-TEACHING (DCCT)

Inspired by previous peer-teaching strategy methods [19],

[20], we develop a novel Dual Clustering Co-teaching (DCCT)

framework to train two networks using two sets of pseudo

labels. To increase the difference and independence of the

two networks, we follow MMT [20] to employ the temporally

average models [18] for our method.

A. The Framework of DCCT

In order to better illustrate the workﬂow of our method,

the two networks are called Net1and Net2for short, and

their temporally average models are called M ean Net1and

Mean N et2. As shown in Fig. 1, our method mainly contains

two stages: (a) pseudo label generation and memory initializa-

tion; (b) co-teaching of the two networks.

(a) In the stage of pseudo label generation and memory

initialization, we adopt Mean N et1and Mean N et2to

extract features from the unlabeled dataset X. Then, different

clustering parameters are calculated according to the proposed

dual clustering with dynamic parameters (DCDP). After that,

the features extracted by Mean Net1and Mean Net2are

clustered with different parameters to generate two sets of

pseudo labels. And some outliers in Xmay be discarded

according to the clustering results. Then, we get two relatively

clean datasets and their pseudo labels: training dataset X1with

pseudo labels Y1and training dataset X2with pseudo labels

Y2. At the same time, the two clustering results are exploited

to initialize the memory bank Memory1and Memory2, as

shown in Fig. 1(a).

(b) In the stage of co-teaching, we perform consistent

sample mining (CSM) in each iteration. Concretely, we lever-

age Mean N et1and Memory1to mine consistent sample set

X∗

1from the training dataset X1, and X∗

1is employed to train

Net2. Similarly, the consistent sample set X∗

2is employed

to train Net1, as shown in Fig. 1(b). Net1and Net2are

trained by the contrastive loss shown in Eq. 4. Memory1

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DualClusteringCo-teachingwithConsistentSampleMiningforUnsupervisedPersonRe-IdenticationZeqiChen1ZhichaoCui2ChiZhang1JiahuanZhou3YuehuLiu11Xi'anJiaotongUniversity2Chang'anUniversity3PekingUniversityAbstractInunsupervisedpersonRe-ID,peer-teachingstrat-egyleveragingtwonetworkstofacilitatetraininghas...

展开>> 收起<<

1 Dual Clustering Co-teaching with Consistent Sample Mining for Unsupervised Person.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Dual Clustering Co-teaching with Consistent Sample Mining for Unsupervised Person

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: