1 How Image Generation Helps Visible-to-Infrared Person Re-Identiﬁcation

2025-04-28 0 0 1.11MB 12 页 10玖币

侵权投诉

How Image Generation Helps Visible-to-Infrared

Person Re-Identiﬁcation?

Honghu Pan, Yongyong Chen, Member, IEEE, Yunqi He, Xin Li*, Zhenyu He*, Senior Member, IEEE

Abstract—Compared to visible-to-visible (V2V) person re-

identiﬁcation (ReID), the visible-to-infrared (V2I) person ReID

task is more challenging due to the lack of sufﬁcient training

samples and the large cross-modality discrepancy. To this end,

we propose Flow2Flow, a uniﬁed framework that could jointly

achieve training sample expansion and cross-modality image

generation for V2I person ReID. Speciﬁcally, Flow2Flow learns

bijective transformations from both the visible image domain

and the infrared domain to a shared isotropic Gaussian domain

with an invertible visible ﬂow-based generator and an infrared

one, respectively. With Flow2Flow, we are able to generate

pseudo training samples by the transformation from latent

Gaussian noises to visible or infrared images, and generate

cross-modality images by transformations from existing-modality

images to latent Gaussian noises to missing-modality images.

For the purpose of identity alignment and modality alignment

of generated images, we develop adversarial training strategies

to train Flow2Flow. Speciﬁcally, we design an image encoder

and a modality discriminator for each modality. The image

encoder encourages the generated images to be similar to real

images of the same identity via identity adversarial training,

and the modality discriminator makes the generated images

modal-indistinguishable from real images via modality adversar-

ial training. Experimental results on SYSU-MM01 and RegDB

demonstrate that both training sample expansion and cross-

modality image generation can signiﬁcantly improve V2I ReID

accuracy.

Index Terms—Visible-to-Infrared Person Re-Identiﬁcation,

Flow-based Generative Model, Adversarial Training.

I. INTRODUCTION

Person re-identiﬁcation (ReID), which aims to match pedes-

trian images captured by non-overlapped cameras, is a crucial

technique in video surveillance. In recent years, the person

ReID methods [1], [2] have achieved human-level accuracy

This research is supported in part by the National Natural Science Founda-

tion of China (Grant No.62172126 and Grant No.62106063), by the Shenzhen

Research Council (Grant No. JCYJ20210324120202006), by the Guang-

dong Natural Science Foundation under Grant 2022A1515010819, by the

Shenzhen College Stability Support Plan (Grant GXWD20201230155427003-

20200824113231001), and by The Major Key Project of PCL (Grant

PCL2021A03-1).

H. Pan is with School of Computer Science and Technology, Harbin

Institute of Technology, Shenzhen, Shenzhen 518055, China. (Email:

19B951002@stu.hit.edu.cn)

Y. Chen is with School of Computer Science and Technology, Harbin

Institute of Technology, Shenzhen 518055, China, and also with Guangdong

Provincial Key Laboratory of Novel Security Intelligence Technologies.

(Email: YongyongChen.cn@gmail.com)

Y. He is with College of Information and Computer Engineering, Northeast

Forestry University, Harbin 150000, China. (Email: heyunqi.cs@gmail.com)

X. Li is with Peng Cheng Laboratory, Shenzhen 518055, China. (Email:

xinlihitsz@gmail.com)

Z. He is with School of Computer Science and Technology, Harbin Institute

of Technology, Shenzhen, Shenzhen 518055, China, and also with Peng Cheng

Laboratory, Shenzhen 518055, China. (Email: zhenyuhe@hit.edu.cn)

Noise domainVisible domain Infrared domain

Visible flow

Reverse

Infrared flow

Forward

Infrared flow

Reverse

Visible flow

Forward

Infrared flow

Forward

Visible flow

Forward

(a)

(b)

(c)

Fig. 1. Schematic of (a) training sample generation, (b) visible-to-infrared

cross-modality image generation, and (c) infrared-to-visible cross-modality

image generation, in which images outlined by red boxes are generated

images. The proposed Flow2Flow contains a visible ﬂow and an infrared

ﬂow, which learn bijective transformations from the visible image domain and

infrared domain to an isotropic Gaussian domain, respectively. (a) Training

sample generation: a latent Gaussian noise could be transformed to pseudo

visible sample ˆx<v>

jor pseudo infrared sample ˆx<r>

jby the forward

propagation of visible ﬂow or infrared ﬂow. (b) Cross-modality generation

from visible domain to infrared domain: the visible image x<v>

iis ﬁrst

transformed to a latent Gaussian noise ziby reverse propagation of the visible

ﬂow, then zican be transformed to the corresponding infrared image ˆx<r>

by forward propagation of the infrared ﬂow. (c) Vice versa for cross-modality

generation from infrared domain to visible domain.

on some large-scale datasets [3], [4]. However, these methods

assume that the pedestrian images are captured by visible-

spectrum cameras under bright environments, and do not work

well in the nighttime surveillance scenarios. Considering that

the infrared radiation is immune to illumination, the visible-

to-infrared (V2I) person ReID [5], [6], [7], [8], which denotes

a cross-spectrum or cross-modality matching task, has gained

a broad attention in the computer vision community.

Although recent researches [7], [8], [9] have made great

efforts on V2I ReID, it is still very challenging due to the

following two reasons. First, the number of training images

in V2I datasets [5], [6] is not as large as that in visible-

to-visible (V2V) ReID datasets [3], [4], especially for the

infrared images. For example, MSMT17 [4], one of large-

scale V2V datasets, contains 32,621 training samples, while

SYSU-MM01 [5] and RegDB [6] only contain 9,929 and 2,060

infrared images for training, respectively. Second, the modality

discrepancy between the visible spectrum and infrared spec-

arXiv:2210.01585v2 [cs.CV] 25 Oct 2022

trum is very large. In V2V ReID, some of recent studies [10],

[11] have considered to generate pseudo training samples to

reduce intra-class variance, while to our knowledge, training

sample expansion has not been studied in V2I ReID. Existing

methods of V2I ReID [12], [8], [13] mainly struggle to re-

duce the cross-modality discrepancy via cross-modality image

generation, most of which employ the generative adversarial

network (GAN) [14], [15] to generate the missing modality

images for existing modality images.

In this paper, we develop Flow2Flow, a uniﬁed framework

to explore how image generation, including the training sample

generation and cross-modality image generation, improves the

V2I person ReID task. Speciﬁcally, our framework contains

two ﬂow-based generative models [16], [17], i.e., a visible

ﬂow and an infrared ﬂow, which learn invertible or bijective

transformations from the visible image domain and infrared

image domain to an isotropic Gaussian domain, respectively.

Thereby, generating pseudo visible or infrared training samples

could be achieved by the forward ﬂow propagation from

the latent noise domain to the visible or infrared image do-

main. While generating missing-modality images from given-

modality images could be achieved by transformations from

given-modality domain to Gaussian noise domain to missing-

modality domain. Fig. 1 shows the schematic of the training

sample generation and cross-modality image generation.

To guarantee the invertibility and exact log-likelihood com-

putation, existing ﬂow models [16], [17] is composed of

mutiple 1×1convolutional layers and linear coupling layers,

which leads to insufﬁcient nonlinearity. To resolve this, we

implement an extra invertible activation layer in the last block

of the visible and infrared ﬂows to increase model nonlinearity.

In addition, we propose an identity adversarial training strategy

and a modality adversarial training strategy to encourage

the generated images corresponding to speciﬁc identities and

modalities. For the purpose of adversarial training, we imple-

ment two discriminators for each modality, including an image

encoder for identity alignment and a modality discriminator for

modality alignment. To enable the identity alignment of the

real images and generated images, we minimize the distance

between their encoded features when training generators, and

maximize that when training discriminators. While the modal-

ity discriminators distinguish whether the images are generated

or from a speciﬁc real modality.

To the best of our knowledge, this is the ﬁrst study that

achieves both training sample generation and cross-modality

generation via a uniﬁed framework. Experimental results

demonstrate that both generations improve the V2I ReID

performance signiﬁcantly. For example, the training sample

expansion and cross-modality generation obtain gains of 2.0%

and 1.2% mAP against the baseline model on the SYSU-

MM01 [5] dataset. The main contributions of this paper are

three-fold:

•To explore how image generation helps V2I person ReID,

we propose Flow2Flow, a uniﬁed framework, to jointly

generate pseudo training samples and cross-modality im-

ages, which contains a visible ﬂow and an infrared ﬂow

to learn bijective transformations from image domains to

Gaussian domain;

•For the purpose of identity alignment and modality align-

ment of generated images, we develop an image encoder

and a modality discriminator for each modality to perform

the identity adversarial training and modality adversarial

training, respectively;

•We demonstrate that both the training sample expansion

and cross-modality generation improve the V2I ReID

accuracy signiﬁcantly. In addition, our Flow2Flow model

leads a new state-of-the-arts (SOTA) performance on the

SYSU-MM01 dataset.

The remainder of this paper is organized as follows: Sec-

tion II introduces recent literatures related to this paper;

Section III simply reviews theoretical backgrounds of the ﬂow-

based generative models; Section IV elaborates the Flow2Flow

model in detail; Section V presents the ablation studies,

visualizations and comparisons with the SOTA; Section VI

draws brief conclusions.

II. RELATED WORKS

A. Visible-to-Visible Person ReID

The V2V person ReID is a single-modality image re-

trieval task, which devotes to enlarging the inter-class variance

and reducing the intra-class variance. To this end, existing

methods mainly consider three levels of factors: objective-

level, network-level and data-level. For the objectives or loss

functions, TriNet [1] proposed the hard triplet mining strategy

on the basis of triplet loss to learn pedestrian representations;

BoT [2] combined the cross entropy loss and triplet loss to

train network; moreover, the center loss [18] and angular

loss [19] have also been successfully applied in the V2V

person ReID. For the network, early works [1] learned the

global features from pedestrian images via a single CNN

branch.Next, the multi-branch architecture has been adopted

to learn the multi-granularity or part-level features [20], [21],

[22]. Furthermore, data augmentation or generation [23], [11]

could also improve the ReID accuracy, which belongs to the

data-based category. For example, PN-GAN [11] generated

multi-pose pedestrian images via GAN model, which could

reduce the pedestrian view variance; JVTC [23] conducted

the online data augmentation for contrastive learning, in which

the mesh projections were taken as the references to generate

multi-view images.

B. Visible-to-Infrared Person ReID

The V2I person ReID enables the cross-spectrum pedes-

trian retrieval, whose crux is to reduce the large cross-

modality discrepancy. Existing V2I ReID methods mainly

have two techniques to reduce the modal discrepancy: 1)

learning the modality-shared pedestrian representation and 2)

compensating information of missing modality via generative

models [14], [15]. The modality-shared ones [24], [25], [26],

[7] projected the visible and infrared pedestrian images into a

shared Euclidean space, in which the intra-class similarity and

inter-class similarity are maximized and minimized, respec-

tively. For example, DGD-MSR [24] proposed a modality-

speciﬁc network to extract modality-speciﬁc representations

from each modality; expAT [25] devised an exponential angu-

lar triplet loss beyond the Euclidean metric based constraints

to learn the angularly discriminative features; MPANet [7]

aimed to capture the nuances of cross-modality images via a

modality alleviation module and a pattern alignment module.

The modality compensation ones [9], [27], [12], [8], [13]

usually generated missing modality information from existing

modality data: DDRL [27] proposed an image-level sub-

network based on GAN model, which could translate a visible

(infrared) image to a corresponding infrared (visible) one;

cmPIG [13] employed the set-level alignment information

to generate instance alignment cross-modality paired-images;

FMCNet [9] utilized the feature-level modality compensation

to reduce modality discrepancy, which generated the cross-

modality features rather than images. The method proposed in

this paper could be classiﬁed as the modality compensation

category. Compared to existing methods that directly learn

a transformation from given modality to missing modality

via GAN models, our method employs the ﬂow-based gen-

erative models to construct invertible transformations from

given modality to latent Gaussian noise to missing modality.

Thereby, besides the cross-modality generation, our method

could generate pseudo training samples via transformations

from Gaussian noise to image modalities.

C. Flow-based Generative Model

The ﬂow-based generative model constructs an invertible or

bijective mapping from the complex distribution of true data to

a simple distribution (e.g., isotropic Gaussian distribution). For

the purpose of invertibility and exact log-likelihood computa-

tion, layers in ﬂow-based model should be carefully designed.

RealNVP [16] proposed the afﬁne coupling layer, which

could easily compute the determinant of Jaocibian matrix;

Glow [17] presented an invertible 1×1convolution layer,

meanwhile the LU decomposition was utilized to speed up the

computation of determinants; cAttnFlow [28] introduced the

invertible attentions to increase the nonlinearity of ﬂow-based

model. Recently, a great number of works have extended the

ﬂow-based model into speech synthesis [29], molecular graph

generation [30], [31] and image generation [17], [32], [33]. For

the molecular graph generation, MoFlow [31] implemented an

atom ﬂow and a conditional bond ﬂow to generate the atom

features and atom bonds in molecular, respectively. For the

image super-resolution, SRFlow [32] and HCFlow [33] took

the low-resolution images as the condition, and thus learned

the high-resolution images via a conditional ﬂow. In this paper,

we take advantage of the invertibility of ﬂow-based model to

achieve 1) generating pseudo samples from isotropic Gaussian

noises and 2) cross-modality image generation from existing

modality to latent noises to missing modality. As far as we can

tell, this is the ﬁrst study that applies the ﬂow-based model in

person ReID.

D. Generative Adversarial Network

The ﬁrst GAN model was proposed in [14], which consists

of a generator and a discriminator, and they could improve

each other by the adversarial training. In GAN model, the

generator generates samples from noise variables with a known

probability density function (PDF) and tries to fool the dis-

criminator, and the discriminator distinguishes whether the

data is true or fake to beat the generator. Recently, the GAN

architectures have been heavily reﬁned to adapt various appli-

cation scenarios. For instance, the Conditional GAN [34], [35]

could generate samples corresponding to speciﬁc condition

labels; CycleGAN [15] enabled the unpaired cross-domain

image translation by the cycle consistency loss. Meanwhile,

the GAN model also showed its priority in the V2I person

ReID [9], [27], [12] and V2I person ReID areas [23], [11].

Unlike the ﬂow-based model [16], [17] which could exactly

compute the log-likelihood of true data, GAN model implicitly

minimizes the KL divergence between the true data and data

generated from noises. To make the generated data indistin-

guishable from the real data, training a GAN model pursues

an equilibrium between the generator and discriminator, which

requires careful experimental setup tuning. In this paper,

we combine the ﬂow-based model and adversarial training

to generate the high-quality visible and infrared pedestrian

images.

III. PRELIMINARIES

The ﬂow-based generative model aims to learn a bijective

transformation from a complex distribution X∼P(X)to

a simple distribution Z∼Π(Z)with a known probability

density function, in which Xdenotes the true training data and

Π(Z)is usually a Gaussian distribution. For the purpose of

bijective mapping, the ﬂow-based model consists of a sequence

of invertible generators G=G1• · · · • GL:

xi=G(zi), zi=G−1(xi).(1)

By the change of variable formula, P(X)and Π(Z)satisfy

the following transformation:

P(X) = Π(Z)|det(JG−1)|,(2)

where det(JG−1)denotes the determinant of Jacobian matrix.

Then the objective of max{log(P(X))}can be converted to:

max{X

log(Π(zi)) +

l=1

log 



det(JG−1

l)



}.(3)

From Eq. (1), Eq. (2) and Eq. (3), we could know that the

training process of the ﬂow-based model follows the reverse

propagation, and the inference or generation process follows

the forward propagation.

A standard ﬂow-based model mainly contains two cate-

gories of layers: invertible 1×1convolution layer [17] and

afﬁne coupling layer [36], [16]. For a single generator Glin

G, the reverse and forward projection of the 1×1convolution

layer has the following expression:

z<l−1>

i=Wlz<l>

i, z<l>

i=W−1

lz<l−1>

i,(4)

where Z<0>and Z<L> denotes Zand X, respectively. The

design of the afﬁne coupling layer should allow 1) invertible

transformation and 2) exact computation of the Jacobian

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1HowImageGenerationHelpsVisible-to-InfraredPersonRe-Identication?HonghuPan,YongyongChen,Member,IEEE,YunqiHe,XinLi*,ZhenyuHe*,SeniorMember,IEEEAbstractComparedtovisible-to-visible(V2V)personre-identication(ReID),thevisible-to-infrared(V2I)personReIDtaskismorechallengingduetothelackofsufcienttrain...

展开>> 收起<<

1 How Image Generation Helps Visible-to-Infrared Person Re-Identiﬁcation.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 How Image Generation Helps Visible-to-Infrared Person Re-Identiﬁcation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: