1 How Image Generation Helps Visible-to-Infrared Person Re-Identification

2025-04-28 0 0 1.11MB 12 页 10玖币
侵权投诉
1
How Image Generation Helps Visible-to-Infrared
Person Re-Identification?
Honghu Pan, Yongyong Chen, Member, IEEE, Yunqi He, Xin Li*, Zhenyu He*, Senior Member, IEEE
Abstract—Compared to visible-to-visible (V2V) person re-
identification (ReID), the visible-to-infrared (V2I) person ReID
task is more challenging due to the lack of sufficient training
samples and the large cross-modality discrepancy. To this end,
we propose Flow2Flow, a unified framework that could jointly
achieve training sample expansion and cross-modality image
generation for V2I person ReID. Specifically, Flow2Flow learns
bijective transformations from both the visible image domain
and the infrared domain to a shared isotropic Gaussian domain
with an invertible visible flow-based generator and an infrared
one, respectively. With Flow2Flow, we are able to generate
pseudo training samples by the transformation from latent
Gaussian noises to visible or infrared images, and generate
cross-modality images by transformations from existing-modality
images to latent Gaussian noises to missing-modality images.
For the purpose of identity alignment and modality alignment
of generated images, we develop adversarial training strategies
to train Flow2Flow. Specifically, we design an image encoder
and a modality discriminator for each modality. The image
encoder encourages the generated images to be similar to real
images of the same identity via identity adversarial training,
and the modality discriminator makes the generated images
modal-indistinguishable from real images via modality adversar-
ial training. Experimental results on SYSU-MM01 and RegDB
demonstrate that both training sample expansion and cross-
modality image generation can significantly improve V2I ReID
accuracy.
Index Terms—Visible-to-Infrared Person Re-Identification,
Flow-based Generative Model, Adversarial Training.
I. INTRODUCTION
Person re-identification (ReID), which aims to match pedes-
trian images captured by non-overlapped cameras, is a crucial
technique in video surveillance. In recent years, the person
ReID methods [1], [2] have achieved human-level accuracy
This research is supported in part by the National Natural Science Founda-
tion of China (Grant No.62172126 and Grant No.62106063), by the Shenzhen
Research Council (Grant No. JCYJ20210324120202006), by the Guang-
dong Natural Science Foundation under Grant 2022A1515010819, by the
Shenzhen College Stability Support Plan (Grant GXWD20201230155427003-
20200824113231001), and by The Major Key Project of PCL (Grant
PCL2021A03-1).
H. Pan is with School of Computer Science and Technology, Harbin
Institute of Technology, Shenzhen, Shenzhen 518055, China. (Email:
19B951002@stu.hit.edu.cn)
Y. Chen is with School of Computer Science and Technology, Harbin
Institute of Technology, Shenzhen 518055, China, and also with Guangdong
Provincial Key Laboratory of Novel Security Intelligence Technologies.
(Email: YongyongChen.cn@gmail.com)
Y. He is with College of Information and Computer Engineering, Northeast
Forestry University, Harbin 150000, China. (Email: heyunqi.cs@gmail.com)
X. Li is with Peng Cheng Laboratory, Shenzhen 518055, China. (Email:
xinlihitsz@gmail.com)
Z. He is with School of Computer Science and Technology, Harbin Institute
of Technology, Shenzhen, Shenzhen 518055, China, and also with Peng Cheng
Laboratory, Shenzhen 518055, China. (Email: zhenyuhe@hit.edu.cn)
Noise domainVisible domain Infrared domain
Visible flow
Reverse
Infrared flow
Forward
Infrared flow
Reverse
Visible flow
Forward
Infrared flow
Forward
Visible flow
Forward
(a)
(b)
(c)
Fig. 1. Schematic of (a) training sample generation, (b) visible-to-infrared
cross-modality image generation, and (c) infrared-to-visible cross-modality
image generation, in which images outlined by red boxes are generated
images. The proposed Flow2Flow contains a visible flow and an infrared
flow, which learn bijective transformations from the visible image domain and
infrared domain to an isotropic Gaussian domain, respectively. (a) Training
sample generation: a latent Gaussian noise could be transformed to pseudo
visible sample ˆx<v>
jor pseudo infrared sample ˆx<r>
jby the forward
propagation of visible flow or infrared flow. (b) Cross-modality generation
from visible domain to infrared domain: the visible image x<v>
iis first
transformed to a latent Gaussian noise ziby reverse propagation of the visible
flow, then zican be transformed to the corresponding infrared image ˆx<r>
i
by forward propagation of the infrared flow. (c) Vice versa for cross-modality
generation from infrared domain to visible domain.
on some large-scale datasets [3], [4]. However, these methods
assume that the pedestrian images are captured by visible-
spectrum cameras under bright environments, and do not work
well in the nighttime surveillance scenarios. Considering that
the infrared radiation is immune to illumination, the visible-
to-infrared (V2I) person ReID [5], [6], [7], [8], which denotes
a cross-spectrum or cross-modality matching task, has gained
a broad attention in the computer vision community.
Although recent researches [7], [8], [9] have made great
efforts on V2I ReID, it is still very challenging due to the
following two reasons. First, the number of training images
in V2I datasets [5], [6] is not as large as that in visible-
to-visible (V2V) ReID datasets [3], [4], especially for the
infrared images. For example, MSMT17 [4], one of large-
scale V2V datasets, contains 32,621 training samples, while
SYSU-MM01 [5] and RegDB [6] only contain 9,929 and 2,060
infrared images for training, respectively. Second, the modality
discrepancy between the visible spectrum and infrared spec-
arXiv:2210.01585v2 [cs.CV] 25 Oct 2022
2
trum is very large. In V2V ReID, some of recent studies [10],
[11] have considered to generate pseudo training samples to
reduce intra-class variance, while to our knowledge, training
sample expansion has not been studied in V2I ReID. Existing
methods of V2I ReID [12], [8], [13] mainly struggle to re-
duce the cross-modality discrepancy via cross-modality image
generation, most of which employ the generative adversarial
network (GAN) [14], [15] to generate the missing modality
images for existing modality images.
In this paper, we develop Flow2Flow, a unified framework
to explore how image generation, including the training sample
generation and cross-modality image generation, improves the
V2I person ReID task. Specifically, our framework contains
two flow-based generative models [16], [17], i.e., a visible
flow and an infrared flow, which learn invertible or bijective
transformations from the visible image domain and infrared
image domain to an isotropic Gaussian domain, respectively.
Thereby, generating pseudo visible or infrared training samples
could be achieved by the forward flow propagation from
the latent noise domain to the visible or infrared image do-
main. While generating missing-modality images from given-
modality images could be achieved by transformations from
given-modality domain to Gaussian noise domain to missing-
modality domain. Fig. 1 shows the schematic of the training
sample generation and cross-modality image generation.
To guarantee the invertibility and exact log-likelihood com-
putation, existing flow models [16], [17] is composed of
mutiple 1×1convolutional layers and linear coupling layers,
which leads to insufficient nonlinearity. To resolve this, we
implement an extra invertible activation layer in the last block
of the visible and infrared flows to increase model nonlinearity.
In addition, we propose an identity adversarial training strategy
and a modality adversarial training strategy to encourage
the generated images corresponding to specific identities and
modalities. For the purpose of adversarial training, we imple-
ment two discriminators for each modality, including an image
encoder for identity alignment and a modality discriminator for
modality alignment. To enable the identity alignment of the
real images and generated images, we minimize the distance
between their encoded features when training generators, and
maximize that when training discriminators. While the modal-
ity discriminators distinguish whether the images are generated
or from a specific real modality.
To the best of our knowledge, this is the first study that
achieves both training sample generation and cross-modality
generation via a unified framework. Experimental results
demonstrate that both generations improve the V2I ReID
performance significantly. For example, the training sample
expansion and cross-modality generation obtain gains of 2.0%
and 1.2% mAP against the baseline model on the SYSU-
MM01 [5] dataset. The main contributions of this paper are
three-fold:
To explore how image generation helps V2I person ReID,
we propose Flow2Flow, a unified framework, to jointly
generate pseudo training samples and cross-modality im-
ages, which contains a visible flow and an infrared flow
to learn bijective transformations from image domains to
Gaussian domain;
For the purpose of identity alignment and modality align-
ment of generated images, we develop an image encoder
and a modality discriminator for each modality to perform
the identity adversarial training and modality adversarial
training, respectively;
We demonstrate that both the training sample expansion
and cross-modality generation improve the V2I ReID
accuracy significantly. In addition, our Flow2Flow model
leads a new state-of-the-arts (SOTA) performance on the
SYSU-MM01 dataset.
The remainder of this paper is organized as follows: Sec-
tion II introduces recent literatures related to this paper;
Section III simply reviews theoretical backgrounds of the flow-
based generative models; Section IV elaborates the Flow2Flow
model in detail; Section V presents the ablation studies,
visualizations and comparisons with the SOTA; Section VI
draws brief conclusions.
II. RELATED WORKS
A. Visible-to-Visible Person ReID
The V2V person ReID is a single-modality image re-
trieval task, which devotes to enlarging the inter-class variance
and reducing the intra-class variance. To this end, existing
methods mainly consider three levels of factors: objective-
level, network-level and data-level. For the objectives or loss
functions, TriNet [1] proposed the hard triplet mining strategy
on the basis of triplet loss to learn pedestrian representations;
BoT [2] combined the cross entropy loss and triplet loss to
train network; moreover, the center loss [18] and angular
loss [19] have also been successfully applied in the V2V
person ReID. For the network, early works [1] learned the
global features from pedestrian images via a single CNN
branch.Next, the multi-branch architecture has been adopted
to learn the multi-granularity or part-level features [20], [21],
[22]. Furthermore, data augmentation or generation [23], [11]
could also improve the ReID accuracy, which belongs to the
data-based category. For example, PN-GAN [11] generated
multi-pose pedestrian images via GAN model, which could
reduce the pedestrian view variance; JVTC [23] conducted
the online data augmentation for contrastive learning, in which
the mesh projections were taken as the references to generate
multi-view images.
B. Visible-to-Infrared Person ReID
The V2I person ReID enables the cross-spectrum pedes-
trian retrieval, whose crux is to reduce the large cross-
modality discrepancy. Existing V2I ReID methods mainly
have two techniques to reduce the modal discrepancy: 1)
learning the modality-shared pedestrian representation and 2)
compensating information of missing modality via generative
models [14], [15]. The modality-shared ones [24], [25], [26],
[7] projected the visible and infrared pedestrian images into a
shared Euclidean space, in which the intra-class similarity and
inter-class similarity are maximized and minimized, respec-
tively. For example, DGD-MSR [24] proposed a modality-
specific network to extract modality-specific representations
3
from each modality; expAT [25] devised an exponential angu-
lar triplet loss beyond the Euclidean metric based constraints
to learn the angularly discriminative features; MPANet [7]
aimed to capture the nuances of cross-modality images via a
modality alleviation module and a pattern alignment module.
The modality compensation ones [9], [27], [12], [8], [13]
usually generated missing modality information from existing
modality data: DDRL [27] proposed an image-level sub-
network based on GAN model, which could translate a visible
(infrared) image to a corresponding infrared (visible) one;
cmPIG [13] employed the set-level alignment information
to generate instance alignment cross-modality paired-images;
FMCNet [9] utilized the feature-level modality compensation
to reduce modality discrepancy, which generated the cross-
modality features rather than images. The method proposed in
this paper could be classified as the modality compensation
category. Compared to existing methods that directly learn
a transformation from given modality to missing modality
via GAN models, our method employs the flow-based gen-
erative models to construct invertible transformations from
given modality to latent Gaussian noise to missing modality.
Thereby, besides the cross-modality generation, our method
could generate pseudo training samples via transformations
from Gaussian noise to image modalities.
C. Flow-based Generative Model
The flow-based generative model constructs an invertible or
bijective mapping from the complex distribution of true data to
a simple distribution (e.g., isotropic Gaussian distribution). For
the purpose of invertibility and exact log-likelihood computa-
tion, layers in flow-based model should be carefully designed.
RealNVP [16] proposed the affine coupling layer, which
could easily compute the determinant of Jaocibian matrix;
Glow [17] presented an invertible 1×1convolution layer,
meanwhile the LU decomposition was utilized to speed up the
computation of determinants; cAttnFlow [28] introduced the
invertible attentions to increase the nonlinearity of flow-based
model. Recently, a great number of works have extended the
flow-based model into speech synthesis [29], molecular graph
generation [30], [31] and image generation [17], [32], [33]. For
the molecular graph generation, MoFlow [31] implemented an
atom flow and a conditional bond flow to generate the atom
features and atom bonds in molecular, respectively. For the
image super-resolution, SRFlow [32] and HCFlow [33] took
the low-resolution images as the condition, and thus learned
the high-resolution images via a conditional flow. In this paper,
we take advantage of the invertibility of flow-based model to
achieve 1) generating pseudo samples from isotropic Gaussian
noises and 2) cross-modality image generation from existing
modality to latent noises to missing modality. As far as we can
tell, this is the first study that applies the flow-based model in
person ReID.
D. Generative Adversarial Network
The first GAN model was proposed in [14], which consists
of a generator and a discriminator, and they could improve
each other by the adversarial training. In GAN model, the
generator generates samples from noise variables with a known
probability density function (PDF) and tries to fool the dis-
criminator, and the discriminator distinguishes whether the
data is true or fake to beat the generator. Recently, the GAN
architectures have been heavily refined to adapt various appli-
cation scenarios. For instance, the Conditional GAN [34], [35]
could generate samples corresponding to specific condition
labels; CycleGAN [15] enabled the unpaired cross-domain
image translation by the cycle consistency loss. Meanwhile,
the GAN model also showed its priority in the V2I person
ReID [9], [27], [12] and V2I person ReID areas [23], [11].
Unlike the flow-based model [16], [17] which could exactly
compute the log-likelihood of true data, GAN model implicitly
minimizes the KL divergence between the true data and data
generated from noises. To make the generated data indistin-
guishable from the real data, training a GAN model pursues
an equilibrium between the generator and discriminator, which
requires careful experimental setup tuning. In this paper,
we combine the flow-based model and adversarial training
to generate the high-quality visible and infrared pedestrian
images.
III. PRELIMINARIES
The flow-based generative model aims to learn a bijective
transformation from a complex distribution XP(X)to
a simple distribution ZΠ(Z)with a known probability
density function, in which Xdenotes the true training data and
Π(Z)is usually a Gaussian distribution. For the purpose of
bijective mapping, the flow-based model consists of a sequence
of invertible generators G=G1• · · · • GL:
xi=G(zi), zi=G1(xi).(1)
By the change of variable formula, P(X)and Π(Z)satisfy
the following transformation:
P(X) = Π(Z)|det(JG1)|,(2)
where det(JG1)denotes the determinant of Jacobian matrix.
Then the objective of max{log(P(X))}can be converted to:
max{X
i
log(Π(zi)) +
L
X
l=1
log
det(JG1
l)
}.(3)
From Eq. (1), Eq. (2) and Eq. (3), we could know that the
training process of the flow-based model follows the reverse
propagation, and the inference or generation process follows
the forward propagation.
A standard flow-based model mainly contains two cate-
gories of layers: invertible 1×1convolution layer [17] and
affine coupling layer [36], [16]. For a single generator Glin
G, the reverse and forward projection of the 1×1convolution
layer has the following expression:
z<l1>
i=Wlz<l>
i, z<l>
i=W1
lz<l1>
i,(4)
where Z<0>and Z<L> denotes Zand X, respectively. The
design of the affine coupling layer should allow 1) invertible
transformation and 2) exact computation of the Jacobian
摘要:

1HowImageGenerationHelpsVisible-to-InfraredPersonRe-Identication?HonghuPan,YongyongChen,Member,IEEE,YunqiHe,XinLi*,ZhenyuHe*,SeniorMember,IEEEAbstract—Comparedtovisible-to-visible(V2V)personre-identication(ReID),thevisible-to-infrared(V2I)personReIDtaskismorechallengingduetothelackofsufcienttrain...

展开>> 收起<<
1 How Image Generation Helps Visible-to-Infrared Person Re-Identification.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.11MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注