2
trum is very large. In V2V ReID, some of recent studies [10],
[11] have considered to generate pseudo training samples to
reduce intra-class variance, while to our knowledge, training
sample expansion has not been studied in V2I ReID. Existing
methods of V2I ReID [12], [8], [13] mainly struggle to re-
duce the cross-modality discrepancy via cross-modality image
generation, most of which employ the generative adversarial
network (GAN) [14], [15] to generate the missing modality
images for existing modality images.
In this paper, we develop Flow2Flow, a unified framework
to explore how image generation, including the training sample
generation and cross-modality image generation, improves the
V2I person ReID task. Specifically, our framework contains
two flow-based generative models [16], [17], i.e., a visible
flow and an infrared flow, which learn invertible or bijective
transformations from the visible image domain and infrared
image domain to an isotropic Gaussian domain, respectively.
Thereby, generating pseudo visible or infrared training samples
could be achieved by the forward flow propagation from
the latent noise domain to the visible or infrared image do-
main. While generating missing-modality images from given-
modality images could be achieved by transformations from
given-modality domain to Gaussian noise domain to missing-
modality domain. Fig. 1 shows the schematic of the training
sample generation and cross-modality image generation.
To guarantee the invertibility and exact log-likelihood com-
putation, existing flow models [16], [17] is composed of
mutiple 1×1convolutional layers and linear coupling layers,
which leads to insufficient nonlinearity. To resolve this, we
implement an extra invertible activation layer in the last block
of the visible and infrared flows to increase model nonlinearity.
In addition, we propose an identity adversarial training strategy
and a modality adversarial training strategy to encourage
the generated images corresponding to specific identities and
modalities. For the purpose of adversarial training, we imple-
ment two discriminators for each modality, including an image
encoder for identity alignment and a modality discriminator for
modality alignment. To enable the identity alignment of the
real images and generated images, we minimize the distance
between their encoded features when training generators, and
maximize that when training discriminators. While the modal-
ity discriminators distinguish whether the images are generated
or from a specific real modality.
To the best of our knowledge, this is the first study that
achieves both training sample generation and cross-modality
generation via a unified framework. Experimental results
demonstrate that both generations improve the V2I ReID
performance significantly. For example, the training sample
expansion and cross-modality generation obtain gains of 2.0%
and 1.2% mAP against the baseline model on the SYSU-
MM01 [5] dataset. The main contributions of this paper are
three-fold:
•To explore how image generation helps V2I person ReID,
we propose Flow2Flow, a unified framework, to jointly
generate pseudo training samples and cross-modality im-
ages, which contains a visible flow and an infrared flow
to learn bijective transformations from image domains to
Gaussian domain;
•For the purpose of identity alignment and modality align-
ment of generated images, we develop an image encoder
and a modality discriminator for each modality to perform
the identity adversarial training and modality adversarial
training, respectively;
•We demonstrate that both the training sample expansion
and cross-modality generation improve the V2I ReID
accuracy significantly. In addition, our Flow2Flow model
leads a new state-of-the-arts (SOTA) performance on the
SYSU-MM01 dataset.
The remainder of this paper is organized as follows: Sec-
tion II introduces recent literatures related to this paper;
Section III simply reviews theoretical backgrounds of the flow-
based generative models; Section IV elaborates the Flow2Flow
model in detail; Section V presents the ablation studies,
visualizations and comparisons with the SOTA; Section VI
draws brief conclusions.
II. RELATED WORKS
A. Visible-to-Visible Person ReID
The V2V person ReID is a single-modality image re-
trieval task, which devotes to enlarging the inter-class variance
and reducing the intra-class variance. To this end, existing
methods mainly consider three levels of factors: objective-
level, network-level and data-level. For the objectives or loss
functions, TriNet [1] proposed the hard triplet mining strategy
on the basis of triplet loss to learn pedestrian representations;
BoT [2] combined the cross entropy loss and triplet loss to
train network; moreover, the center loss [18] and angular
loss [19] have also been successfully applied in the V2V
person ReID. For the network, early works [1] learned the
global features from pedestrian images via a single CNN
branch.Next, the multi-branch architecture has been adopted
to learn the multi-granularity or part-level features [20], [21],
[22]. Furthermore, data augmentation or generation [23], [11]
could also improve the ReID accuracy, which belongs to the
data-based category. For example, PN-GAN [11] generated
multi-pose pedestrian images via GAN model, which could
reduce the pedestrian view variance; JVTC [23] conducted
the online data augmentation for contrastive learning, in which
the mesh projections were taken as the references to generate
multi-view images.
B. Visible-to-Infrared Person ReID
The V2I person ReID enables the cross-spectrum pedes-
trian retrieval, whose crux is to reduce the large cross-
modality discrepancy. Existing V2I ReID methods mainly
have two techniques to reduce the modal discrepancy: 1)
learning the modality-shared pedestrian representation and 2)
compensating information of missing modality via generative
models [14], [15]. The modality-shared ones [24], [25], [26],
[7] projected the visible and infrared pedestrian images into a
shared Euclidean space, in which the intra-class similarity and
inter-class similarity are maximized and minimized, respec-
tively. For example, DGD-MSR [24] proposed a modality-
specific network to extract modality-specific representations