CoopHash Cooperative Learning of Multipurpose Descriptor Contrastive Pair Generator via Variational MCMC Teaching for Image Hashing

2025-05-06 0 0 2.43MB 24 页 10玖币
侵权投诉
CoopHash: Cooperative Learning of Multipurpose
Descriptor & Contrastive Pair Generator via
Variational MCMC Teaching for Image Hashing
Khoa D. Doan1, Jianwen Xie2, Yaxuan Zhu3, Yang Zhao4, Ping Li5
1College of Engineering and Computer Science, VinUniversity 2Akool Research
3Department of Statistics and Data Science, UCLA 4Google Research 5VecML
Abstract
Leveraging supervised information can lead to superior retrieval performance in
the image hashing domain but the performance degrades significantly without
enough labeled data. One effective solution to boost performance is to employ
generative models, such as Generative Adversarial Networks (GANs), to generate
synthetic data in an image hashing model. However, GAN-based methods are
difficult to train, which prevents the hashing approaches from jointly training
the generative models and the hash functions. This limitation results in sub-
optimal retrieval performance. To overcome this limitation, we propose a novel
framework, the generative cooperative hashing network, which is based on energy-
based cooperative learning. This framework jointly learns a powerful generative
representation of the data and a robust hash function via two components: a top-
down contrastive pair generator that synthesizes contrastive images and a bottom-up
multipurpose descriptor that simultaneously represents the images from multiple
perspectives, including probability density, hash code, latent code, and category.
The two components are jointly learned via a novel likelihood-based cooperative
learning scheme. We conduct experiments on several real-world datasets and
show that the proposed method outperforms the competing hashing supervised
methods, achieving up to 10% relative improvement over the current state-of-the-
art supervised hashing methods, and exhibits a significantly better performance in
out-of-distribution retrieval.
1 Introduction
The rapid growth of digital data, especially images, brings many challenges to the problem of similar-
ity search. A complete linear scan of all the images in such massive databases is computationally
expensive, especially when the database contains millions (or billions) of items. To guarantee both
computational efficiency and retrieval accuracy, approximate nearest-neighbor (ANN) methods have
become increasingly important. The research on efficient algorithms for ANN search dates back
to [Friedman et al., 1975, 1977]. Hashing is an ANN method that has several advantages compared
to other ANN methods [Broder et al., 1997, Indyk and Motwani, 1998, Charikar, 2002, Datar et al.,
2004, Li and Church, 2005, Lv et al., 2007, Paulevé et al., 2010, Shrivastava and Li, 2012, Li, 2019,
Li and Zhao, 2022]. In hashing, high-dimensional data points are projected onto a much smaller
locality-preserving binary space. Searching for similar images can be efficiently performed in this
binary space using computationally efficient hamming distance.
This paper focuses on the learning-to-hash methods that “learn” data-dependent hash functions
for efficient image retrieval. Prior studies on learning-to-hash include [Kulis and Darrell, 2009,
Preprint.
arXiv:2210.04288v4 [cs.CV] 12 Jun 2024
Grauman and Fergus, 2013, Gong et al., 2013, Wang et al., 2016, Gui and Li, 2018, Dong et al., 2020].
Recently, deep hashing methods learn preserving representations while simultaneously controlling the
quantization error of binarizing the continuous representation to binary codes [Lai et al., 2015, Zhu
et al., 2016, Cao et al., 2017, Yuan et al., 2020, Jin et al., 2020, Hoe et al., 2021, Doan et al., 2022].
These methods achieve state-of-the-art performance on several datasets. However, their performance
is limited by the amount of supervised information available in the training dataset. Sufficiently
annotating massive datasets (common in retrieval applications) is an expensive and tedious task.
Subject to such scarcity of supervised similarity information, deep hashing methods run into problems
such as overfitting and train/test distribution mismatch, resulting in a significant loss in retrieval
performance.
To improve generalization, ldata-synthesis techniques are proposed for hashing [Qiu et al., 2017, Cao
et al., 2018a]. These methods rely on increasingly powerful generative models, such as Generative
Adversarial Network (GAN), and achieve a substantial retrieval improvement. However, training
GANs is difficult because of problems such as mode collapse. In existing GAN-based methods, the
generative model is first trained, then the trained generator model is used to synthesize additional
data to fine-tune the hash function while the discriminator is usually discarded. This separation is ad
hoc and does not utilize the full power of generative models, such as representation learning.
On the other hand, energy-based models (EBMs), parameterized by modern neural networks for
generative learning, have recently received significant attention in computer vision. For example,
EBMs have been successfully applied to generations of images [Xie et al., 2016], videos [Xie et al.,
2017], and 3D volumetric shapes [Xie et al., 2018c]. Maximum likelihood estimation of EBMs
typically relies on Markov chain Monte Carlo (MCMC) [Liu, 2008, Barbu and Zhu, 2020] sampling,
such as Langevin dynamics [Neal, 2011], to compute the intractable gradient for updating the model
parameters. MCMC is known to be computationally expensive, therefore [Xie et al., 2018a,b, 2022b]
propose the Generative Cooperative Networks (CoopNets) to learn an EBM, which is called descriptor,
together with a GAN-style generator serving as an amortized sampler for efficient learning of the
EBM. CoopNets jointly trains the descriptor as a teacher network and the generator as a student
network via the MCMC teaching algorithm [Xie et al., 2018a]. The generator plays the role of a fast
sampler to initialize the MCMC of the descriptor for fast convergence, while the descriptor teaches
the generator how to mimic the MCMC transition such that the generator can be a good approximate
sampler for the descriptor.
Compared to GANs, CoopNets’ framework has several appealing advantages: First, cooperative
learning of two probabilistic models is based on maximum likelihood estimation, which generally
does not suffer from GAN’s mode collapse issue. Second, while GANs’ bottom-up discriminator
becomes invalid after training because it fails to tell apart the real and fake examples and only the
generator is useful for synthesis purposes, in the CoopNets framework, both bottom-up descriptor
and top-down generator are valid models for representation and generation. Our paper belongs to the
category of energy-based cooperative learning where we bring in the powers of representation and
generation of CoopNets to hashing, further advancing the state-of-the-art performance in this domain.
Specifically, building on the foundation of cooperative learning, we jointly train a novel multipurpose
descriptor for representation and a contrastive pair generator for generation via MCMC teaching.
In the context of cooperative learning with a generator, the proposed multipurpose descriptor si-
multaneously learns to represent the images from multiple perspectives, including the probability
density of the image (with an energy head and a negative log-likelihood loss), the latent code of the
image (with an inference head and a VAE loss), the category of the image (with a discriminative
head and a classification loss), and especially the hash code of the image (with a hash head and a
triplet loss). Our hash function is part of the novel, multipurpose architecture of the descriptor. The
modules of this cooperative network play different, but essential roles in learning higher-quality hash
codes with desirable properties : (i) The contrastive pair generator learns to synthesize contrastive
samples to improve the generalization of the hash function; (ii) The energy head learns effective image
representations and improves the robustness of the learned hash function against out-of-distribution
(OOD) data; and (iii) The inference head improves the training efficiency of the model and helps
recover the corrupted data during retrieval, thus improving the retrieval robustness. The proposed
CoopHash framework is illustrated in Figure 1.
In a controlled environment, the learned hash function achieves state-of-the-art results. When there
is a small conceptual drift in the underlying data distribution, a realistic scenario in today’s digital
2
Figure 1: CoopHash consists of two main components: 1) a contrastive pair generator, that takes
as inputs the concatenation of a random noise vector
z
and a label
c+
and synthesizes a contrastive
image pair
{ˆx+,ˆx}
s from the same class
c+
and a different class
c
. 2) a multipurpose descriptor
that describes the images in multiple ways, including an explicit density model
p(x|c)
, an variational
inference model
p(z|x)
, a discriminative model
p(c|x)
, and a hashing model. All four models share a
base bottom-up representational network. The multipurpose descriptor network is trained by a loss
including negative maximum likelihood, variational loss, triplet-ranking loss, and classification loss,
while the contrastive pair generator learns from the descriptor and serves as a fast initializer of the
MCMC of the descriptor. In the retrieval phase, only the hashing computational path is used; the
binary hash codes are the signs of the real-valued outputs.
world, the learned hash function can still retrieve the most relevant results to the query while in other
existing hashing methods, such retrieval performance degrades significantly. Finally, our approach
can handle corrupted data in both training and testing, making our method well-suited for real-world
applications. The contributions of our paper are summarized below:
We are the first to study the problem of supervised image hashing in the context of generative
cooperative learning, where we specially design a pair of descriptor (energy-based model) and
generator (latent variable model), and jointly train them via likelihood-based cooperative training.
We extend the original cooperative learning framework into a multi-task version by proposing a
novel multi-headed or multipurpose descriptor, which integrates cooperative learning (i.e., MCMC
teaching process), hash coding (i.e., triplet-ranking loss), classification, and variational inference
into a single framework to improve both the generalization capacity of the learned hashing model
and the cooperative learning efficiency.
We train our model (i.e., the descriptor and the generator) in an improved cooperative learning
algorithm, where the MCMC-based inference step of the generator in the original cooperative
learning framework is replaced by the variational inference for computational efficiency. The
resulting training strategy becomes a novel variational MCMC teaching algorithm.
We provide theoretical understanding, including convergence analysis and mode collapse analysis,
for our model (please see the appendix).
We conduct extensive experiments, including the conventional retrieval evaluation and out-of-
distribution retrieval evaluation, on several benchmark datasets to demonstrate the advantages of
the proposed framework over several state-of-the-art hashing techniques.
2 Related Work
2.1 Image hashing
In hashing, shallow methods learn linear hash functions and rely on carefully-constructed features
that are extracted from any hand-crafted feature extraction techniques or representation-learning
algorithms. Conversely, the deep hashing methods combine the feature representation learning phase
and the hashing phase into an end-to-end model and have demonstrated significant performance
improvements over the hand-crafted feature-based approaches [Xia et al., 2014, Cao et al., 2017,
Doan et al., 2022].
Generative Supervised Hashing. Hashing methods can also be divided into unsupervised [Gong
et al., 2013, Weiss et al., 2008, Yang et al., 2019, Dizaji et al., 2018, Lin et al., 2016, Li and Zhao,
3
2022] and supervised hashing [Shen et al., 2015, Yang et al., 2018, Ge et al., 2014, Gui and Li, 2018,
Deng et al., 2020, Yuan et al., 2020, Jin et al., 2020, Li et al., 2023, Lei et al., 2023, Xu et al., 2021,
Wei et al., 2023, Zhang et al., 2023, Wang et al., 2023]. Supervised methods demonstrate superior
performance over unsupervised ones, but they can easily overfit when there are limited labeled data.
To overcome such limitations, data synthesis techniques have been successfully used in image hashing
to improve the retrieval performance [Qiu et al., 2017, Gao et al., 2018]. These methods employ
generative models, such as the popular GAN, to synthesize contrastive images. However, GANs are
difficult to train and their generative models do not have any usefulness toward learning the hash
functions beyond data synthesizing.
2.2 Generative cooperative network
Xie et al. [2018b, 2022b] propose a powerful generative model, called generative cooperative network
(CoopNets), which is able to generate realistic image and video patterns. The CoopNets framework
jointly trains an EBM (i.e., descriptor network) and a latent variable model (i.e., generator network)
via a cooperative learning scheme, where the descriptor is trained by MCMC-based maximum
likelihood estimation [Xie et al., 2016], while the generator learns from the descriptor and serves as a
fast initializer of the MCMC of the descriptor. Other variants include CoopVAEBM [Xie et al., 2021c]
and CoopFlow [Xie et al., 2022a]. Xie et al. [2021a] study the conditional version of CoopNets for
supervised image-to-image translation. As to applications, Zhang et al. [2022] apply the conditional
framework to generative salient object detection. Xie et al. [2021b] propose to jointly train two
CoopNets models with cycle consistency for unsupervised image-to-image translation. Zhu et al.
[2024] turns the cooperative network into a diffusion generative model. Most of the above works
focus on leveraging CoopNets for data synthesis. Our paper studies generative hashing based on
the cooperative learning scheme.
3 Cooperative Hashing Network
The CoopHash framework (described in Figure 1) consists of a contrastive pair generator network
and a multipurpose descriptor network. They are jointly trained by an MCMC-based cooperative
learning algorithm [Xie et al., 2018b].
3.1 Problem Statement
Given a dataset
X={x1, x2, ..., xn}
of
n
images, hashing aims to learn a discrete-output, nonlinear
mapping function
H:x→ {−1,1}K
, which encodes each image
x
into a
K
-bit binary vector such
that the similarity structure between the images is preserved in the discrete space. In supervised
hashing, each example
xi∈ X
is associated with a label
ci
. Note that this is a point-wise label of
an image. Another also common supervised scenario has the pairwise similarity label for each pair
of images. However, for most image applications, pair-wise labeling is significantly labor-intensive
because a dataset of nimages requires n2pairwise labels.
3.2 Contrastive Pair Generator Network
Let
g(c, z; Λ)
be a nonlinear mapping function parameterized by a top-down decoder network.
Λ
contains all the learning parameters in the network. The conditional generator in the form of a latent
variable model is given by
z N (0, Id), x =g(c, z; Λ) + ϵ, ϵ N (0, σ2ID),
which defines an implicit conditional distribution of an image
x
given a label
c
, i.e.,
p(x|c; Λ) =
Rp(z)p(x|c, z; Λ)dz
, where
p(x|c, z; Λ) = N(g(x, z; Λ), σ2ID)
. We further revise the generator for
contrastive pair generation given a query example
(ci, xi)
sampled from the empirical data distribution
pdata(c, x)
. The contrastive pair generator produces a pair of synthetic examples, consisting of one
similar example generated with the same label of the query example
c+=ci
and one dissimilar
example generated with a different label
c̸=ci
. Both share the same latent code
z
for semantic
feature preservation. To be specific,
z N (0, Id); c+, cpdata(c); ϵ N (0, σID);
x+=g(c+, z; Λ) + ϵ;x=g(c, z; Λ) + ϵ(1)
The generator plays two key roles in the framework: the first one is to provide contrastive pairs
for contrastive learning of the hash function, which is the main goal of the framework, while the
second one is to serve as an approximate sampler for efficient MCMC sampling and training of the
4
energy-based descriptor, which is the core step of the cooperative learning. Although the first role
aims at the target, i.e., training the hash function, the second role is the foundation of the learning.
Without the second role, the generator cannot be trained successfully such that it will fail to generate
useful contrastive image pairs.
In the cooperative training scheme, the generator as an ancestral sampler learns to approximate the
MCMC sampling of the EBM. Thus, the learning objective is to minimize the negative log-likelihood
of the samples
{(˜xi|ci)}˜n
i=1
drawn from the EBM, i.e.,
LG(Λ) = 1
˜nPn
i=1 log p(˜xi|ci; Λ)
, which
amounts to minimizing the following objective
LG(Λ) = 1
n
n
X
i=1
||˜xig(ci,˜zi; Λ)||2,(2)
where
˜zip(z|˜xi; Λ)
is the corresponding latent vector inferred from
˜xi
. In the original cooperative
learning algorithm [Xie et al., 2018b], the inference process is typically achieved by MCMC sampling
from the intractable posterior
p(z|˜xi; Λ)
. In Section 3.3, we propose to learn an encoder as a
fast inference model with reparameterization trick [Kingma and Welling, 2014] to speed up the
computation of Eq. (2). Both the EBM and the inference model are represented by the multipurpose
descriptor network.
3.3 Multipurpose Descriptor Network
The multipurpose descriptor aims at representing the images from different perspectives. We propose
to parameterize the descriptor by a multi-headed bottom-up neural network, where each branch
accounts for one different representation of the image. The proposed descriptor assembles four
types of representational models of data in a single network in the sense that all models share a
base network but have separate lightweight heads built on the top of the base network for different
representational purposes. Let f0(x;θ0)be the shared base network with parameters θ0.
Energy head. The energy head
hE
along with the base network
f0
specifies an energy function
fE(x, c; ΘE)
, where observed image-label pairs are assigned lower energy values than unobserved
ones. For notation simplicity, let
ΘE= (θ0, θE)
, and then the energy function
fE(x, c; ΘE) =
hE(c, f0(x, θ0); θE)
. With the energy function
fE
, the descriptor explicitly defines a probability
distribution of xgiven its label cin the form of energy-based model
p(x|c; ΘE) = p(x, c; ΘE)
Rp(x, c; ΘE)dx =exp[fE(x, c; ΘE)]
Z(c; ΘE),(3)
where
Z(c; ΘE) = Rexp[fE(x, c; ΘE)]dx
is the intractable normalizing constant. Eq. (3) is also
called generative modeling of neural network
fE
[Xie et al., 2016]. The training of
θE
in this context
can be achieved by maximum likelihood estimation, which will lead to the “analysis by synthesis”
algorithm [Grenander et al., 2007]. Given a set of training images with labels
{(ci, xi)}n
i=1
, we train
ΘEby minimizing the negative log-likelihood (NLL):
LNLLE) = 1
n
n
X
i=1
log p(xi|ci; ΘE),(4)
whose gradient is given by
1
n
n
X
i=1
{fE(xi, ci; ΘE)
ΘE
Ep(x|ciE)[fE(x, ci; ΘE)
ΘE
]},
where the
Ep(x|ciE)
denotes the intractable expectation with respect to
p(x|ci; ΘE)
, which can be
approximated by the average of MCMC samples. We can rely on a cooperative MCMC sampling
strategy that draws samples by (i) first generating initial examples
ˆx
by the generator, and then (ii)
revising ˆxby finite steps of Langevin updates [Zhu and Mumford, 1998] to obtain final ˜x, that is
ˆx=g(c, ˆz; Λ),ˆz N (0, Id)(5)
˜xt+1 = ˜xtδ2
2
fE(˜xt, c; ΘE)
˜x+δN(0, ID),(6)
where
˜x0= ˆx
,
t
indexes the Langevin time steps, and
δ
is the step size. The Langevin dynamics in
Eq. (6) is a gradient-based MCMC, which is equivalent to a stochastic gradient descent algorithm that
5
摘要:

CoopHash:CooperativeLearningofMultipurposeDescriptor&ContrastivePairGeneratorviaVariationalMCMCTeachingforImageHashingKhoaD.Doan1,JianwenXie2,YaxuanZhu3,YangZhao4,PingLi51CollegeofEngineeringandComputerScience,VinUniversity2AkoolResearch3DepartmentofStatisticsandDataScience,UCLA4GoogleResearch5VecML...

展开>> 收起<<
CoopHash Cooperative Learning of Multipurpose Descriptor Contrastive Pair Generator via Variational MCMC Teaching for Image Hashing.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:2.43MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注