CoopHash Cooperative Learning of Multipurpose Descriptor Contrastive Pair Generator via Variational MCMC Teaching for Image Hashing

2025-05-06 0 0 2.43MB 24 页 10玖币

侵权投诉

CoopHash: Cooperative Learning of Multipurpose

Descriptor & Contrastive Pair Generator via

Variational MCMC Teaching for Image Hashing

Khoa D. Doan1, Jianwen Xie2, Yaxuan Zhu3, Yang Zhao4, Ping Li5

1College of Engineering and Computer Science, VinUniversity 2Akool Research

3Department of Statistics and Data Science, UCLA 4Google Research 5VecML

Abstract

Leveraging supervised information can lead to superior retrieval performance in

the image hashing domain but the performance degrades signiﬁcantly without

enough labeled data. One effective solution to boost performance is to employ

generative models, such as Generative Adversarial Networks (GANs), to generate

synthetic data in an image hashing model. However, GAN-based methods are

difﬁcult to train, which prevents the hashing approaches from jointly training

the generative models and the hash functions. This limitation results in sub-

optimal retrieval performance. To overcome this limitation, we propose a novel

framework, the generative cooperative hashing network, which is based on energy-

based cooperative learning. This framework jointly learns a powerful generative

representation of the data and a robust hash function via two components: a top-

down contrastive pair generator that synthesizes contrastive images and a bottom-up

multipurpose descriptor that simultaneously represents the images from multiple

perspectives, including probability density, hash code, latent code, and category.

The two components are jointly learned via a novel likelihood-based cooperative

learning scheme. We conduct experiments on several real-world datasets and

show that the proposed method outperforms the competing hashing supervised

methods, achieving up to 10% relative improvement over the current state-of-the-

art supervised hashing methods, and exhibits a signiﬁcantly better performance in

out-of-distribution retrieval.

1 Introduction

The rapid growth of digital data, especially images, brings many challenges to the problem of similar-

ity search. A complete linear scan of all the images in such massive databases is computationally

expensive, especially when the database contains millions (or billions) of items. To guarantee both

computational efﬁciency and retrieval accuracy, approximate nearest-neighbor (ANN) methods have

become increasingly important. The research on efﬁcient algorithms for ANN search dates back

to [Friedman et al., 1975, 1977]. Hashing is an ANN method that has several advantages compared

to other ANN methods [Broder et al., 1997, Indyk and Motwani, 1998, Charikar, 2002, Datar et al.,

2004, Li and Church, 2005, Lv et al., 2007, Paulevé et al., 2010, Shrivastava and Li, 2012, Li, 2019,

Li and Zhao, 2022]. In hashing, high-dimensional data points are projected onto a much smaller

locality-preserving binary space. Searching for similar images can be efﬁciently performed in this

binary space using computationally efﬁcient hamming distance.

This paper focuses on the learning-to-hash methods that “learn” data-dependent hash functions

for efﬁcient image retrieval. Prior studies on learning-to-hash include [Kulis and Darrell, 2009,

Preprint.

arXiv:2210.04288v4 [cs.CV] 12 Jun 2024

Grauman and Fergus, 2013, Gong et al., 2013, Wang et al., 2016, Gui and Li, 2018, Dong et al., 2020].

Recently, deep hashing methods learn preserving representations while simultaneously controlling the

quantization error of binarizing the continuous representation to binary codes [Lai et al., 2015, Zhu

et al., 2016, Cao et al., 2017, Yuan et al., 2020, Jin et al., 2020, Hoe et al., 2021, Doan et al., 2022].

These methods achieve state-of-the-art performance on several datasets. However, their performance

is limited by the amount of supervised information available in the training dataset. Sufﬁciently

annotating massive datasets (common in retrieval applications) is an expensive and tedious task.

Subject to such scarcity of supervised similarity information, deep hashing methods run into problems

such as overﬁtting and train/test distribution mismatch, resulting in a signiﬁcant loss in retrieval

performance.

To improve generalization, ldata-synthesis techniques are proposed for hashing [Qiu et al., 2017, Cao

et al., 2018a]. These methods rely on increasingly powerful generative models, such as Generative

Adversarial Network (GAN), and achieve a substantial retrieval improvement. However, training

GANs is difﬁcult because of problems such as mode collapse. In existing GAN-based methods, the

generative model is ﬁrst trained, then the trained generator model is used to synthesize additional

data to ﬁne-tune the hash function while the discriminator is usually discarded. This separation is ad

hoc and does not utilize the full power of generative models, such as representation learning.

On the other hand, energy-based models (EBMs), parameterized by modern neural networks for

generative learning, have recently received signiﬁcant attention in computer vision. For example,

EBMs have been successfully applied to generations of images [Xie et al., 2016], videos [Xie et al.,

2017], and 3D volumetric shapes [Xie et al., 2018c]. Maximum likelihood estimation of EBMs

typically relies on Markov chain Monte Carlo (MCMC) [Liu, 2008, Barbu and Zhu, 2020] sampling,

such as Langevin dynamics [Neal, 2011], to compute the intractable gradient for updating the model

parameters. MCMC is known to be computationally expensive, therefore [Xie et al., 2018a,b, 2022b]

propose the Generative Cooperative Networks (CoopNets) to learn an EBM, which is called descriptor,

together with a GAN-style generator serving as an amortized sampler for efﬁcient learning of the

EBM. CoopNets jointly trains the descriptor as a teacher network and the generator as a student

network via the MCMC teaching algorithm [Xie et al., 2018a]. The generator plays the role of a fast

sampler to initialize the MCMC of the descriptor for fast convergence, while the descriptor teaches

the generator how to mimic the MCMC transition such that the generator can be a good approximate

sampler for the descriptor.

Compared to GANs, CoopNets’ framework has several appealing advantages: First, cooperative

learning of two probabilistic models is based on maximum likelihood estimation, which generally

does not suffer from GAN’s mode collapse issue. Second, while GANs’ bottom-up discriminator

becomes invalid after training because it fails to tell apart the real and fake examples and only the

generator is useful for synthesis purposes, in the CoopNets framework, both bottom-up descriptor

and top-down generator are valid models for representation and generation. Our paper belongs to the

category of energy-based cooperative learning where we bring in the powers of representation and

generation of CoopNets to hashing, further advancing the state-of-the-art performance in this domain.

Speciﬁcally, building on the foundation of cooperative learning, we jointly train a novel multipurpose

descriptor for representation and a contrastive pair generator for generation via MCMC teaching.

In the context of cooperative learning with a generator, the proposed multipurpose descriptor si-

multaneously learns to represent the images from multiple perspectives, including the probability

density of the image (with an energy head and a negative log-likelihood loss), the latent code of the

image (with an inference head and a VAE loss), the category of the image (with a discriminative

head and a classiﬁcation loss), and especially the hash code of the image (with a hash head and a

triplet loss). Our hash function is part of the novel, multipurpose architecture of the descriptor. The

modules of this cooperative network play different, but essential roles in learning higher-quality hash

codes with desirable properties : (i) The contrastive pair generator learns to synthesize contrastive

samples to improve the generalization of the hash function; (ii) The energy head learns effective image

representations and improves the robustness of the learned hash function against out-of-distribution

(OOD) data; and (iii) The inference head improves the training efﬁciency of the model and helps

recover the corrupted data during retrieval, thus improving the retrieval robustness. The proposed

CoopHash framework is illustrated in Figure 1.

In a controlled environment, the learned hash function achieves state-of-the-art results. When there

is a small conceptual drift in the underlying data distribution, a realistic scenario in today’s digital

Figure 1: CoopHash consists of two main components: 1) a contrastive pair generator, that takes

as inputs the concatenation of a random noise vector

and a label

and synthesizes a contrastive

image pair

{ˆx+,ˆx−}

s from the same class

and a different class

c−

. 2) a multipurpose descriptor

that describes the images in multiple ways, including an explicit density model

p(x|c)

, an variational

inference model

p(z|x)

, a discriminative model

p(c|x)

, and a hashing model. All four models share a

base bottom-up representational network. The multipurpose descriptor network is trained by a loss

including negative maximum likelihood, variational loss, triplet-ranking loss, and classiﬁcation loss,

while the contrastive pair generator learns from the descriptor and serves as a fast initializer of the

MCMC of the descriptor. In the retrieval phase, only the hashing computational path is used; the

binary hash codes are the signs of the real-valued outputs.

world, the learned hash function can still retrieve the most relevant results to the query while in other

existing hashing methods, such retrieval performance degrades signiﬁcantly. Finally, our approach

can handle corrupted data in both training and testing, making our method well-suited for real-world

applications. The contributions of our paper are summarized below:

•

We are the ﬁrst to study the problem of supervised image hashing in the context of generative

cooperative learning, where we specially design a pair of descriptor (energy-based model) and

generator (latent variable model), and jointly train them via likelihood-based cooperative training.

•

We extend the original cooperative learning framework into a multi-task version by proposing a

novel multi-headed or multipurpose descriptor, which integrates cooperative learning (i.e., MCMC

teaching process), hash coding (i.e., triplet-ranking loss), classiﬁcation, and variational inference

into a single framework to improve both the generalization capacity of the learned hashing model

and the cooperative learning efﬁciency.

•

We train our model (i.e., the descriptor and the generator) in an improved cooperative learning

algorithm, where the MCMC-based inference step of the generator in the original cooperative

learning framework is replaced by the variational inference for computational efﬁciency. The

resulting training strategy becomes a novel variational MCMC teaching algorithm.

•

We provide theoretical understanding, including convergence analysis and mode collapse analysis,

for our model (please see the appendix).

•

We conduct extensive experiments, including the conventional retrieval evaluation and out-of-

distribution retrieval evaluation, on several benchmark datasets to demonstrate the advantages of

the proposed framework over several state-of-the-art hashing techniques.

2 Related Work

2.1 Image hashing

In hashing, shallow methods learn linear hash functions and rely on carefully-constructed features

that are extracted from any hand-crafted feature extraction techniques or representation-learning

algorithms. Conversely, the deep hashing methods combine the feature representation learning phase

and the hashing phase into an end-to-end model and have demonstrated signiﬁcant performance

improvements over the hand-crafted feature-based approaches [Xia et al., 2014, Cao et al., 2017,

Doan et al., 2022].

Generative Supervised Hashing. Hashing methods can also be divided into unsupervised [Gong

et al., 2013, Weiss et al., 2008, Yang et al., 2019, Dizaji et al., 2018, Lin et al., 2016, Li and Zhao,

2022] and supervised hashing [Shen et al., 2015, Yang et al., 2018, Ge et al., 2014, Gui and Li, 2018,

Deng et al., 2020, Yuan et al., 2020, Jin et al., 2020, Li et al., 2023, Lei et al., 2023, Xu et al., 2021,

Wei et al., 2023, Zhang et al., 2023, Wang et al., 2023]. Supervised methods demonstrate superior

performance over unsupervised ones, but they can easily overﬁt when there are limited labeled data.

To overcome such limitations, data synthesis techniques have been successfully used in image hashing

to improve the retrieval performance [Qiu et al., 2017, Gao et al., 2018]. These methods employ

generative models, such as the popular GAN, to synthesize contrastive images. However, GANs are

difﬁcult to train and their generative models do not have any usefulness toward learning the hash

functions beyond data synthesizing.

2.2 Generative cooperative network

Xie et al. [2018b, 2022b] propose a powerful generative model, called generative cooperative network

(CoopNets), which is able to generate realistic image and video patterns. The CoopNets framework

jointly trains an EBM (i.e., descriptor network) and a latent variable model (i.e., generator network)

via a cooperative learning scheme, where the descriptor is trained by MCMC-based maximum

likelihood estimation [Xie et al., 2016], while the generator learns from the descriptor and serves as a

fast initializer of the MCMC of the descriptor. Other variants include CoopVAEBM [Xie et al., 2021c]

and CoopFlow [Xie et al., 2022a]. Xie et al. [2021a] study the conditional version of CoopNets for

supervised image-to-image translation. As to applications, Zhang et al. [2022] apply the conditional

framework to generative salient object detection. Xie et al. [2021b] propose to jointly train two

CoopNets models with cycle consistency for unsupervised image-to-image translation. Zhu et al.

[2024] turns the cooperative network into a diffusion generative model. Most of the above works

focus on leveraging CoopNets for data synthesis. Our paper studies generative hashing based on

the cooperative learning scheme.

3 Cooperative Hashing Network

The CoopHash framework (described in Figure 1) consists of a contrastive pair generator network

and a multipurpose descriptor network. They are jointly trained by an MCMC-based cooperative

learning algorithm [Xie et al., 2018b].

3.1 Problem Statement

Given a dataset

X={x1, x2, ..., xn}

images, hashing aims to learn a discrete-output, nonlinear

mapping function

H:x−→ {−1,1}K

, which encodes each image

into a

-bit binary vector such

that the similarity structure between the images is preserved in the discrete space. In supervised

hashing, each example

xi∈ X

is associated with a label

. Note that this is a point-wise label of

an image. Another also common supervised scenario has the pairwise similarity label for each pair

of images. However, for most image applications, pair-wise labeling is signiﬁcantly labor-intensive

because a dataset of nimages requires n2pairwise labels.

3.2 Contrastive Pair Generator Network

Let

g(c, z; Λ)

be a nonlinear mapping function parameterized by a top-down decoder network.

contains all the learning parameters in the network. The conditional generator in the form of a latent

variable model is given by

z∼ N (0, Id), x =g(c, z; Λ) + ϵ, ϵ ∼ N (0, σ2ID),

which deﬁnes an implicit conditional distribution of an image

given a label

, i.e.,

p(x|c; Λ) =

Rp(z)p(x|c, z; Λ)dz

, where

p(x|c, z; Λ) = N(g(x, z; Λ), σ2ID)

. We further revise the generator for

contrastive pair generation given a query example

(ci, xi)

sampled from the empirical data distribution

pdata(c, x)

. The contrastive pair generator produces a pair of synthetic examples, consisting of one

similar example generated with the same label of the query example

c+=ci

and one dissimilar

example generated with a different label

c−̸=ci

. Both share the same latent code

for semantic

feature preservation. To be speciﬁc,

z∼ N (0, Id); c+, c−∼pdata(c); ϵ∼ N (0, σID);

x+=g(c+, z; Λ) + ϵ;x−=g(c−, z; Λ) + ϵ(1)

The generator plays two key roles in the framework: the ﬁrst one is to provide contrastive pairs

for contrastive learning of the hash function, which is the main goal of the framework, while the

second one is to serve as an approximate sampler for efﬁcient MCMC sampling and training of the

energy-based descriptor, which is the core step of the cooperative learning. Although the ﬁrst role

aims at the target, i.e., training the hash function, the second role is the foundation of the learning.

Without the second role, the generator cannot be trained successfully such that it will fail to generate

useful contrastive image pairs.

In the cooperative training scheme, the generator as an ancestral sampler learns to approximate the

MCMC sampling of the EBM. Thus, the learning objective is to minimize the negative log-likelihood

of the samples

{(˜xi|ci)}˜n

i=1

drawn from the EBM, i.e.,

LG(Λ) = 1

˜nPn

i=1 log p(˜xi|ci; Λ)

, which

amounts to minimizing the following objective

LG(Λ) = 1

i=1

||˜xi−g(ci,˜zi; Λ)||2,(2)

where

˜zi∼p(z|˜xi; Λ)

is the corresponding latent vector inferred from

˜xi

. In the original cooperative

learning algorithm [Xie et al., 2018b], the inference process is typically achieved by MCMC sampling

from the intractable posterior

p(z|˜xi; Λ)

. In Section 3.3, we propose to learn an encoder as a

fast inference model with reparameterization trick [Kingma and Welling, 2014] to speed up the

computation of Eq. (2). Both the EBM and the inference model are represented by the multipurpose

descriptor network.

3.3 Multipurpose Descriptor Network

The multipurpose descriptor aims at representing the images from different perspectives. We propose

to parameterize the descriptor by a multi-headed bottom-up neural network, where each branch

accounts for one different representation of the image. The proposed descriptor assembles four

types of representational models of data in a single network in the sense that all models share a

base network but have separate lightweight heads built on the top of the base network for different

representational purposes. Let f0(x;θ0)be the shared base network with parameters θ0.

Energy head. The energy head

along with the base network

speciﬁes an energy function

fE(x, c; ΘE)

, where observed image-label pairs are assigned lower energy values than unobserved

ones. For notation simplicity, let

ΘE= (θ0, θE)

, and then the energy function

fE(x, c; ΘE) =

hE(c, f0(x, θ0); θE)

. With the energy function

, the descriptor explicitly deﬁnes a probability

distribution of xgiven its label cin the form of energy-based model

p(x|c; ΘE) = p(x, c; ΘE)

Rp(x, c; ΘE)dx =exp[−fE(x, c; ΘE)]

Z(c; ΘE),(3)

where

Z(c; ΘE) = Rexp[−fE(x, c; ΘE)]dx

is the intractable normalizing constant. Eq. (3) is also

called generative modeling of neural network

[Xie et al., 2016]. The training of

θE

in this context

can be achieved by maximum likelihood estimation, which will lead to the “analysis by synthesis”

algorithm [Grenander et al., 2007]. Given a set of training images with labels

{(ci, xi)}n

i=1

, we train

ΘEby minimizing the negative log-likelihood (NLL):

LNLL(ΘE) = −1

i=1

log p(xi|ci; ΘE),(4)

whose gradient is given by

i=1

{∂fE(xi, ci; ΘE)

∂ΘE

−Ep(x|ci;ΘE)[∂fE(x, ci; ΘE)

∂ΘE

]},

where the

Ep(x|ci;ΘE)

denotes the intractable expectation with respect to

p(x|ci; ΘE)

, which can be

approximated by the average of MCMC samples. We can rely on a cooperative MCMC sampling

strategy that draws samples by (i) ﬁrst generating initial examples

ˆx

by the generator, and then (ii)

revising ˆxby ﬁnite steps of Langevin updates [Zhu and Mumford, 1998] to obtain ﬁnal ˜x, that is

ˆx=g(c, ˆz; Λ),ˆz∼ N (0, Id)(5)

˜xt+1 = ˜xt−δ2

∂fE(˜xt, c; ΘE)

∂˜x+δN(0, ID),(6)

where

˜x0= ˆx

indexes the Langevin time steps, and

is the step size. The Langevin dynamics in

Eq. (6) is a gradient-based MCMC, which is equivalent to a stochastic gradient descent algorithm that

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CoopHash:CooperativeLearningofMultipurposeDescriptor&ContrastivePairGeneratorviaVariationalMCMCTeachingforImageHashingKhoaD.Doan1,JianwenXie2,YaxuanZhu3,YangZhao4,PingLi51CollegeofEngineeringandComputerScience,VinUniversity2AkoolResearch3DepartmentofStatisticsandDataScience,UCLA4GoogleResearch5VecML...

展开>> 收起<<

CoopHash Cooperative Learning of Multipurpose Descriptor Contrastive Pair Generator via Variational MCMC Teaching for Image Hashing.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CoopHash Cooperative Learning of Multipurpose Descriptor Contrastive Pair Generator via Variational MCMC Teaching for Image Hashing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: