Grauman and Fergus, 2013, Gong et al., 2013, Wang et al., 2016, Gui and Li, 2018, Dong et al., 2020].
Recently, deep hashing methods learn preserving representations while simultaneously controlling the
quantization error of binarizing the continuous representation to binary codes [Lai et al., 2015, Zhu
et al., 2016, Cao et al., 2017, Yuan et al., 2020, Jin et al., 2020, Hoe et al., 2021, Doan et al., 2022].
These methods achieve state-of-the-art performance on several datasets. However, their performance
is limited by the amount of supervised information available in the training dataset. Sufficiently
annotating massive datasets (common in retrieval applications) is an expensive and tedious task.
Subject to such scarcity of supervised similarity information, deep hashing methods run into problems
such as overfitting and train/test distribution mismatch, resulting in a significant loss in retrieval
performance.
To improve generalization, ldata-synthesis techniques are proposed for hashing [Qiu et al., 2017, Cao
et al., 2018a]. These methods rely on increasingly powerful generative models, such as Generative
Adversarial Network (GAN), and achieve a substantial retrieval improvement. However, training
GANs is difficult because of problems such as mode collapse. In existing GAN-based methods, the
generative model is first trained, then the trained generator model is used to synthesize additional
data to fine-tune the hash function while the discriminator is usually discarded. This separation is ad
hoc and does not utilize the full power of generative models, such as representation learning.
On the other hand, energy-based models (EBMs), parameterized by modern neural networks for
generative learning, have recently received significant attention in computer vision. For example,
EBMs have been successfully applied to generations of images [Xie et al., 2016], videos [Xie et al.,
2017], and 3D volumetric shapes [Xie et al., 2018c]. Maximum likelihood estimation of EBMs
typically relies on Markov chain Monte Carlo (MCMC) [Liu, 2008, Barbu and Zhu, 2020] sampling,
such as Langevin dynamics [Neal, 2011], to compute the intractable gradient for updating the model
parameters. MCMC is known to be computationally expensive, therefore [Xie et al., 2018a,b, 2022b]
propose the Generative Cooperative Networks (CoopNets) to learn an EBM, which is called descriptor,
together with a GAN-style generator serving as an amortized sampler for efficient learning of the
EBM. CoopNets jointly trains the descriptor as a teacher network and the generator as a student
network via the MCMC teaching algorithm [Xie et al., 2018a]. The generator plays the role of a fast
sampler to initialize the MCMC of the descriptor for fast convergence, while the descriptor teaches
the generator how to mimic the MCMC transition such that the generator can be a good approximate
sampler for the descriptor.
Compared to GANs, CoopNets’ framework has several appealing advantages: First, cooperative
learning of two probabilistic models is based on maximum likelihood estimation, which generally
does not suffer from GAN’s mode collapse issue. Second, while GANs’ bottom-up discriminator
becomes invalid after training because it fails to tell apart the real and fake examples and only the
generator is useful for synthesis purposes, in the CoopNets framework, both bottom-up descriptor
and top-down generator are valid models for representation and generation. Our paper belongs to the
category of energy-based cooperative learning where we bring in the powers of representation and
generation of CoopNets to hashing, further advancing the state-of-the-art performance in this domain.
Specifically, building on the foundation of cooperative learning, we jointly train a novel multipurpose
descriptor for representation and a contrastive pair generator for generation via MCMC teaching.
In the context of cooperative learning with a generator, the proposed multipurpose descriptor si-
multaneously learns to represent the images from multiple perspectives, including the probability
density of the image (with an energy head and a negative log-likelihood loss), the latent code of the
image (with an inference head and a VAE loss), the category of the image (with a discriminative
head and a classification loss), and especially the hash code of the image (with a hash head and a
triplet loss). Our hash function is part of the novel, multipurpose architecture of the descriptor. The
modules of this cooperative network play different, but essential roles in learning higher-quality hash
codes with desirable properties : (i) The contrastive pair generator learns to synthesize contrastive
samples to improve the generalization of the hash function; (ii) The energy head learns effective image
representations and improves the robustness of the learned hash function against out-of-distribution
(OOD) data; and (iii) The inference head improves the training efficiency of the model and helps
recover the corrupted data during retrieval, thus improving the retrieval robustness. The proposed
CoopHash framework is illustrated in Figure 1.
In a controlled environment, the learned hash function achieves state-of-the-art results. When there
is a small conceptual drift in the underlying data distribution, a realistic scenario in today’s digital
2