
Efficient Document Retrieval by End-to-End Refining and Quantizing
BERT Embedding with Contrastive Product Quantization
Zexuan Qiu1, Qinliang Su1,2∗
, Jianxing Yu3, and Shijing Si4
1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China
3School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China
4School of Economics and Finance, Shanghai International Studies University, China
{qiuzx3@mail2, suqliang@mail, yujx26@mail}.sysu.edu.cn
Abstract
Efficient document retrieval heavily relies on
the technique of semantic hashing, which
learns a binary code for every document and
employs Hamming distance to evaluate docu-
ment distances. However, existing semantic
hashing methods are mostly established on out-
dated TFIDF features, which obviously do not
contain lots of important semantic information
about documents. Furthermore, the Hamming
distance can only be equal to one of several
integer values, significantly limiting its repre-
sentational ability for document distances. To
address these issues, in this paper, we propose
to leverage BERT embeddings to perform ef-
ficient retrieval based on the product quantiza-
tion technique, which will assign for every doc-
ument a real-valued codeword from the code-
book, instead of a binary code as in seman-
tic hashing. Specifically, we first transform
the original BERT embeddings via a learnable
mapping and feed the transformed embedding
into a probabilistic product quantization mod-
ule to output the assigned codeword. The refin-
ing and quantizing modules can be optimized
in an end-to-end manner by minimizing the
probabilistic contrastive loss. A mutual infor-
mation maximization based method is further
proposed to improve the representativeness of
codewords, so that documents can be quan-
tized more accurately. Extensive experiments
conducted on three benchmarks demonstrate
that our proposed method significantly outper-
forms current state-of-the-art baselines1.
1 Introduction
In the era of big data, Approximate Nearest Neigh-
bor (ANN) search has attracted tremendous atten-
tion thanks to its high search efficiency and extraor-
dinary performance in modern information retrieval
∗∗ Corresponding author.
1
Our PyTorch code is available at
https://github.
com/qiuzx2/MICPQ
, and our MindSpore code will be also
released soon.
systems. By quantizing each document as a com-
pact binary code, semantic hashing (Salakhutdinov
and Hinton,2009) has become the main solution to
ANN search due to the extremely low cost of cal-
culating Hamming distance between binary codes.
One of the main approaches for unsupervised se-
mantic hashing methods is established on gener-
ative models (Chaidaroon and Fang,2017;Shen
et al.,2018;Dong et al.,2019;Zheng et al.,2020),
which encourage the binary codes to be able to re-
construct the input documents. Alternatively, some
other methods are driven by graphs (Weiss et al.,
2008;Chaidaroon et al.,2020;Hansen et al.,2020;
Ou et al.,2021a), hoping the binary codes can re-
cover the neighborhood relationship. Though these
methods have obtained great retrieval performance,
there still exist two main problems.
Firstly, these methods are mostly established on
top of the outdated TFIDF features, which do not
contain various kinds of important information of
documents, like word order, contextual informa-
tion, etc. In recent years, pre-trained language mod-
els like BERT have achieved tremendous success in
various downstream tasks. Thus, a natural question
to ask is whether we can establish efficient retrieval
methods on BERT embeddings. However, it has
been widely reported that BERT embeddings are
not suitable for semantic similarity-related tasks
(Reimers and Gurevych,2019), which perform
even worse than the traditional Glove embeddings
(Pennington et al.,2014). (Ethayarajh,2019;Li
et al.,2020) attribute this to the "anisotropy" phe-
nomenon that BERT embeddings only occupy a
narrow cone in the vector space, causing the se-
mantic information hidden in BERT embeddings
not easy to be leveraged directly. Thus, it is impor-
tant to investigate how to effectively leverage the
BERT embeddings for efficient document retrieval.
Secondly, to guarantee the efficiency of retrieval,
most existing methods quantize every document
to a binary code via semantic hashing. There is
arXiv:2210.17170v1 [cs.IR] 31 Oct 2022