Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization

2025-08-18 0 0 396.56KB 11 页 10玖币
侵权投诉
Efficient Document Retrieval by End-to-End Refining and Quantizing
BERT Embedding with Contrastive Product Quantization
Zexuan Qiu1, Qinliang Su1,2
, Jianxing Yu3, and Shijing Si4
1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China
3School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China
4School of Economics and Finance, Shanghai International Studies University, China
{qiuzx3@mail2, suqliang@mail, yujx26@mail}.sysu.edu.cn
Abstract
Efficient document retrieval heavily relies on
the technique of semantic hashing, which
learns a binary code for every document and
employs Hamming distance to evaluate docu-
ment distances. However, existing semantic
hashing methods are mostly established on out-
dated TFIDF features, which obviously do not
contain lots of important semantic information
about documents. Furthermore, the Hamming
distance can only be equal to one of several
integer values, significantly limiting its repre-
sentational ability for document distances. To
address these issues, in this paper, we propose
to leverage BERT embeddings to perform ef-
ficient retrieval based on the product quantiza-
tion technique, which will assign for every doc-
ument a real-valued codeword from the code-
book, instead of a binary code as in seman-
tic hashing. Specifically, we first transform
the original BERT embeddings via a learnable
mapping and feed the transformed embedding
into a probabilistic product quantization mod-
ule to output the assigned codeword. The refin-
ing and quantizing modules can be optimized
in an end-to-end manner by minimizing the
probabilistic contrastive loss. A mutual infor-
mation maximization based method is further
proposed to improve the representativeness of
codewords, so that documents can be quan-
tized more accurately. Extensive experiments
conducted on three benchmarks demonstrate
that our proposed method significantly outper-
forms current state-of-the-art baselines1.
1 Introduction
In the era of big data, Approximate Nearest Neigh-
bor (ANN) search has attracted tremendous atten-
tion thanks to its high search efficiency and extraor-
dinary performance in modern information retrieval
Corresponding author.
1
Our PyTorch code is available at
https://github.
com/qiuzx2/MICPQ
, and our MindSpore code will be also
released soon.
systems. By quantizing each document as a com-
pact binary code, semantic hashing (Salakhutdinov
and Hinton,2009) has become the main solution to
ANN search due to the extremely low cost of cal-
culating Hamming distance between binary codes.
One of the main approaches for unsupervised se-
mantic hashing methods is established on gener-
ative models (Chaidaroon and Fang,2017;Shen
et al.,2018;Dong et al.,2019;Zheng et al.,2020),
which encourage the binary codes to be able to re-
construct the input documents. Alternatively, some
other methods are driven by graphs (Weiss et al.,
2008;Chaidaroon et al.,2020;Hansen et al.,2020;
Ou et al.,2021a), hoping the binary codes can re-
cover the neighborhood relationship. Though these
methods have obtained great retrieval performance,
there still exist two main problems.
Firstly, these methods are mostly established on
top of the outdated TFIDF features, which do not
contain various kinds of important information of
documents, like word order, contextual informa-
tion, etc. In recent years, pre-trained language mod-
els like BERT have achieved tremendous success in
various downstream tasks. Thus, a natural question
to ask is whether we can establish efficient retrieval
methods on BERT embeddings. However, it has
been widely reported that BERT embeddings are
not suitable for semantic similarity-related tasks
(Reimers and Gurevych,2019), which perform
even worse than the traditional Glove embeddings
(Pennington et al.,2014). (Ethayarajh,2019;Li
et al.,2020) attribute this to the "anisotropy" phe-
nomenon that BERT embeddings only occupy a
narrow cone in the vector space, causing the se-
mantic information hidden in BERT embeddings
not easy to be leveraged directly. Thus, it is impor-
tant to investigate how to effectively leverage the
BERT embeddings for efficient document retrieval.
Secondly, to guarantee the efficiency of retrieval,
most existing methods quantize every document
to a binary code via semantic hashing. There is
arXiv:2210.17170v1 [cs.IR] 31 Oct 2022
no doubt that the Hamming distance can improve
the retrieval efficiency significantly, but it also re-
stricts the representation of document similarities
seriously since it can only be an integer from
B
to
B
for
B
-bit codes. Recently, an alternative ap-
proach named product quantization (Jégou et al.,
2011;Jang and Cho,2021;Wang et al.,2021) has
been proposed in the computer vision community
to alleviate this problem. Basically, it seeks to quan-
tize every item to one of the codewords in a code-
book, which is represented by a Cartesian product
of multiple small codebooks. It has been shown
that product quantization is able to deliver superior
performance than semantic hashing while keeping
the cost of computation and storage relatively un-
changed. However, this technique has rarely been
explored in unsupervised document retrieval.
Motivated by the two problems above, in this
paper, we propose an end-to-end contrastive prod-
uct quantization model to jointly refine the original
BERT embeddings and quantize the refined embed-
dings into codewords. Specifically, we first trans-
form the original BERT embeddings via a learn-
able mapping and feed the transformed embedding
into a probabilistic product quantization module to
output a quantized representation (codeword). To
preserve as much semantic information as possi-
ble in the quantized representations, inspired by
recent successes of contrastive learning, a proba-
bilistic contrastive loss is designed and trained in an
end-to-end manner, simultaneously achieving the
optimization of refining and quantizing modules.
Later, to further improve the retrieval performance,
inspired by the recent development of clustering,
a mutual information maximization based method
is further developed to increase the representative-
ness of learned codewords. By doing so, the cluster
structure hidden in the dataset of documents could
be kept soundly, making the documents be quan-
tized more accurately. Extensive experiments are
conducted on three real-world datasets, and the ex-
perimental results demonstrate that our proposed
method significantly outperforms current state-of-
the-art baselines. Empirical analyses also demon-
strate the effectiveness of every proposed compo-
nent in our proposed model.
2 Preliminaries of Product Quantization
for Information Retrieval
In fields of efficient information retrieval, a preva-
lent approach is semantic hashing, which maps
every item
x
to a binary code
b
and then uses Ham-
ming distances to reflect the semantic similarity of
items. Thanks to the low cost of computing Ham-
ming distance, the retrieval can be performed very
efficiently. However, the Hamming distance can
only be an integer from
B
to
B
for
B
-bit codes,
which is too restrictive to reflect the rich similarity
information.
An alternative approach is vector quantization
(VQ) (Gray and Neuhoff,1998), which assigns ev-
ery item with a codeword from a codebook
C
. The
codeword in VQ could be any vector in
RD
, rather
than limited to the binary form as in semantic hash-
ing. By storing pre-computed distances between
any two codewords in a table, the distance between
items can be obtained efficiently by looking up the
table. However, to ensure competitive performance,
the number of codewords in a codebook needs to
be very large. For example, there are
264
different
codes for a 64-bit binary code, and thus the number
of codewords in VQ should also be of this scale,
which however is too large to be handled.
To tackle this issue, product quantization (Jégou
et al.,2011) proposes to represent the codebook
C
as a Cartesian product of Msmall codebooks
C=C1×C2× · · · × CM,(1)
where the
m
-th codebook
Cm
consists of
K
code-
words
{cm
k}K
k=1
with
cm
kRD/M
. For an item,
the product quantization will choose a codeword
from every codebook
Cm
, and the final codeword
assigned to this item is
c=c1c2· · · cM,(2)
where
cm
denotes the codeword chosen from
Cm
,
and
denotes concatenation. For each codeword
c
,
we only need to record its indices in the
M
code-
books, which only requires
Mlog2K
bits. Thanks
to the Cartesian product decomposition, now we
only need to store
MK
codewords of dimension
RD/M
and
M
lookup tables of size
K×K
. As
an example, to enable a total of
264
codewords,
we can set
M= 32
and
K= 4
, which obviously
will reduce the size of footprint and lookup tables
significantly. During retrieval, we only need to
look up the
M
tables and sum them up, which is
only slightly more costly than the computation of
Hamming distance.
3 The End-to-End Joint Refining and
Quantizing Framework
To retrieve semantic-similar documents, a core
problem is how to produce for every document
a quantized representation that preserves the
semantic-similarity information of original docu-
ments. In this section, a simple two-stage method is
first proposed, and then we develop an end-to-end
method that is able to directly output representa-
tions with desired properties.
3.1 A Simple Two-Stage Approach
To obtain semantics-preserving quantized represen-
tations, a naive approach is to first promote the se-
mantic information in original BERT embeddings
and then quantize them. Many methods have been
proposed on how to refine BERT embeddings to
promote their semantic information. These meth-
ods can be essentially described as transforming
the original embedding
z(x)
into another one
˜z(x)
via a mapping g(·)as
˜z(x) = g(z(x)),(3)
where
g(·)
could be a flow-based mapping (Li et al.,
2020), or a neural network trained to maximize
the agreement between representations of a doc-
ument’s two views (Gao et al.,2021), etc. It has
been reported that the refined embeddings
˜z(x)
are
semantically much richer than original one
z(x)
.
Then, we can follow standard product quantization
procedures to quantize the refined embeddings
˜z(x)
into discrete representations.
3.2 End-to-End Refining and Quantizing via
Contrastive Product Quantization
Obviously, the separation between the refining and
quantizing steps in the two-stage method could
result in a significant loss of semantic informa-
tion in the final quantized representation. To ad-
dress this issue, an end-to-end refining and quan-
tizing method is proposed. We first slice the
original BERT embedding
z(x)
into
M
segments
zm(x)RD/M
for
m= 1,2,· · · , M
. Then, we
refine
zm(x)
by transforming it into a semantic-
richer form
˜zm(x)
via a mapping
gm
θ(·)
, that is,
˜zm(x) = gm
θ(zm(x)),(4)
where the subscript
θ
denotes the learnable param-
eter. Different from the mapping
g(·)
which is
determined at the refining stage and is irrelevant to
the quantization in the two-stage method, the map-
ping
gm
θ(·)
here will be learned later by taking the
influences of quantization error into account. Now,
instead of quantizing the refined embedding
˜zm(x)
to a fixed codeword, we propose to quantize it to
one of the codewords
{cm
km}K
km=1
by stochastically
selecting kmaccording to the distribution
p(km|x) =
exp − k˜zm(x)cm
kmk2
PK
i=1 exp − k˜zm(x)cm
ik2(5)
with
km= 1,2,· · · , K
. Obviously, the probability
that
˜zm(x)
is quantized to a codeword is inversely
proportional to their distance. Thus, by denoting
km
as a random sample drawn from
p(km|x)
,i.e.,
kmp(km|x)
, we can represent the
m
-th quan-
tized representation of document xas
hm(x) = Cm·one_hot(km),(6)
and the whole quantized representation of xas
h(x) = h1(x)h2(x)· · · hM(x).(7)
Note that the quantized representation
h(x)
de-
pends on random variables
kmp(km|x)
for
m= 1,2,· · · , M , thus h(x)itself is also random.
Now, we seek to preserve as much semantic in-
formation as possible in the quantized represen-
tation
h(x)
. Inspired by the recent successes of
contrastive learning in semantic-rich representation
learning (Gao et al.,2021), we propose to mini-
mize the contrastive loss. Specifically, for every
document
x
, we first obtain two BERT embeddings
by passing it through BERT two times with two
independent dropout masks and then use the em-
beddings to generate two quantized representations
h(1)(x)
and
h(2)(x)
according to
(6)
and
(7)
. Then,
we define the contrastive loss as
Lcl =1
|B| X
x∈B
`(1)(x) + `(2)(x),(8)
where
B
denotes a mini-batch of training docu-
ments; and `(i)(x)for i= 1,2is defined as
`(i)(x),log S(h(1)
x, h(2)
x)
S(h(1)
x, h(2)
x) +P
t∈B\x
n=1,2
S(h(i)
x, h(n)
t),
(9)
with
h(1)
x
denoting the abbreviation of
h(1)(x)
for
conciseness; and S(h1, h2)is defined as
S(h1, h2),exp (sim(h1, h2)cl),(10)
摘要:

EfcientDocumentRetrievalbyEnd-to-EndReningandQuantizingBERTEmbeddingwithContrastiveProductQuantizationZexuanQiu1,QinliangSu1;2,JianxingYu3,andShijingSi41SchoolofComputerScienceandEngineering,SunYat-senUniversity,Guangzhou,China2GuangdongKeyLaboratoryofBigDataAnalysisandProcessing,Guangzhou,China3...

展开>> 收起<<
Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:396.56KB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注