Efﬁcient Document Retrieval by End-to-End Reﬁning and Quantizing BERT Embedding with Contrastive Product Quantization

2025-08-18 0 0 396.56KB 11 页 10玖币

侵权投诉

Efﬁcient Document Retrieval by End-to-End Reﬁning and Quantizing

BERT Embedding with Contrastive Product Quantization

Zexuan Qiu1, Qinliang Su1,2∗

, Jianxing Yu3, and Shijing Si4

1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, China

3School of Artiﬁcial Intelligence, Sun Yat-sen University, Guangzhou, China

4School of Economics and Finance, Shanghai International Studies University, China

{qiuzx3@mail2, suqliang@mail, yujx26@mail}.sysu.edu.cn

Abstract

Efﬁcient document retrieval heavily relies on

the technique of semantic hashing, which

learns a binary code for every document and

employs Hamming distance to evaluate docu-

ment distances. However, existing semantic

hashing methods are mostly established on out-

dated TFIDF features, which obviously do not

contain lots of important semantic information

about documents. Furthermore, the Hamming

distance can only be equal to one of several

integer values, signiﬁcantly limiting its repre-

sentational ability for document distances. To

address these issues, in this paper, we propose

to leverage BERT embeddings to perform ef-

ﬁcient retrieval based on the product quantiza-

tion technique, which will assign for every doc-

ument a real-valued codeword from the code-

book, instead of a binary code as in seman-

tic hashing. Speciﬁcally, we ﬁrst transform

the original BERT embeddings via a learnable

mapping and feed the transformed embedding

into a probabilistic product quantization mod-

ule to output the assigned codeword. The reﬁn-

ing and quantizing modules can be optimized

in an end-to-end manner by minimizing the

probabilistic contrastive loss. A mutual infor-

mation maximization based method is further

proposed to improve the representativeness of

codewords, so that documents can be quan-

tized more accurately. Extensive experiments

conducted on three benchmarks demonstrate

that our proposed method signiﬁcantly outper-

forms current state-of-the-art baselines1.

1 Introduction

In the era of big data, Approximate Nearest Neigh-

bor (ANN) search has attracted tremendous atten-

tion thanks to its high search efﬁciency and extraor-

dinary performance in modern information retrieval

∗∗ Corresponding author.

Our PyTorch code is available at

https://github.

com/qiuzx2/MICPQ

, and our MindSpore code will be also

released soon.

systems. By quantizing each document as a com-

pact binary code, semantic hashing (Salakhutdinov

and Hinton,2009) has become the main solution to

ANN search due to the extremely low cost of cal-

culating Hamming distance between binary codes.

One of the main approaches for unsupervised se-

mantic hashing methods is established on gener-

ative models (Chaidaroon and Fang,2017;Shen

et al.,2018;Dong et al.,2019;Zheng et al.,2020),

which encourage the binary codes to be able to re-

construct the input documents. Alternatively, some

other methods are driven by graphs (Weiss et al.,

2008;Chaidaroon et al.,2020;Hansen et al.,2020;

Ou et al.,2021a), hoping the binary codes can re-

cover the neighborhood relationship. Though these

methods have obtained great retrieval performance,

there still exist two main problems.

Firstly, these methods are mostly established on

top of the outdated TFIDF features, which do not

contain various kinds of important information of

documents, like word order, contextual informa-

tion, etc. In recent years, pre-trained language mod-

els like BERT have achieved tremendous success in

various downstream tasks. Thus, a natural question

to ask is whether we can establish efﬁcient retrieval

methods on BERT embeddings. However, it has

been widely reported that BERT embeddings are

not suitable for semantic similarity-related tasks

(Reimers and Gurevych,2019), which perform

even worse than the traditional Glove embeddings

(Pennington et al.,2014). (Ethayarajh,2019;Li

et al.,2020) attribute this to the "anisotropy" phe-

nomenon that BERT embeddings only occupy a

narrow cone in the vector space, causing the se-

mantic information hidden in BERT embeddings

not easy to be leveraged directly. Thus, it is impor-

tant to investigate how to effectively leverage the

BERT embeddings for efﬁcient document retrieval.

Secondly, to guarantee the efﬁciency of retrieval,

most existing methods quantize every document

to a binary code via semantic hashing. There is

arXiv:2210.17170v1 [cs.IR] 31 Oct 2022

no doubt that the Hamming distance can improve

the retrieval efﬁciency signiﬁcantly, but it also re-

stricts the representation of document similarities

seriously since it can only be an integer from

−B

for

-bit codes. Recently, an alternative ap-

proach named product quantization (Jégou et al.,

2011;Jang and Cho,2021;Wang et al.,2021) has

been proposed in the computer vision community

to alleviate this problem. Basically, it seeks to quan-

tize every item to one of the codewords in a code-

book, which is represented by a Cartesian product

of multiple small codebooks. It has been shown

that product quantization is able to deliver superior

performance than semantic hashing while keeping

the cost of computation and storage relatively un-

changed. However, this technique has rarely been

explored in unsupervised document retrieval.

Motivated by the two problems above, in this

paper, we propose an end-to-end contrastive prod-

uct quantization model to jointly reﬁne the original

BERT embeddings and quantize the reﬁned embed-

dings into codewords. Speciﬁcally, we ﬁrst trans-

form the original BERT embeddings via a learn-

able mapping and feed the transformed embedding

into a probabilistic product quantization module to

output a quantized representation (codeword). To

preserve as much semantic information as possi-

ble in the quantized representations, inspired by

recent successes of contrastive learning, a proba-

bilistic contrastive loss is designed and trained in an

end-to-end manner, simultaneously achieving the

optimization of reﬁning and quantizing modules.

Later, to further improve the retrieval performance,

inspired by the recent development of clustering,

a mutual information maximization based method

is further developed to increase the representative-

ness of learned codewords. By doing so, the cluster

structure hidden in the dataset of documents could

be kept soundly, making the documents be quan-

tized more accurately. Extensive experiments are

conducted on three real-world datasets, and the ex-

perimental results demonstrate that our proposed

method signiﬁcantly outperforms current state-of-

the-art baselines. Empirical analyses also demon-

strate the effectiveness of every proposed compo-

nent in our proposed model.

2 Preliminaries of Product Quantization

for Information Retrieval

In ﬁelds of efﬁcient information retrieval, a preva-

lent approach is semantic hashing, which maps

every item

to a binary code

and then uses Ham-

ming distances to reﬂect the semantic similarity of

items. Thanks to the low cost of computing Ham-

ming distance, the retrieval can be performed very

efﬁciently. However, the Hamming distance can

only be an integer from

−B

for

-bit codes,

which is too restrictive to reﬂect the rich similarity

information.

An alternative approach is vector quantization

(VQ) (Gray and Neuhoff,1998), which assigns ev-

ery item with a codeword from a codebook

. The

codeword in VQ could be any vector in

, rather

than limited to the binary form as in semantic hash-

ing. By storing pre-computed distances between

any two codewords in a table, the distance between

items can be obtained efﬁciently by looking up the

table. However, to ensure competitive performance,

the number of codewords in a codebook needs to

be very large. For example, there are

264

different

codes for a 64-bit binary code, and thus the number

of codewords in VQ should also be of this scale,

which however is too large to be handled.

To tackle this issue, product quantization (Jégou

et al.,2011) proposes to represent the codebook

as a Cartesian product of Msmall codebooks

C=C1×C2× · · · × CM,(1)

where the

-th codebook

consists of

code-

words

{cm

k}K

k=1

with

k∈RD/M

. For an item,

the product quantization will choose a codeword

from every codebook

, and the ﬁnal codeword

assigned to this item is

c=c1◦c2· · · ◦ cM,(2)

where

denotes the codeword chosen from

and

◦

denotes concatenation. For each codeword

we only need to record its indices in the

code-

books, which only requires

Mlog2K

bits. Thanks

to the Cartesian product decomposition, now we

only need to store

codewords of dimension

RD/M

and

lookup tables of size

K×K

. As

an example, to enable a total of

264

codewords,

we can set

M= 32

and

K= 4

, which obviously

will reduce the size of footprint and lookup tables

signiﬁcantly. During retrieval, we only need to

look up the

tables and sum them up, which is

only slightly more costly than the computation of

Hamming distance.

3 The End-to-End Joint Reﬁning and

Quantizing Framework

To retrieve semantic-similar documents, a core

problem is how to produce for every document

a quantized representation that preserves the

semantic-similarity information of original docu-

ments. In this section, a simple two-stage method is

ﬁrst proposed, and then we develop an end-to-end

method that is able to directly output representa-

tions with desired properties.

3.1 A Simple Two-Stage Approach

To obtain semantics-preserving quantized represen-

tations, a naive approach is to ﬁrst promote the se-

mantic information in original BERT embeddings

and then quantize them. Many methods have been

proposed on how to reﬁne BERT embeddings to

promote their semantic information. These meth-

ods can be essentially described as transforming

the original embedding

z(x)

into another one

˜z(x)

via a mapping g(·)as

˜z(x) = g(z(x)),(3)

where

g(·)

could be a ﬂow-based mapping (Li et al.,

2020), or a neural network trained to maximize

the agreement between representations of a doc-

ument’s two views (Gao et al.,2021), etc. It has

been reported that the reﬁned embeddings

˜z(x)

are

semantically much richer than original one

z(x)

Then, we can follow standard product quantization

procedures to quantize the reﬁned embeddings

˜z(x)

into discrete representations.

3.2 End-to-End Reﬁning and Quantizing via

Contrastive Product Quantization

Obviously, the separation between the reﬁning and

quantizing steps in the two-stage method could

result in a signiﬁcant loss of semantic informa-

tion in the ﬁnal quantized representation. To ad-

dress this issue, an end-to-end reﬁning and quan-

tizing method is proposed. We ﬁrst slice the

original BERT embedding

z(x)

into

segments

zm(x)∈RD/M

for

m= 1,2,· · · , M

. Then, we

reﬁne

zm(x)

by transforming it into a semantic-

richer form

˜zm(x)

via a mapping

θ(·)

, that is,

˜zm(x) = gm

θ(zm(x)),(4)

where the subscript

denotes the learnable param-

eter. Different from the mapping

g(·)

which is

determined at the reﬁning stage and is irrelevant to

the quantization in the two-stage method, the map-

ping

θ(·)

here will be learned later by taking the

inﬂuences of quantization error into account. Now,

instead of quantizing the reﬁned embedding

˜zm(x)

to a ﬁxed codeword, we propose to quantize it to

one of the codewords

{cm

km}K

km=1

by stochastically

selecting kmaccording to the distribution

p(km|x) =

exp − k˜zm(x)−cm

kmk2

i=1 exp − k˜zm(x)−cm

ik2(5)

with

km= 1,2,· · · , K

. Obviously, the probability

that

˜zm(x)

is quantized to a codeword is inversely

proportional to their distance. Thus, by denoting

as a random sample drawn from

p(km|x)

,i.e.,

km∼p(km|x)

, we can represent the

-th quan-

tized representation of document xas

hm(x) = Cm·one_hot(km),(6)

and the whole quantized representation of xas

h(x) = h1(x)◦h2(x)· · · ◦ hM(x).(7)

Note that the quantized representation

h(x)

de-

pends on random variables

km∼p(km|x)

for

m= 1,2,· · · , M , thus h(x)itself is also random.

Now, we seek to preserve as much semantic in-

formation as possible in the quantized represen-

tation

h(x)

. Inspired by the recent successes of

contrastive learning in semantic-rich representation

learning (Gao et al.,2021), we propose to mini-

mize the contrastive loss. Speciﬁcally, for every

document

, we ﬁrst obtain two BERT embeddings

by passing it through BERT two times with two

independent dropout masks and then use the em-

beddings to generate two quantized representations

h(1)(x)

and

h(2)(x)

according to

(6)

and

(7)

. Then,

we deﬁne the contrastive loss as

Lcl =−1

|B| X

x∈B

`(1)(x) + `(2)(x),(8)

where

denotes a mini-batch of training docu-

ments; and `(i)(x)for i= 1,2is deﬁned as

`(i)(x),log S(h(1)

x, h(2)

S(h(1)

x, h(2)

x) +P

t∈B\x

n=1,2

S(h(i)

x, h(n)

t),

(9)

with

h(1)

denoting the abbreviation of

h(1)(x)

for

conciseness; and S(h1, h2)is deﬁned as

S(h1, h2),exp (sim(h1, h2)/τcl),(10)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientDocumentRetrievalbyEnd-to-EndReningandQuantizingBERTEmbeddingwithContrastiveProductQuantizationZexuanQiu1,QinliangSu1;2,JianxingYu3,andShijingSi41SchoolofComputerScienceandEngineering,SunYat-senUniversity,Guangzhou,China2GuangdongKeyLaboratoryofBigDataAnalysisandProcessing,Guangzhou,China3...

展开>> 收起<<

Efﬁcient Document Retrieval by End-to-End Reﬁning and Quantizing BERT Embedding with Contrastive Product Quantization.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Document Retrieval by End-to-End Reﬁning and Quantizing BERT Embedding with Contrastive Product Quantization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: