Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search Ziyang LiuChaokun Wang Hao Feng Lingfei Wu Liqun Yang

2025-05-03 0 0 4.74MB 18 页 10玖币
侵权投诉
Knowledge Distillation based Contextual Relevance Matching
for E-commerce Product Search
Ziyang Liu§,Chaokun Wang§*, Hao Feng§, Lingfei Wu, Liqun Yang
§Tsinghua University,JD.com,CNAEIT
§liu-zy21@mails.tsinghua.edu.cn,chaokun@tsinghua.edu.cn
lwu@email.wm.edu, yanglq@cnaeit.com
Abstract
Online relevance matching is an essential task
of e-commerce product search to boost the
utility of search engines and ensure a smooth
user experience. Previous work adopts ei-
ther classical relevance matching models or
Transformer-style models to address it. How-
ever, they ignore the inherent bipartite graph
structures that are ubiquitous in e-commerce
product search logs and are too inefficient to
deploy online. In this paper, we design an
efficient knowledge distillation framework for
e-commerce relevance matching to integrate
the respective advantages of Transformer-style
models and classical relevance matching mod-
els. Especially for the core student model of
the framework, we propose a novel method us-
ing k-order relevance modeling. The experi-
mental results on large-scale real-world data
(the size is 6174 million) show that the pro-
posed method significantly improves the pre-
diction accuracy in terms of human relevance
judgment. We deploy our method to the anony-
mous online search platform. The A/B test-
ing results show that our method significantly
improves 5.7% of UV-value under price sort
mode.
1 Introduction
Relevance matching
(Guo et al.,2016;Rao et al.,
2019;Wang et al.,2020) is an important task in the
field of ad-hoc information retrieval (Zhai and Laf-
ferty,2017), which aims to return a sequence of in-
formation resources related to a user query (Huang
et al.,2020;Chang et al.,2021a;Sun and Duh,
2020). Generally, texts are the dominant form of
the user query and returned information resources.
Given two sentences, the target of relevance match-
ing is to estimate their relevance score and then
judge whether they are relevant or not. However,
text similarity does not mean semantic similarity.
For example, while “mac pro 1.7GHz” and “mac
*Chaokun Wang is the corresponding author.
lipstick 1.7ml” look alike, they describe two differ-
ent and irrelevant products. Therefore, relevance
matching is important, especially for long-term
user satisfaction of e-commerce search (Cheng
et al.,2022;Niu et al.,2020;Xu et al.,2021;Zhu
et al.,2020).
Related work.
With the rapid development
of deep learning, the current research on rel-
evance matching can be grouped into two
camps (see Appendix Afor further details):
1.
Classical Relevance Matching Models.
For the
given query and item, the classical relevance match-
ing models either learn their individual embeddings
or learn an overall embedding based on the calcula-
tion from word-level interaction to sentence-level
interaction. The representative methods include
ESIM (Chen et al.,2017) and BERT2DNN (Jiang
et al.,2020). 2.
Transformer-style Models.
These
models adopt the multi-layer Transformer net-
work structure (Vaswani et al.,2017a). They have
achieved breakthroughs on many NLP tasks and
even reached human-level accuracy. The represen-
tative methods include BERT (Devlin et al.,2019)
and ERNIE (Sun et al.,2019b).
Although Transformer-style models show satis-
factory performance on relevance matching, they
are hard to deploy to the online environment due
to their high time complexity. Moreover, the above
methods cannot deal with the abundant context
information (i.e., the neighbor features in a query-
item bipartite graph) in e-commerce product search.
Last but not least, when applied to real-world sce-
narios, existing classical relevance matching mod-
els directly use user behaviors as labeling informa-
tion (Fig. 1). However, this solution is not directly
suitable for relevance matching because user be-
haviors are often noisy and deviate from relevance
signals (Mao et al.,2019;Liu and Mao,2020).
In this paper, we propose to incorporate bipartite
graph embedding into the knowledge distillation
framework (Li et al.,2021a;Dong et al.,2021;
arXiv:2210.01701v1 [cs.IR] 4 Oct 2022
(The color of red denotes problematic examples.)
q1——i1
q2——i3
q3——i3
q3——i4
Click
Query Item
q1
q2
q3
i1
i2
i3
i4
Sentence
q1mac pro
q2mac lipstick
q3mac mini
i1
Apple
MacBook Pro (16-
inch, 16GB RAM,
1TB Storage, 2.3GHz Intel Core i9)
i2
Apple MacBook Pro MLH42LL/A 13.3-
inch Laptop
i3
Little MAC Lipstick 0.06 oz/ 1.77 ml
WHIRL
i4
Apple Mac Mini with Apple M1 Chip
(8GB RAM, 256GB SSD Storage)
View but not click
Positives Negatives
q1——i2
q1——i3
q3——i2
ARC-I model: using click
behaviors as training labels
Figure 1: Shortcoming of the existing relevance match-
ing model. Here we take the ARC-I model as an exam-
ple. The right part shows the ground truth of queries
and item titles. The left part shows two problematic ex-
amples in ARC-I, which deviate from the ground truth.
Rashid et al.,2021;Wu et al.,2021b;Zhang et al.,
2020) to solve the relevance matching problem
in the scene of e-commerce product search. We
adopt BERT (Devlin et al.,2019) as the teacher
model in this framework. We design a novel
model called BERM,
B
ipartite graph
E
mbedding
for
R
elevance
M
atching (BERM), which acts as
the student model in our knowledge distillation
framework. This model captures the 0-order rel-
evance using a word interaction matrix attached
with positional encoding and captures the higher-
order relevance using the metapath embedding with
graph attention scores. For online deployment, it is
further distilled into a tiny model BERM-O.
Our main contributions are as follows:
We formalize the
k
-order relevance problem in
a bipartite graph (Section 2.1) and address it by
a knowledge distillation framework with a novel
student model called BERM.
We apply BERM to the e-commerce product
search scene with abundant context information
(Section 2.3) and evaluate its performance (Sec-
tion 3). The results indicate that BERM outper-
forms the state-of-the-art methods.
To facilitate online applications, we further distill
BERM into a faster model, i.e., BERM-O. The
results of online A/B testing indicate that BERM-
O significantly improves 5.7% (relative value) of
UV-value under price sort mode.
2 Method
2.1 Problem Definition
We first give the definition of the bipartite graph:
Definition 1 Bipartite Graph
. Given a graph
G=
(U,V,E,A,R)
, it contains two disjoint node sets
U:{u1, u2,· · · , un}
and
V:{v1, v2,· · · , vn0}
.
For edge set
E:{e1,· · · , em}
, each edge
ei
con-
nects
uj
in
U
and
vk
in
V
. In addition, there is a
node type mapping function
f1:U ∪ V A
and
an edge type mapping function
f2:E → R
. Such
a graph Gis called a bipartite graph.
Example 1
Given an e-commerce search log, we
can build a query-item bipartite graph as shown in
the left part of Fig. 1, where
A={Query, Item}
and R={Click}.
In a bipartite graph, we use the metapath and
metapath instance to incorporate the neighboring
node information into relevance matching. They
are defined as follows:
Definition 2 Metapath and Metapath Instance
in Bipartite Graph
. Given a bipartite graph
G=
(U,V,E,A,R)
, a metapath
Pi=a1
r1
a2
r2
· · · rl
al+1
(
aj6=aj+1,16j6l
) repre-
sents the path from
a1
to
al+1
successively through
r1, r2,· · · , rl
(
aj∈ A
,
rj∈ R
). The length of
Pi
is denoted as
|Pi|
and
|Pi|=l
. For brevity, the set
of all metapaths on
G
can be represented in regu-
lar expression as
PG= (a|ε) (a0a)+(a0|ε)
where
a, a0∈ A
,
a6=a0
. The metapath instance
p
is a
definite node sequence instantiated from metapath
Pi
. All instances of
Pi
is denoted as
I(Pi)
, then
pI(Pi).
Example 2
As shown in Fig. 1, an instance of
metapath “Query-Item-Query” is “q2-i3-q3”.
Definition 3 k-order Relevance
. Given a bipar-
tite graph
G= (U,V,E,A,R)
, a function
Fk
rel :
U × V [0,1]
is called a
k
-order relevance func-
tion on
G
if
Fk
rel(ui, vj) = G(Φ(ui),Φ(vj)|Ck)
,
where
Φ(·)
is a function to map each node to
a representation vector,
G(.)
is the score func-
tion,
ui∈ U
,
vj∈ V
, and context information
Ck=SIPiI(Pi),PiPG,|Pi|=kIPi.
Many existing relevance matching mod-
els (Huang et al.,2013;Shen et al.,2014;Hu
et al.,2014b) ignore context information
Ck
and
only consider the sentences w.r.t. the query and
item to be matched, which corresponds to 0-order
relevance (for more details, please see the “Related
Work” part in Appendix A). We call it context-free
relevance matching in this paper. Considering that
both the 0-order neighbor (i.e., the node itself)
and
k
-order neighbor (
k>1
) are necessary for
relevance matching, we argue that a reasonable
mechanism should ensure that they can cooperate
with each other. Then the research objective of our
work is defined as follows:
Search log
Query
Unlabeled
Data
Labeled
Data
Teacher model
Student model
Transfer set
Online model
Item1
Item2
Item3
……
User
Further
distillation
BERT
BERM BERM-O
Graph
analysis
Raw graph
construction
Graph refinement
Figure 2: The e-commerce knowledge distillation framework proposed in our work. Three models are used in this
framework: teacher model BERT, student model BERM, and online model BERM-O.
Definition 4 Contextual Relevance Matching
.
Given a bipartite graph
G= (U,V,E,A,R)
, the
task of contextual relevance matching is to deter-
mine the context information
Ck
on
G
and learn
the score function G(·).
2.2 Overview
We propose a complete knowledge distillation
framework (Fig. 2), whose student model incor-
porates the context information, for contextual rele-
vance matching in e-commerce product search. The
main components of this framework are described
as follows (see Appendix Bfor further details):
Graph construction.
We firstly construct a raw
bipartite graph
G
based on the search data col-
lected from the anonymous platform. Then we
construct a knowledge-enhanced bipartite graph
G0
with the help of BERT, which is fine-tuned by
the human-labeled relevance data.
Student model design.
We design a novel stu-
dent model BERM corresponding to the score
function
G(·)
in Def. 4. Specifically, macro
and micro matching embeddings are derived
in BERM to capture the sentence-level and
word-level relevance matching signal, respec-
tively. Also, based on the metapaths ‘
Q
-
I
-
Q
and ‘
I
-
Q
-
I
”, we design a node-level encoder and
a metapath-instance-level aggregator to derive
metapath embeddings.
Online application.
To serve online search, we
conduct further distillation to BERM and obtain
BERM-O, which can be easily deployed to the
online search platform.
2.3 BERM Model
Next we describe BERM in detail, including 0-
order relevance modeling (Section 2.3.1),
k
-order
relevance modeling (Section 2.3.2), and overall
learning objective (Section 2.3.3).
2.3.1 0-order Relevance Modeling
The whole structure of BERM includes both the
0-order relevance modeling and
k
-order relevance
modeling. This subsection introduces the 0-order
relevance modeling which captures sentence-level
and word-level matching signals by incorporating
the macro matching embedding and micro match-
ing embedding, respectively.
Word embedding in e-commerce scene.
In e-
commerce scene, the basic representations of query
or item is an intractable problem. On one hand,
it is infeasible to represent queries and items as
individual embeddings due to the unbounded entity
space. On the other hand, product type names (like
“iphone11”) or attribute names (like “256GB”) have
special background information and could contain
complex lexicons such as different languages and
numerals. To address these problems, we adopt
word embedding in BERM, which dramatically re-
duces the representation space. Also, we treat con-
tiguous numerals, contiguous English letters, or sin-
gle Chinese characters as one word and only retain
the high-frequency words (such as the words oc-
curring more than fifty times in a six-month search
log) in the vocabulary. The final vocabulary is
only in the tens of thousands, which saves mem-
ory consumption and lookup time of indexes by
a large margin. Each word is represented by a
d
-
dimensional embedding vector, which is trained by
Word2Vec (Mikolov et al.,2013). The
i
-th word’s
embedding of query
Q
(or item title
I
) is denoted
as Ei
QRd(or Ei
IRd).
Macro and micro matching embeddings.
To
capture sentence-level and word-level matching
signals, we employ macro matching embedding
and micro matching embedding, respectively. For
the macro matching embedding, taking query
Q
with
lQ
words and item
I
with
lI
words as ex-
amples, their macro embeddings
EQ
seq,EI
seq Rd
are calculated by the column-wise mean value of
EQRlQ×d,EIRlI×d:
EQ
seq =1
lQ
lQ
X
i=1
Ei
Q,EI
seq =1
lI
lI
X
i=1
Ei
I.(1)
For the micro matching embedding, we first build
an interaction matrix
Mint RlQ×lI
whose
(i, j)
-th
...
...
...
...
...
...
encoder encoder
... ...
attention
х х
+
...
...
...
...
...
...
encoder encoder
... ...
attention
х х
+
...
...
I-Q-I Q-I-Q
EI-Q-I EQ-I-Q
instance
embedding
metapath
embedding
Figure 3: Calculation process of EQIQand
EIQIin BERM.
entry is the dot product of Ei
Qand Ej
I:
Mint =nmi,j
intolQ×lI
, mi,j
int =hEi
Q,Ej
Ii.(2)
Then the micro matching embedding
Eint RlQlI
is the vectorization of Mint, i.e., Eint = vec(Mint).
2.3.2 k-order Relevance Modeling
The
k
-order relevance model contains a node-level
encoder and a metapath-instance-level aggregator.
Node-level encoder.
The input of the node-
level encoder is node embeddings and its output
is an instance embedding (i.e., the embedding of
a metapath instance). Specifically, to obtain the
instance embedding, we integrate the embeddings
of neighboring nodes into the anchor node embed-
ding with a mean encoder. Taking “
Q
-
Itop1
-
Qtop1
(note that “
Itop1
” is the top-1 node in the 1-hop
neighbor list of node
Q
, and see Appendix B.1
for further details) as an example, its embedding
EQItop1Qtop1 Rdis calculated by:
EQItop1Qtop1 = MEAN(EQ
seq,EItop1
seq ,EQtop1
seq ).(3)
The metapath instance bridges the communication
gap between different types of nodes and can be
used to update the anchor node embedding from
structure information.
Metapath-instance-level aggregator.
The in-
puts of the metapath-instance-level aggregator are
instance embeddings and its output is a metapath
embedding. Different metapath instances convey
different information, so they have various effects
on the final metapath embedding. However, the
mapping relationship between the instance embed-
ding and metapath embedding is unknown. To
learn their relationship automatically, we intro-
duce the “graph attention” mechanism to generate
metapath embeddings (Wu et al.,2021a;Liu et al.,
2022b). Taking metapath “
Q
-
I
-
Q
” as an example,
we use graph attention to represent the mapping
relationship between “
Q
-
I
-
Q
” and its instances.
The final metapath embedding
EQIQRd
is ob-
tained (
EIQIRd
is calculated similarly) by ac-
cumulating all instance embeddings with attention
scores Att1,Att2,Att3,Att4R:
EQIQ= LeakyReLU(Att1·EQItop1 Qtop1
+ Att2·EQItop1 Qtop2
+ Att3·EQItop2 Qtop1
+ Att4·EQItop2 Qtop2 ).
(4)
Though
Atti
can be set as a fixed value, we adopt
a more flexible way, i.e., using the neural network
to learn
Atti
automatically. Specifically, we feed
the concatenation of the anchor node embedding
and metapath instance embedding into a one-layer
neural network (its weight is
Watt R6d×4
and its
bias is
batt R1×4
) with a softmax layer, which
outputs an attention distribution:
(Atti)1i4= softmax(Econcat Watt +batt),(5)
Econcat = [EQ
seq|EI
seq|EQItop1Qtop1 |EQItop1Qtop2
|EQItop2Qtop1 |EQItop2Qtop2 ].
(6)
The above process is shown in Fig. 3.
Embedding fusion.
By the 0-order and
k
-
order relevance modeling, three types of em-
beddings are generated, including macro match-
ing embedding (
EQ
seq,EI
seq Rd
), micro match-
ing embedding (
Eint RlQlI
), and metapath
embedding (
EQIQ,EIQIRd
). We con-
catenate them together and feed the result
to a three-layer neural network (its weights
are
W0R(4d+lQlI)×d,W1,W2Rd×d,W3Rd×1
and biases are
b0,b1,b2R1×d,b3R1×1
), which
outputs the final relevance estimation score ˆyi:
ˆyi= Sigmoid(E3W3+b3),(7)
Ej+1 = ReLU(EjWj+bj),E0=Eall, j = 0,1,2,(8)
Eall = [EQ
seq|EI
seq|Eint|EQIQ|EIQI].(9)
2.3.3 Overall Learning Objective
We evaluate the cross-entropy error on the estima-
tion score
ˆyi
and label
yi
(note that
yi[0,1]
is the score of the teacher model BERT, see Ap-
pendix B.1 for further details), and then minimize
the following loss function:
L=
˜n
X
i=1
yilog(ˆyi) + (1 yi) log(1 ˆyi).(10)
In Appendix C, we analyze the complexities of
BERT, BERM, and BERM-O.
摘要:

KnowledgeDistillationbasedContextualRelevanceMatchingforE-commerceProductSearchZiyangLiu§,ChaokunWang§*,HaoFeng§,LingfeiWu†,LiqunYang‡§TsinghuaUniversity,†JD.com,‡CNAEIT§liu-zy21@mails.tsinghua.edu.cn,chaokun@tsinghua.edu.cn„lwu@email.wm.edu,…yanglq@cnaeit.comAbstractOnlinerelevancematchingisanessen...

展开>> 收起<<
Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search Ziyang LiuChaokun Wang Hao Feng Lingfei Wu Liqun Yang.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:4.74MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注