Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search Ziyang LiuChaokun Wang Hao Feng Lingfei Wu Liqun Yang

2025-05-03 0 0 4.74MB 18 页 10玖币

侵权投诉

Knowledge Distillation based Contextual Relevance Matching

for E-commerce Product Search

Ziyang Liu§,Chaokun Wang§*, Hao Feng§, Lingfei Wu†, Liqun Yang‡

§Tsinghua University,†JD.com,‡CNAEIT

§liu-zy21@mails.tsinghua.edu.cn,chaokun@tsinghua.edu.cn

†lwu@email.wm.edu, ‡yanglq@cnaeit.com

Abstract

Online relevance matching is an essential task

of e-commerce product search to boost the

utility of search engines and ensure a smooth

user experience. Previous work adopts ei-

ther classical relevance matching models or

Transformer-style models to address it. How-

ever, they ignore the inherent bipartite graph

structures that are ubiquitous in e-commerce

product search logs and are too inefﬁcient to

deploy online. In this paper, we design an

efﬁcient knowledge distillation framework for

e-commerce relevance matching to integrate

the respective advantages of Transformer-style

models and classical relevance matching mod-

els. Especially for the core student model of

the framework, we propose a novel method us-

ing k-order relevance modeling. The experi-

mental results on large-scale real-world data

(the size is 6∼174 million) show that the pro-

posed method signiﬁcantly improves the pre-

diction accuracy in terms of human relevance

judgment. We deploy our method to the anony-

mous online search platform. The A/B test-

ing results show that our method signiﬁcantly

improves 5.7% of UV-value under price sort

mode.

1 Introduction

Relevance matching

(Guo et al.,2016;Rao et al.,

2019;Wang et al.,2020) is an important task in the

ﬁeld of ad-hoc information retrieval (Zhai and Laf-

ferty,2017), which aims to return a sequence of in-

formation resources related to a user query (Huang

et al.,2020;Chang et al.,2021a;Sun and Duh,

2020). Generally, texts are the dominant form of

the user query and returned information resources.

Given two sentences, the target of relevance match-

ing is to estimate their relevance score and then

judge whether they are relevant or not. However,

text similarity does not mean semantic similarity.

For example, while “mac pro 1.7GHz” and “mac

*Chaokun Wang is the corresponding author.

lipstick 1.7ml” look alike, they describe two differ-

ent and irrelevant products. Therefore, relevance

matching is important, especially for long-term

user satisfaction of e-commerce search (Cheng

et al.,2022;Niu et al.,2020;Xu et al.,2021;Zhu

et al.,2020).

Related work.

With the rapid development

of deep learning, the current research on rel-

evance matching can be grouped into two

camps (see Appendix Afor further details):

Classical Relevance Matching Models.

For the

given query and item, the classical relevance match-

ing models either learn their individual embeddings

or learn an overall embedding based on the calcula-

tion from word-level interaction to sentence-level

interaction. The representative methods include

ESIM (Chen et al.,2017) and BERT2DNN (Jiang

et al.,2020). 2.

Transformer-style Models.

These

models adopt the multi-layer Transformer net-

work structure (Vaswani et al.,2017a). They have

achieved breakthroughs on many NLP tasks and

even reached human-level accuracy. The represen-

tative methods include BERT (Devlin et al.,2019)

and ERNIE (Sun et al.,2019b).

Although Transformer-style models show satis-

factory performance on relevance matching, they

are hard to deploy to the online environment due

to their high time complexity. Moreover, the above

methods cannot deal with the abundant context

information (i.e., the neighbor features in a query-

item bipartite graph) in e-commerce product search.

Last but not least, when applied to real-world sce-

narios, existing classical relevance matching mod-

els directly use user behaviors as labeling informa-

tion (Fig. 1). However, this solution is not directly

suitable for relevance matching because user be-

haviors are often noisy and deviate from relevance

signals (Mao et al.,2019;Liu and Mao,2020).

In this paper, we propose to incorporate bipartite

graph embedding into the knowledge distillation

framework (Li et al.,2021a;Dong et al.,2021;

arXiv:2210.01701v1 [cs.IR] 4 Oct 2022

(The color of red denotes problematic examples.)

q1——i1

q2——i3

q3——i3

q3——i4

Click

Query Item

Sentence

q1mac pro

q2mac lipstick

q3mac mini

Apple

MacBook Pro (16-

inch, 16GB RAM,

1TB Storage, 2.3GHz Intel Core i9)

Apple MacBook Pro MLH42LL/A 13.3-

inch Laptop

Little MAC Lipstick 0.06 oz/ 1.77 ml

WHIRL

Apple Mac Mini with Apple M1 Chip

(8GB RAM, 256GB SSD Storage)

View but not click

Positives Negatives

q1——i2

q1——i3

q3——i2

ARC-I model: using click

behaviors as training labels

Figure 1: Shortcoming of the existing relevance match-

ing model. Here we take the ARC-I model as an exam-

ple. The right part shows the ground truth of queries

and item titles. The left part shows two problematic ex-

amples in ARC-I, which deviate from the ground truth.

Rashid et al.,2021;Wu et al.,2021b;Zhang et al.,

2020) to solve the relevance matching problem

in the scene of e-commerce product search. We

adopt BERT (Devlin et al.,2019) as the teacher

model in this framework. We design a novel

model called BERM,

ipartite graph

mbedding

for

elevance

atching (BERM), which acts as

the student model in our knowledge distillation

framework. This model captures the 0-order rel-

evance using a word interaction matrix attached

with positional encoding and captures the higher-

order relevance using the metapath embedding with

graph attention scores. For online deployment, it is

further distilled into a tiny model BERM-O.

Our main contributions are as follows:

•

We formalize the

-order relevance problem in

a bipartite graph (Section 2.1) and address it by

a knowledge distillation framework with a novel

student model called BERM.

•

We apply BERM to the e-commerce product

search scene with abundant context information

(Section 2.3) and evaluate its performance (Sec-

tion 3). The results indicate that BERM outper-

forms the state-of-the-art methods.

•

To facilitate online applications, we further distill

BERM into a faster model, i.e., BERM-O. The

results of online A/B testing indicate that BERM-

O signiﬁcantly improves 5.7% (relative value) of

UV-value under price sort mode.

2 Method

2.1 Problem Deﬁnition

We ﬁrst give the deﬁnition of the bipartite graph:

Deﬁnition 1 Bipartite Graph

. Given a graph

(U,V,E,A,R)

, it contains two disjoint node sets

U:{u1, u2,· · · , un}

and

V:{v1, v2,· · · , vn0}

For edge set

E:{e1,· · · , em}

, each edge

con-

nects

and

. In addition, there is a

node type mapping function

f1:U ∪ V → A

and

an edge type mapping function

f2:E → R

. Such

a graph Gis called a bipartite graph.

Example 1

Given an e-commerce search log, we

can build a query-item bipartite graph as shown in

the left part of Fig. 1, where

A={Query, Item}

and R={Click}.

In a bipartite graph, we use the metapath and

metapath instance to incorporate the neighboring

node information into relevance matching. They

are deﬁned as follows:

Deﬁnition 2 Metapath and Metapath Instance

in Bipartite Graph

. Given a bipartite graph

(U,V,E,A,R)

, a metapath

Pi=a1

→a2

→

· · · rl

→al+1

(

aj6=aj+1,16j6l

) repre-

sents the path from

al+1

successively through

r1, r2,· · · , rl

(

aj∈ A

rj∈ R

). The length of

is denoted as

|Pi|

and

|Pi|=l

. For brevity, the set

of all metapaths on

can be represented in regu-

lar expression as

PG= (a|ε) (a0a)+(a0|ε)

where

a, a0∈ A

a6=a0

. The metapath instance

is a

deﬁnite node sequence instantiated from metapath

. All instances of

is denoted as

I(Pi)

, then

p∈I(Pi).

Example 2

As shown in Fig. 1, an instance of

metapath “Query-Item-Query” is “q2-i3-q3”.

Deﬁnition 3 k-order Relevance

. Given a bipar-

tite graph

G= (U,V,E,A,R)

, a function

rel :

U × V → [0,1]

is called a

-order relevance func-

tion on

rel(ui, vj) = G(Φ(ui),Φ(vj)|Ck)

where

Φ(·)

is a function to map each node to

a representation vector,

G(.)

is the score func-

tion,

ui∈ U

vj∈ V

, and context information

Ck=SIPi⊆I(Pi),Pi∈PG,|Pi|=kIPi.

Many existing relevance matching mod-

els (Huang et al.,2013;Shen et al.,2014;Hu

et al.,2014b) ignore context information

and

only consider the sentences w.r.t. the query and

item to be matched, which corresponds to 0-order

relevance (for more details, please see the “Related

Work” part in Appendix A). We call it context-free

relevance matching in this paper. Considering that

both the 0-order neighbor (i.e., the node itself)

and

-order neighbor (

k>1

) are necessary for

relevance matching, we argue that a reasonable

mechanism should ensure that they can cooperate

with each other. Then the research objective of our

work is deﬁned as follows:

Search log

Query

Unlabeled

Data

Labeled

Data

Teacher model

Student model

Transfer set

Online model

Item1

Item2

Item3

……

User

Further

distillation

BERT

BERM BERM-O

Graph

analysis

Raw graph

construction

Graph refinement

Figure 2: The e-commerce knowledge distillation framework proposed in our work. Three models are used in this

framework: teacher model BERT, student model BERM, and online model BERM-O.

Deﬁnition 4 Contextual Relevance Matching

Given a bipartite graph

G= (U,V,E,A,R)

, the

task of contextual relevance matching is to deter-

mine the context information

and learn

the score function G(·).

2.2 Overview

We propose a complete knowledge distillation

framework (Fig. 2), whose student model incor-

porates the context information, for contextual rele-

vance matching in e-commerce product search. The

main components of this framework are described

as follows (see Appendix Bfor further details):

•Graph construction.

We ﬁrstly construct a raw

bipartite graph

based on the search data col-

lected from the anonymous platform. Then we

construct a knowledge-enhanced bipartite graph

with the help of BERT, which is ﬁne-tuned by

the human-labeled relevance data.

•Student model design.

We design a novel stu-

dent model BERM corresponding to the score

function

G(·)

in Def. 4. Speciﬁcally, macro

and micro matching embeddings are derived

in BERM to capture the sentence-level and

word-level relevance matching signal, respec-

tively. Also, based on the metapaths ‘

”

and ‘

”, we design a node-level encoder and

a metapath-instance-level aggregator to derive

metapath embeddings.

•Online application.

To serve online search, we

conduct further distillation to BERM and obtain

BERM-O, which can be easily deployed to the

online search platform.

2.3 BERM Model

Next we describe BERM in detail, including 0-

order relevance modeling (Section 2.3.1),

-order

relevance modeling (Section 2.3.2), and overall

learning objective (Section 2.3.3).

2.3.1 0-order Relevance Modeling

The whole structure of BERM includes both the

0-order relevance modeling and

-order relevance

modeling. This subsection introduces the 0-order

relevance modeling which captures sentence-level

and word-level matching signals by incorporating

the macro matching embedding and micro match-

ing embedding, respectively.

Word embedding in e-commerce scene.

In e-

commerce scene, the basic representations of query

or item is an intractable problem. On one hand,

it is infeasible to represent queries and items as

individual embeddings due to the unbounded entity

space. On the other hand, product type names (like

“iphone11”) or attribute names (like “256GB”) have

special background information and could contain

complex lexicons such as different languages and

numerals. To address these problems, we adopt

word embedding in BERM, which dramatically re-

duces the representation space. Also, we treat con-

tiguous numerals, contiguous English letters, or sin-

gle Chinese characters as one word and only retain

the high-frequency words (such as the words oc-

curring more than ﬁfty times in a six-month search

log) in the vocabulary. The ﬁnal vocabulary is

only in the tens of thousands, which saves mem-

ory consumption and lookup time of indexes by

a large margin. Each word is represented by a

dimensional embedding vector, which is trained by

Word2Vec (Mikolov et al.,2013). The

-th word’s

embedding of query

(or item title

) is denoted

as Ei

Q∈Rd(or Ei

I∈Rd).

Macro and micro matching embeddings.

capture sentence-level and word-level matching

signals, we employ macro matching embedding

and micro matching embedding, respectively. For

the macro matching embedding, taking query

with

words and item

with

words as ex-

amples, their macro embeddings

seq,EI

seq ∈Rd

are calculated by the column-wise mean value of

EQ∈RlQ×d,EI∈RlI×d:

seq =1

i=1

Q,EI

seq =1

i=1

I.(1)

For the micro matching embedding, we ﬁrst build

an interaction matrix

Mint ∈RlQ×lI

whose

(i, j)

-th

...

encoder encoder

... ...

attention

х х

...

encoder encoder

... ...

attention

х х

...

I-Q-I Q-I-Q

EI-Q-I EQ-I-Q

……

instance

embedding

metapath

embedding

Figure 3: Calculation process of EQ−I−Qand

EI−Q−Iin BERM.

entry is the dot product of Ei

Qand Ej

Mint =nmi,j

intolQ×lI

, mi,j

int =hEi

Q,Ej

Ii.(2)

Then the micro matching embedding

Eint ∈RlQlI

is the vectorization of Mint, i.e., Eint = vec(Mint).

2.3.2 k-order Relevance Modeling

The

-order relevance model contains a node-level

encoder and a metapath-instance-level aggregator.

Node-level encoder.

The input of the node-

level encoder is node embeddings and its output

is an instance embedding (i.e., the embedding of

a metapath instance). Speciﬁcally, to obtain the

instance embedding, we integrate the embeddings

of neighboring nodes into the anchor node embed-

ding with a mean encoder. Taking “

Itop1

Qtop1

”

(note that “

Itop1

” is the top-1 node in the 1-hop

neighbor list of node

, and see Appendix B.1

for further details) as an example, its embedding

EQ−Itop1−Qtop1 ∈Rdis calculated by:

EQ−Itop1−Qtop1 = MEAN(EQ

seq,EItop1

seq ,EQtop1

seq ).(3)

The metapath instance bridges the communication

gap between different types of nodes and can be

used to update the anchor node embedding from

structure information.

Metapath-instance-level aggregator.

The in-

puts of the metapath-instance-level aggregator are

instance embeddings and its output is a metapath

embedding. Different metapath instances convey

different information, so they have various effects

on the ﬁnal metapath embedding. However, the

mapping relationship between the instance embed-

ding and metapath embedding is unknown. To

learn their relationship automatically, we intro-

duce the “graph attention” mechanism to generate

metapath embeddings (Wu et al.,2021a;Liu et al.,

2022b). Taking metapath “

” as an example,

we use graph attention to represent the mapping

relationship between “

” and its instances.

The ﬁnal metapath embedding

EQ−I−Q∈Rd

is ob-

tained (

EI−Q−I∈Rd

is calculated similarly) by ac-

cumulating all instance embeddings with attention

scores Att1,Att2,Att3,Att4∈R:

EQ−I−Q= LeakyReLU(Att1·EQ−Itop1 −Qtop1

+ Att2·EQ−Itop1 −Qtop2

+ Att3·EQ−Itop2 −Qtop1

+ Att4·EQ−Itop2 −Qtop2 ).

(4)

Though

Atti

can be set as a ﬁxed value, we adopt

a more ﬂexible way, i.e., using the neural network

to learn

Atti

automatically. Speciﬁcally, we feed

the concatenation of the anchor node embedding

and metapath instance embedding into a one-layer

neural network (its weight is

Watt ∈R6d×4

and its

bias is

batt ∈R1×4

) with a softmax layer, which

outputs an attention distribution:

(Atti)1≤i≤4= softmax(Econcat ∗Watt +batt),(5)

Econcat = [EQ

seq|EI

seq|EQ−Itop1−Qtop1 |EQ−Itop1−Qtop2

|EQ−Itop2−Qtop1 |EQ−Itop2−Qtop2 ].

(6)

The above process is shown in Fig. 3.

Embedding fusion.

By the 0-order and

order relevance modeling, three types of em-

beddings are generated, including macro match-

ing embedding (

seq,EI

seq ∈Rd

), micro match-

ing embedding (

Eint ∈RlQlI

), and metapath

embedding (

EQ−I−Q,EI−Q−I∈Rd

). We con-

catenate them together and feed the result

to a three-layer neural network (its weights

are

W0∈R(4d+lQlI)×d,W1,W2∈Rd×d,W3∈Rd×1

and biases are

b0,b1,b2∈R1×d,b3∈R1×1

), which

outputs the ﬁnal relevance estimation score ˆyi:

ˆyi= Sigmoid(E3∗W3+b3),(7)

Ej+1 = ReLU(Ej∗Wj+bj),E0=Eall, j = 0,1,2,(8)

Eall = [EQ

seq|EI

seq|Eint|EQ−I−Q|EI−Q−I].(9)

2.3.3 Overall Learning Objective

We evaluate the cross-entropy error on the estima-

tion score

ˆyi

and label

(note that

yi∈[0,1]

is the score of the teacher model BERT, see Ap-

pendix B.1 for further details), and then minimize

the following loss function:

L=−

˜n

i=1

yilog(ˆyi) + (1 −yi) log(1 −ˆyi).(10)

In Appendix C, we analyze the complexities of

BERT, BERM, and BERM-O.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

KnowledgeDistillationbasedContextualRelevanceMatchingforE-commerceProductSearchZiyangLiu§,ChaokunWang§*,HaoFeng§,LingfeiWu,LiqunYang§TsinghuaUniversity,JD.com,CNAEIT§liu-zy21@mails.tsinghua.edu.cn,chaokun@tsinghua.edu.cnlwu@email.wm.edu,yanglq@cnaeit.comAbstractOnlinerelevancematchingisanessen...

展开>> 收起<<

Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search Ziyang LiuChaokun Wang Hao Feng Lingfei Wu Liqun Yang.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Knowledge Distillation based Contextual Relevance Matching for E-commerce Product Search Ziyang LiuChaokun Wang Hao Feng Lingfei Wu Liqun Yang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: