Short Text Pre-training with Extended Token Classiﬁcation for E-commerce Query Understanding Haoming Jiang Tianyu Cao Zheng Li Chen Luo Xianfeng Tang

2025-05-03 0 0 444.27KB 8 页 10玖币

侵权投诉

Short Text Pre-training with Extended Token Classiﬁcation for

E-commerce Query Understanding

Haoming Jiang∗, Tianyu Cao, Zheng Li, Chen Luo, Xianfeng Tang

Qingyu Yin, Danqing Zhang, Rahul Goutam, Bing Yin

Amazon Search

jhaoming@amazon.com

Abstract

E-commerce query understanding is the pro-

cess of inferring the shopping intent of cus-

tomers by extracting semantic meaning from

their search queries. The recent progress of

pre-trained masked language models (MLM)

in natural language processing is extremely at-

tractive for developing effective query under-

standing models. Speciﬁcally, MLM learns

contextual text embedding via recovering the

masked tokens in the sentences. Such a pre-

training process relies on the sufﬁcient con-

textual information. It is, however, less ef-

fective for search queries, which are usually

short text. When applying masking to short

search queries, most contextual information is

lost and the intent of the search queries may

be changed. To mitigate the above issues for

MLM pre-training on search queries, we pro-

pose a novel pre-training task speciﬁcally de-

signed for short text, called Extended Token

Classiﬁcation (ETC). Instead of masking the

input text, our approach extends the input by

inserting tokens via a generator network, and

trains a discriminator to identify which tokens

are inserted in the extended input. We conduct

experiments in an E-commerce store to demon-

strate the effectiveness of ETC.

1 Introduction

Query Understanding (QU) plays an essential role

in E-commerce shopping platform, where it ex-

tracts the shopping intents of the customers from

their search queries. Traditional approaches usu-

ally rely on handcrafted features or rules (Henstock

et al.,2001), which only have limited coverage.

More recently, deep learning models are proposed

to improve the the generalization performance of

QU models (Nigam et al.,2019;Lin et al.,2020).

These methods usually train a deep learning model

from scratch, which requires a large amount of

manually labeled data. Annotating a large number

of queries can be expensive, time-consuming, and

prone to human errors. Therefore, the labeled data

is often limited.

To achieve better model performance with lim-

ited data, researchers resorted to the masked

language model (MLM) pre-training with large

amount of unlabeled open-domain data (Devlin

et al.,2019;Liu et al.,2019b;Jiang et al.,2019;He

et al.,2021) and achieved the state-of-the-art perfor-

mance in QU tasks (Kumar et al.,2019;Jiang et al.,

2021;Zhang et al.,2021;Li et al.,2021). How-

ever, open-domain pre-trained models can only pro-

vide limited semantic and syntax information for

QU tasks in E-commerce search domain. In order

to capture domain-speciﬁc information, Lee et al.

(2020); Gururangan et al. (2020); Gu et al. (2021)

propose to pre-train the MLM on a large in-domain

unlabeled data either initialized randomly or from

a public pre-trained checkpoint.

bamboo charcoal bag

Original Query

bamboo [MASK] bag

Masked Query

organic bamboo charcoal bag

Extended Query

Figure 1: Original Query vs. Masked Query vs. Ex-

tended Query. ‘bamboo charcoal bag’ is a bag of

‘bamboo charcoal’, while ‘bamboo bag’ is a bag made

of bamboo. Masking out ‘charcoal’ will completely

change the user’s search intent. On the contrary, ex-

tending the query to ‘organic bamboo charcoal bag’

does not change the user’s search intent even though the

combination of ‘organic’ and ‘bamboo charcoal bag’ is

not common.

Although the search query domain speciﬁc

MLM can adapt to the search query distribution

in some extent, it is not effective in capturing the

contextual information of the search queries due to

the short length of the queries. There are two major

challenges:

arXiv:2210.03915v1 [cs.CL] 8 Oct 2022

•

MLM (Devlin et al.,2019) randomly replace

tokens in the text by

[MASK]

tokens and train

the model to recover them with a low masking

probability (e.g.,

15%

in Devlin et al. (2019); Liu

et al. (2019b)). Since the length of search queries is

short, there will be many queries having no masked

tokens during training. Even though we can ensure

each query to be masked for at least one token,

the percentage of mask tokens will be way much

higher and thus loss much context information;

•

Masking out tokens may signiﬁcantly change

the intent of the search queries. Figure 1shows an

example of masked token changing the intent of

the query.

In this paper, we propose a new pre-training task

—

xtended

oken

lassiﬁcation (

ETC

) to miti-

gate the above-mentioned issues for search query

pre-training. Instead of masking out tokens in the

search query and training the model to recover the

tokens, we extend the search query and train the

model to identify which tokens are extended. The

extended query is generated by inserting tokens

with a generator, which is a pre-trained masked

language model. The generator takes query with

randomly inserted

[MASK]

tokens as the input and

ﬁll in the blanks by its prediction. There are several

beneﬁts of ETC:

•

It turns the language modeling task into a binary

classiﬁcation task on all tokens, which makes the

model easier to train;

•

All samples will be used to train the model, even

when the probability of inserting tokens is low;

•

Since the generator has already been pre-trained,

the extended queries alter the meaning of the search

query less frequently.

We conduct experiments on an E-commerce

store to demonstrate the effectiveness of ETC. We

conduct ﬁne-tuning experiments in a wide range of

query understanding tasks including three classiﬁ-

cation tasks, one sequence labeling task, and one

text generation task. We show that ETC outper-

forms open-domain pre-trained models, and search

query domain speciﬁc pre-trained MLM model and

ELECTRA (Clark et al.,2020) model.

2 Background

Masked Language Modeling (MLM) pre-training

is ﬁrst introduced in Devlin et al. (2019) to learn

contextual word representations with a large trans-

former model (Vaswani et al.,2017). Given a se-

quence of tokens

x= [x1, ..., xn]

,Devlin et al.

(2019) corrupt it into

xmask

by masking

15%

of its

tokens at random:

mi∼Binomial(0.15),for i∈[0, ..., n]

xmask = REPLACE(x,[m1, ..., mn],[MASK])

Devlin et al. (2019) then train a transformer-

based language model

parameterized by

reconstruct xconditioned on xmask:

min

t=1

1(mt= 1)pG(xt|xmask),

where

pG(xt|xmask)

denotes the predicted proba-

bility of the t-th token being xtgiven xmask.

Devlin et al. (2019) also introduced a next sen-

tence prediction (NSP) pre-training task, which is

shown to be not very effective in a later work (Liu

et al.,2019b). In this paper, we do not discuss the

NSP pre-training task.

3 Method

ETC adopts two transformer-based neural net-

works: a generator

and a discriminator

. A

raw text input is ﬁrst inserted with some

[MASK]

tokens, and then the generator, a masked language

model, ﬁlls

[MASK]

tokens with its prediction.

The discriminator is trained to identify which to-

kens are generated. The encoder of the discrim-

inator is then used as the pre-trained model for

ﬁne-tuning on downstream tasks. We summarize

the extended token classiﬁcation pre-training task

in Figure 2.

3.1 Extended Query Generation

Each query is a sequence of tokens

[x1, ..., xn]

, where the number of the tokens

is usually small for search queries. As a result,

masked language models must set a high enough

masking rate to make sure at least one token is

masked out and the training can be really con-

ducted with this sample. By replacing tokens with

mask tokens, the semantic meaning of the search

queries might be altered. Instead of masking to-

kens, we propose to insert

[MASK]

tokens in the

query and use a generator to ﬁll in the blanks.

Speciﬁcally, we randomly select a set of positions

m= [m0, ..., mn]with a ﬁxed probability p:

mi∼Binomial(p),for i∈[0, ..., n],

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ShortTextPre-trainingwithExtendedTokenClassicationforE-commerceQueryUnderstandingHaomingJiang,TianyuCao,ZhengLi,ChenLuo,XianfengTangQingyuYin,DanqingZhang,RahulGoutam,BingYinAmazonSearchjhaoming@amazon.comAbstractE-commercequeryunderstandingisthepro-cessofinferringtheshoppingintentofcus-tomersbyex...

展开>> 收起<<

Short Text Pre-training with Extended Token Classiﬁcation for E-commerce Query Understanding Haoming Jiang Tianyu Cao Zheng Li Chen Luo Xianfeng Tang.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Short Text Pre-training with Extended Token Classiﬁcation for E-commerce Query Understanding Haoming Jiang Tianyu Cao Zheng Li Chen Luo Xianfeng Tang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: