Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding Haoming Jiang Tianyu Cao Zheng Li Chen Luo Xianfeng Tang

2025-05-03 0 0 444.27KB 8 页 10玖币
侵权投诉
Short Text Pre-training with Extended Token Classification for
E-commerce Query Understanding
Haoming Jiang, Tianyu Cao, Zheng Li, Chen Luo, Xianfeng Tang
Qingyu Yin, Danqing Zhang, Rahul Goutam, Bing Yin
Amazon Search
jhaoming@amazon.com
Abstract
E-commerce query understanding is the pro-
cess of inferring the shopping intent of cus-
tomers by extracting semantic meaning from
their search queries. The recent progress of
pre-trained masked language models (MLM)
in natural language processing is extremely at-
tractive for developing effective query under-
standing models. Specifically, MLM learns
contextual text embedding via recovering the
masked tokens in the sentences. Such a pre-
training process relies on the sufficient con-
textual information. It is, however, less ef-
fective for search queries, which are usually
short text. When applying masking to short
search queries, most contextual information is
lost and the intent of the search queries may
be changed. To mitigate the above issues for
MLM pre-training on search queries, we pro-
pose a novel pre-training task specifically de-
signed for short text, called Extended Token
Classification (ETC). Instead of masking the
input text, our approach extends the input by
inserting tokens via a generator network, and
trains a discriminator to identify which tokens
are inserted in the extended input. We conduct
experiments in an E-commerce store to demon-
strate the effectiveness of ETC.
1 Introduction
Query Understanding (QU) plays an essential role
in E-commerce shopping platform, where it ex-
tracts the shopping intents of the customers from
their search queries. Traditional approaches usu-
ally rely on handcrafted features or rules (Henstock
et al.,2001), which only have limited coverage.
More recently, deep learning models are proposed
to improve the the generalization performance of
QU models (Nigam et al.,2019;Lin et al.,2020).
These methods usually train a deep learning model
from scratch, which requires a large amount of
manually labeled data. Annotating a large number
of queries can be expensive, time-consuming, and
prone to human errors. Therefore, the labeled data
is often limited.
To achieve better model performance with lim-
ited data, researchers resorted to the masked
language model (MLM) pre-training with large
amount of unlabeled open-domain data (Devlin
et al.,2019;Liu et al.,2019b;Jiang et al.,2019;He
et al.,2021) and achieved the state-of-the-art perfor-
mance in QU tasks (Kumar et al.,2019;Jiang et al.,
2021;Zhang et al.,2021;Li et al.,2021). How-
ever, open-domain pre-trained models can only pro-
vide limited semantic and syntax information for
QU tasks in E-commerce search domain. In order
to capture domain-specific information, Lee et al.
(2020); Gururangan et al. (2020); Gu et al. (2021)
propose to pre-train the MLM on a large in-domain
unlabeled data either initialized randomly or from
a public pre-trained checkpoint.
bamboo charcoal bag
Original Query
bamboo [MASK] bag
Masked Query
organic bamboo charcoal bag
Extended Query
Figure 1: Original Query vs. Masked Query vs. Ex-
tended Query. ‘bamboo charcoal bag’ is a bag of
‘bamboo charcoal’, while ‘bamboo bag’ is a bag made
of bamboo. Masking out ‘charcoal’ will completely
change the user’s search intent. On the contrary, ex-
tending the query to ‘organic bamboo charcoal bag’
does not change the user’s search intent even though the
combination of ‘organic’ and ‘bamboo charcoal bag’ is
not common.
Although the search query domain specific
MLM can adapt to the search query distribution
in some extent, it is not effective in capturing the
contextual information of the search queries due to
the short length of the queries. There are two major
challenges:
arXiv:2210.03915v1 [cs.CL] 8 Oct 2022
MLM (Devlin et al.,2019) randomly replace
tokens in the text by
[MASK]
tokens and train
the model to recover them with a low masking
probability (e.g.,
15%
in Devlin et al. (2019); Liu
et al. (2019b)). Since the length of search queries is
short, there will be many queries having no masked
tokens during training. Even though we can ensure
each query to be masked for at least one token,
the percentage of mask tokens will be way much
higher and thus loss much context information;
Masking out tokens may significantly change
the intent of the search queries. Figure 1shows an
example of masked token changing the intent of
the query.
In this paper, we propose a new pre-training task
E
xtended
T
oken
C
lassification (
ETC
) to miti-
gate the above-mentioned issues for search query
pre-training. Instead of masking out tokens in the
search query and training the model to recover the
tokens, we extend the search query and train the
model to identify which tokens are extended. The
extended query is generated by inserting tokens
with a generator, which is a pre-trained masked
language model. The generator takes query with
randomly inserted
[MASK]
tokens as the input and
fill in the blanks by its prediction. There are several
benefits of ETC:
It turns the language modeling task into a binary
classification task on all tokens, which makes the
model easier to train;
All samples will be used to train the model, even
when the probability of inserting tokens is low;
Since the generator has already been pre-trained,
the extended queries alter the meaning of the search
query less frequently.
We conduct experiments on an E-commerce
store to demonstrate the effectiveness of ETC. We
conduct fine-tuning experiments in a wide range of
query understanding tasks including three classifi-
cation tasks, one sequence labeling task, and one
text generation task. We show that ETC outper-
forms open-domain pre-trained models, and search
query domain specific pre-trained MLM model and
ELECTRA (Clark et al.,2020) model.
2 Background
Masked Language Modeling (MLM) pre-training
is first introduced in Devlin et al. (2019) to learn
contextual word representations with a large trans-
former model (Vaswani et al.,2017). Given a se-
quence of tokens
x= [x1, ..., xn]
,Devlin et al.
(2019) corrupt it into
xmask
by masking
15%
of its
tokens at random:
miBinomial(0.15),for i[0, ..., n]
xmask = REPLACE(x,[m1, ..., mn],[MASK])
Devlin et al. (2019) then train a transformer-
based language model
G
parameterized by
θ
to
reconstruct xconditioned on xmask:
min
θ
E
n
X
t=1
1(mt= 1)pG(xt|xmask),
where
pG(xt|xmask)
denotes the predicted proba-
bility of the t-th token being xtgiven xmask.
Devlin et al. (2019) also introduced a next sen-
tence prediction (NSP) pre-training task, which is
shown to be not very effective in a later work (Liu
et al.,2019b). In this paper, we do not discuss the
NSP pre-training task.
3 Method
ETC adopts two transformer-based neural net-
works: a generator
G
and a discriminator
D
. A
raw text input is first inserted with some
[MASK]
tokens, and then the generator, a masked language
model, fills
[MASK]
tokens with its prediction.
The discriminator is trained to identify which to-
kens are generated. The encoder of the discrim-
inator is then used as the pre-trained model for
fine-tuning on downstream tasks. We summarize
the extended token classification pre-training task
in Figure 2.
3.1 Extended Query Generation
Each query is a sequence of tokens
x=
[x1, ..., xn]
, where the number of the tokens
n
is usually small for search queries. As a result,
masked language models must set a high enough
masking rate to make sure at least one token is
masked out and the training can be really con-
ducted with this sample. By replacing tokens with
mask tokens, the semantic meaning of the search
queries might be altered. Instead of masking to-
kens, we propose to insert
[MASK]
tokens in the
query and use a generator to fill in the blanks.
Specifically, we randomly select a set of positions
m= [m0, ..., mn]with a fixed probability p:
miBinomial(p),for i[0, ..., n],
摘要:

ShortTextPre-trainingwithExtendedTokenClassicationforE-commerceQueryUnderstandingHaomingJiang,TianyuCao,ZhengLi,ChenLuo,XianfengTangQingyuYin,DanqingZhang,RahulGoutam,BingYinAmazonSearchjhaoming@amazon.comAbstractE-commercequeryunderstandingisthepro-cessofinferringtheshoppingintentofcus-tomersbyex...

展开>> 收起<<
Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding Haoming Jiang Tianyu Cao Zheng Li Chen Luo Xianfeng Tang.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:444.27KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注