
Short Text Pre-training with Extended Token Classification for
E-commerce Query Understanding
Haoming Jiang∗, Tianyu Cao, Zheng Li, Chen Luo, Xianfeng Tang
Qingyu Yin, Danqing Zhang, Rahul Goutam, Bing Yin
Amazon Search
jhaoming@amazon.com
Abstract
E-commerce query understanding is the pro-
cess of inferring the shopping intent of cus-
tomers by extracting semantic meaning from
their search queries. The recent progress of
pre-trained masked language models (MLM)
in natural language processing is extremely at-
tractive for developing effective query under-
standing models. Specifically, MLM learns
contextual text embedding via recovering the
masked tokens in the sentences. Such a pre-
training process relies on the sufficient con-
textual information. It is, however, less ef-
fective for search queries, which are usually
short text. When applying masking to short
search queries, most contextual information is
lost and the intent of the search queries may
be changed. To mitigate the above issues for
MLM pre-training on search queries, we pro-
pose a novel pre-training task specifically de-
signed for short text, called Extended Token
Classification (ETC). Instead of masking the
input text, our approach extends the input by
inserting tokens via a generator network, and
trains a discriminator to identify which tokens
are inserted in the extended input. We conduct
experiments in an E-commerce store to demon-
strate the effectiveness of ETC.
1 Introduction
Query Understanding (QU) plays an essential role
in E-commerce shopping platform, where it ex-
tracts the shopping intents of the customers from
their search queries. Traditional approaches usu-
ally rely on handcrafted features or rules (Henstock
et al.,2001), which only have limited coverage.
More recently, deep learning models are proposed
to improve the the generalization performance of
QU models (Nigam et al.,2019;Lin et al.,2020).
These methods usually train a deep learning model
from scratch, which requires a large amount of
manually labeled data. Annotating a large number
of queries can be expensive, time-consuming, and
prone to human errors. Therefore, the labeled data
is often limited.
To achieve better model performance with lim-
ited data, researchers resorted to the masked
language model (MLM) pre-training with large
amount of unlabeled open-domain data (Devlin
et al.,2019;Liu et al.,2019b;Jiang et al.,2019;He
et al.,2021) and achieved the state-of-the-art perfor-
mance in QU tasks (Kumar et al.,2019;Jiang et al.,
2021;Zhang et al.,2021;Li et al.,2021). How-
ever, open-domain pre-trained models can only pro-
vide limited semantic and syntax information for
QU tasks in E-commerce search domain. In order
to capture domain-specific information, Lee et al.
(2020); Gururangan et al. (2020); Gu et al. (2021)
propose to pre-train the MLM on a large in-domain
unlabeled data either initialized randomly or from
a public pre-trained checkpoint.
bamboo charcoal bag
Original Query
bamboo [MASK] bag
Masked Query
organic bamboo charcoal bag
Extended Query
Figure 1: Original Query vs. Masked Query vs. Ex-
tended Query. ‘bamboo charcoal bag’ is a bag of
‘bamboo charcoal’, while ‘bamboo bag’ is a bag made
of bamboo. Masking out ‘charcoal’ will completely
change the user’s search intent. On the contrary, ex-
tending the query to ‘organic bamboo charcoal bag’
does not change the user’s search intent even though the
combination of ‘organic’ and ‘bamboo charcoal bag’ is
not common.
Although the search query domain specific
MLM can adapt to the search query distribution
in some extent, it is not effective in capturing the
contextual information of the search queries due to
the short length of the queries. There are two major
challenges:
arXiv:2210.03915v1 [cs.CL] 8 Oct 2022