2 Proposed Method
The development of the proposed ESTC system
consists of three main parts, including scene-based
topic generation for each product, scene-based
product clustering to aggregate products with simi-
lar topic titles, as well as the quality control module
to ensure the quality of AI-generated channels. We
also include a simple data augmentation module
to discover weakly supervised data in order to im-
prove the diversity of generated topic titles.
2.1 Scene-based Topic Generation
In this work, we propose to generate the scene-
based topic titles for each product. To be specific,
given input information
X= (x1, x2, . . . , x|X|)
of
a product
P
, including product’s title
T
, a set of at-
tributes
A
and side information
O
obtained through
optical character recognition techniques, paired
with scene-based topic title
Y= (y1, y2, . . . , y|Y|)
,
we aim to learn model parameters
θ
and estimate
the conditional probability:
P(Y|X;θ) =
|Y|
Y
t=1
p(yt|y<t;X;θ)
where
y<t
stands for all tokens in a scene title be-
fore position t(i.e., y<t = (y1, y2, . . . , yt−1)).
Pretraining with E-commerce Corpus
Pre-
trained models (Radford et al.,2019;Devlin et al.,
2019;Lewis et al.,2020;Raffel et al.,2020;Zou
et al.,2020;Xue et al.,2021) have proved effective
in many downstream tasks, however, most of which
are developed on English corpora from general do-
mains, such as news articles, books, stories and
web text. In our scenario, we aim to produce topic
titles in Chinese that summarize certain usage sce-
narios of products. Therefore, a model is required
to understand the products through its associated in-
formation (such as title, semi-structured attributes)
and generate scene-based topic titles, where we
argue that the model should learn knowledge from
e-commerce fields and thus propose to further pre-
train models in domain (Gururangan et al.,2020).
Specifically, besides the product title, attribute set
as well as side information, we also collect the
corresponding advertising copywriting of products
from e-commerce platforms for the second phase
of pre-training. We adopt the UniLM (Dong et al.,
2019) with BERT initialization as backbone struc-
ture.
Recall that the product attributes
A
is a set with-
out fixed order. We observe that input containing
same attributes yet in different orders might results
in different outputs. On the other hand, UniLM is
an encoder-decoder shared architecture. To rein-
force both the understanding and generation ability
of no-order input information, in addition to the
original pre-training objectives of UniLM, we also
propose two objectives to adapt the target domain:
•
Consistency Classification: Given a product
title-attributes pair, this task aims to classify
if the two refer to the same product. For the
positive example, the attributes and the title
describe the same product and attributes are
randomly concatenated as a sequence to in-
troduce disorder noises. For the negative ex-
ample, we randomly select attributes from an-
other different product.
•
Sentence Reordering: We split the product
copywriting into pieces according to marks
(such as comma and period). Such pieces
are then shuffled and concatenated as a new
text sequence. The model takes the shuffled
sequence as input and learns to generate the
original copywriting.
After the second phase of pre-training in the target
e-commerce domain, we fine-tune the pre-trained
model on the scene-based topic generation dataset.
2.2 Scene-based Product Clustering
One intuitive solution to constructing a scene-based
topic channel is to group products with exactly
the same generated topic titles. However, we ob-
serve there exists channels with similar topic titles,
each of which merely contains several products,
while we expect one channel has diverse products
to ensure user experience. Therefore, we design a
clustering module to aggregate products with se-
mantically similar topic titles.
Topic Encoding
To better learn scene-based
topic representations and distinguish different topic
titles, we take all topic titles from training set as in-
put and employ the SimCSE (Gao et al.,2021)
to further fine-tune the e-commerce pre-trained
UniLM model in an unsupervised fashion. The
embeddings of the last layer are used as the initial-
ization for product clustering.
Product Clustering
This module aims to group
products with semantically similar topic titles into