a recommended new game The Witcher 3 because he
clicked some similar games last week, so the item side
fields should be included as all existing works do. Or
it’s because he is in the game zone now and any history
click on games indicates a strong interest in games. In
the latter case, the game zone feature from the context
side plays an important role in capturing his interest from
behaviors.
•Second, existing works interact all behavior fields with all
target item side fields. Recent studies [8], [9] show that
some interactions in attention are unnecessary and harm
the performance. Involving more fields as the query may
introduce more irrelevant field interactions and further
deteriorate the performance.
•Third, as a part of the input layer of a more complicated
DNN model for CTR prediction, the procedure of gener-
ating a user interest vector should be lightweight. Unfor-
tunately, most existing methods use an MLP to calculate
the attention weight, which leads to high computation
complexity.
To resolve these challenges, we propose to include all
item/user/context fields as the query in the attention unit, and
calculate a learnable weight for each field pair between user
behavior fields and these query fields. To avoid introducing
noisy field pairs, we further propose to automatically select the
most important ones by pruning on these weights. Besides, we
adopt a simple dot product function rather than an MLP as the
attention function, leading to much less computation cost. We
summarize the AUC as well as the average inference time of
AutoAttention and several baseline models in Fig. 1. Except
Sum Pooling which has a very low inference time due to its
simplicity, the proposed AutoAttention gets a higher AUC than
all the other baseline models with low inference time. The
main contributions of this paper are summarized as follows:
•We propose to involve all item/user/context fields as the
query in the attention unit for user interest modeling.
A weight is assigned for each field pair between user
behavior fields and these query fields. Pruning on weights
automates the field pair selection, preventing performance
deterioration due to introducing irrelevant field pairs.
•We propose to use a simple dot product attention, rather
than an MLP in existing methods. This greatly reduces
the time complexity with comparable or even better
performance.
•We conduct extensive experiments on public and produc-
tion datasets to compare AutoAttention with state-of-the-
art methods. Evaluation results verify the effectiveness
of AutoAttention. We also study the learnt field pair
weights and find that AutoAttention does identify several
field pairs including user or context side fields, which are
ignored by expert knowledge in existing works.
The rest of the paper is organized as follows. Section II
provides the preliminaries of existing user behavior methods.
In Section III, we describe AutoAttention, and describe its
connection with several existing methods. Experiment settings
and evaluation results are presented in Section IV. Finally,
Section V and Section VI discusses the related work and
concludes the paper, respectively.
II. PRELIMINARIES
In this section, we present the preliminaries of user behavior
modeling in CTR prediction. A CTR prediction model aims
at predicting the probability that a user clicks an item given
a context (e.g., time, location, publisher information, etc.). It
takes fields from three sides as the input:
pCTR =f(user,item,context)
where user side fields consists of user demographic fields and
behavior fields, item and context denote fields from the item
and context sides, respectively. In this paper, we focus on how
to capture a user’s interest from user behaviors.
Given a user uand her corresponding behaviors
{v1,v2,...,vH}, her interest is represented as a fixed-length
vector as follows:
vu=f(v1,v2,...,vH,eF1,eF2,...,eFM)(1)
where videnotes the embedding for the i-th behavior, H
denotes the length of user behaviors, and eFj∈ RKdenotes
the feature embedding from any other field besides the user
behaviors(e.g., item/user/context side fields), i.e., Fj. Each
behavior is usually represented by multiple item side fields.
Denoting the set of fields to represent behaviors as B={Bp},
then each behavior is represented as vi=PBp∈B vBp, where
vBp∈ RKdenotes the feature embedding for the field Bpof
the i-th behavior.
A straightforward way to calculate vuis to do a sum
or mean pooling over all these viembedding vectors [7].
However, it neglects the importance of each behavior given
a specific target item. Recently, a commonly used behavior
modeling strategy is to adopt an attention mechanism over
the user’s historical behaviors. It learns an attentive weight
for each behavior iw.r.t. a given target item tand then
conducts a weighted sum pooling, i.e. vu=PH
i=1 a(i, t)vi,
where a(i, t)denotes an attention function. For example, Deep
Interest Network (DIN) considers the influence of the target
item on user behaviors [4], which learns larger weights to
those behaviors that are more important given the target item,
as shown in Eqn. (2).
vu=f(v1,v2,...,vH,et)
=
H
X
i=1
a(i, t)vi=
H
X
i=1
MLP(vi,et)vi
(2)
where etdenotes the embedding vector of the target item t.
MLP() denotes an MLP with its output as the attention weight.
Following DIN, DIEN [5] further considers the evolution of
user interest, and DSIN [6] considers the homogeneity and
heterogeneity of a user’s interests within and among sessions.
DIF-SR [10] proposes to only consider the interaction between
2