Predictive Querying for Autoregressive Neural Sequence Models Alex Boyd1Sam Showalter2Stephan Mandt12Padhraic Smyth12

2025-04-26 0 0 1.98MB 31 页 10玖币
侵权投诉
Predictive Querying for Autoregressive Neural
Sequence Models
Alex Boyd1Sam Showalter2Stephan Mandt1,2Padhraic Smyth1,2
1Department of Statistics 2Department of Computer Science
University of California, Irvine
{alexjb,showalte,mandt,p.smyth}@uci.edu
Abstract
In reasoning about sequential events it is natural to pose probabilistic queries such as
“when will event A occur next” or “what is the probability of A occurring before B”,
with applications in areas such as user modeling, medicine, and finance. However,
with machine learning shifting towards neural autoregressive models such as RNNs
and transformers, probabilistic querying has been largely restricted to simple cases
such as next-event prediction. This is in part due to the fact that future querying
involves marginalization over large path spaces, which is not straightforward to
do efficiently in such models. In this paper we introduce a general typology for
predictive queries in neural autoregressive sequence models and show that such
queries can be systematically represented by sets of elementary building blocks.
We leverage this typology to develop new query estimation methods based on
beam search, importance sampling, and hybrids. Across four large-scale sequence
datasets from different application domains, as well as for the GPT-2 language
model, we demonstrate the ability to make query answering tractable for arbitrary
queries in exponentially-large predictive path-spaces, and find clear differences in
cost-accuracy tradeoffs between search and sampling methods.
1 Introduction
One of the major successes in machine learning in recent years has been the development of neural
sequence models for categorical sequences, particularly in natural language applications but also in
other areas such as automatic code generation and program synthesis [Shin et al., 2019, Chen et al.,
2021], computer security [Brown et al., 2018], recommender systems [Wu et al., 2017], genomics
[Shin et al., 2021, Amin et al., 2021], and survival analysis [Lee et al., 2019]. Many of the models
(although not all) rely on autoregressive training and prediction, allowing for the sequential generation
of sequence completions in a recursive manner conditioned on sequence history.
A natural question in this context is how to compute answers to predictive queries that go beyond
traditional one-step-ahead predictions. Examples of such queries are “how likely is event
A
to occur
before event
B
?” and “how likely is event
C
to occur (once or more) within the next
K
steps of the
sequence?” These types of queries are very natural across a wide variety of application contexts, for
example, the probability that an individual will finish speaking or writing a sentence within the next
Kwords, or that a user will use one app before another. See Fig. 1 and Appendix A for examples.
In this paper we develop a general framework for answering such predictive queries in the context of
autoregressive (AR) neural sequence models. This amounts to computing conditional probabilities
of propositional statements about future events, conditioned on the history of the sequence as
summarized by the current hidden state representation. We focus in particular on how to perform near
Authors contributed equally
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06464v3 [cs.LG] 4 Nov 2022
0.00
0.02
0.04 “In my opinion”
0.000
0.025
0.050
“Hi, my name is”
1 5 10 15 20 25 30
Steps into Sequence K
0.000
0.025
0.050 “Once upon a”
1 5 10 15 20 25 30
Steps into Sequence K
0.00
0.05
0.10 “Where is”
Figure 1: (top) Illustration of a query for the probability of a given sentence "In my opinion..." ending
in
K
steps. (bottom) GPT-2 query estimates across 4 prefixes with
V= 50,257, K 30
. Importance
sampling query estimates maintain a 6x reduction in variance relative to naive model sampling for the
same computation budget. Open-ended prefixes (top-left) generally possess longer-tailed distributions
relative to simple prefixes. Additional details provided in Sections 4, 5, and Appendix H.5.
real-time computation of such queries, motivated by use-cases such as answering human-generated
queries and utilizing query estimates within the optimization loop of training a model. Somewhat
surprisingly, although there has been extensive prior work on multivariate probabilistic querying in
areas such as graphical models and database querying, as well as for restricted types of queries for
traditional sequence models such as Markov models, querying for neural sequence models appears
to be unexplored. One possible reason is that the problem is computationally intractable in the
general case (as we discuss later in Section 3), typically scaling as
OVK1
or worse for predictions
K
-steps ahead, given a sequence model with a
V
-ary alphabet (e.g. compared to
OKV 2
for
Markov chains).
Our contributions are three-fold:
1.
We introduce and develop the problem of predictive querying in neural sequence models by
reducing complex queries to elementary building blocks. These elementary queries define
restricted path spaces over future event sequences.
2.
We show that the underlying autoregressive model can always be constrained to the restricted
path space satisfying a given query. This gives rise to a new proposal distribution that can be
used for importance sampling, beam search, or a new hybrid approach.
3.
We evaluate these methods across three user behavior and two language data datasets. While
all three methods significantly improve over naive forward simulation, the hybrid approach
further improves over importance sampling and beam search. We furthermore explore how the
performance of all methods relates to the model entropy.
Code for this work is available at https://github.com/ajboyd2/prob_seq_queries.
2 Related Work
Research on efficient computation of probabilistic queries has a long history in machine learning
and AI, going back to work on exact inference in multivariate graphical models [Pearl, 1988, Koller
and Friedman, 2009]. Queries in this context are typically of two types. The first are conditional
probability queries, the focus of our attention here: computing probabilities defined for a subset
X
of variables of interest, conditioned on a second subset
Y=y
of observed variable values, and
marginalizing over the set
Z
of all other variables. The second type of queries can broadly be referred
to as assignment queries, seeking the most likely (highest conditional probability) assignment of
values
x
for
X
, again conditioned on
Y=y
and marginalizing over the set
Z
. Assignment queries
are also referred to as most probable explanation (MPE) queries, or as maximum a posteriori (MAP)
queries when Zis the empty set [Koller et al., 2007].
2
Figure 2: (left) Tree diagram of the complete sequence space for a vocabulary
V={a, b, c}
and the
corresponding query space
Q
(right) for when the first appearance of
a
occurs on the third step (i.e.,
τ(a)=3), defined as the set product of restricted domains listed below the figure.
For models that can be characterized with sparse Markov dependence structure, there is a significant
body of work on efficient inference algorithms that can leverage such structure [Koller and Friedman,
2009], in particular for sequential models where recursive computation can be effectively leveraged
[Bilmes, 2010]. However, autoregressive neural sequence models are inherently non-Markov since
the real-valued current hidden state is a function of the entire history of the sequence. Each hidden
state vector induces a tree containing
Vk
unique future trajectories with state-dependent probabilities
for each path. Techniques such as dynamic programming (used effectively in Markov-structured
sequence models) are not applicable in this context, and both assignment queries and conditional
probability queries are NP-hard in general Chen et al. [2018].
For assignment-type queries there has been considerable work in natural language processing with
neural sequence models, particularly for the MAP problem of generating high-quality/high-probability
sequences conditioned on sequence history or other conditioning information. A variety of heuristic
decoding methods have been developed and found to be useful in practice, including beam search
[Sutskever et al., 2014], best-first search [Xu and Durrett, 2021], sampling methods [Holtzman et al.,
2019], and hybrid variants [Shaham and Levy, 2022]. However, for conditional probability queries
with neural sequence models (the focus of this paper), there has been no prior work in general on
this problem to our knowledge. While decoding techniques such as beam search can also be useful
in the context of conditional probability queries, as we will see later in Section 4, such techniques
have significant limitations in this context, since by definition they produce lower-bounds on the
probabilities of interest and, hence, are biased estimators.
3 Probabilistic Queries
Notation
Let
X1:N:= [X1, X2, . . . , XN]
be a sequence of random variables with arbitrary length
N
. Additionally, let
x1:N:= [x1, x2, . . . , xN]
be their respective observed values where each
xi
takes on values from a fixed vocabulary
V:= {1, . . . , V }
. Examples of these sequences include
sentences where each letter or word is a single value, or streams of discrete events generated by some
process or user. We will refer to individual variable-value pairs in the sequence as events.
We consider an autoregressive model
pθ(X1:N) = QN
i=1 pθ(Xi|X1:i1)
, parameterized by
θ
and
trained on a given dataset of
M
independent draws from a ground truth distribution
P(X1:N)
. We
assume that this model can be conditioned on a subsequence, termed the history
H
. We will remap
the indices of subsequent random variables to start at 1
2
. We abbreviate conditioning on a history by
an asterisk , i.e., P(·) := P(·|H)and p
θ(·) := pθ(·|H).
Defining Probabilistic Queries
Given a specific history of events
H
, there are a variety of different
questions one could ask about the future continuing where the history left off: (
Q1
) What event is
likely to happen next? (
Q2
) Which events are likely to occur
k > 1
steps from now? (
Q3
) What is the
distribution of when the next instance of
aV
occurs? (
Q4
) How likely is it that we will see event
aVoccur before bV? (Q5) How likely is it for aVto occur ntimes in the next ksteps?
We define a common framework for such queries by defining probabilistic queries to be of the form
p
θ(X1:K∈ Q)
with
Q ⊂ VK
. This can be extended to the infinite setting (i.e.,
p
θ([Xk]k∈ Q)
2
For example, if
|H| = 3
then
P(X4|H)
is the distribution of the
7th
value in a sequence after conditioning
on the first 3 values in the sequence.
3
# Question Probabilistic Query Cost (K· |Q|)
Q1Next event? p
θ(X1)O(1)
Q2Event Ksteps from now? p
θ(XK)OVK1
Q3Next instance of a?p
θ(τ(a) = K)O(V1)K1
Q4Will ahappen before b?p
θ(τ(a)< τ(b)) O(V2)K
Q5How many instances of ain Ksteps? p
θ(Na(K) = n)OK
n(V1)Kn
Table 1: List of example questions, corresponding probabilistic queries, and associated costs of exact
computation computation with an autoregressive model. The cost of accommodating a history
H
is
assumed to be an additive constant for all queries. While
Q4
extends to infinite time, the cost reported
is for computing a lower bound up to Ksteps.
where Q ⊂ V). Exact computation of an arbitrary query is straightforward to represent:
p
θ(X1:K∈ Q) = X
x1:K∈Q
p
θ(X1:K=x1:K) = X
x1:K∈Q
K
Y
k=1
p
θ(Xk=xk|X<k =x<k).(1)
Depending on
|Q|
, performing this calculation can quickly become intractable, motivating lower
bounds or approximations (developed in more detail in Section 4). In this context it is helpful to
impose structure on the query
Q
to make subsequent estimation easier, in particular by breaking
Q
into the following structured partition:
Q=iQ(i)where Q(i)∩ Q(j)=for i6=j(2)
and Q(i)=V(i)
1× V(i)
2× ··· × V(i)
Kwhere V(i)
kVfor k= 1, . . . , K. (3)
In words, this means a given query
Q
can be broken into a partition of simpler queries
Q(i)
which
take the form of a set cross product between restricted domains
V(i)
k
, one domain for each token
Xk
.
3
An illustration of an example query set can be seen in Fig. 2. A natural consequence of this is that:
p
θ(X1:K∈ Q) = X
i
p
θX1:K∈ Q(i)=X
i
p
θK
k=1Xk∈ V(i)
k,
which lends itself to more easily estimating each term in the sum. This will be discussed in Section 4.
Queries of Interest
All of the queries posed earlier in this section can be represented under the
framework detailed in Eq. (2) and Eq. (3), as illustrated in Table 1.
Q1
&
Q2
The queries
p
θ(X1=a)
and
p
θ(XK=a)
for some
aV
can be represented with
Q={a}and Q=VK1× {a}respectively.
Q3
The probability of the next instance of
aV
occurring at some point in time
K1
,
p
θ(τ(a) = K)
where
τ(·)
is the hitting time, can be represented as
Q= (V\ {a})K1× {a}
. This
can be adapted for a set AVby replacing {a}in Qwith A.
Q4
The probability of
aV
occurring before
bV
,
p
θ(τ(a)< τ(b))
, is represented as
Q=
i=1Q(i)
where
Q(i)= (V\ {a, b})i1× {a}
. Lower bounds to this can be computed by
limiting i<i0. Like Q3, this can also be extended to disjoint sets A, B V.
Q5
The probability of
aV
occurring
n
times in the next
K
steps,
p
θ(Na(K) = n)
, is represented
as
Q=C(K,n)
i=1 Q(i)
, where
Na(K)
is a random variable for the number of occurrences of events of
type
a
from steps
1
to
K
and
Q(i)
s are defined to cover all unique permutations of orders of products
composed of: {a}nand (V\ {a})Kn. Like above, this can easily be extended for AV.
Query Complexity
From Eq. (1), exact computation of a query involves computing
K· |Q|
conditional distributions (e.g.,
p
θ(Xk|X<k =x<k)
) in an autoregressive manner. Under the struc-
tured representation, the number of conditional distributions needed is equivalently
PiQK
k=1 |V(i)
k|
.
3Ideally, the partitioning is chosen to have the smallest number of Q(i)s needed.
4
Non-attention based neural sequence models often define
p
θ(Xk|X<k =x<k) := fθ(hk)
where
hk=RNNθ(hk1, xk1)
. As such, the computation complexity for any individual conditional
distribution remains constant with respect to sequence length. We will refer to the complexity of this
atomic action as being
O(1)
. Naturally, the actual complexity depends on the model architecture
and has a multiplicative scaling on the cost of computing a query. The number of atomic operations
needed to exactly compute
Q1
-
Q5
for this class of models can be found in Table 1. Should
pθ
be
an attention based model (e.g., a transformer [Vaswani et al., 2017]) then the time complexity of
computing a single one-step-ahead distribution becomes
O(K)
, further exacerbating the
exponential
growth
of many queries. Note that with some particular parametric forms of
pθ
, query computation
can be more efficient, e.g., see Appendix C for a discussion on query complexity for Markov models.
4 Query Estimation Methods
Since exact query computation can scale exponentially in
K
it is natural to consider approximation
methods. In particular we focus on importance sampling, beam search, and a hybrid of both methods.
All methods will be based on a novel proposal distribution, discussed below.
4.1 Proposal Distribution
For various estimation methods which will be discussed later, it is beneficial to have a proposal
distribution
q(X1:K=x1:K)
whose domain matches that of the query
Q
. For importance sampling,
we will need this distribution as a proposal distribution, while we use it as our base model for
selecting high-probability sequences in beam search. We would like the proposal distribution to
resemble our original model while also respecting the query. One thought is to have
q(X1:K=
x1:K) = p
θ(X1:K=x1:K|X1:K∈ Q)
. However, computing this probability involves normalizing
over
p
θ(X1:K∈ Q)
which is exactly what we are trying to estimate in the first place. Instead of
restricting the joint distribution to the query, we can instead restrict every conditional distribution to
the query’s restricted domain. To see this, we first partition
Q=iQ(i)
and define an autoregressive
proposal distribution for each Q(i)=QK
k=1 V(i)
kas follows:
q(i)(X1:K=x1:K) =
K
Y
k=1
p
θXk=xk|X<k =x<k, Xk∈ V(i)
k(4)
=
K
Y
k=1
p
θ(Xk=xk|X<k =x<k)xk∈ V(i)
k
Pv∈V(i)
k
p
θ(Xk=v|X<k =x<k)
where
(·)
is the indicator function. That is, we constrain the outcomes of each conditional prob-
ability to the restricted domains
V(i)
k
and renormalize them accordingly. To evaluate the proposal
distribution’s probability, we multiply all conditional probabilities according to the chain rule. Since
the entire distribution is computed for a single model call
p
θ(Xk|X<k =x<k)
, it is possible to both
sample a
K
-length sequence and compute its likelihood under
q(i)
with only
K
model calls. Thus,
we can efficiently sample sequences from a distribution that is both informed by the underlying model
pθ
and that respects the given domain
Q
. As discussed in the next section, this proposal will be used
for importance sampling and for the base distribution on which beam search is conducted.
4.2 Estimation Techniques
Sampling
One can naively sample any arbitrary probability value using Monte Carlo samples to
estimate
p
θ(X1:K∈ Q) = Ep
θ[ (X1:K∈ Q)]
; however, this typically will have high variance.
This can be substantially improved upon by exclusively drawing sequences from the query space
Q
.
Arbitrary queries can be written as a sum of probabilities of individual sequences, as seen in Eq. (1).
This summation can be equivalently written as an expected value,
p
θ(X1:K∈ Q) = X
x1:K∈Q
p
θ(X1:K=x1:K) = |Q| Ex1:K∼U(Q)[p
θ(X1:K=x1:K)] ,
where
U
is a uniform distribution. It is common for
p
θ
to concentrate most of the available probability
mass on a small subset of the total possible space
VK
. Should
|Q|
be large, then
|Q| p
θ(X1:K=x1:K)
5
摘要:

PredictiveQueryingforAutoregressiveNeuralSequenceModelsAlexBoyd1SamShowalter2StephanMandt1;2PadhraicSmyth1;21DepartmentofStatistics2DepartmentofComputerScienceUniversityofCalifornia,Irvine{alexjb,showalte,mandt,p.smyth}@uci.eduAbstractInreasoningaboutsequentialeventsitisnaturaltoposeprobabilisticq...

展开>> 收起<<
Predictive Querying for Autoregressive Neural Sequence Models Alex Boyd1Sam Showalter2Stephan Mandt12Padhraic Smyth12.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:1.98MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注