Predictive Querying for Autoregressive Neural Sequence Models Alex Boyd1Sam Showalter2Stephan Mandt12Padhraic Smyth12

2025-04-26 0 0 1.98MB 31 页 10玖币

侵权投诉

Predictive Querying for Autoregressive Neural

Sequence Models

Alex Boyd∗1Sam Showalter∗2Stephan Mandt1,2Padhraic Smyth1,2

1Department of Statistics 2Department of Computer Science

University of California, Irvine

{alexjb,showalte,mandt,p.smyth}@uci.edu

Abstract

In reasoning about sequential events it is natural to pose probabilistic queries such as

“when will event A occur next” or “what is the probability of A occurring before B”,

with applications in areas such as user modeling, medicine, and ﬁnance. However,

with machine learning shifting towards neural autoregressive models such as RNNs

and transformers, probabilistic querying has been largely restricted to simple cases

such as next-event prediction. This is in part due to the fact that future querying

involves marginalization over large path spaces, which is not straightforward to

do efﬁciently in such models. In this paper we introduce a general typology for

predictive queries in neural autoregressive sequence models and show that such

queries can be systematically represented by sets of elementary building blocks.

We leverage this typology to develop new query estimation methods based on

beam search, importance sampling, and hybrids. Across four large-scale sequence

datasets from different application domains, as well as for the GPT-2 language

model, we demonstrate the ability to make query answering tractable for arbitrary

queries in exponentially-large predictive path-spaces, and ﬁnd clear differences in

cost-accuracy tradeoffs between search and sampling methods.

1 Introduction

One of the major successes in machine learning in recent years has been the development of neural

sequence models for categorical sequences, particularly in natural language applications but also in

other areas such as automatic code generation and program synthesis [Shin et al., 2019, Chen et al.,

2021], computer security [Brown et al., 2018], recommender systems [Wu et al., 2017], genomics

[Shin et al., 2021, Amin et al., 2021], and survival analysis [Lee et al., 2019]. Many of the models

(although not all) rely on autoregressive training and prediction, allowing for the sequential generation

of sequence completions in a recursive manner conditioned on sequence history.

A natural question in this context is how to compute answers to predictive queries that go beyond

traditional one-step-ahead predictions. Examples of such queries are “how likely is event

to occur

before event

?” and “how likely is event

to occur (once or more) within the next

steps of the

sequence?” These types of queries are very natural across a wide variety of application contexts, for

example, the probability that an individual will ﬁnish speaking or writing a sentence within the next

Kwords, or that a user will use one app before another. See Fig. 1 and Appendix A for examples.

In this paper we develop a general framework for answering such predictive queries in the context of

autoregressive (AR) neural sequence models. This amounts to computing conditional probabilities

of propositional statements about future events, conditioned on the history of the sequence as

summarized by the current hidden state representation. We focus in particular on how to perform near

∗Authors contributed equally

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06464v3 [cs.LG] 4 Nov 2022

0.00

0.02

0.04 “In my opinion”

0.000

0.025

0.050

“Hi, my name is”

1 5 10 15 20 25 30

Steps into Sequence K

0.000

0.025

0.050 “Once upon a”

1 5 10 15 20 25 30

Steps into Sequence K

0.00

0.05

0.10 “Where is”

Figure 1: (top) Illustration of a query for the probability of a given sentence "In my opinion..." ending

steps. (bottom) GPT-2 query estimates across 4 preﬁxes with

V= 50,257, K ≤30

. Importance

sampling query estimates maintain a 6x reduction in variance relative to naive model sampling for the

same computation budget. Open-ended preﬁxes (top-left) generally possess longer-tailed distributions

relative to simple preﬁxes. Additional details provided in Sections 4, 5, and Appendix H.5.

real-time computation of such queries, motivated by use-cases such as answering human-generated

queries and utilizing query estimates within the optimization loop of training a model. Somewhat

surprisingly, although there has been extensive prior work on multivariate probabilistic querying in

areas such as graphical models and database querying, as well as for restricted types of queries for

traditional sequence models such as Markov models, querying for neural sequence models appears

to be unexplored. One possible reason is that the problem is computationally intractable in the

general case (as we discuss later in Section 3), typically scaling as

OVK−1

or worse for predictions

-steps ahead, given a sequence model with a

-ary alphabet (e.g. compared to

OKV 2

for

Markov chains).

Our contributions are three-fold:

We introduce and develop the problem of predictive querying in neural sequence models by

reducing complex queries to elementary building blocks. These elementary queries deﬁne

restricted path spaces over future event sequences.

We show that the underlying autoregressive model can always be constrained to the restricted

path space satisfying a given query. This gives rise to a new proposal distribution that can be

used for importance sampling, beam search, or a new hybrid approach.

We evaluate these methods across three user behavior and two language data datasets. While

all three methods signiﬁcantly improve over naive forward simulation, the hybrid approach

further improves over importance sampling and beam search. We furthermore explore how the

performance of all methods relates to the model entropy.

Code for this work is available at https://github.com/ajboyd2/prob_seq_queries.

2 Related Work

Research on efﬁcient computation of probabilistic queries has a long history in machine learning

and AI, going back to work on exact inference in multivariate graphical models [Pearl, 1988, Koller

and Friedman, 2009]. Queries in this context are typically of two types. The ﬁrst are conditional

probability queries, the focus of our attention here: computing probabilities deﬁned for a subset

of variables of interest, conditioned on a second subset

Y=y

of observed variable values, and

marginalizing over the set

of all other variables. The second type of queries can broadly be referred

to as assignment queries, seeking the most likely (highest conditional probability) assignment of

values

for

, again conditioned on

Y=y

and marginalizing over the set

. Assignment queries

are also referred to as most probable explanation (MPE) queries, or as maximum a posteriori (MAP)

queries when Zis the empty set [Koller et al., 2007].

Figure 2: (left) Tree diagram of the complete sequence space for a vocabulary

V={a, b, c}

and the

corresponding query space

(right) for when the ﬁrst appearance of

occurs on the third step (i.e.,

τ(a)=3), deﬁned as the set product of restricted domains listed below the ﬁgure.

For models that can be characterized with sparse Markov dependence structure, there is a signiﬁcant

body of work on efﬁcient inference algorithms that can leverage such structure [Koller and Friedman,

2009], in particular for sequential models where recursive computation can be effectively leveraged

[Bilmes, 2010]. However, autoregressive neural sequence models are inherently non-Markov since

the real-valued current hidden state is a function of the entire history of the sequence. Each hidden

state vector induces a tree containing

unique future trajectories with state-dependent probabilities

for each path. Techniques such as dynamic programming (used effectively in Markov-structured

sequence models) are not applicable in this context, and both assignment queries and conditional

probability queries are NP-hard in general Chen et al. [2018].

For assignment-type queries there has been considerable work in natural language processing with

neural sequence models, particularly for the MAP problem of generating high-quality/high-probability

sequences conditioned on sequence history or other conditioning information. A variety of heuristic

decoding methods have been developed and found to be useful in practice, including beam search

[Sutskever et al., 2014], best-ﬁrst search [Xu and Durrett, 2021], sampling methods [Holtzman et al.,

2019], and hybrid variants [Shaham and Levy, 2022]. However, for conditional probability queries

with neural sequence models (the focus of this paper), there has been no prior work in general on

this problem to our knowledge. While decoding techniques such as beam search can also be useful

in the context of conditional probability queries, as we will see later in Section 4, such techniques

have signiﬁcant limitations in this context, since by deﬁnition they produce lower-bounds on the

probabilities of interest and, hence, are biased estimators.

3 Probabilistic Queries

Notation

Let

X1:N:= [X1, X2, . . . , XN]

be a sequence of random variables with arbitrary length

. Additionally, let

x1:N:= [x1, x2, . . . , xN]

be their respective observed values where each

takes on values from a ﬁxed vocabulary

V:= {1, . . . , V }

. Examples of these sequences include

sentences where each letter or word is a single value, or streams of discrete events generated by some

process or user. We will refer to individual variable-value pairs in the sequence as events.

We consider an autoregressive model

pθ(X1:N) = QN

i=1 pθ(Xi|X1:i−1)

, parameterized by

and

trained on a given dataset of

independent draws from a ground truth distribution

P(X1:N)

. We

assume that this model can be conditioned on a subsequence, termed the history

. We will remap

the indices of subsequent random variables to start at 1

. We abbreviate conditioning on a history by

an asterisk ∗, i.e., P∗(·) := P(·|H)and p∗

θ(·) := pθ(·|H).

Deﬁning Probabilistic Queries

Given a speciﬁc history of events

, there are a variety of different

questions one could ask about the future continuing where the history left off: (

) What event is

likely to happen next? (

) Which events are likely to occur

k > 1

steps from now? (

) What is the

distribution of when the next instance of

a∈V

occurs? (

) How likely is it that we will see event

a∈Voccur before b∈V? (Q5) How likely is it for a∈Vto occur ntimes in the next ksteps?

We deﬁne a common framework for such queries by deﬁning probabilistic queries to be of the form

p∗

θ(X1:K∈ Q)

with

Q ⊂ VK

. This can be extended to the inﬁnite setting (i.e.,

p∗

θ([Xk]k∈ Q)

For example, if

|H| = 3

then

P(X4|H)

is the distribution of the

7th

value in a sequence after conditioning

on the ﬁrst 3 values in the sequence.

# Question Probabilistic Query Cost (K· |Q|)

Q1Next event? p∗

θ(X1)O(1)

Q2Event Ksteps from now? p∗

θ(XK)OVK−1

Q3Next instance of a?p∗

θ(τ(a) = K)O(V−1)K−1

Q4Will ahappen before b?p∗

θ(τ(a)< τ(b)) O(V−2)K†

Q5How many instances of ain Ksteps? p∗

θ(Na(K) = n)OK

n(V−1)K−n

Table 1: List of example questions, corresponding probabilistic queries, and associated costs of exact

computation computation with an autoregressive model. The cost of accommodating a history

assumed to be an additive constant for all queries. While

extends to inﬁnite time, the cost reported

is for computing a lower bound up to Ksteps.

where Q ⊂ V∞). Exact computation of an arbitrary query is straightforward to represent:

p∗

θ(X1:K∈ Q) = X

x1:K∈Q

p∗

θ(X1:K=x1:K) = X

x1:K∈Q

k=1

p∗

θ(Xk=xk|X<k =x<k).(1)

Depending on

|Q|

, performing this calculation can quickly become intractable, motivating lower

bounds or approximations (developed in more detail in Section 4). In this context it is helpful to

impose structure on the query

to make subsequent estimation easier, in particular by breaking

into the following structured partition:

Q=∪iQ(i)where Q(i)∩ Q(j)=∅for i6=j(2)

and Q(i)=V(i)

1× V(i)

2× ··· × V(i)

Kwhere V(i)

k⊆Vfor k= 1, . . . , K. (3)

In words, this means a given query

can be broken into a partition of simpler queries

Q(i)

which

take the form of a set cross product between restricted domains

V(i)

, one domain for each token

An illustration of an example query set can be seen in Fig. 2. A natural consequence of this is that:

p∗

θ(X1:K∈ Q) = X

p∗

θX1:K∈ Q(i)=X

p∗

θ∩K

k=1Xk∈ V(i)

k,

which lends itself to more easily estimating each term in the sum. This will be discussed in Section 4.

Queries of Interest

All of the queries posed earlier in this section can be represented under the

framework detailed in Eq. (2) and Eq. (3), as illustrated in Table 1.

The queries

p∗

θ(X1=a)

and

p∗

θ(XK=a)

for some

a∈V

can be represented with

Q={a}and Q=VK−1× {a}respectively.

The probability of the next instance of

a∈V

occurring at some point in time

K≥1

p∗

θ(τ(a) = K)

where

τ(·)

is the hitting time, can be represented as

Q= (V\ {a})K−1× {a}

. This

can be adapted for a set A⊂Vby replacing {a}in Qwith A.

The probability of

a∈V

occurring before

b∈V

p∗

θ(τ(a)< τ(b))

, is represented as

Q=∪∞

i=1Q(i)

where

Q(i)= (V\ {a, b})i−1× {a}

. Lower bounds to this can be computed by

limiting i<i0. Like Q3, this can also be extended to disjoint sets A, B ⊂V.

The probability of

a∈V

occurring

times in the next

steps,

p∗

θ(Na(K) = n)

, is represented

Q=∪C(K,n)

i=1 Q(i)

, where

Na(K)

is a random variable for the number of occurrences of events of

type

from steps

and

Q(i)

’s are deﬁned to cover all unique permutations of orders of products

composed of: {a}nand (V\ {a})K−n. Like above, this can easily be extended for A⊂V.

Query Complexity

From Eq. (1), exact computation of a query involves computing

K· |Q|

conditional distributions (e.g.,

p∗

θ(Xk|X<k =x<k)

) in an autoregressive manner. Under the struc-

tured representation, the number of conditional distributions needed is equivalently

PiQK

k=1 |V(i)

3Ideally, the partitioning is chosen to have the smallest number of Q(i)’s needed.

Non-attention based neural sequence models often deﬁne

p∗

θ(Xk|X<k =x<k) := fθ(hk)

where

hk=RNNθ(hk−1, xk−1)

. As such, the computation complexity for any individual conditional

distribution remains constant with respect to sequence length. We will refer to the complexity of this

atomic action as being

O(1)

. Naturally, the actual complexity depends on the model architecture

and has a multiplicative scaling on the cost of computing a query. The number of atomic operations

needed to exactly compute

for this class of models can be found in Table 1. Should

pθ

an attention based model (e.g., a transformer [Vaswani et al., 2017]) then the time complexity of

computing a single one-step-ahead distribution becomes

O(K)

, further exacerbating the

exponential

growth

of many queries. Note that with some particular parametric forms of

pθ

, query computation

can be more efﬁcient, e.g., see Appendix C for a discussion on query complexity for Markov models.

4 Query Estimation Methods

Since exact query computation can scale exponentially in

it is natural to consider approximation

methods. In particular we focus on importance sampling, beam search, and a hybrid of both methods.

All methods will be based on a novel proposal distribution, discussed below.

4.1 Proposal Distribution

For various estimation methods which will be discussed later, it is beneﬁcial to have a proposal

distribution

q(X1:K=x1:K)

whose domain matches that of the query

. For importance sampling,

we will need this distribution as a proposal distribution, while we use it as our base model for

selecting high-probability sequences in beam search. We would like the proposal distribution to

resemble our original model while also respecting the query. One thought is to have

q(X1:K=

x1:K) = p∗

θ(X1:K=x1:K|X1:K∈ Q)

. However, computing this probability involves normalizing

over

p∗

θ(X1:K∈ Q)

which is exactly what we are trying to estimate in the ﬁrst place. Instead of

restricting the joint distribution to the query, we can instead restrict every conditional distribution to

the query’s restricted domain. To see this, we ﬁrst partition

Q=∪iQ(i)

and deﬁne an autoregressive

proposal distribution for each Q(i)=QK

k=1 V(i)

kas follows:

q(i)(X1:K=x1:K) =

k=1

p∗

θXk=xk|X<k =x<k, Xk∈ V(i)

k(4)

k=1

p∗

θ(Xk=xk|X<k =x<k)xk∈ V(i)

k

Pv∈V(i)

p∗

θ(Xk=v|X<k =x<k)

where

(·)

is the indicator function. That is, we constrain the outcomes of each conditional prob-

ability to the restricted domains

V(i)

and renormalize them accordingly. To evaluate the proposal

distribution’s probability, we multiply all conditional probabilities according to the chain rule. Since

the entire distribution is computed for a single model call

p∗

θ(Xk|X<k =x<k)

, it is possible to both

sample a

-length sequence and compute its likelihood under

q(i)

with only

model calls. Thus,

we can efﬁciently sample sequences from a distribution that is both informed by the underlying model

pθ

and that respects the given domain

. As discussed in the next section, this proposal will be used

for importance sampling and for the base distribution on which beam search is conducted.

4.2 Estimation Techniques

Sampling

One can naively sample any arbitrary probability value using Monte Carlo samples to

estimate

p∗

θ(X1:K∈ Q) = Ep∗

θ[ (X1:K∈ Q)]

; however, this typically will have high variance.

This can be substantially improved upon by exclusively drawing sequences from the query space

Arbitrary queries can be written as a sum of probabilities of individual sequences, as seen in Eq. (1).

This summation can be equivalently written as an expected value,

p∗

θ(X1:K∈ Q) = X

x1:K∈Q

p∗

θ(X1:K=x1:K) = |Q| Ex1:K∼U(Q)[p∗

θ(X1:K=x1:K)] ,

where

is a uniform distribution. It is common for

p∗

to concentrate most of the available probability

mass on a small subset of the total possible space

. Should

|Q|

be large, then

|Q| p∗

θ(X1:K=x1:K)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PredictiveQueryingforAutoregressiveNeuralSequenceModelsAlexBoyd1SamShowalter2StephanMandt1;2PadhraicSmyth1;21DepartmentofStatistics2DepartmentofComputerScienceUniversityofCalifornia,Irvine{alexjb,showalte,mandt,p.smyth}@uci.eduAbstractInreasoningaboutsequentialeventsitisnaturaltoposeprobabilisticq...

展开>> 收起<<

Predictive Querying for Autoregressive Neural Sequence Models Alex Boyd1Sam Showalter2Stephan Mandt12Padhraic Smyth12.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Predictive Querying for Autoregressive Neural Sequence Models Alex Boyd1Sam Showalter2Stephan Mandt12Padhraic Smyth12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: