
Figure 2: (left) Tree diagram of the complete sequence space for a vocabulary
V={a, b, c}
and the
corresponding query space
Q
(right) for when the first appearance of
a
occurs on the third step (i.e.,
τ(a)=3), defined as the set product of restricted domains listed below the figure.
For models that can be characterized with sparse Markov dependence structure, there is a significant
body of work on efficient inference algorithms that can leverage such structure [Koller and Friedman,
2009], in particular for sequential models where recursive computation can be effectively leveraged
[Bilmes, 2010]. However, autoregressive neural sequence models are inherently non-Markov since
the real-valued current hidden state is a function of the entire history of the sequence. Each hidden
state vector induces a tree containing
Vk
unique future trajectories with state-dependent probabilities
for each path. Techniques such as dynamic programming (used effectively in Markov-structured
sequence models) are not applicable in this context, and both assignment queries and conditional
probability queries are NP-hard in general Chen et al. [2018].
For assignment-type queries there has been considerable work in natural language processing with
neural sequence models, particularly for the MAP problem of generating high-quality/high-probability
sequences conditioned on sequence history or other conditioning information. A variety of heuristic
decoding methods have been developed and found to be useful in practice, including beam search
[Sutskever et al., 2014], best-first search [Xu and Durrett, 2021], sampling methods [Holtzman et al.,
2019], and hybrid variants [Shaham and Levy, 2022]. However, for conditional probability queries
with neural sequence models (the focus of this paper), there has been no prior work in general on
this problem to our knowledge. While decoding techniques such as beam search can also be useful
in the context of conditional probability queries, as we will see later in Section 4, such techniques
have significant limitations in this context, since by definition they produce lower-bounds on the
probabilities of interest and, hence, are biased estimators.
3 Probabilistic Queries
Notation
Let
X1:N:= [X1, X2, . . . , XN]
be a sequence of random variables with arbitrary length
N
. Additionally, let
x1:N:= [x1, x2, . . . , xN]
be their respective observed values where each
xi
takes on values from a fixed vocabulary
V:= {1, . . . , V }
. Examples of these sequences include
sentences where each letter or word is a single value, or streams of discrete events generated by some
process or user. We will refer to individual variable-value pairs in the sequence as events.
We consider an autoregressive model
pθ(X1:N) = QN
i=1 pθ(Xi|X1:i−1)
, parameterized by
θ
and
trained on a given dataset of
M
independent draws from a ground truth distribution
P(X1:N)
. We
assume that this model can be conditioned on a subsequence, termed the history
H
. We will remap
the indices of subsequent random variables to start at 1
2
. We abbreviate conditioning on a history by
an asterisk ∗, i.e., P∗(·) := P(·|H)and p∗
θ(·) := pθ(·|H).
Defining Probabilistic Queries
Given a specific history of events
H
, there are a variety of different
questions one could ask about the future continuing where the history left off: (
Q1
) What event is
likely to happen next? (
Q2
) Which events are likely to occur
k > 1
steps from now? (
Q3
) What is the
distribution of when the next instance of
a∈V
occurs? (
Q4
) How likely is it that we will see event
a∈Voccur before b∈V? (Q5) How likely is it for a∈Vto occur ntimes in the next ksteps?
We define a common framework for such queries by defining probabilistic queries to be of the form
p∗
θ(X1:K∈ Q)
with
Q ⊂ VK
. This can be extended to the infinite setting (i.e.,
p∗
θ([Xk]k∈ Q)
2
For example, if
|H| = 3
then
P(X4|H)
is the distribution of the
7th
value in a sequence after conditioning
on the first 3 values in the sequence.
3