base model learns to propose plausible predictions; the energy function learns to reweight the
proposals. Although the proposals are generated autoregressively, the energy function reads each
entire completed sequence (i.e., true past events together with predicted future events) and learns to
assign higher weights to those which appear more realistic as a whole.
Hybrid models have already demonstrated effective at natural language processing such as detecting
machine-generated text (Bakhtin et al., 2019) and improving coherency in text generation (Deng
et al., 2020). We are the first to develop a model of this kind for time-stamped event sequences. Our
model can use any autoregressive event model as its base model, and we choose the state-of-the-art
continuous-time Transformer architecture (Yang et al., 2022) as its energy function.
•
A family of new training objectives. Our second contribution is a family of training objectives that
can estimate the parameters of our proposed model with low computational cost. Our training
methods are based on the principle of noise-contrastive estimation since the log-likelihood of our
HYPRO model involves an intractable normalizing constant (due to using energy functions).
•
A new efficient inference method. Another contribution is a normalized importance sampling
algorithm, which can efficiently draw the predictions of future events over a given time interval
from a trained HYPRO model.
2 Technical Background
2.1 Formulation: Generative Modeling of Event Sequences
We are given a fixed time interval
[0, T ]
over which an event sequence is observed. Suppose there
are
I
events in the sequence at times
0< t1< . . . < tI≤T
. We denote the sequence as
x[0,T ]= (t1, k1),...,(tI, kI)where each ki∈ {1, . . . , K}is a discrete event type.
Generative models of event sequences are temporal point processes. They are
autoregressive
:
events are generated from left to right; the probability of
(ti, ki)
depends on the history of events
x[0,ti)= (t1, k1),...,(ti−1, ki−1)
that were drawn at times
< ti
. They are
locally normalized
: if
we use
pk(t|x[0,t))
to denote the probability that an event of type
k
occurs over the infinitesimal
interval
[t, t+dt)
, then the probability that nothing occurs will be
1−PK
k=1 pk(t|x[0,t))
. Specifically,
temporal point processes define functions
λk
that determine a finite
intensity λk(t|x[0,t))≥0
for each event type
k
at each time
t > 0
such that
pk(t|x[0,t)) = λk(t|x[0,t))dt
. Then the
log-likelihood of a temporal point process given the entire event sequence x[0,T ]is
I
X
i=1
log λki(ti|x[0,ti))−ZT
t=0
K
X
k=1
λk(t|x[0,t))dt (1)
Popular examples of temporal point processes include Poisson processes (Daley & Vere-Jones, 2007)
as well as Hawkes processes (Hawkes, 1971) and their modern neural versions (Du et al., 2016; Mei
& Eisner, 2017; Zuo et al., 2020; Zhang et al., 2020; Yang et al., 2022).
2.2 Task and Challenge: Long-Horizon Prediction and Cascading Errors
We are interested in predicting the future events over an extended time interval
(T, T 0]
. We call this
task
long-horizon prediction
as the boundary
T0
is so large that (with a high probability) many
events will happen over
(T, T 0]
. A principled way to solve this task works as follows: we draw many
possible future event sequences over the interval
(T, T 0]
, and then use this empirical distribution to
answer questions such as “how many events of type k= 3 will happen over that interval”.
A serious technical issue arises when we draw each possible future sequence. To draw an event
sequence from an autoregressive model, we have to repeatedly draw the next event, append it to the
history, and then continue to draw the next event conditioned on the new history. This process is
prone to
cascading errors
: any error in a drawn event is likely to cause all the subsequent draws to
differ from what they should be, and such errors will accumulate.
2.3 Globally Normalized Models: Hope and Difficulties
An ideal fix of this issue is to develop a
globally normalized model
for event sequences. For any
time interval
[0, T ]
, such a model will give a probability distribution that is normalized over all
the possible full sequences on
[0, T ]
rather than over all the possible instantaneous subsequences
within each
(t, t +dt)
. Technically, a globally normalized model assigns to each sequence
x[0,T ]
a
2