HYPRO A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences Siqiao Xue Xiaoming Shi James Y Zhang

2025-04-29 0 0 917.7KB 20 页 10玖币
侵权投诉
HYPRO: A Hybridly Normalized Probabilistic Model
for Long-Horizon Prediction of Event Sequences
Siqiao Xue, Xiaoming Shi, James Y Zhang
Ant Group
569 Xixi Road
Hangzhou, China
siqiao.xsq@alibaba-inc.com
{peter.sxm,james.z}@antgroup.com
Hongyuan Mei
Toyota Technological Institute at Chicago
6045 S Kenwood Ave
Chicago, IL 60637
hongyuan@ttic.edu
Abstract
In this paper, we tackle the important yet under-investigated problem of making
long-horizon prediction of event sequences. Existing state-of-the-art models do
not perform well at this task due to their autoregressive structure. We propose
HYPRO, a hybridly normalized probabilistic model that naturally fits this task:
its first part is an autoregressive base model that learns to propose predictions;
its second part is an energy function that learns to reweight the proposals such
that more realistic predictions end up with higher probabilities. We also propose
efficient training and inference algorithms for this model. Experiments on multiple
real-world datasets demonstrate that our proposed HYPRO model can significantly
outperform previous models at making long-horizon predictions of future events.
We also conduct a range of ablation studies to investigate the effectiveness of each
component of our proposed methods.
1 Introduction
Long-horizon prediction of event sequences is essential in various real-world applied domains:
Healthcare. Given a patient’s symptoms and treatments so far, we would be interested in predicting
their future health conditions over the next several months, including their prognosis and treatment.
Commercial. Given an online consumer’s previous purchases and reviews, we may be interested in
predicting what they would buy over the next several weeks and plan our advertisement accordingly.
Urban planning. Having monitored the traffic flow of a town for the past few days, we’d like to
predict its future traffic over the next few hours, which would be useful for congestion management.
Similar scenarios arise in computer systems,finance,dialogue,music, etc.
Though being important, this task has been under-investigated: the previous work in this research
area has been mostly focused on the prediction of the next single event (e.g., its time and type).
In this paper, we show that previous state-of-the-art models suffer at making long-horizon predictions,
i.e., predicting the series of future events over a given time interval. That is because those models
are all autoregressive: predicting each future event is conditioned on all the previously predicted
events; an error can not be corrected after it is made and any error will be cascaded through all the
subsequent predictions. Problems of the same kind also exist in natural language processing tasks
such as generation and machine translation (Ranzato et al., 2016; Goyal, 2021).
In this paper, we propose a novel modeling framework that learns to make long-horizon prediction of
event sequences. Our main technical contributions include:
A new model. The key component of our framework is HYPRO, a
hy
bridly normalized neural
pro
babilistic model that combines an autoregressive base model with an energy function: the
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01753v1 [cs.LG] 4 Oct 2022
base model learns to propose plausible predictions; the energy function learns to reweight the
proposals. Although the proposals are generated autoregressively, the energy function reads each
entire completed sequence (i.e., true past events together with predicted future events) and learns to
assign higher weights to those which appear more realistic as a whole.
Hybrid models have already demonstrated effective at natural language processing such as detecting
machine-generated text (Bakhtin et al., 2019) and improving coherency in text generation (Deng
et al., 2020). We are the first to develop a model of this kind for time-stamped event sequences. Our
model can use any autoregressive event model as its base model, and we choose the state-of-the-art
continuous-time Transformer architecture (Yang et al., 2022) as its energy function.
A family of new training objectives. Our second contribution is a family of training objectives that
can estimate the parameters of our proposed model with low computational cost. Our training
methods are based on the principle of noise-contrastive estimation since the log-likelihood of our
HYPRO model involves an intractable normalizing constant (due to using energy functions).
A new efficient inference method. Another contribution is a normalized importance sampling
algorithm, which can efficiently draw the predictions of future events over a given time interval
from a trained HYPRO model.
2 Technical Background
2.1 Formulation: Generative Modeling of Event Sequences
We are given a fixed time interval
[0, T ]
over which an event sequence is observed. Suppose there
are
I
events in the sequence at times
0< t1< . . . < tIT
. We denote the sequence as
x[0,T ]= (t1, k1),...,(tI, kI)where each ki∈ {1, . . . , K}is a discrete event type.
Generative models of event sequences are temporal point processes. They are
autoregressive
:
events are generated from left to right; the probability of
(ti, ki)
depends on the history of events
x[0,ti)= (t1, k1),...,(ti1, ki1)
that were drawn at times
< ti
. They are
locally normalized
: if
we use
pk(t|x[0,t))
to denote the probability that an event of type
k
occurs over the infinitesimal
interval
[t, t+dt)
, then the probability that nothing occurs will be
1PK
k=1 pk(t|x[0,t))
. Specifically,
temporal point processes define functions
λk
that determine a finite
intensity λk(t|x[0,t))0
for each event type
k
at each time
t > 0
such that
pk(t|x[0,t)) = λk(t|x[0,t))dt
. Then the
log-likelihood of a temporal point process given the entire event sequence x[0,T ]is
I
X
i=1
log λki(ti|x[0,ti))ZT
t=0
K
X
k=1
λk(t|x[0,t))dt (1)
Popular examples of temporal point processes include Poisson processes (Daley & Vere-Jones, 2007)
as well as Hawkes processes (Hawkes, 1971) and their modern neural versions (Du et al., 2016; Mei
& Eisner, 2017; Zuo et al., 2020; Zhang et al., 2020; Yang et al., 2022).
2.2 Task and Challenge: Long-Horizon Prediction and Cascading Errors
We are interested in predicting the future events over an extended time interval
(T, T 0]
. We call this
task
long-horizon prediction
as the boundary
T0
is so large that (with a high probability) many
events will happen over
(T, T 0]
. A principled way to solve this task works as follows: we draw many
possible future event sequences over the interval
(T, T 0]
, and then use this empirical distribution to
answer questions such as “how many events of type k= 3 will happen over that interval”.
A serious technical issue arises when we draw each possible future sequence. To draw an event
sequence from an autoregressive model, we have to repeatedly draw the next event, append it to the
history, and then continue to draw the next event conditioned on the new history. This process is
prone to
cascading errors
: any error in a drawn event is likely to cause all the subsequent draws to
differ from what they should be, and such errors will accumulate.
2.3 Globally Normalized Models: Hope and Difficulties
An ideal fix of this issue is to develop a
globally normalized model
for event sequences. For any
time interval
[0, T ]
, such a model will give a probability distribution that is normalized over all
the possible full sequences on
[0, T ]
rather than over all the possible instantaneous subsequences
within each
(t, t +dt)
. Technically, a globally normalized model assigns to each sequence
x[0,T ]
a
2
score
exp E(x[0,T ])
where
E
is called
energy function
; the normalized probability of
x[0,T ]
is
proportional to its score: i.e., p(x[0,T ])exp E(x[0,T ]).
Had we trained a globally normalized model, we wish to enumerate all the possible
x[T,T 0]
for a
given
x[0,T ]
and select those which give the highest model probabilities
p(x[0,T 0])
. Prediction made
this way would not suffer cascading errors: the entire
x[0,T 0]
was jointly selected and thus the overall
compatibility between the events had been considered.
However, training such a globally normalized probabilistic model involves computing the normalizing
constant
Pexp E(x[0,T ])
where the summation
P
is taken over all the possible sequences; it is
intractable since there are infinitely many sequences. What’s worse, it is also intractable to exactly
sample from such a model; approximate sampling is tractable but expensive.
3 HYPRO: A Hybridly Normalized Neural Probabilistic Model
We propose HYPRO, a hybridly normalized neural probabilistic model that combines a temporal
point process and an energy function: it enjoys both the efficiency of autoregressive models and the
capacity of globally normalized models. Our model normalizes over (sub)sequences: for any given
interval [0, T ]and its extension (T, T 0]of interest, the model probability of the sequence x(T,T 0]is
pHYPRO x(T ,T 0]|x[0,T ]=pauto x(T,T 0]|x[0,T ]exp(Eθ(x[0,T 0]))
Zθ(x[0,T ])(2)
where
pauto
is the probability under the chosen temporal point process and
Eθ
is an energy function
with parameters
θ
. The normalizing constant sums over all the possible continuations
x(T,T 0]
for a
given prefix x[0,T ]:Zθx[0,T ]def
=Px(T,T 0]pauto x(T,T 0]|x[0,T ]exp Eθ(x[0,T 0]).
The key advantage of our model over autoregressive models is that: the energy function
Eθ
is able
to pick up the global features that may have been missed by the autoregressive base model
pauto
;
intuitively, the energy function fits the residuals that are not captured by the autoregressive model.
Our model is general: in principle,
pauto
can be any autoregressive model including those mentioned
in section 2.1 and
Eθ
can be any function that is able to encode an event sequence to a real number.
In section 5, we will introduce a couple of specific pauto and Eθand experiment with them.
In this section, we focus on the training method and inference algorithm.
3.1 Training Objectives
Training our full model
pHYPRO
is to learn the parameters of the autoregressive model
pauto
as well as
those of the energy function
Eθ
. Maximum likelihood estimation (MLE) is undesirable: the objective
would be
log pHYPRO x(T ,T 0]|x[0,T ]= log pauto x(T,T 0]|x[0,T ]Eθ(x[0,T 0])log Zθx[0,T ]
where the normalizing constant
Zθx[0,T ]
is known to be uncomputable and inapproximable for a
large variety of reasonably expressive functions Eθ(Lin & McCarthy, 2022).
We propose a training method that works around this normalizing constant. We first train
pauto
just
like how previous work trained temporal point processes.
1
Then we use the trained
pauto
as a noise
distribution and learn the parameters
θ
of
Eθ
by noise-contrastive estimation (NCE). Precisely, we
sample
N
noise sequences
x(1)
[T,T 0], . . . , x(N)
[T,T 0]
, compute the “energy”
Eθ(x(n)
[0,T 0])
for each completed
sequence x(n)
[0,T 0], and then plug those energies into one of the following training objectives.
Note that all the completed sequences x(n)
[0,T 0]share the same observed prefix x[0,T ].
Binary-NCE Objective.
We train a binary classifier based on the energy function
Eθ
to discriminate
the true event sequence—denoted as x(0)
[0,T 0]—against the noise sequences by maximizing
Jbinary = log σEθ(x(0)
[0,T 0])+
N
X
n=1
log σEθ(x(n)
[0,T 0]))(3)
1
It can be done by either maximum likelihood estimation or noise-contrastive estimation: for the former,
read Daley & Vere-Jones (2007); for the latter, read Mei et al. (2020b) which also has an in-depth discussion
about the theoretical connections between these two parameter estimation principles.
3
where
σ(u) = 1
1+exp(u)
is the sigmoid function. By maximizing this objective, we are essentially
pushing our energy function
Eθ
such that the observed sequences have low energy but the noise
sequences have high energy. As a result, the observed sequences will be more probable under our full
model pHYPRO while the noise sequences will be less probable: see equation (2).
Theoretical guarantees of general Binary-NCE can be found in Gutmann & Hyvärinen (2010).
For general conditional probability models like ours, Binary-NCE implicitly assumes self-
normalization (Mnih & Teh, 2012; Ma & Collins, 2018): i.e., Zθx[0,T ]= 1 is satisfied.
This type of training objective has been used to train a hybridly normalized text generation model by
Deng et al. (2020); see section 4 for more discussion about its relations with our work.
Multi-NCE Objective.
Another option is to use Multi-NCE objective
2
, which means we maximize
Jmulti =Eθ(x(0)
[0,T 0])log
N
X
n=0
exp Eθ(x(n)
[0,T 0]))(4)
By maximizing this objective, we are pushing our energy function
Eθ
such that each observed
sequence has relatively lower energy than the noise sequences sharing the same observed prefix. In
contrast,
Jbinary
attempts to make energies absolutely low (for observed data) or high (for noise data)
without considering whether they share prefixes. This effect is analyzed in Analysis-III of section 5.2.
This
Jmulti
objective also enjoys better statistical properties than
Jbinary
since it doesn’t assume self-
normalization: the normalizing constant
Zθx[0,T ]
is neatly cancelled out in its derivation; see
Appendix A.1 for a full derivation of both Binary-NCE and Multi-NCE.
Theoretical guarantees of Multi-NCE for discrete-time models were established by Ma & Collins
(2018); Mei et al. (2020b) generalized them to temporal point processes.
Considering Distances Between Sequences.
Previous work (LeCun et al., 2006; Bakhtin et al.,
2019) reported that energy functions may be better learned if the distances between samples are
considered. This has inspired us to design a regularization term that enforces such consideration.
Suppose that we can measure a well-defined “distance” between the true sequence
x(0)
[0,T 0]
and any
noise sequence
x(n)
[0,T 0]
; we denote it as
d(n)
. We encourage the energy of each noise sequence to be
higher than that of the observed sequence by a margin; that is, we propose the following regularization:
Ω =
N
X
n=1
max 0, βd(n) + Eθ(x(0)
[0,T 0])Eθ(x(n)
[0,T 0])(5)
where
β > 0
is a hyperparameter that we tune on the held-out development data. With this
regularization, the energies of the sequences with larger distances will be pulled farther apart: this
will help discriminate not only between the observed sequence and the noise sequences, but also
between the noise sequences themselves, thus making the energy function Eθmore informed.
This method is general so the distance
d
can be any appropriately defined metric. In section 5, we
will experiment with an optimal transport distance specifically designed for event sequences.
Note that the distance
d
in the regularization may be the final test metric. In that case, our method is
directly optimizing for the final evaluation score.
Generating Noise Sequences.
Generating event sequences from an autoregressive temporal point
process has been well-studied in previous literature. The standard way is to call the
thinning
algorithm
(Lewis & Shedler, 1979; Liniger, 2009). The full recipe for our setting is in Algorithm 1.
3.2 Inference Algorithm
Inference involves drawing future sequences
x(T,T 0]
from the trained full model
pHYPRO
; due to the
uncomputability of the normalizing constant Z(x[0,T ]), exact sampling is intractable.
We propose a
normalized importance sampling
method to approximately draw
x(T,T 0]
from
pHYPRO
; it is shown in Algorithm 2. We first use the trained
pauto
to be our proposal distribu-
tion and call the thinning algorithm (Algorithm 1) to draw proposals
xh1i
[T,T 0], . . . , xhMi
[T,T 0]
. Then we
2
It was named as Ranking-NCE by Ma & Collins (2018), but we think Multi-NCE is a more appropriate
name since it constructs a multi-class classifier over one correct answer and multiple incorrect answers.
4
摘要:

HYPRO:AHybridlyNormalizedProbabilisticModelforLong-HorizonPredictionofEventSequencesSiqiaoXue,XiaomingShi,JamesYZhangAntGroup569XixiRoadHangzhou,Chinasiqiao.xsq@alibaba-inc.com{peter.sxm,james.z}@antgroup.comHongyuanMeiToyotaTechnologicalInstituteatChicago6045SKenwoodAveChicago,IL60637hongyuan@ttic....

展开>> 收起<<
HYPRO A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences Siqiao Xue Xiaoming Shi James Y Zhang.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:917.7KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注