HYPRO A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences Siqiao Xue Xiaoming Shi James Y Zhang

2025-04-29 0 0 917.7KB 20 页 10玖币

侵权投诉

HYPRO: A Hybridly Normalized Probabilistic Model

for Long-Horizon Prediction of Event Sequences

Siqiao Xue, Xiaoming Shi, James Y Zhang

Ant Group

569 Xixi Road

Hangzhou, China

siqiao.xsq@alibaba-inc.com

{peter.sxm,james.z}@antgroup.com

Hongyuan Mei

Toyota Technological Institute at Chicago

6045 S Kenwood Ave

Chicago, IL 60637

hongyuan@ttic.edu

Abstract

In this paper, we tackle the important yet under-investigated problem of making

long-horizon prediction of event sequences. Existing state-of-the-art models do

not perform well at this task due to their autoregressive structure. We propose

HYPRO, a hybridly normalized probabilistic model that naturally ﬁts this task:

its ﬁrst part is an autoregressive base model that learns to propose predictions;

its second part is an energy function that learns to reweight the proposals such

that more realistic predictions end up with higher probabilities. We also propose

efﬁcient training and inference algorithms for this model. Experiments on multiple

real-world datasets demonstrate that our proposed HYPRO model can signiﬁcantly

outperform previous models at making long-horizon predictions of future events.

We also conduct a range of ablation studies to investigate the effectiveness of each

component of our proposed methods.

1 Introduction

Long-horizon prediction of event sequences is essential in various real-world applied domains:

•

Healthcare. Given a patient’s symptoms and treatments so far, we would be interested in predicting

their future health conditions over the next several months, including their prognosis and treatment.

•

Commercial. Given an online consumer’s previous purchases and reviews, we may be interested in

predicting what they would buy over the next several weeks and plan our advertisement accordingly.

•

Urban planning. Having monitored the trafﬁc ﬂow of a town for the past few days, we’d like to

predict its future trafﬁc over the next few hours, which would be useful for congestion management.

• Similar scenarios arise in computer systems,ﬁnance,dialogue,music, etc.

Though being important, this task has been under-investigated: the previous work in this research

area has been mostly focused on the prediction of the next single event (e.g., its time and type).

In this paper, we show that previous state-of-the-art models suffer at making long-horizon predictions,

i.e., predicting the series of future events over a given time interval. That is because those models

are all autoregressive: predicting each future event is conditioned on all the previously predicted

events; an error can not be corrected after it is made and any error will be cascaded through all the

subsequent predictions. Problems of the same kind also exist in natural language processing tasks

such as generation and machine translation (Ranzato et al., 2016; Goyal, 2021).

In this paper, we propose a novel modeling framework that learns to make long-horizon prediction of

event sequences. Our main technical contributions include:

•

A new model. The key component of our framework is HYPRO, a

bridly normalized neural

pro

babilistic model that combines an autoregressive base model with an energy function: the

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01753v1 [cs.LG] 4 Oct 2022

base model learns to propose plausible predictions; the energy function learns to reweight the

proposals. Although the proposals are generated autoregressively, the energy function reads each

entire completed sequence (i.e., true past events together with predicted future events) and learns to

assign higher weights to those which appear more realistic as a whole.

Hybrid models have already demonstrated effective at natural language processing such as detecting

machine-generated text (Bakhtin et al., 2019) and improving coherency in text generation (Deng

et al., 2020). We are the ﬁrst to develop a model of this kind for time-stamped event sequences. Our

model can use any autoregressive event model as its base model, and we choose the state-of-the-art

continuous-time Transformer architecture (Yang et al., 2022) as its energy function.

•

A family of new training objectives. Our second contribution is a family of training objectives that

can estimate the parameters of our proposed model with low computational cost. Our training

methods are based on the principle of noise-contrastive estimation since the log-likelihood of our

HYPRO model involves an intractable normalizing constant (due to using energy functions).

•

A new efﬁcient inference method. Another contribution is a normalized importance sampling

algorithm, which can efﬁciently draw the predictions of future events over a given time interval

from a trained HYPRO model.

2 Technical Background

2.1 Formulation: Generative Modeling of Event Sequences

We are given a ﬁxed time interval

[0, T ]

over which an event sequence is observed. Suppose there

are

events in the sequence at times

0< t1< . . . < tI≤T

. We denote the sequence as

x[0,T ]= (t1, k1),...,(tI, kI)where each ki∈ {1, . . . , K}is a discrete event type.

Generative models of event sequences are temporal point processes. They are

autoregressive

events are generated from left to right; the probability of

(ti, ki)

depends on the history of events

x[0,ti)= (t1, k1),...,(ti−1, ki−1)

that were drawn at times

< ti

. They are

locally normalized

: if

we use

pk(t|x[0,t))

to denote the probability that an event of type

occurs over the inﬁnitesimal

interval

[t, t+dt)

, then the probability that nothing occurs will be

1−PK

k=1 pk(t|x[0,t))

. Speciﬁcally,

temporal point processes deﬁne functions

λk

that determine a ﬁnite

intensity λk(t|x[0,t))≥0

for each event type

at each time

t > 0

such that

pk(t|x[0,t)) = λk(t|x[0,t))dt

. Then the

log-likelihood of a temporal point process given the entire event sequence x[0,T ]is

i=1

log λki(ti|x[0,ti))−ZT

t=0

k=1

λk(t|x[0,t))dt (1)

Popular examples of temporal point processes include Poisson processes (Daley & Vere-Jones, 2007)

as well as Hawkes processes (Hawkes, 1971) and their modern neural versions (Du et al., 2016; Mei

& Eisner, 2017; Zuo et al., 2020; Zhang et al., 2020; Yang et al., 2022).

2.2 Task and Challenge: Long-Horizon Prediction and Cascading Errors

We are interested in predicting the future events over an extended time interval

(T, T 0]

. We call this

task

long-horizon prediction

as the boundary

is so large that (with a high probability) many

events will happen over

(T, T 0]

. A principled way to solve this task works as follows: we draw many

possible future event sequences over the interval

(T, T 0]

, and then use this empirical distribution to

answer questions such as “how many events of type k= 3 will happen over that interval”.

A serious technical issue arises when we draw each possible future sequence. To draw an event

sequence from an autoregressive model, we have to repeatedly draw the next event, append it to the

history, and then continue to draw the next event conditioned on the new history. This process is

prone to

cascading errors

: any error in a drawn event is likely to cause all the subsequent draws to

differ from what they should be, and such errors will accumulate.

2.3 Globally Normalized Models: Hope and Difﬁculties

An ideal ﬁx of this issue is to develop a

globally normalized model

for event sequences. For any

time interval

[0, T ]

, such a model will give a probability distribution that is normalized over all

the possible full sequences on

[0, T ]

rather than over all the possible instantaneous subsequences

within each

(t, t +dt)

. Technically, a globally normalized model assigns to each sequence

x[0,T ]

score

exp −E(x[0,T ])

where

is called

energy function

; the normalized probability of

x[0,T ]

proportional to its score: i.e., p(x[0,T ])∝exp −E(x[0,T ]).

Had we trained a globally normalized model, we wish to enumerate all the possible

x[T,T 0]

for a

given

x[0,T ]

and select those which give the highest model probabilities

p(x[0,T 0])

. Prediction made

this way would not suffer cascading errors: the entire

x[0,T 0]

was jointly selected and thus the overall

compatibility between the events had been considered.

However, training such a globally normalized probabilistic model involves computing the normalizing

constant

Pexp −E(x[0,T ])

where the summation

is taken over all the possible sequences; it is

intractable since there are inﬁnitely many sequences. What’s worse, it is also intractable to exactly

sample from such a model; approximate sampling is tractable but expensive.

3 HYPRO: A Hybridly Normalized Neural Probabilistic Model

We propose HYPRO, a hybridly normalized neural probabilistic model that combines a temporal

point process and an energy function: it enjoys both the efﬁciency of autoregressive models and the

capacity of globally normalized models. Our model normalizes over (sub)sequences: for any given

interval [0, T ]and its extension (T, T 0]of interest, the model probability of the sequence x(T,T 0]is

pHYPRO x(T ,T 0]|x[0,T ]=pauto x(T,T 0]|x[0,T ]exp(−Eθ(x[0,T 0]))

Zθ(x[0,T ])(2)

where

pauto

is the probability under the chosen temporal point process and

Eθ

is an energy function

with parameters

. The normalizing constant sums over all the possible continuations

x(T,T 0]

for a

given preﬁx x[0,T ]:Zθx[0,T ]def

=Px(T,T 0]pauto x(T,T 0]|x[0,T ]exp −Eθ(x[0,T 0]).

The key advantage of our model over autoregressive models is that: the energy function

Eθ

is able

to pick up the global features that may have been missed by the autoregressive base model

pauto

;

intuitively, the energy function ﬁts the residuals that are not captured by the autoregressive model.

Our model is general: in principle,

pauto

can be any autoregressive model including those mentioned

in section 2.1 and

Eθ

can be any function that is able to encode an event sequence to a real number.

In section 5, we will introduce a couple of speciﬁc pauto and Eθand experiment with them.

In this section, we focus on the training method and inference algorithm.

3.1 Training Objectives

Training our full model

pHYPRO

is to learn the parameters of the autoregressive model

pauto

as well as

those of the energy function

Eθ

. Maximum likelihood estimation (MLE) is undesirable: the objective

would be

log pHYPRO x(T ,T 0]|x[0,T ]= log pauto x(T,T 0]|x[0,T ]−Eθ(x[0,T 0])−log Zθx[0,T ]

where the normalizing constant

Zθx[0,T ]

is known to be uncomputable and inapproximable for a

large variety of reasonably expressive functions Eθ(Lin & McCarthy, 2022).

We propose a training method that works around this normalizing constant. We ﬁrst train

pauto

just

like how previous work trained temporal point processes.

Then we use the trained

pauto

as a noise

distribution and learn the parameters

Eθ

by noise-contrastive estimation (NCE). Precisely, we

sample

noise sequences

x(1)

[T,T 0], . . . , x(N)

[T,T 0]

, compute the “energy”

Eθ(x(n)

[0,T 0])

for each completed

sequence x(n)

[0,T 0], and then plug those energies into one of the following training objectives.

Note that all the completed sequences x(n)

[0,T 0]share the same observed preﬁx x[0,T ].

Binary-NCE Objective.

We train a binary classiﬁer based on the energy function

Eθ

to discriminate

the true event sequence—denoted as x(0)

[0,T 0]—against the noise sequences by maximizing

Jbinary = log σ−Eθ(x(0)

[0,T 0])+

n=1

log σEθ(x(n)

[0,T 0]))(3)

It can be done by either maximum likelihood estimation or noise-contrastive estimation: for the former,

read Daley & Vere-Jones (2007); for the latter, read Mei et al. (2020b) which also has an in-depth discussion

about the theoretical connections between these two parameter estimation principles.

where

σ(u) = 1

1+exp(−u)

is the sigmoid function. By maximizing this objective, we are essentially

pushing our energy function

Eθ

such that the observed sequences have low energy but the noise

sequences have high energy. As a result, the observed sequences will be more probable under our full

model pHYPRO while the noise sequences will be less probable: see equation (2).

Theoretical guarantees of general Binary-NCE can be found in Gutmann & Hyvärinen (2010).

For general conditional probability models like ours, Binary-NCE implicitly assumes self-

normalization (Mnih & Teh, 2012; Ma & Collins, 2018): i.e., Zθx[0,T ]= 1 is satisﬁed.

This type of training objective has been used to train a hybridly normalized text generation model by

Deng et al. (2020); see section 4 for more discussion about its relations with our work.

Multi-NCE Objective.

Another option is to use Multi-NCE objective

, which means we maximize

Jmulti =−Eθ(x(0)

[0,T 0])−log

n=0

exp −Eθ(x(n)

[0,T 0]))(4)

By maximizing this objective, we are pushing our energy function

Eθ

such that each observed

sequence has relatively lower energy than the noise sequences sharing the same observed preﬁx. In

contrast,

Jbinary

attempts to make energies absolutely low (for observed data) or high (for noise data)

without considering whether they share preﬁxes. This effect is analyzed in Analysis-III of section 5.2.

This

Jmulti

objective also enjoys better statistical properties than

Jbinary

since it doesn’t assume self-

normalization: the normalizing constant

Zθx[0,T ]

is neatly cancelled out in its derivation; see

Appendix A.1 for a full derivation of both Binary-NCE and Multi-NCE.

Theoretical guarantees of Multi-NCE for discrete-time models were established by Ma & Collins

(2018); Mei et al. (2020b) generalized them to temporal point processes.

Considering Distances Between Sequences.

Previous work (LeCun et al., 2006; Bakhtin et al.,

2019) reported that energy functions may be better learned if the distances between samples are

considered. This has inspired us to design a regularization term that enforces such consideration.

Suppose that we can measure a well-deﬁned “distance” between the true sequence

x(0)

[0,T 0]

and any

noise sequence

x(n)

[0,T 0]

; we denote it as

d(n)

. We encourage the energy of each noise sequence to be

higher than that of the observed sequence by a margin; that is, we propose the following regularization:

Ω =

n=1

max 0, βd(n) + Eθ(x(0)

[0,T 0])−Eθ(x(n)

[0,T 0])(5)

where

β > 0

is a hyperparameter that we tune on the held-out development data. With this

regularization, the energies of the sequences with larger distances will be pulled farther apart: this

will help discriminate not only between the observed sequence and the noise sequences, but also

between the noise sequences themselves, thus making the energy function Eθmore informed.

This method is general so the distance

can be any appropriately deﬁned metric. In section 5, we

will experiment with an optimal transport distance speciﬁcally designed for event sequences.

Note that the distance

in the regularization may be the ﬁnal test metric. In that case, our method is

directly optimizing for the ﬁnal evaluation score.

Generating Noise Sequences.

Generating event sequences from an autoregressive temporal point

process has been well-studied in previous literature. The standard way is to call the

thinning

algorithm

(Lewis & Shedler, 1979; Liniger, 2009). The full recipe for our setting is in Algorithm 1.

3.2 Inference Algorithm

Inference involves drawing future sequences

x(T,T 0]

from the trained full model

pHYPRO

; due to the

uncomputability of the normalizing constant Z(x[0,T ]), exact sampling is intractable.

We propose a

normalized importance sampling

method to approximately draw

x(T,T 0]

from

pHYPRO

; it is shown in Algorithm 2. We ﬁrst use the trained

pauto

to be our proposal distribu-

tion and call the thinning algorithm (Algorithm 1) to draw proposals

xh1i

[T,T 0], . . . , xhMi

[T,T 0]

. Then we

It was named as Ranking-NCE by Ma & Collins (2018), but we think Multi-NCE is a more appropriate

name since it constructs a multi-class classiﬁer over one correct answer and multiple incorrect answers.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HYPRO:AHybridlyNormalizedProbabilisticModelforLong-HorizonPredictionofEventSequencesSiqiaoXue,XiaomingShi,JamesYZhangAntGroup569XixiRoadHangzhou,Chinasiqiao.xsq@alibaba-inc.com{peter.sxm,james.z}@antgroup.comHongyuanMeiToyotaTechnologicalInstituteatChicago6045SKenwoodAveChicago,IL60637hongyuan@ttic....

展开>> 收起<<

HYPRO A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences Siqiao Xue Xiaoming Shi James Y Zhang.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HYPRO A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences Siqiao Xue Xiaoming Shi James Y Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: