WaveBound Dynamic Error Bounds for Stable Time Series Forecasting Youngin Cho Daejin Kim Dongmin Kim Mohammad Azam Khan Jaegul Choo

2025-05-06 1 0 6.16MB 25 页 10玖币

侵权投诉

WaveBound: Dynamic Error Bounds for Stable Time

Series Forecasting

Youngin Cho* Daejin Kim* Dongmin Kim Mohammad Azam Khan Jaegul Choo

KAIST AI

{choyi0521,kiddj,tommy.dm.kim,azamkhan,jchoo}@kaist.ac.kr

Abstract

Time series forecasting has become a critical task due to its high practicality in

real-world applications such as trafﬁc, energy consumption, economics and ﬁ-

nance, and disease analysis. Recent deep-learning-based approaches have shown

remarkable success in time series forecasting. Nonetheless, due to the dynamics of

time series data, deep networks still suffer from unstable training and overﬁtting.

Inconsistent patterns appearing in real-world data lead the model to be biased to a

particular pattern, thus limiting the generalization. In this work, we introduce the

dynamic error bounds on training loss to address the overﬁtting issue in time series

forecasting. Consequently, we propose a regularization method called WaveBound

which estimates the adequate error bounds of training loss for each time step and

feature at each iteration. By allowing the model to focus less on unpredictable data,

WaveBound stabilizes the training process, thus signiﬁcantly improving general-

ization. With the extensive experiments, we show that WaveBound consistently

improves upon the existing models in large margins, including the state-of-the-art

model.

1 Introduction

Time series forecasting has gained a lot of attention due to its high practicality in real-world applica-

tions such as trafﬁc [

], energy consumption [

], economics and ﬁnance [

], and disease analysis [

Recent deep-learning-based approaches, particularly transformer-based methods [5–9], have shown

remarkable success in time series forecasting. Nevertheless, inconsistent patterns and unpredictable

behaviors in real data enforce the models to ﬁt in patterns, even for the cases of unpredictable incident,

and induce the unstable training. In unpredictable cases, the model does not neglect them in training,

but rather receives a huge penalty (i.e., training loss). Ideally, small magnitudes of training loss

should be presented for unpredictable patterns. This implies the need for proper regularization of the

forecasting models in time series forecasting.

Recently, Ishida et al. [

] claimed that zero training loss introduces a high bias in training, hence

leading to an overconﬁdent model and a decrease in generalization. To remedy this issue, they propose

a simple regularization called ﬂooding and explicitly prevent the training loss from decreasing below

a small constant threshold called the ﬂood level. In this work, we also focus on the drawbacks of zero

training loss in time series forecasting. In time series forecasting, the model is enforced to ﬁt to an

inevitably appearing unpredictable pattern which mostly generates a tremendous error. However, the

original ﬂooding is not applicable to time series forecasting mainly due to the following two main

reasons. (i) Unlike image classiﬁcation, time series forecasting requires the vector output of the size

of the prediction length times the number of features. In this case, the original ﬂooding considers the

average training loss without dealing with each time step and feature individually. (ii) In time series

*Equal contribution

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14303v2 [cs.LG] 28 May 2023

Training loss

Forecast step

Average

Training loss

Forecast step

Training loss

Forecast step

(a) Flooding (original) (b) Flooding (modified) (c) WaveBound (ours)

Figure 1: The conceptual examples for different methods. (a) The original ﬂooding provides the

lower bound of the average loss, rather than considering each time step and feature individually. (b)

Even if the lower bounds of training loss are provided for each time step and feature, the bound of

constant value cannot reﬂect the nature of time series forecasting. (c) Our proposed WaveBound

method provides the lower bound of the training loss for each time step and feature. This lower bound

is dynamically adjusted to give a tighter error bound during the training process.

data, error bounds should be dynamically changed for different patterns. Intuitively, a higher error

should be tolerated for unpredictable patterns.

To properly address the overﬁtting issue in time series forecasting, the difﬁculty of prediction, i.e.,

how unpredictable the current label is, should be measured in the training procedure. To this end, we

introduce the target network updated with an exponential moving average of the original network,

i.e.,source network. At each iteration, the target network can guide a reasonable level of training

loss to the source network — the larger the error of the target network, the more unpredictable the

pattern. In current studies, a slow-moving average target network is commonly used to produce

stable targets in the self-supervised setting [

]. By using the training loss of the target network

for our lower bound, we derive a novel regularization method called WaveBound which faithfully

estimates the error bounds for each time step and feature. By dynamically adjusting the error bounds,

our regularization prevents the model from overly ﬁtting to a certain pattern and further improves

generalization. Figure 1 shows the conceptual difference between the original ﬂooding and our

WaveBound method. The originally proposed ﬂooding determines the direction of the update step for

all points by comparing the average loss and its ﬂood level. In contrast, WaveBound individually

decides the direction of the update step for each point by using the dynamic error bound of the

training loss. The difference between these methods is further discussed in Section 3. Our main

contributions are threefold:

•

We propose a simple yet effective regularization method called WaveBound that dynamically

provides the error bounds of training loss in time series forecasting.

•

We show that our proposed regularization method consistently improves upon the existing

state-of-the-art time series forecasting model on six real-world benchmarks.

•

By conducting extensive experiments, we verify the signiﬁcance of adjusting the error bounds

for each time step, feature, and pattern, thus addressing the overﬁtting issue in time series

forecasting.

2 Preliminary

2.1 Time Series Forecasting

We consider the rolling forecasting setting with a ﬁxed window size [

–

]. The aim of time series

forecasting is to learn a forecaster

g:RL×K→RM×K

which predicts the future series

yt=

{zt+1, zt+2, ..., zt+M:zi∈RK}

given the past series

xt={zt−L+1, zt−L+2, ..., zt:zi∈RK}

at time

where

is the feature dimension and

and

are the input length and output length,

respectively. We mainly address the error bounding in the multivariate regression problem where the

input series

and output series

jointly come from the underlying density

p(x, y)

. For a given loss

function

ℓ

, the risk of

R(g):=E(x,y)∼p(x,y)[ℓ(g(x), y)]

. Since we cannot directly access the

distribution

, we instead minimize its empirical version

R(g):=1

NPN

i=1 ℓ(g(xi), yi)

using training

data

X:={(xi, yi)}N

i=1

. In the analysis, we assume that the errors are independent and identically

distributed. We mainly consider using the mean squared error (MSE) loss, which is widely used as an

objective function in recent time series forecasting models [

–

]. Then, the risk can be rewritten as

the sum of the risk at each prediction step and feature:

R(g) =E(u,v)∼p(u,v)1

MK ||g(u)−v||2=1

MK X

j,k

Rjk(g),

R(g) = 1

NMK

i=1

||g(xi)−yi||2=1

MK X

j,k

Rjk(g),

(1)

where Rjk(g):=E(u,v)∼p(u,v)||gjk (u)−vjk||2and ˆ

Rjk(g):=1

NPN

i=1 ||gjk(xi)−(yi)jk||2.

2.2 Flooding

To address the overﬁtting problem, Ishida et al. [

] suggested ﬂooding, which restricts the training

loss to stay above a certain constant. Given the empirical risk

and the manually searched lower

bound b, called the ﬂood level, we instead minimize the ﬂooded empirical risk, which is deﬁned as

Rfl(g) = |ˆ

R(g)−b|+b.1(2)

The gradient update of the ﬂooded empirical risk with respect to the model parameters is performed

as that of the empirical risk if

R(g)> b

and is otherwise performed in the opposite direction. The

ﬂooded empirical risk estimator is known to provide a better approximation of the risk than the

empirical risk estimator in terms of MSE if the risk is greater than b.

For the mini-batched optimization, a gradient update of the ﬂooded empirical risk is performed with

respect to the mini-batch. Let

Rt(g)

denote the empirical risk with respect to the

-th mini-batch for

t∈ {1,2, ..., T }. Then, by Jensen’s inequality,

Rfl(g)≤1

t=1

(|ˆ

Rt(g)−b|+b).(3)

Therefore, the mini-batched optimization minimizes the upper bound of the ﬂooded empirical risk.

3 Method

In this section, we propose a novel regularization called WaveBound which is specially-designed for

time series forecasting. We ﬁrst deal with the drawbacks of applying original ﬂooding to time series

forecasting and then introduce a more desirable form of regularization.

3.1 Flooding in Time Series Forecasting

We ﬁrst discuss how the original ﬂooding may not effectively work for the time series forecasting

problem. We start with rewriting Equation (2) using the risks at each prediction step and feature:

Rfl(g) = 



R(g)−b



+b=







MK X

j,k

Rjk(g)

−b



+b. (4)

Flooded empirical risk constrains the lower bound of the average of the empirical risk for all prediction

steps and features by a constant value of

. However, for the multivariate regression model, this

regularization does not independently bound each

Rjk(g)

. As a result, the regularization effect is

concentrated on output variables where ˆ

Rjk(g)greatly varies in training.

In this circumstance, the modiﬁed version of ﬂooding can be explored by considering the individual

training loss for each time step and feature. This can be done by subtracting the ﬂood level

for each

time step and feature as follows:

Rconst(g) = 1

MK X

j,k |ˆ

Rjk(g)−b|+b.(5)

The constant

outside of absolute value brackets does not affect the gradient update, but make sure

Rf l (g) = ˆ

R(g)if ˆ

R(g)> b. This property is especially useful in the analysis of estimation error.

Slow update Wave empirical risk

Source network

Target network

(Error bound)

Estimate risk

Figure 2: Our proposed WaveBound method provides the dynamic error bounds of the training loss

for each time step and feature using the target network. The target network

gτ

is updated with the

EMA of the source network

gθ

. At

-th time step and

-th feature, the training loss is bounded by

our estimated error bound

Rjk(gτ)−ϵ

,i.e., the gradient ascent is performed instead of the gradient

descent when the training loss is below the error bound.

For the remainder of this study, we denote this version of ﬂooding as constant ﬂooding. Compared

to the original ﬂooding that considers the average of the whole training loss, constant ﬂooding

individually constrains the lower bound of the training loss at each time step and feature by the value

of b.

Nonetheless, it still fails to consider different difﬁculties of predictions for different mini-batches.

For constant ﬂooding, it is challenging to properly minimize the variants of empirical risk in the

batch-wise training process. As in Equation 3, the mini-batched optimization minimizes the upper

bound of the ﬂooded empirical risk. The problem is that the inequality becomes less tight as each

ﬂooded risk term

Rt(g)−b

for

t∈ {1,2, ..., T }

differs signiﬁcantly. Since the time series data

typically contains lots of unpredictable noise, this happens frequently as the loss of each batch highly

varies. To ensure the tightness of the inequality, the bound for

Rt(g)

should be adaptively chosen for

each batch.

3.2 WaveBound

As previously mentioned, to properly bound the empirical risk in time series forecasting, the regular-

ization method should be considered with the following conditions: (i) The regularization should

consider the empirical risk for each time step and feature individually. (ii) For different patterns,

i.e., mini-batches, different error bounds should be searched in the batch-wise training process. To

handle this, we ﬁnd the error bound for each time step and feature and dynamically adjust it at each

iteration. Since manually searching different bounds for each time step and feature at every iteration

is impractical, we estimate the error bounds for different predictions using the exponential moving

average (EMA) model [13].

Concretely, two networks are employed throughout the training phase: the source network

gθ

and

target network

gτ

which have the same architecture, but different weights

and

, respectively. The

target network estimates the proper lower bounds of errors for the predictions of the source network,

and its weights are updated with the exponential moving average of the weights of the source network:

τ←ατ + (1 −α)θ, (6)

where

α∈[0,1]

is a target decay rate. On the other hands, the source network updates their weights

using the gradient descent update in the direction of the gradient of wave empirical risk

Rwb(gθ)

which is deﬁned as

Rwb(gθ) = 1

MK X

j,k

Rwb

jk (gθ),

Rwb

jk (gθ) = 



Rjk(gθ)−(ˆ

Rjk(gτ)−ϵ)



+ ( ˆ

Rjk(gτ)−ϵ),

(7)

where

is a hyperparameter indicating how far the error bound of the source network can be from

the error of the target network. Intuitively, the target network guides the lower bound of the training

loss for each time step and feature to prevent the source network from training towards a loss lower

than that bound, i.e., overﬁtting to a certain pattern. As the exponential moving average model is

known to have the effect of ensembling the source networks and memorizing training data visible in

earlier iterations [

], the target network can robustly estimate the error bound of the source network

against the instability mostly caused by noisy input data. Figure 2 shows how the source network and

the target network perform in our WaveBound method. A summary of WaveBound is provided in

Appendix B.

Mini-batched optimization. For

t∈ {1,2, ..., T }

, let

(ˆ

Rwb

t)jk(g)

and

(ˆ

Rt)jk(g)

denote the wave

empirical risk and the empirical risk at

-th step and

-th feature relative to the

-th mini-batch,

respectively. Given the target network g∗, by Jensen’s inequality,

Rwb

jk (g)≤1

t=1 



(ˆ

Rt)jk(g)−(ˆ

Rt)jk(g∗) + ϵ



+ ( ˆ

Rt)jk(g∗)−ϵ=1

t=1

(ˆ

Rwb

t)jk(g).(8)

Therefore, the mini-batched optimization minimizes the upper bound of wave empirical risk. Note

that if

is close to

g∗

, the values of

(ˆ

Rt)jk(g)−(ˆ

Rt)jk(g∗)+ϵ

are similar across mini-batches, which

gives a tight bound in Jensen’s inequality. We expect the EMA update to work so that this condition

is met, giving a tight upper bound for the wave empirical risk in the mini-batched optimization.

MSE reduction. We show that the MSE of our suggested wave empirical risk estimator can be

smaller than that of the empirical risk estimator given an appropriate ϵ.

Theorem 1. Fix measurable functions

and

g∗

. Let

I:={(i, j) : i= 1,2, ..., M, j = 1,2, ..., K}

and let J(X):={(i, j)∈I:ˆ

Rij (g)<ˆ

Rij (g∗)−ϵ}. If the following two conditions hold:

(a) ∀(i, j),(k, l)∈Isuch that (i, j)̸= (k, l),ˆ

Rij (g)−ˆ

Rij (g∗)⊥ˆ

Rkl(g)

(b) ˆ

Rij (g∗)< Rij (g) + ϵfor all (i, j)∈J(X)almost surely,

then

MSE(ˆ

R(g)) ≥MSE(ˆ

Rwb(g))

. Given the condition (a), if we have

0< α

such that

α <

Rij (g)−ˆ

Rij (g∗) + ϵfor all (i, j)∈J(X)almost surely, then

MSE(ˆ

R(g)) −MSE(ˆ

Rwb(g)) ≥4α2X

(i,j)∈I

Pr[α < ˆ

Rij (g∗)−ˆ

Rij (g)−ϵ].(9)

Proof. Please see Appendix A.

Intuitively, Theorem 1 states that the MSE of the empirical risk estimator can be reduced when

the following conditions hold: (i) The network

g∗

has sufﬁcient expressive power so that the loss

difference between

and

g∗

at each output variable is unrelated to the loss at the other output variables

. (ii)

Rij (g∗)−ϵ

likely lies in between

Rij (g)

and

Rij (g)

. It is preferable to have

g∗

as the

EMA model of

since the training loss of the EMA model cannot be readily below the test loss of

the model. Then,

can be chosen as a ﬁxed small value so that the training loss of the source model

at each output variable can be closely bounded below by the test loss at that variable.

4 Experiments

4.1 WaveBound with Forecasting Models

In this section, we evaluate our WaveBound method on real-world benchmarks using various time

series forecasting models, including the state-of-the-art models.

Baselines. Since our method can be easily applied to deep-learning-based forecasting models, we

evaluate our regularization adopted by several baselines including the state-of-the-art method. For

the multivariate setting, we select Autoformer [

], Pyraformer [

], Informer [

], LSTNet [

], and

TCN [16]. For the univariate setting, we additionally include N-BEATS [15] for the baseline.

Datasets. We examine the performance of forecasting models in six real-world benchmarks. (1) The

Electricity Transformer Temperature (ETT) [

] dataset contains two years of data from two separate

counties in China with intervals of 1-hour level (ETTh1, ETTh2) and 15 minutes level (ETTm1,

ETTm2) collected from electricity transformers. Each time step contains an oil temperature and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WaveBound:DynamicErrorBoundsforStableTimeSeriesForecastingYounginCho*DaejinKim*DongminKimMohammadAzamKhanJaegulChooKAISTAI{choyi0521,kiddj,tommy.dm.kim,azamkhan,jchoo}@kaist.ac.krAbstractTimeseriesforecastinghasbecomeacriticaltaskduetoitshighpracticalityinreal-worldapplicationssuchastraffic,energyco...

展开>> 收起<<

WaveBound Dynamic Error Bounds for Stable Time Series Forecasting Youngin Cho Daejin Kim Dongmin Kim Mohammad Azam Khan Jaegul Choo.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WaveBound Dynamic Error Bounds for Stable Time Series Forecasting Youngin Cho Daejin Kim Dongmin Kim Mohammad Azam Khan Jaegul Choo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: