WaveBound Dynamic Error Bounds for Stable Time Series Forecasting Youngin Cho Daejin Kim Dongmin Kim Mohammad Azam Khan Jaegul Choo

2025-05-06 0 0 6.16MB 25 页 10玖币
侵权投诉
WaveBound: Dynamic Error Bounds for Stable Time
Series Forecasting
Youngin Cho* Daejin Kim* Dongmin Kim Mohammad Azam Khan Jaegul Choo
KAIST AI
{choyi0521,kiddj,tommy.dm.kim,azamkhan,jchoo}@kaist.ac.kr
Abstract
Time series forecasting has become a critical task due to its high practicality in
real-world applications such as traffic, energy consumption, economics and fi-
nance, and disease analysis. Recent deep-learning-based approaches have shown
remarkable success in time series forecasting. Nonetheless, due to the dynamics of
time series data, deep networks still suffer from unstable training and overfitting.
Inconsistent patterns appearing in real-world data lead the model to be biased to a
particular pattern, thus limiting the generalization. In this work, we introduce the
dynamic error bounds on training loss to address the overfitting issue in time series
forecasting. Consequently, we propose a regularization method called WaveBound
which estimates the adequate error bounds of training loss for each time step and
feature at each iteration. By allowing the model to focus less on unpredictable data,
WaveBound stabilizes the training process, thus significantly improving general-
ization. With the extensive experiments, we show that WaveBound consistently
improves upon the existing models in large margins, including the state-of-the-art
model.
1 Introduction
Time series forecasting has gained a lot of attention due to its high practicality in real-world applica-
tions such as traffic [
1
], energy consumption [
2
], economics and finance [
3
], and disease analysis [
4
].
Recent deep-learning-based approaches, particularly transformer-based methods [5–9], have shown
remarkable success in time series forecasting. Nevertheless, inconsistent patterns and unpredictable
behaviors in real data enforce the models to fit in patterns, even for the cases of unpredictable incident,
and induce the unstable training. In unpredictable cases, the model does not neglect them in training,
but rather receives a huge penalty (i.e., training loss). Ideally, small magnitudes of training loss
should be presented for unpredictable patterns. This implies the need for proper regularization of the
forecasting models in time series forecasting.
Recently, Ishida et al. [
10
] claimed that zero training loss introduces a high bias in training, hence
leading to an overconfident model and a decrease in generalization. To remedy this issue, they propose
a simple regularization called flooding and explicitly prevent the training loss from decreasing below
a small constant threshold called the flood level. In this work, we also focus on the drawbacks of zero
training loss in time series forecasting. In time series forecasting, the model is enforced to fit to an
inevitably appearing unpredictable pattern which mostly generates a tremendous error. However, the
original flooding is not applicable to time series forecasting mainly due to the following two main
reasons. (i) Unlike image classification, time series forecasting requires the vector output of the size
of the prediction length times the number of features. In this case, the original flooding considers the
average training loss without dealing with each time step and feature individually. (ii) In time series
*Equal contribution
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14303v2 [cs.LG] 28 May 2023
Training loss
Forecast step
Average
Training loss
Forecast step
Training loss
Forecast step
(a) Flooding (original) (b) Flooding (modified) (c) WaveBound (ours)
Figure 1: The conceptual examples for different methods. (a) The original flooding provides the
lower bound of the average loss, rather than considering each time step and feature individually. (b)
Even if the lower bounds of training loss are provided for each time step and feature, the bound of
constant value cannot reflect the nature of time series forecasting. (c) Our proposed WaveBound
method provides the lower bound of the training loss for each time step and feature. This lower bound
is dynamically adjusted to give a tighter error bound during the training process.
data, error bounds should be dynamically changed for different patterns. Intuitively, a higher error
should be tolerated for unpredictable patterns.
To properly address the overfitting issue in time series forecasting, the difficulty of prediction, i.e.,
how unpredictable the current label is, should be measured in the training procedure. To this end, we
introduce the target network updated with an exponential moving average of the original network,
i.e.,source network. At each iteration, the target network can guide a reasonable level of training
loss to the source network — the larger the error of the target network, the more unpredictable the
pattern. In current studies, a slow-moving average target network is commonly used to produce
stable targets in the self-supervised setting [
11
,
12
]. By using the training loss of the target network
for our lower bound, we derive a novel regularization method called WaveBound which faithfully
estimates the error bounds for each time step and feature. By dynamically adjusting the error bounds,
our regularization prevents the model from overly fitting to a certain pattern and further improves
generalization. Figure 1 shows the conceptual difference between the original flooding and our
WaveBound method. The originally proposed flooding determines the direction of the update step for
all points by comparing the average loss and its flood level. In contrast, WaveBound individually
decides the direction of the update step for each point by using the dynamic error bound of the
training loss. The difference between these methods is further discussed in Section 3. Our main
contributions are threefold:
We propose a simple yet effective regularization method called WaveBound that dynamically
provides the error bounds of training loss in time series forecasting.
We show that our proposed regularization method consistently improves upon the existing
state-of-the-art time series forecasting model on six real-world benchmarks.
By conducting extensive experiments, we verify the significance of adjusting the error bounds
for each time step, feature, and pattern, thus addressing the overfitting issue in time series
forecasting.
2 Preliminary
2.1 Time Series Forecasting
We consider the rolling forecasting setting with a fixed window size [
5
7
]. The aim of time series
forecasting is to learn a forecaster
g:RL×KRM×K
which predicts the future series
yt=
{zt+1, zt+2, ..., zt+M:ziRK}
given the past series
xt={ztL+1, ztL+2, ..., zt:ziRK}
at time
t
where
K
is the feature dimension and
L
and
M
are the input length and output length,
respectively. We mainly address the error bounding in the multivariate regression problem where the
input series
x
and output series
y
jointly come from the underlying density
p(x, y)
. For a given loss
function
, the risk of
g
is
R(g):=E(x,y)p(x,y)[(g(x), y)]
. Since we cannot directly access the
distribution
p
, we instead minimize its empirical version
ˆ
R(g):=1
NPN
i=1 (g(xi), yi)
using training
data
X:={(xi, yi)}N
i=1
. In the analysis, we assume that the errors are independent and identically
distributed. We mainly consider using the mean squared error (MSE) loss, which is widely used as an
2
objective function in recent time series forecasting models [
5
7
]. Then, the risk can be rewritten as
the sum of the risk at each prediction step and feature:
R(g) =E(u,v)p(u,v)1
MK ||g(u)v||2=1
MK X
j,k
Rjk(g),
ˆ
R(g) = 1
NMK
N
X
i=1
||g(xi)yi||2=1
MK X
j,k
ˆ
Rjk(g),
(1)
where Rjk(g):=E(u,v)p(u,v)||gjk (u)vjk||2and ˆ
Rjk(g):=1
NPN
i=1 ||gjk(xi)(yi)jk||2.
2.2 Flooding
To address the overfitting problem, Ishida et al. [
10
] suggested flooding, which restricts the training
loss to stay above a certain constant. Given the empirical risk
ˆ
R
and the manually searched lower
bound b, called the flood level, we instead minimize the flooded empirical risk, which is defined as
ˆ
Rfl(g) = |ˆ
R(g)b|+b.1(2)
The gradient update of the flooded empirical risk with respect to the model parameters is performed
as that of the empirical risk if
ˆ
R(g)> b
and is otherwise performed in the opposite direction. The
flooded empirical risk estimator is known to provide a better approximation of the risk than the
empirical risk estimator in terms of MSE if the risk is greater than b.
For the mini-batched optimization, a gradient update of the flooded empirical risk is performed with
respect to the mini-batch. Let
ˆ
Rt(g)
denote the empirical risk with respect to the
t
-th mini-batch for
t∈ {1,2, ..., T }. Then, by Jensen’s inequality,
ˆ
Rfl(g)1
T
T
X
t=1
(|ˆ
Rt(g)b|+b).(3)
Therefore, the mini-batched optimization minimizes the upper bound of the flooded empirical risk.
3 Method
In this section, we propose a novel regularization called WaveBound which is specially-designed for
time series forecasting. We first deal with the drawbacks of applying original flooding to time series
forecasting and then introduce a more desirable form of regularization.
3.1 Flooding in Time Series Forecasting
We first discuss how the original flooding may not effectively work for the time series forecasting
problem. We start with rewriting Equation (2) using the risks at each prediction step and feature:
ˆ
Rfl(g) =
ˆ
R(g)b
+b=
1
MK X
j,k
ˆ
Rjk(g)
b
+b. (4)
Flooded empirical risk constrains the lower bound of the average of the empirical risk for all prediction
steps and features by a constant value of
b
. However, for the multivariate regression model, this
regularization does not independently bound each
ˆ
Rjk(g)
. As a result, the regularization effect is
concentrated on output variables where ˆ
Rjk(g)greatly varies in training.
In this circumstance, the modified version of flooding can be explored by considering the individual
training loss for each time step and feature. This can be done by subtracting the flood level
b
for each
time step and feature as follows:
ˆ
Rconst(g) = 1
MK X
j,k |ˆ
Rjk(g)b|+b.(5)
1
The constant
b
outside of absolute value brackets does not affect the gradient update, but make sure
˜
Rf l (g) = ˆ
R(g)if ˆ
R(g)> b. This property is especially useful in the analysis of estimation error.
3
Slow update Wave empirical risk
Source network
Target network
(Error bound)
Estimate risk
Estimate risk
Figure 2: Our proposed WaveBound method provides the dynamic error bounds of the training loss
for each time step and feature using the target network. The target network
gτ
is updated with the
EMA of the source network
gθ
. At
j
-th time step and
k
-th feature, the training loss is bounded by
our estimated error bound
ˆ
Rjk(gτ)ϵ
,i.e., the gradient ascent is performed instead of the gradient
descent when the training loss is below the error bound.
For the remainder of this study, we denote this version of flooding as constant flooding. Compared
to the original flooding that considers the average of the whole training loss, constant flooding
individually constrains the lower bound of the training loss at each time step and feature by the value
of b.
Nonetheless, it still fails to consider different difficulties of predictions for different mini-batches.
For constant flooding, it is challenging to properly minimize the variants of empirical risk in the
batch-wise training process. As in Equation 3, the mini-batched optimization minimizes the upper
bound of the flooded empirical risk. The problem is that the inequality becomes less tight as each
flooded risk term
ˆ
Rt(g)b
for
t∈ {1,2, ..., T }
differs significantly. Since the time series data
typically contains lots of unpredictable noise, this happens frequently as the loss of each batch highly
varies. To ensure the tightness of the inequality, the bound for
ˆ
Rt(g)
should be adaptively chosen for
each batch.
3.2 WaveBound
As previously mentioned, to properly bound the empirical risk in time series forecasting, the regular-
ization method should be considered with the following conditions: (i) The regularization should
consider the empirical risk for each time step and feature individually. (ii) For different patterns,
i.e., mini-batches, different error bounds should be searched in the batch-wise training process. To
handle this, we find the error bound for each time step and feature and dynamically adjust it at each
iteration. Since manually searching different bounds for each time step and feature at every iteration
is impractical, we estimate the error bounds for different predictions using the exponential moving
average (EMA) model [13].
Concretely, two networks are employed throughout the training phase: the source network
gθ
and
target network
gτ
which have the same architecture, but different weights
θ
and
τ
, respectively. The
target network estimates the proper lower bounds of errors for the predictions of the source network,
and its weights are updated with the exponential moving average of the weights of the source network:
τατ + (1 α)θ, (6)
where
α[0,1]
is a target decay rate. On the other hands, the source network updates their weights
θ
using the gradient descent update in the direction of the gradient of wave empirical risk
ˆ
Rwb(gθ)
which is defined as
ˆ
Rwb(gθ) = 1
MK X
j,k
ˆ
Rwb
jk (gθ),
ˆ
Rwb
jk (gθ) =
ˆ
Rjk(gθ)(ˆ
Rjk(gτ)ϵ)
+ ( ˆ
Rjk(gτ)ϵ),
(7)
where
ϵ
is a hyperparameter indicating how far the error bound of the source network can be from
the error of the target network. Intuitively, the target network guides the lower bound of the training
loss for each time step and feature to prevent the source network from training towards a loss lower
4
than that bound, i.e., overfitting to a certain pattern. As the exponential moving average model is
known to have the effect of ensembling the source networks and memorizing training data visible in
earlier iterations [
13
], the target network can robustly estimate the error bound of the source network
against the instability mostly caused by noisy input data. Figure 2 shows how the source network and
the target network perform in our WaveBound method. A summary of WaveBound is provided in
Appendix B.
Mini-batched optimization. For
t∈ {1,2, ..., T }
, let
(ˆ
Rwb
t)jk(g)
and
(ˆ
Rt)jk(g)
denote the wave
empirical risk and the empirical risk at
j
-th step and
k
-th feature relative to the
t
-th mini-batch,
respectively. Given the target network g, by Jensen’s inequality,
ˆ
Rwb
jk (g)1
T
T
X
t=1
(ˆ
Rt)jk(g)(ˆ
Rt)jk(g) + ϵ
+ ( ˆ
Rt)jk(g)ϵ=1
T
T
X
t=1
(ˆ
Rwb
t)jk(g).(8)
Therefore, the mini-batched optimization minimizes the upper bound of wave empirical risk. Note
that if
g
is close to
g
, the values of
(ˆ
Rt)jk(g)(ˆ
Rt)jk(g)+ϵ
are similar across mini-batches, which
gives a tight bound in Jensen’s inequality. We expect the EMA update to work so that this condition
is met, giving a tight upper bound for the wave empirical risk in the mini-batched optimization.
MSE reduction. We show that the MSE of our suggested wave empirical risk estimator can be
smaller than that of the empirical risk estimator given an appropriate ϵ.
Theorem 1. Fix measurable functions
g
and
g
. Let
I:={(i, j) : i= 1,2, ..., M, j = 1,2, ..., K}
,
and let J(X):={(i, j)I:ˆ
Rij (g)<ˆ
Rij (g)ϵ}. If the following two conditions hold:
(a) (i, j),(k, l)Isuch that (i, j)̸= (k, l),ˆ
Rij (g)ˆ
Rij (g)ˆ
Rkl(g)
(b) ˆ
Rij (g)< Rij (g) + ϵfor all (i, j)J(X)almost surely,
then
MSE(ˆ
R(g)) MSE(ˆ
Rwb(g))
. Given the condition (a), if we have
0< α
such that
α <
Rij (g)ˆ
Rij (g) + ϵfor all (i, j)J(X)almost surely, then
MSE(ˆ
R(g)) MSE(ˆ
Rwb(g)) 4α2X
(i,j)I
Pr[α < ˆ
Rij (g)ˆ
Rij (g)ϵ].(9)
Proof. Please see Appendix A.
Intuitively, Theorem 1 states that the MSE of the empirical risk estimator can be reduced when
the following conditions hold: (i) The network
g
has sufficient expressive power so that the loss
difference between
g
and
g
at each output variable is unrelated to the loss at the other output variables
in
g
. (ii)
ˆ
Rij (g)ϵ
likely lies in between
ˆ
Rij (g)
and
Rij (g)
. It is preferable to have
g
as the
EMA model of
g
since the training loss of the EMA model cannot be readily below the test loss of
the model. Then,
ϵ
can be chosen as a fixed small value so that the training loss of the source model
at each output variable can be closely bounded below by the test loss at that variable.
4 Experiments
4.1 WaveBound with Forecasting Models
In this section, we evaluate our WaveBound method on real-world benchmarks using various time
series forecasting models, including the state-of-the-art models.
Baselines. Since our method can be easily applied to deep-learning-based forecasting models, we
evaluate our regularization adopted by several baselines including the state-of-the-art method. For
the multivariate setting, we select Autoformer [
5
], Pyraformer [
6
], Informer [
7
], LSTNet [
14
], and
TCN [16]. For the univariate setting, we additionally include N-BEATS [15] for the baseline.
Datasets. We examine the performance of forecasting models in six real-world benchmarks. (1) The
Electricity Transformer Temperature (ETT) [
7
] dataset contains two years of data from two separate
counties in China with intervals of 1-hour level (ETTh1, ETTh2) and 15 minutes level (ETTm1,
ETTm2) collected from electricity transformers. Each time step contains an oil temperature and
5
摘要:

WaveBound:DynamicErrorBoundsforStableTimeSeriesForecastingYounginCho*DaejinKim*DongminKimMohammadAzamKhanJaegulChooKAISTAI{choyi0521,kiddj,tommy.dm.kim,azamkhan,jchoo}@kaist.ac.krAbstractTimeseriesforecastinghasbecomeacriticaltaskduetoitshighpracticalityinreal-worldapplicationssuchastraffic,energyco...

展开>> 收起<<
WaveBound Dynamic Error Bounds for Stable Time Series Forecasting Youngin Cho Daejin Kim Dongmin Kim Mohammad Azam Khan Jaegul Choo.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:25 页 大小:6.16MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注