
Under Review for International Conference on Machine Learning 2024
to recent deep learning models. In this section, we briefly
describe the recent deep learning models for time-series
forecasting. Motivated by the huge success of recurrent neu-
ral networks (RNNs) (Clevert et al., 2016; Li et al., 2018; Yu
et al., 2017), many novel deep learning architectures have
been developed for improving forecasting performance. To
effectively capture long-term dependency, which is a limi-
tation of RNNs, Stoller et al. (2020) have proposed convo-
lutional neural networks (CNNs). However, it is required
to stack lots of the same CNNs to capture long-term depen-
dency (Zhou et al., 2021). Attention-based models, includ-
ing Transformer (Vaswani et al., 2017) and Informer (Zhou
et al., 2021), have been another popular research direction in
time-series forecasting. Although these models effectively
capture temporal dependencies, they incur high computa-
tional costs and often struggle to obtain appropriate temporal
information (Wu et al., 2021). To cope with the problem,
Wu et al. (2021); Zhou et al. (2022) have adopted the input
decomposition method, which helps models better encode
appropriate information. Other state-of-the-art models adopt
neural memory networks (Kaiser et al., 2017; Sukhbaatar
et al., 2015; Madotto et al., 2018; Lee et al., 2022), which
refer to historical data stored in the memory to generate
meaningful representation.
2.2. Training Metrics
Conventionally, mean squared error (MSE),
Lp
norm and its
variants are mainstream metrics used to optimize forecasting
models. However, they are not optimal for training forecast-
ing models (Esling & Agon, 2012) because the time-series is
temporally continuous. Moreover, the
Lp
norm provides less
information about temporal correlation among time-series
data. To better model temporal dynamics in time-series data,
researchers have used differentiable, approximated dynamic
time warping (DTW) as an alternative metric of MSE (Cu-
turi & Blondel, 2017; Abid & Zou, 2018; Mensch & Blondel,
2018). However, using DTW as a loss function results in
temporal localization of changes being ignored. Recently,
Le Guen & Thome (2019) have suggested DILATE, a train-
ing metric to catch sudden changes of nonstationary signals
in a timely manner with smooth approximation of DTW
and penalized temporal distortion index (TDI). To guaran-
tee DILATE’s operation in a timely manner, penalized TDI
issues a harsh penalty when predictions showed high tem-
poral distortion. However, the TDI relies on the DTW path,
and DTW often showed misalignment because of noise and
scale sensitivity. Thus, DILATE often loses its advantage
with complex data, showing disadvantages at the training. In
this paper, we discuss distortions and transformation invari-
ances and design a new loss function that enables models to
learn shapes in the data and produce noise-robust forecasting
results.
3. Preliminary
In this section, we investigate common distortions focus-
ing on the goal of time-series forecasting (i.e., modeling
temporal dynamics and accurate forecasting). To clarify the
concepts of time-series forecasting and related terms, we
first define the notations and terms used (Sec. 3.1). We then
discuss common distortions in time-series from the transfor-
mation perspective that need to be considered for building
a shape-aware loss function (Sec. 3.2) and describe how
other loss functions (e.g., dynamic time warping (DTW)
and temporal distortion index (TDI)) handle shapes during
learning (Sec. 3.3). We will discuss the conditions for effec-
tive time-series forecasting in the next session (Sec. 4.1).
3.1. Notations and Definitions
Let
Xt
denote a data point at a time step
t
. We define a
time-series forecasting problem as follows:
Definition 3.1. Given
T
-length historical time-series
X=
[Xt−T+1, . . . , Xt], Xi∈RF
at time
i
and a corresponding
T′
-length future time-series
Y= [Yt+1, . . . , Yt+T′], Yi∈
RC
, time-series forecasting aims to learn the mapping func-
tion f:RT×F→RT′×C.
To distinguish between the label (i.e., ground truth) and
prediction time-series data, we note the label data as
Y
and
prediction data as
ˆ
Y
. Next, we set up two goals for time-
series forecasting, which require not only precise but also
informative forecasting (Wu et al., 2021; Zhou et al., 2022;
Le Guen & Thome, 2019) as follows:
•
The mapping function
f
should be learnt to point-
wisely reduce distance between ˆ
Yand Y;
•
The output
ˆ
Y
should have similar temporal dynamics
with Y.
Temporal dynamics are informative patterns in a time-series,
such as rise, drop, peak, and plateau. The optimization for
point-wise distance reduction is a conventional method used
in the deep learning domain, which can be obtained using
the MAE or MSE. However, in a real-world problem, such as
traffic speed or stock market prediction, accurate forecasting
of temporal dynamics is required. Esling & Agon (2012)
also emphasized the measurement of temporal dynamics, as
“...allowing the recognition of perceptually similar objects
even though they are not mathematically identical.” In this
paper, we define temporal dynamics as follows:
Definition 3.2. Temporal dynamics (or shapes) are informa-
tive periodic and nonperiodic patterns in time-series data.
In this work, we aim to design a shape-aware loss function
that satisfies both goals. To this end, we first discuss distor-
tions that two time-series with similar shapes can have.
3