in modeling and learning the linear and stationary time dependencies with a noise component (De
Gooijer & Hyndman, 2006), the multivariate autoregressive time series models (MAR) (Fountis &
Dickey, 1989) that can learn time series patterns accompanied by explanatory variables. Moreover,
several statistical methods have been developed to extract the nonlinear signals from the series, such
as the bilinear model (Poskitt & Tremayne, 1986), the threshold autoregressive model (Stark, 1992),
and the autoregressive conditional heteroscedastic (ARCH) model (Engle, 1982). However, these
models have a rigorous requirement of the stationarity of a time series, which encounters severe
restrictions in practical use if most of the impacting factors are unavailable.
Since the time series prediction is closely related to regression analysis in machine learning,
traditional machine learning models (MLs), such as decision tree (DT) (Galicia, Talavera-Llames,
Troncoso, Koprinska, & Martínez-Álvarez, 2019; Lee & Oh, 1996), support vector machine (SVM),
and k nearest neighbor (kNN), can be used for time series forecasting (Galicia, et al., 2019). Inspired
by the notable achievements of deep learning (DL) in natural language processing (Devlin, Chang,
Lee, & Toutanova, 2018), image classification (Krizhevsky, Sutskever, & Hinton, 2012), and
reinforcement learning (Silver, et al., 2016), several artificial neural network (ANN) algorithms
have drawn people’s attention and become strong contenders along with statistical methods in the
forecasting community with their better prediction accuracies (Zhang, Patuwo, & Hu, 1998).
Significantly, different from MLs that require hand-crafted features, DLs have a great potential to
learn complex non-linear temporal feature interactions among multiple series. Because DLs
automatically learn complex data representations of an MTS, they alleviate the need for manual
feature engineering and model design (Bengio, Courville, & Vincent, 2013; Lim & Zohren, 2021).
Moreover, DLs can learn the linear and nonlinear patterns of data better.
Initially, most DLs are developed to model and learn the temporal dependency of time series. For
instance, the simplest DL, the recurrent neural network (RNN), can store a lot of information about
the past and it allows updates of its hidden state dynamically (Rumelhart et al. 1986; Werbos 1990;
Elman 1990). To address the weakness of RNNs in managing long-term dependencies, the long-
short term memory (LSTM) (Hochreiter & Schmidhuber, 1997), a variant of RNN capable of
learning long-term dependence, has also been employed for series forecasting (Gers, Schmidhuber,
& Cummins, 2000). LSTM comprises a separate autoencoder and forecasting sub-models. LSTM
has an RNN architecture but it is different from RNN, whereas it can solve the problem of vanishing
gradient. The Gate Recurrent Unit (GRU) (Dey & Salem, 2017) is also an important variant of RNN,
where its basic idea of learning long-term dependence is consistent with LSTM; however, it only
uses a reset gate and an update gate. The long- and short-term time-series network (LSTNet) (Lai,
Chang, Yang, & Liu, 2018) is designed specifically for MTS forecasting with up to hundreds of time
series. LSTNet uses CNNs to capture short-term patterns and LSTM (Hochreiter & Schmidhuber,
1997) or GRU (Dey & Salem, 2017) for memorizing relatively long-term patterns. Besides, the
attention mechanism (Bahdanau, Cho, & Bengio, 2014; Luong, Pham, & Manning, 2015), originally
utilized in encoder-decoder networks (Krizhevsky, et al., 2012), somewhat solves the problem of
integrating correlative unites, and thus increases the effectiveness of RNNs (Lai, et al., 2018). The
temporal pattern attention reviews the information at each stage and selects relevant information to
help to generate the outputs (Shih, Sun, & Lee, 2019). Recent studies demonstrate how both the
automatic feature learning capabilities of LSTMs and their ability to handle input sequences can be
harnessed in an end-to-end model that can be used to drive demand forecasting (Hu & Zheng, 2020).
Besides learning the dynamics of temporal dependence, time series that exhibit spatial