MTSMAE: Masked Autoencoders for Multivariate
Time-Series Forecasting
Peiwang Tang1,2, Xianchao Zhang3,4∗
1Institute of Advanced Technology, University of Science and Technology of China, China
2G60 STI Valley Industry & Innovation Institute, Jiaxing University, China
3Key Laboratory of Medical Electronics and Digital Health of Zhejiang Province,
Jiaxing University, China
4Engineering Research Center of Intelligent Human Health Situation Awareness of Zhejiang Province,
Jiaxing University, China
{tpw}@mail.ustc.edu.cn, {zhangxianchao}@zjxu.edu.cn
Abstract—Large-scale self-supervised pre-training Trans-
former architecture have significantly boosted the performance
for various tasks in natural language processing (NLP) and
computer vision (CV). However, there is a lack of researches on
processing multivariate time-series by pre-trained Transformer,
and especially, current study on masking time-series for self-
supervised learning is still a gap. Different from language
and image processing, the information density of time-series
increases the difficulty of research. The challenge goes further
with the invalidity of the previous patch embedding and mask
methods. In this paper, according to the data characteristics of
multivariate time-series, a patch embedding method is proposed,
and we present an self-supervised pre-training approach based
on Masked Autoencoders (MAE), called MTSMAE, which can
improve the performance significantly over supervised learning
without pre-training. Evaluating our method on several common
multivariate time-series datasets from different fields and with
different characteristics, experiment results demonstrate that the
performance of our method is significantly better than the best
method currently available.
Index Terms—Autoencoder, Pre-Training, Time-Series, Fore-
casting
I. INTRODUCTION
With the fast evolvement of deep learning in recent years
[1]–[3], training a model is anticipated to accommodate hun-
dreds of millions of labeled data [4]. The demand for large
scale data processing has been solved by self-supervised pre-
training in natural language processing (NLP) and computer
vision (CV) fields [5], [6]. Most of these solutions are based on
masked modeling, such as masked language modeling in NLP
[7], [8] or masked image modeling in CV [9]–[11]. Their ideas
are conceptually simple: firstly mask parts of the data based
on the original data and then enable to recovery these parts by
learning [12], [13]. Masked modeling encourages the model
to infer the deleted parts according to the context information,
so that the model can learn the deep semantics, which has
become the benchmark of self-supervised pre-training in NLP
and CV fields [5], [6]. These pre-trained masked modeling has
been proved to be well applied to various downstream tasks,
among which a simpler and more effective way is masked
autoencoders (MAE) [5]. However, despite a widely interests
in this idea from academia and industry following the success
of MAE, the progress of autoencoder methods in the field of
multivariate time-series data (MTSD) lags behind other fields.
One of the main reasons is that the information density
of MTSD is different from that of CV and NLP. The local
information of MTSD seems to be heavy spatial redundancy,
but the multivariate information within each time point has
high specificity. Missing information can be easily learned
from information at adjacent time points with little high-
level understanding. In order to overcome this difference and
encourage learning more useful features, we use the idea
of Vision Transformer (ViT) [4], patch MTSD, and mask
more random patches than the original MAE, e.g. 85%. This
simple strategy has a ideal performance in MSTD, which can
reduce redundancy effectively and futher increase the overall
understanding beyond low-level information of the model.
Another reason is the design of the decoder. The decoder
of the autoencoder maps the latent representation back to
the input. In CV, the decoder can reconstruct the pixel level
representation in patch. In NLP, the decoder predicts missing
words. In MTSD, the decoder recovers completely different
data with information specificity and a dimension that even
highly to 321. For different types of MTSD, the dimensions
of data at each time point are different, which may have 321
dimensions or only 7 dimensions. We find that the design of
decoder plays a key role in the latent representation of learning
for MTSD.
Based on the above analysis, we propose a very simple and
effective MTSMAE for MTSD representation learning. Our
MTSMAE idea is simple: In the pre-training, patch MTSD,
masks random patches from the input and recover the missing
patches; in the fine-tuning, take out the encoder trained in the
previous step, and the input of the decoder is redesigned. In
addition, our encoder only calculates visible patches. Unlike
the decoder with only one layer of MLP in Bert [6], we design
different levels of decoder according to different MTSD, but
compared with MAE, our decoder are all lightweight. We have
conducted extensive experiments on four different datasets
of three types. The final experimental results show that our
proposed MTSMAE can significantly improve the accuracy
of prediction, and is superior to other state-of-the-art models.
arXiv:2210.02199v1 [cs.LG] 4 Oct 2022