MTSMAE Masked Autoencoders for Multivariate Time-Series Forecasting Peiwang Tang12 Xianchao Zhang34

2025-05-02 0 0 456.9KB 8 页 10玖币

侵权投诉

MTSMAE: Masked Autoencoders for Multivariate

Time-Series Forecasting

Peiwang Tang1,2, Xianchao Zhang3,4∗

1Institute of Advanced Technology, University of Science and Technology of China, China

2G60 STI Valley Industry & Innovation Institute, Jiaxing University, China

3Key Laboratory of Medical Electronics and Digital Health of Zhejiang Province,

Jiaxing University, China

4Engineering Research Center of Intelligent Human Health Situation Awareness of Zhejiang Province,

Jiaxing University, China

{tpw}@mail.ustc.edu.cn, {zhangxianchao}@zjxu.edu.cn

Abstract—Large-scale self-supervised pre-training Trans-

former architecture have signiﬁcantly boosted the performance

for various tasks in natural language processing (NLP) and

computer vision (CV). However, there is a lack of researches on

processing multivariate time-series by pre-trained Transformer,

and especially, current study on masking time-series for self-

supervised learning is still a gap. Different from language

and image processing, the information density of time-series

increases the difﬁculty of research. The challenge goes further

with the invalidity of the previous patch embedding and mask

methods. In this paper, according to the data characteristics of

multivariate time-series, a patch embedding method is proposed,

and we present an self-supervised pre-training approach based

on Masked Autoencoders (MAE), called MTSMAE, which can

improve the performance signiﬁcantly over supervised learning

without pre-training. Evaluating our method on several common

multivariate time-series datasets from different ﬁelds and with

different characteristics, experiment results demonstrate that the

performance of our method is signiﬁcantly better than the best

method currently available.

Index Terms—Autoencoder, Pre-Training, Time-Series, Fore-

casting

I. INTRODUCTION

With the fast evolvement of deep learning in recent years

[1]–[3], training a model is anticipated to accommodate hun-

dreds of millions of labeled data [4]. The demand for large

scale data processing has been solved by self-supervised pre-

training in natural language processing (NLP) and computer

vision (CV) ﬁelds [5], [6]. Most of these solutions are based on

masked modeling, such as masked language modeling in NLP

[7], [8] or masked image modeling in CV [9]–[11]. Their ideas

are conceptually simple: ﬁrstly mask parts of the data based

on the original data and then enable to recovery these parts by

learning [12], [13]. Masked modeling encourages the model

to infer the deleted parts according to the context information,

so that the model can learn the deep semantics, which has

become the benchmark of self-supervised pre-training in NLP

and CV ﬁelds [5], [6]. These pre-trained masked modeling has

been proved to be well applied to various downstream tasks,

among which a simpler and more effective way is masked

autoencoders (MAE) [5]. However, despite a widely interests

in this idea from academia and industry following the success

of MAE, the progress of autoencoder methods in the ﬁeld of

multivariate time-series data (MTSD) lags behind other ﬁelds.

One of the main reasons is that the information density

of MTSD is different from that of CV and NLP. The local

information of MTSD seems to be heavy spatial redundancy,

but the multivariate information within each time point has

high speciﬁcity. Missing information can be easily learned

from information at adjacent time points with little high-

level understanding. In order to overcome this difference and

encourage learning more useful features, we use the idea

of Vision Transformer (ViT) [4], patch MTSD, and mask

more random patches than the original MAE, e.g. 85%. This

simple strategy has a ideal performance in MSTD, which can

reduce redundancy effectively and futher increase the overall

understanding beyond low-level information of the model.

Another reason is the design of the decoder. The decoder

of the autoencoder maps the latent representation back to

the input. In CV, the decoder can reconstruct the pixel level

representation in patch. In NLP, the decoder predicts missing

words. In MTSD, the decoder recovers completely different

data with information speciﬁcity and a dimension that even

highly to 321. For different types of MTSD, the dimensions

of data at each time point are different, which may have 321

dimensions or only 7 dimensions. We ﬁnd that the design of

decoder plays a key role in the latent representation of learning

for MTSD.

Based on the above analysis, we propose a very simple and

effective MTSMAE for MTSD representation learning. Our

MTSMAE idea is simple: In the pre-training, patch MTSD,

masks random patches from the input and recover the missing

patches; in the ﬁne-tuning, take out the encoder trained in the

previous step, and the input of the decoder is redesigned. In

addition, our encoder only calculates visible patches. Unlike

the decoder with only one layer of MLP in Bert [6], we design

different levels of decoder according to different MTSD, but

compared with MAE, our decoder are all lightweight. We have

conducted extensive experiments on four different datasets

of three types. The ﬁnal experimental results show that our

proposed MTSMAE can signiﬁcantly improve the accuracy

of prediction, and is superior to other state-of-the-art models.

arXiv:2210.02199v1 [cs.LG] 4 Oct 2022

GHFRGHU

LQSXW

HQFRGHU

5GHFRGHU

SDWFK

HPEHGGLQJ

SDWFK

HPEHGGLQJ

HQFRGHU

LQSXW HQFRGHU HQFRGHU

RXWSXW

ODEHO

SDUW

]HUR

SDUW

SDWFK

HPEHGGLQJ

QRQSDWFK

HPEHGGLQJ

SRVLWLRQHPEHGGLQJ

3GHFRGHU

SUHGLFWLRQ



6FDODU

(PEHGGLQJ

)HDWXUH

0DS

*OREDO7LPH/RFDO7LPH

(a) Pre-Training

GHFRGHU

LQSXW

HQFRGHU

5GHFRGHU

SDWFK

HPEHGGLQJ

SDWFK

HPEHGGLQJ

HQFRGHU

LQSXW HQFRGHU HQFRGHU

RXWSXW

ODEHO

SDUW

]HUR

SDUW

SDWFK

HPEHGGLQJ

QRQSDWFK

HPEHGGLQJ

SRVLWLRQHPEHGGLQJ

3GHFRGHU

SUHGLFWLRQ



6FDODU

(PEHGGLQJ

)HDWXUH

0DS

*OREDO7LPH/RFDO7LPH

(b) Fine-Tuning

Fig. 1. Our MTSMAE architecture. In the pre-training, our model consists of encoder and Rdeocder (the decoder responsible for recovering the original

input); in the ﬁne-tuning, our model consists of encoder and Pdeocder (the decoder responsible for predicting future data).

II. RELATED WORK

A. Masked modeling for Self-supervised learning

Self-supervised learning approaches have aroused great in-

terest in natural language processing and computer vision,

often focusing on different pretext tasks for pre-training [14]–

[16]. BERT [6] and GPT [7], [8], [17] are very successful

masked modeling for pre-training in NLP. They learn repre-

sentations from the original input corrupted by masking, which

have been proved by a large amount of evidence to be highly

extensible [17], and these pre-trained representations can be

well extended to various downstream tasks. As the original

method applied BERT to CV, BEIT [10] ﬁrstly “tokenize” the

original image as a visual token and then randomly mask some

image patches and input them into the backbone Transformer

[2]. The goal of pre-training is to restore the original visual

token from the damaged image patch. Based masked modeling

as all autoencoder [18], MAE [5] uses the encoder to map the

observed signal to the potential representation, and the decoder

to reconstruct the original signal from the latent representation.

In turn, different from the classic autoencoder, MAE adopts

an asymmetric design to allow the encoder to operate on only

part of the observed signal (without mask token) and use a

lightweight decoder to reconstruct the complete signal from

the latent representation along with mask tokens.

B. Models for Time-Series Forecasting

Forecasting is one of the most important applications of

time-series. LSTNet [19] uses convolutional neural network

(CNN) [1] and recurrent neural network (RNN) to extract

short-term local dependence patterns between variables and

ﬁnd long-term patterns of time-series trends. In addition, the

traditional autoregressive model is used to solve the scale

insensitivity problem of neural network model. The temporal

convolutional network (TCN) [20] is proposed to make the

CNN have time-series characteristics, as a variant of CNN

that deals with sequence modeling tasks. It mixes RNN and

CNN architecture to use the causal convolution to simulate

temporal causality. Reformer [21] is proposed to replace the

original dot-product attention with a new one using locality-

sensitive hashing. It decreases the complexity from O(L2)

to O(Llog L)and makes the storage activated only once in

the training process rather than ntimes (here nrefers to

the number of layers) by using reversible residual layers to

replace standard residuals. LogTrans [22] proposes a convolu-

tional self-attention mechanism, which uses causal convolution

to process the local context of the sequence and calculate

the query / key of self-attention. Further more, it uses the

logsparse Transformer architecture to ensure that each cell can

receive signals from other cells in the sequence data, while

reducing the time complexity of the architecture. Informer

[23] is proposed ProbSparse self-attention and encoder with

self-attention distilling. The former is based on query and

key similarity sampling dot-product pairs, which reduces the

computational complexity of Transformer and allows it to

accept longer input. The latter adopts the concept of distillation

to design encoder, so that the model can continuously extract

the feature vectors that focus on, while reducing the memory

occupation.

III. METHODOLOGY

The problem of multivariate time-series forecasting is to in-

put the past sequence Xt=xt

1,· · · , xt

Lx|xt

i∈Rdxat time

t, and output the predict the corresponding future sequence

Yt=nyt

1,· · · , yt

Ly|yt

i∈Rdyo, where Lxand Lyare the

lengths of input and output sequences respectively, and dx

and dyare the feature dimensions of input Xand output Y

respectively.

Our masked autoencoders (MTSMAE) is a simple autoen-

coding method and the training process is divided into two

stages, as shown in the Fig. 1. As all autoencoders, there is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MTSMAE:MaskedAutoencodersforMultivariateTime-SeriesForecastingPeiwangTang1;2,XianchaoZhang3;41InstituteofAdvancedTechnology,UniversityofScienceandTechnologyofChina,China2G60STIValleyIndustry&InnovationInstitute,JiaxingUniversity,China3KeyLaboratoryofMedicalElectronicsandDigitalHealthofZhejiangProvi...

展开>> 收起<<

MTSMAE Masked Autoencoders for Multivariate Time-Series Forecasting Peiwang Tang12 Xianchao Zhang34.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MTSMAE Masked Autoencoders for Multivariate Time-Series Forecasting Peiwang Tang12 Xianchao Zhang34

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: