Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond Haoyu Pan1 Hao Wu2 Tan Yang1

2025-04-29 1 0 1.66MB 8 页 10玖币

侵权投诉

Enhancing Spatiotemporal Prediction Model using

Modular Design and Beyond

Haoyu Pan 1, Hao Wu 2, Tan Yang *,1

1School of Computer, Beijing University of Posts and Telecommunications

2Institute of Advanced Technology, University of Science and Technology of China

haoyu pan@bupt.edu.cn, easylearninghao@gmail.com, tyang@bupt.edu.cn

Abstract—Predictive learning uses a known state to generate

a future state over a period of time. It is a challenging task

to predict spatiotemporal sequence because the spatiotemporal

sequence varies both in time and space. The mainstream method

is to model spatial and temporal structures at the same time

using RNN-based or transformer-based architecture, and then

generates future data by using learned experience in the way

of auto-regressive. The method of learning spatial and temporal

features simultaneously brings a lot of parameters to the model,

which makes the model difﬁcult to be convergent. In this paper,

a modular design is proposed, which decomposes spatiotemporal

sequence model into two modules: a spatial encoder-decoder

and a predictor. These two modules can extract spatial features

and predict future data respectively. The spatial encoder-decoder

maps the data into a latent embedding space and generates data

from the latent space while the predictor forecasts future embed-

ding from past. By applying the design to the current research

and performing experiments on KTH-Action and MovingMNIST

datasets, we both improve computational performance and obtain

state-of-the-art results.

Index Terms—Spatialtemporal predivitve learning, Represen-

tation learning, Deep Learning

I. INTRODUCTION

Space-time dynamical system is common in the real world

so related research on spatiotemporal predictive learning has

received extensive attention. It is widely used in video predic-

tion [1], [2], trafﬁc ﬂow prediction [3]–[5], weather forecast

[6], [7] and physical simulation system [8]. Different from

time series prediction, the spatiotemporal sequence at each

time step can be regarded as an image (or frame) distributed

in higher dimensional feature space. Spatiotemporal predic-

tive learning requires sufﬁcient modeling of time and space

structure [9].From the perspective of time, the model requires

short-term and long-term memory features, while from the

perspective of space, it needs to capture both short-range and

long-range information.

In current deep neural network research, recurrent neural

network (RNN) class models have extraordinary effect on

processing time-varying sequence, while convolutional neural

network(CNN) are widely used to capture spatial features of

images. Therefore, Shi et al. proposed an end-to-end model

that combines the architectures of CNN and RNN, called

ConvLSTM [10], for spatio-temporal predictive learning. On

the other hand, with the advent of the transformer architec-

ture, an increasing number of sequence models turn to use

transformer, like natural language processing(NLP) [11], [12].

Transformer has better parallelism than the RNN architecture,

as demonstrated in NLP and time series tasks. Then the

spatiotemporal analysis models based on transformer [5], [13]

began to appear. However, there are two deﬁciencies in both

RNN-based and transformer-based architectures:

Deﬁciency I. During training, the model needs to capture

temporal dependence and spatial variance simultaneously. The

large number of parameters involved in training and slow

calculation will hinder the convergence of the model.

Deﬁciency II. In terms of spatiotemporal feature encoding

and decoding methods, the model with pure CNN architecture

performs well in the task of image feature extraction, but per-

forms poorly in the task of image generation. Meanwhile, it is

even more difﬁcult for CNN to model time series changes and

spatial features at the same time, which leads to distortion of

CNN’s frame generation results in video sequences. Pixel-level

prediction of spatiotemporal sequences is still a challenging

task.

We can decouple the spatiotemporal sequence prediction

model into three parts: the ﬁrst part is used to capture the

spatial features of the current time step, which we call the spa-

tial encoder. The second part combines the spatial information

captured by the ﬁrst module and passes it in time to predict the

spatial features of the future frames, which we call the predic-

tor. The last part decodes the future spatial features generated

by the predictor to generate the real future spatiotemporal

data, which we call the spatial decoder. Since the spatial

encoder and decoder are inverse operations to each other, we

can implement it with an encoder-decoder architecture, which

we call the spatial encoder-decoder. For example, in both

ConvLSTM and ConvTransformer, a pure CNN architecture

is used to encode and decode spatial features.

In computer systems, complex systems are often disas-

sembled into relatively simple sub-modules and then com-

bined, and artiﬁcial intelligence systems can do the same.

In recent studies, including the current outstanding models

like PredRNN [14] and 3D-temporal convolutional trans-

former(TCTN) [15], both parameters for both spatially depen-

dent and temporally dependent aspects are updated at training

time, which leads to inefﬁciency of the model: a large number

of parameters need to be updated each time during training,

so that convergence is very slow. To address the deﬁciency I

and II mentioned above, we provide ideas to solve them.

For Deﬁciency I, we try to decompose a model into two

arXiv:2210.01500v1 [cs.CV] 4 Oct 2022

modules: a spatial encoder-decoder module and a predictive

learning module, which is called a predictor. The spatial

encoder converts data into a latent space. Firstly, we train a

static image encoder-decoder, and when we do spatiotemporal

sequence prediction, we only need to input spatiotemporal data

directly to this spatial encoder-decoder, get the representation

tensor of spatiotemporal data at each time step of the sequence,

and use this representation tensor instead of the original image.

This representation tensor is used to generate the representa-

tion tensor of the future data instead of the original image.

Finally, the representation tensor generated by the predictor is

input to the spatial decoder to obtain the real spatiotemporal

data. For Deﬁciency II, we use Vector Quantised-Variational

AutoEncoder(VQ-VAE) [16] to do this task instead of CNN.

VQ-VAE is an auto-encoder that has excellent performance in

image generation tasks and its architecture ﬁts our needs.

We use two representative spatiotemporal sequence predic-

tion models, PredRNN and TCTN, as the predictor. The former

predicts data using an RNN-like architecture and the latter uses

a transformer architecture for prediction. The modiﬁcation of

these two different architectures demonstrates the generality of

our proposed approach. We do not make major changes to the

structure of the original models. The main changes are: on the

one hand, the input layer of the original model is modiﬁed

so that the images are ﬁrst processed by a pre-trained VQ-

VAE encoder before being fed into the model. On the other

hand, the parameters of the original model are reduced (the

number of layers remained the same, and the hidden units of

each layer were halved). Finally, we ﬁnd experimentally that

our approach improves the ﬁnal results of the model despite

the reduction of the predictor parameters.

In summary, the contributions of this paper include the

following:

•Applying a modular design that decomposes sequence

prediction model into two parts: spatial decoder-encoder

and predictor to making the training of the model easier

and more efﬁcient.

•Utilizing VQ-VAE to encode and generate images, which

improves the data generation quality.

II. RELATED WORK

In this section we introduce two different temporal sequence

model architectures and analyze their characteristics. On the

other hand, a VAE model is introduced.

Convolutional RNNs. Models of ConvRNN-like architec-

tures were ﬁrst proposed by Shi et al. They use convolutional

operations instead of the fully connected layer in FC-LSTM

as a way to better capture spatiotemporal correlations in spa-

tiotemporal sequence data. However, due to its simple archi-

tecture, ambiguity can occur in the generation of future frames.

Many variant models of ConvLSTM have been proposed [17]–

[19], among which PredRNN is an improved ConvLSTM

network that is typical of RNN-based approaches. PredRNN

adds a spatiotemporal memory unit to the architecture of

ConvLSTM and allows the data to be passed along horizontal

and vertical directions, enabling the model to better model

spatiotemporal correlation. Also RredRNN employs some

techniques to make better predictions, such as reverse schedule

sampling(RSS), which gradually replaces the real data with

predicted data as input in a multi-step iterative prediction

process. The shortcomings of ConvRNNs are that they are

limited by the RNN architecture, have average prediction

results when the sequences are long, and are computationally

intensive.

Convolutional Tranformers. Transformer research has

gained signiﬁcant results in both natural language processing

and computer vision ﬁelds, transformer-based models in spa-

tiotemporal sequence prediction tasks have gradually become

more numerous. The standard transformer uses stacked trans-

former blocks with spatial attention and temporal attention for

spatiotemporal sequence prediction, however, the drawback

is the same as FC-LSTM, limited by the fully connected

layers used, the model cannot capture short-term denpency

well. ConvTransformer [13] applies the transformer structure

to the domain of spatiotemporal sequential data, using multi-

headed convolutional attention instead of the traditional one-

dimensional attention mechanism to capture spatiotemporal

features. convTransformer performs better in predicting inter-

mediate content with known context, but performs mediocrely

for future information. Inspired by the multi-headed convo-

lutional attention in ConvTransformer, Yang et al. proposed

TCTN [15], a 3D convolutional transformer-based spatiotem-

poral sequence prediction model.TCTN uses a transformer-

based encoder with temporal convolutional layers to be used to

capture both short-term and long-term dependencies, achieving

results comparable to PredRNN in some metrics. The results

are comparable to those of PredRNN in some metrics.

VQ-VAE. Vector quantised variational autoencoder(VQ-

VAE), as known as a neural discrete representation learning

method, is a very effective auto encoder (AE) model [16]. It

consists of an encoder and a decoder. In the image generation

task, the image is input to the encoder and a latent embedding

vector zis output. VQ-VAE also maintains a dictionary, which

is a vector of length D. The nearest vector z0is obtained by

performing a nearest neighbor search on zin the dictionary,

and z0is used as the ﬁnal encoding result instead of z. VQ-

VAE uses a CNN with residual units to perform a reduction of

the encoding result z0, to ﬁt the distribution of the encoding.

When training VQ-VAE, since the process of discrete encoding

has no gradient, VQ-VAE uses Straight-Through Estimator

instead of the usual loss function. The model has great power

to construct images, videos and speech.

III. METHODS

A. Overview

A spatiotemporal sequence of data can be represented as

a length-Ssequence X={X1,X2, ..., XS}, each Xt∈

RW×H×D, t = 1,2, ..., S, means Xthas dimensions of H×W

and depth of D. The spatiotemporal model is to use Xto

predict length-m sequence ˜

X={˜

X1,˜

X2, ..., ˜

XT}which is

conitunous after X. This means that the spatiotemporal pre-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnhancingSpatiotemporalPredictionModelusingModularDesignandBeyondHaoyuPan1,HaoWu2,TanYang*,11SchoolofComputer,BeijingUniversityofPostsandTelecommunications2InstituteofAdvancedTechnology,UniversityofScienceandTechnologyofChinahaoyupan@bupt.edu.cn,easylearninghao@gmail.com,tyang@bupt.edu.cnAbstractPr...

展开>> 收起<<

Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond Haoyu Pan1 Hao Wu2 Tan Yang1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond Haoyu Pan1 Hao Wu2 Tan Yang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: