Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond Haoyu Pan1 Hao Wu2 Tan Yang1

2025-04-29 0 0 1.66MB 8 页 10玖币
侵权投诉
Enhancing Spatiotemporal Prediction Model using
Modular Design and Beyond
Haoyu Pan 1, Hao Wu 2, Tan Yang *,1
1School of Computer, Beijing University of Posts and Telecommunications
2Institute of Advanced Technology, University of Science and Technology of China
haoyu pan@bupt.edu.cn, easylearninghao@gmail.com, tyang@bupt.edu.cn
Abstract—Predictive learning uses a known state to generate
a future state over a period of time. It is a challenging task
to predict spatiotemporal sequence because the spatiotemporal
sequence varies both in time and space. The mainstream method
is to model spatial and temporal structures at the same time
using RNN-based or transformer-based architecture, and then
generates future data by using learned experience in the way
of auto-regressive. The method of learning spatial and temporal
features simultaneously brings a lot of parameters to the model,
which makes the model difficult to be convergent. In this paper,
a modular design is proposed, which decomposes spatiotemporal
sequence model into two modules: a spatial encoder-decoder
and a predictor. These two modules can extract spatial features
and predict future data respectively. The spatial encoder-decoder
maps the data into a latent embedding space and generates data
from the latent space while the predictor forecasts future embed-
ding from past. By applying the design to the current research
and performing experiments on KTH-Action and MovingMNIST
datasets, we both improve computational performance and obtain
state-of-the-art results.
Index Terms—Spatialtemporal predivitve learning, Represen-
tation learning, Deep Learning
I. INTRODUCTION
Space-time dynamical system is common in the real world
so related research on spatiotemporal predictive learning has
received extensive attention. It is widely used in video predic-
tion [1], [2], traffic flow prediction [3]–[5], weather forecast
[6], [7] and physical simulation system [8]. Different from
time series prediction, the spatiotemporal sequence at each
time step can be regarded as an image (or frame) distributed
in higher dimensional feature space. Spatiotemporal predic-
tive learning requires sufficient modeling of time and space
structure [9].From the perspective of time, the model requires
short-term and long-term memory features, while from the
perspective of space, it needs to capture both short-range and
long-range information.
In current deep neural network research, recurrent neural
network (RNN) class models have extraordinary effect on
processing time-varying sequence, while convolutional neural
network(CNN) are widely used to capture spatial features of
images. Therefore, Shi et al. proposed an end-to-end model
that combines the architectures of CNN and RNN, called
ConvLSTM [10], for spatio-temporal predictive learning. On
the other hand, with the advent of the transformer architec-
ture, an increasing number of sequence models turn to use
transformer, like natural language processing(NLP) [11], [12].
Transformer has better parallelism than the RNN architecture,
as demonstrated in NLP and time series tasks. Then the
spatiotemporal analysis models based on transformer [5], [13]
began to appear. However, there are two deficiencies in both
RNN-based and transformer-based architectures:
Deficiency I. During training, the model needs to capture
temporal dependence and spatial variance simultaneously. The
large number of parameters involved in training and slow
calculation will hinder the convergence of the model.
Deficiency II. In terms of spatiotemporal feature encoding
and decoding methods, the model with pure CNN architecture
performs well in the task of image feature extraction, but per-
forms poorly in the task of image generation. Meanwhile, it is
even more difficult for CNN to model time series changes and
spatial features at the same time, which leads to distortion of
CNN’s frame generation results in video sequences. Pixel-level
prediction of spatiotemporal sequences is still a challenging
task.
We can decouple the spatiotemporal sequence prediction
model into three parts: the first part is used to capture the
spatial features of the current time step, which we call the spa-
tial encoder. The second part combines the spatial information
captured by the first module and passes it in time to predict the
spatial features of the future frames, which we call the predic-
tor. The last part decodes the future spatial features generated
by the predictor to generate the real future spatiotemporal
data, which we call the spatial decoder. Since the spatial
encoder and decoder are inverse operations to each other, we
can implement it with an encoder-decoder architecture, which
we call the spatial encoder-decoder. For example, in both
ConvLSTM and ConvTransformer, a pure CNN architecture
is used to encode and decode spatial features.
In computer systems, complex systems are often disas-
sembled into relatively simple sub-modules and then com-
bined, and artificial intelligence systems can do the same.
In recent studies, including the current outstanding models
like PredRNN [14] and 3D-temporal convolutional trans-
former(TCTN) [15], both parameters for both spatially depen-
dent and temporally dependent aspects are updated at training
time, which leads to inefficiency of the model: a large number
of parameters need to be updated each time during training,
so that convergence is very slow. To address the deficiency I
and II mentioned above, we provide ideas to solve them.
For Deficiency I, we try to decompose a model into two
arXiv:2210.01500v1 [cs.CV] 4 Oct 2022
modules: a spatial encoder-decoder module and a predictive
learning module, which is called a predictor. The spatial
encoder converts data into a latent space. Firstly, we train a
static image encoder-decoder, and when we do spatiotemporal
sequence prediction, we only need to input spatiotemporal data
directly to this spatial encoder-decoder, get the representation
tensor of spatiotemporal data at each time step of the sequence,
and use this representation tensor instead of the original image.
This representation tensor is used to generate the representa-
tion tensor of the future data instead of the original image.
Finally, the representation tensor generated by the predictor is
input to the spatial decoder to obtain the real spatiotemporal
data. For Deficiency II, we use Vector Quantised-Variational
AutoEncoder(VQ-VAE) [16] to do this task instead of CNN.
VQ-VAE is an auto-encoder that has excellent performance in
image generation tasks and its architecture fits our needs.
We use two representative spatiotemporal sequence predic-
tion models, PredRNN and TCTN, as the predictor. The former
predicts data using an RNN-like architecture and the latter uses
a transformer architecture for prediction. The modification of
these two different architectures demonstrates the generality of
our proposed approach. We do not make major changes to the
structure of the original models. The main changes are: on the
one hand, the input layer of the original model is modified
so that the images are first processed by a pre-trained VQ-
VAE encoder before being fed into the model. On the other
hand, the parameters of the original model are reduced (the
number of layers remained the same, and the hidden units of
each layer were halved). Finally, we find experimentally that
our approach improves the final results of the model despite
the reduction of the predictor parameters.
In summary, the contributions of this paper include the
following:
Applying a modular design that decomposes sequence
prediction model into two parts: spatial decoder-encoder
and predictor to making the training of the model easier
and more efficient.
Utilizing VQ-VAE to encode and generate images, which
improves the data generation quality.
II. RELATED WORK
In this section we introduce two different temporal sequence
model architectures and analyze their characteristics. On the
other hand, a VAE model is introduced.
Convolutional RNNs. Models of ConvRNN-like architec-
tures were first proposed by Shi et al. They use convolutional
operations instead of the fully connected layer in FC-LSTM
as a way to better capture spatiotemporal correlations in spa-
tiotemporal sequence data. However, due to its simple archi-
tecture, ambiguity can occur in the generation of future frames.
Many variant models of ConvLSTM have been proposed [17]–
[19], among which PredRNN is an improved ConvLSTM
network that is typical of RNN-based approaches. PredRNN
adds a spatiotemporal memory unit to the architecture of
ConvLSTM and allows the data to be passed along horizontal
and vertical directions, enabling the model to better model
spatiotemporal correlation. Also RredRNN employs some
techniques to make better predictions, such as reverse schedule
sampling(RSS), which gradually replaces the real data with
predicted data as input in a multi-step iterative prediction
process. The shortcomings of ConvRNNs are that they are
limited by the RNN architecture, have average prediction
results when the sequences are long, and are computationally
intensive.
Convolutional Tranformers. Transformer research has
gained significant results in both natural language processing
and computer vision fields, transformer-based models in spa-
tiotemporal sequence prediction tasks have gradually become
more numerous. The standard transformer uses stacked trans-
former blocks with spatial attention and temporal attention for
spatiotemporal sequence prediction, however, the drawback
is the same as FC-LSTM, limited by the fully connected
layers used, the model cannot capture short-term denpency
well. ConvTransformer [13] applies the transformer structure
to the domain of spatiotemporal sequential data, using multi-
headed convolutional attention instead of the traditional one-
dimensional attention mechanism to capture spatiotemporal
features. convTransformer performs better in predicting inter-
mediate content with known context, but performs mediocrely
for future information. Inspired by the multi-headed convo-
lutional attention in ConvTransformer, Yang et al. proposed
TCTN [15], a 3D convolutional transformer-based spatiotem-
poral sequence prediction model.TCTN uses a transformer-
based encoder with temporal convolutional layers to be used to
capture both short-term and long-term dependencies, achieving
results comparable to PredRNN in some metrics. The results
are comparable to those of PredRNN in some metrics.
VQ-VAE. Vector quantised variational autoencoder(VQ-
VAE), as known as a neural discrete representation learning
method, is a very effective auto encoder (AE) model [16]. It
consists of an encoder and a decoder. In the image generation
task, the image is input to the encoder and a latent embedding
vector zis output. VQ-VAE also maintains a dictionary, which
is a vector of length D. The nearest vector z0is obtained by
performing a nearest neighbor search on zin the dictionary,
and z0is used as the final encoding result instead of z. VQ-
VAE uses a CNN with residual units to perform a reduction of
the encoding result z0, to fit the distribution of the encoding.
When training VQ-VAE, since the process of discrete encoding
has no gradient, VQ-VAE uses Straight-Through Estimator
instead of the usual loss function. The model has great power
to construct images, videos and speech.
III. METHODS
A. Overview
A spatiotemporal sequence of data can be represented as
a length-Ssequence X={X1,X2, ..., XS}, each Xt
RW×H×D, t = 1,2, ..., S, means Xthas dimensions of H×W
and depth of D. The spatiotemporal model is to use Xto
predict length-m sequence ˜
X={˜
X1,˜
X2, ..., ˜
XT}which is
conitunous after X. This means that the spatiotemporal pre-
摘要:

EnhancingSpatiotemporalPredictionModelusingModularDesignandBeyondHaoyuPan1,HaoWu2,TanYang*,11SchoolofComputer,BeijingUniversityofPostsandTelecommunications2InstituteofAdvancedTechnology,UniversityofScienceandTechnologyofChinahaoyupan@bupt.edu.cn,easylearninghao@gmail.com,tyang@bupt.edu.cnAbstract—Pr...

收起<<
Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond Haoyu Pan1 Hao Wu2 Tan Yang1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.66MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注