modules: a spatial encoder-decoder module and a predictive
learning module, which is called a predictor. The spatial
encoder converts data into a latent space. Firstly, we train a
static image encoder-decoder, and when we do spatiotemporal
sequence prediction, we only need to input spatiotemporal data
directly to this spatial encoder-decoder, get the representation
tensor of spatiotemporal data at each time step of the sequence,
and use this representation tensor instead of the original image.
This representation tensor is used to generate the representa-
tion tensor of the future data instead of the original image.
Finally, the representation tensor generated by the predictor is
input to the spatial decoder to obtain the real spatiotemporal
data. For Deficiency II, we use Vector Quantised-Variational
AutoEncoder(VQ-VAE) [16] to do this task instead of CNN.
VQ-VAE is an auto-encoder that has excellent performance in
image generation tasks and its architecture fits our needs.
We use two representative spatiotemporal sequence predic-
tion models, PredRNN and TCTN, as the predictor. The former
predicts data using an RNN-like architecture and the latter uses
a transformer architecture for prediction. The modification of
these two different architectures demonstrates the generality of
our proposed approach. We do not make major changes to the
structure of the original models. The main changes are: on the
one hand, the input layer of the original model is modified
so that the images are first processed by a pre-trained VQ-
VAE encoder before being fed into the model. On the other
hand, the parameters of the original model are reduced (the
number of layers remained the same, and the hidden units of
each layer were halved). Finally, we find experimentally that
our approach improves the final results of the model despite
the reduction of the predictor parameters.
In summary, the contributions of this paper include the
following:
•Applying a modular design that decomposes sequence
prediction model into two parts: spatial decoder-encoder
and predictor to making the training of the model easier
and more efficient.
•Utilizing VQ-VAE to encode and generate images, which
improves the data generation quality.
II. RELATED WORK
In this section we introduce two different temporal sequence
model architectures and analyze their characteristics. On the
other hand, a VAE model is introduced.
Convolutional RNNs. Models of ConvRNN-like architec-
tures were first proposed by Shi et al. They use convolutional
operations instead of the fully connected layer in FC-LSTM
as a way to better capture spatiotemporal correlations in spa-
tiotemporal sequence data. However, due to its simple archi-
tecture, ambiguity can occur in the generation of future frames.
Many variant models of ConvLSTM have been proposed [17]–
[19], among which PredRNN is an improved ConvLSTM
network that is typical of RNN-based approaches. PredRNN
adds a spatiotemporal memory unit to the architecture of
ConvLSTM and allows the data to be passed along horizontal
and vertical directions, enabling the model to better model
spatiotemporal correlation. Also RredRNN employs some
techniques to make better predictions, such as reverse schedule
sampling(RSS), which gradually replaces the real data with
predicted data as input in a multi-step iterative prediction
process. The shortcomings of ConvRNNs are that they are
limited by the RNN architecture, have average prediction
results when the sequences are long, and are computationally
intensive.
Convolutional Tranformers. Transformer research has
gained significant results in both natural language processing
and computer vision fields, transformer-based models in spa-
tiotemporal sequence prediction tasks have gradually become
more numerous. The standard transformer uses stacked trans-
former blocks with spatial attention and temporal attention for
spatiotemporal sequence prediction, however, the drawback
is the same as FC-LSTM, limited by the fully connected
layers used, the model cannot capture short-term denpency
well. ConvTransformer [13] applies the transformer structure
to the domain of spatiotemporal sequential data, using multi-
headed convolutional attention instead of the traditional one-
dimensional attention mechanism to capture spatiotemporal
features. convTransformer performs better in predicting inter-
mediate content with known context, but performs mediocrely
for future information. Inspired by the multi-headed convo-
lutional attention in ConvTransformer, Yang et al. proposed
TCTN [15], a 3D convolutional transformer-based spatiotem-
poral sequence prediction model.TCTN uses a transformer-
based encoder with temporal convolutional layers to be used to
capture both short-term and long-term dependencies, achieving
results comparable to PredRNN in some metrics. The results
are comparable to those of PredRNN in some metrics.
VQ-VAE. Vector quantised variational autoencoder(VQ-
VAE), as known as a neural discrete representation learning
method, is a very effective auto encoder (AE) model [16]. It
consists of an encoder and a decoder. In the image generation
task, the image is input to the encoder and a latent embedding
vector zis output. VQ-VAE also maintains a dictionary, which
is a vector of length D. The nearest vector z0is obtained by
performing a nearest neighbor search on zin the dictionary,
and z0is used as the final encoding result instead of z. VQ-
VAE uses a CNN with residual units to perform a reduction of
the encoding result z0, to fit the distribution of the encoding.
When training VQ-VAE, since the process of discrete encoding
has no gradient, VQ-VAE uses Straight-Through Estimator
instead of the usual loss function. The model has great power
to construct images, videos and speech.
III. METHODS
A. Overview
A spatiotemporal sequence of data can be represented as
a length-Ssequence X={X1,X2, ..., XS}, each Xt∈
RW×H×D, t = 1,2, ..., S, means Xthas dimensions of H×W
and depth of D. The spatiotemporal model is to use Xto
predict length-m sequence ˜
X={˜
X1,˜
X2, ..., ˜
XT}which is
conitunous after X. This means that the spatiotemporal pre-