
Temporally Consistent Transformers for Video Prediction
Figure 5.
Quantitative comparisons between TECO and baseline
methods in long-horizon temporal consistency, showing LPIPS
between generated and ground-truth frames for each timestep.
Timestep 0 corresponds to the first predicted frame (conditioning
frames are not included in the plot). Our method is able to remain
more temporally consistent over hundreds of timesteps of predic-
tion compared to SOTA models.
4. Experiments
4.1. Datasets
We introduce three challenging video datasets to better mea-
sure long-range consistency in video prediction, centered
around 3D environments in DMLab (Beattie et al.,2016),
Minecraft (Guss et al.,2019), and Habitat (Savva et al.,
2019), with videos of agents randomly traversing scenes of
varying difficulty. These datasets require video prediction
models to re-produce observed parts of scenes, and newly
generate unobserved parts. In contrast, many existing video
benchmarks do not have strong long-range dependencies,
where a model with limited context is sufficient. Refer to
Appendix M for further details on each dataset.
DMLab-40k DeepMind Lab is a simulator that procedu-
rally generates random 3D mazes with random floor and
wall textures. We generate 40k action-conditioned
64 ×64
videos of
300
frames of an agent randomly traversing
7×7
mazes by choosing random points in the maze and navigat-
ing to them via the shortest path. We train all models for
both action-conditioned and unconditional prediction (by
periodically masking out actions) to enable both types of
generations. We further discuss the use cases of both action
and unconditional models in Section 4.3.
Minecraft-200k This popular game features procedurally
generated 3D worlds that contain complex terrain such as
hills, forests, rivers, and lakes. We collect 200k action-
conditioned videos of length
300
and resolution
128 ×128
in Minecraft’s marsh biome. The player iterates between
walking forward for a random number of steps and randomly
rotating left or right, resulting in parts of the scene going
out of view and coming back into view later. We train
action-conditioned for all models for ease of interpreting
and evaluating, though it is generally easy for video models
to unconditionally learn these discrete actions.
Habitat-200k Habitat is a simulator for rendering trajec-
tories through scans of real 3D scenes. We compile
∼
1400
indoor scans from HM3D (Ramakrishnan et al.,2021), Mat-
terPort3D (Chang et al.,2017), and Gibson (Xia et al.,2018)
to generate 200k action-conditioned videos of
300
frames at
a resolution of
128 ×128
pixels. We use Habitat’s in-built
path traversal algorithm to construct action trajectories that
move our agent between randomly sampled locations. Sim-
ilar to DMLab, we train all video models to perform both
unconditional and action-conditioned prediction.
Kinetics-600 Kinetics-600 (Carreira & Zisserman,2017)
is a highly complex real-world video dataset, originally pro-
posed for action recognition. The dataset contains
∼
400k
videos of varying length of up to 300 frames. We evaluate
our method in the video prediction without actions (as they
do not exist), generating 80 future frames conditioned on
20. In addition, we filter out videos shorter than 100 frames,
leaving 392k videos that are split for training and evalua-
tion. We use a resolution of
128 ×128
pixels. Although
Kinetics-600 does not have many long-range dependencies,
we evaluate our method on this dataset to show that it can
scale to complex, natural video.
4.2. Baselines
We compare against SOTA baselines selected from several
different families of models: latent-variable-based varia-
tional models, autoregressive likelihood models, and diffu-
sion models. In addition, for efficiency, we train all models
on VQ codes using a pretrained VQ-GAN for each dataset.
For our diffusion baseline, we follow (Rombach et al.,2022)
and use a VAE instead of a VQ-GAN. Note that we do not
have any GANs for our baselines, since to the best of our
knowledge, there does not exist a GAN that trains on latent
space instead of raw pixels, an important aspect for properly
scaling to long video sequences.
Space-time Transformers We compare TECO to sev-
eral variants of space-time transformers as depicted in Fig-
ure 3: VideoGPT (Yan et al.,2021) (autoregressive over
space-time), Phenaki (Villegas et al.,2022) (MaskGit over
space-time full attention), MaskViT (Gupta et al.,2022)
(MaskGit over space-time axial attention), and Hourglass
transformers (Nawrot et al.,2021) (hierarchical autoregres-
sive over space-time). Note that we do not include the
text-conditioning for Phenaki as it is irrelevant in our case.
We only evaluate these models on DMLab, as Table 2 and
Table 1 show that Perceiver-AR (a space-time transformer
with improvements specifically for learning long dependen-
cies) is a stronger baseline.
5