Temporally Consistent Transformers for Video Generation

2025-05-02 0 0 4.21MB 37 页 10玖币
侵权投诉
Temporally Consistent Transformers for Video Generation
Wilson Yan 1Danijar Hafner 2 3 Stephen James 1 4 Pieter Abbeel 1
Abstract
To generate accurate videos, algorithms have to
understand the spatial and temporal dependencies
in the world. Current algorithms enable accurate
predictions over short horizons but tend to suffer
from temporal inconsistencies. When generated
content goes out of view and is later revisited, the
model invents different content instead. Despite
this severe limitation, no established benchmarks
on complex data exist for rigorously evaluating
video generation with long temporal dependen-
cies. In this paper, we curate 3 challenging video
datasets with long-range dependencies by ren-
dering walks through 3D scenes of procedural
mazes, Minecraft worlds, and indoor scans. We
perform a comprehensive evaluation of current
models and observe their limitations in temporal
consistency. Moreover, we introduce the Tempo-
rally Consistent Transformer (TECO), a genera-
tive model that substantially improves long-term
consistency while also reducing sampling time.
By compressing its input sequence into fewer em-
beddings, applying a temporal transformer, and
expanding back using a spatial MaskGit, TECO
outperforms existing models across many met-
rics. Videos are available on the website:
https:
//wilson1yan.github.io/teco
1. Introduction
Recent work in video generation has seen tremendous
progress (Ho et al.,2022;Clark et al.,2019;Yan et al.,2021;
Le Moing et al.,2021;Ge et al.,2022;Tian et al.,2021;Luc
et al.,2020) in producing high-fidelity and diverse samples
on complex video data, which can largely be attributed to
a combination of increased computational resources and
more compute efficient high-capacity neural architectures.
1
UC Berkeley
2
University of Toronto
3
DeepMind
4
Dyson
Robotics Lab. Correspondence to: Wilson Yan
<
wil-
son1.yan@berkeley.edu>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
Figure 1.
TECO generates temporally consistent videos of high
fidelity (low LPIPS) over hundreds of frames while offering orders
of magnitude faster sampling speed compared to previous video
generation models.
However, much of this progress has focused on generating
short videos, where models perform well by basing their
predictions on only a handful of previous frames.
Video prediction models with short context windows can
generate long videos in a sliding window fashion. While
the resulting videos can look impressive at first sight, they
lack temporal consistency. We would like models to predict
temporally consistent videos — where the same content is
generated if a camera pans back to a previously observed
location. On the other hand, the model should imagine a
new part of the scene for locations that have not yet been
observed, and future predictions should remain consistent
to this newly imagined part of the scene.
Prior work has investigated techniques for modeling long-
term dependencies, such as temporal hierarchies (Saxena
et al.,2021) and strided sampling with frame-wise interpola-
tion (Ge et al.,2022;Hong et al.,2022). Other methods train
on sparse sets of frames selected out of long videos (Harvey
et al.,2022;Skorokhodov et al.,2021;Clark et al.,2019;
Saito & Saito,2018;Yu et al.,2022), or model videos via
compressed representations (Yan et al.,2021;Rakhimov
et al.,2020;Le Moing et al.,2021;Seo et al.,2022;Gupta
et al.,2022;Walker et al.,2021). Refer to Appendix L for
more detailed discussion on related work.
Despite this progress, many methods still have difficulty
scaling to datasets with many long-range dependencies.
While Clockwork-VAE (Saxena et al.,2021) trains on long
1
arXiv:2210.02396v2 [cs.CV] 31 May 2023
Temporally Consistent Transformers for Video Prediction
 
Figure 2.
TECO generates sharp and consistent video predictions for hundreds of frames on challenging datasets. The figure shows evenly
spaced frames of the 264 frame predictions, after being conditioned on 36 context frames. From top to bottom, the datasets are are
DMLab, Minecraft, Habitat, and Kinetics-600.
sequences, it is limited by training time (due to recurrence)
and difficult to scale to complex data. On the other hand,
transformer-based methods over latent spaces (Yan et al.,
2021) scale poorly to long videos due to quadratic com-
plexity in attention, with long videos containing tens of
thousands of tokens. Methods that train on subsets of
tokens are limited by truncated backpropagation through
time (Hutchins et al.,2022;Rae et al.,2019;Dai et al.,2019)
or naive temporal operations (Hawthorne et al.,2022).
In addition, there generally do not exist benchmarks for
properly evaluating temporal consistency in video genera-
tion methods, where prior works either focus on generating
long videos where short-term dependencies are sufficient
for accurate prediction (Ge et al.,2022;Skorokhodov et al.,
2021) and/or rely on metrics such as FVD (Unterthiner et al.,
2019) which are more sensitive to image fidelity rather than
capture long-range temporal dependencies.
In this paper, we introduce a set of novel long-horizon video
generation benchmarks, as well as corresponding evalu-
ation metrics to better capture temporal consistency. In
addition, we propose Temporally Consistent Video Trans-
former (TECO), a vector-quantized latent dynamics model
that effectively models long-term dependencies in a com-
pact representation space using efficient transformers. The
key contributions are summarized as follows:
To better evaluate temporal consistency in video pre-
dictions, we propose 3 video datasets with long-
range dependencies including metrics, generated
from 3D scenes in DMLab (Beattie et al.,2016),
Minecraft (Guss et al.,2019), and Habitat (Szot et al.,
2021;Savva et al.,2019)
We benchmark SOTA video generation models on the
datasets and analyze capabilities of each in learning
long-horizon dependencies.
We introduce TECO, an efficient and scalable video
generation model that learns compressed representa-
tions to allow for efficient training and generation. We
show that TECO has strong performance on a vari-
ety of difficult video prediction tasks, and is able to
leverage long-term temporal context to generate high
quality videos with consistency while maintaining fast
sampling speed.
2. Preliminaries
2.1. VQ-GAN
VQ-GAN (Esser et al.,2021;Van Den Oord et al.,2017)
is an autoencoder that learns to compress data into dis-
crete latents, consisting of an encoder
E
, decoder
G
, code-
book
C
, and discriminator
D
. Given an image
x
RH×W×3
, the encoder
E
maps
x
to its latent representation
hRH×W×D
, which is quantized by nearest neighbors
lookup in a codebook of embeddings
C={ei}K
i=1
to pro-
duce
zRH×W×D
.
z
is fed through decoder
G
to re-
2
Temporally Consistent Transformers for Video Prediction
TECO
Dec
Space Time
Transformer
Space Time
Transformer
(19K)2 = 386M
TECO Prior
Temporal Causal
Transformer
(1.2K)2 = 1.44M
Spatial MaskGit
300*(64)2 = 1.22M
300 x 1282 x 3
x
75 × 162 = 19K
z
z
z
300 × 82 = 19K
300 × 22 = 1.2K
300 × 22 = 1.2K
h
z
300 × 82 = 19K
300 × 82 = 19K
75 × 162 = 19K
x
^zz
z
^^
x
^
x
^
z
^
Enc Dec TECO
Enc
(a) Space Time Transformer (b) Temporally Consistent Transformer (Ours)
300 x 1282 x 3
Figure 3.
The architectural design of TECO. (a) Prior work on video generation models over VQ codes adopt a single spatio-temporal
transformer over all codes. This is prohibitive when scaling to long sequences due to the quadratic complexity of attention. (b) We propose
a novel and efficient architecture that aggressively downsamples in space before feeding into a temporal transformer, and then expands
back out with a spatial MaskGit that is separately applied per frame. In the figure, the transformer blocks show the number of attention
links. On training sequences of
300
frames, TECO sees orders of magnitude more efficiency over existing models, allowing the use of
larger models for a given compute budget.
construct
x
. A straight-through estimator (Bengio,2013) is
used to maintain gradient flow through the quantization step.
The codebook optimizes the following loss:
LVQ =sg(h)e2
2+βhsg(e)2
2(1)
where
β= 0.25
is a hyperparameter, and
e
is the nearest-
neighbors embedding from
C
. For reconstruction, VQ-GAN
replaces the original
2
loss with a perceptual loss (Zhang
et al.,2012),
LLPIPS
. Finally, in order to encourage higher-
fidelity samples, patch-level discriminator
D
is trained to
classify between real and reconstructed images, with:
LGAN = log D(x) + log(1 D(ˆx)) (2)
Overall, VQ-GAN optimizes the following loss:
min
E,G,C max
DLLPIPS +LVQ +λLGAN (3)
where
λ=GLLLPIPS 2
GLLGAN 2+δ
is an adaptive weight,
GL
is
the last decoder layer,
δ= 106
, and
LLP IP S
is the same
distance metric described in Zhang et al. (2012).
2.2. MaskGit
MaskGit (Chang et al.,2022) models distributions over dis-
crete tokens, such as produced by a VQ-GAN. It generates
images with competitive sample quality to autoregressive
models at a fraction of the sampling cost by using a masked
token prediction objective during training. Formally, we
denote
zZH×W
as the discrete latent tokens represent-
ing an image. For each training step, we uniformly sample
t[0,1)
and randomly generate a mask
m∈ {0,1}H×W
with
N=γHW
masked values, where
γ= cos π
2t
.
Then, MaskGit learns to predict the masked tokens with the
following objective
Lmask =Ez∈D log p(z|zm).(4)
During inference, because MaskGit has been trained to
model any set of unconditional and conditional probabilities,
we can sample any subset of tokens per sampling iteration.
(Chang et al.,2022) introduces a confidence-based sampling
mechanism whereas other work (Lee et al.,2022) proposes
an iterative sample-and-revise approach.
3. TECO
We present Temporally Consistent Video Transformer
(TECO), a video generation model that more efficiently
scales to training on longer horizon videos.
3.1. Architectural Overview
Our proposed framework is shown in Figure 3, where x1:T
consists of a sequence of video frames. Our primary innova-
tion centers around designing a more efficient architecture
that can scale to long sequences. Prior SOTA methods (Yan
et al.,2021;Ge et al.,2022;Villegas et al.,2022) over VQ-
3
Temporally Consistent Transformers for Video Prediction
codes all train a single spatio-temporal transformer to model
every code, however, this becomes prohibitively expensive
with sequences containing tens of thousands of tokens. On
the other hand, these models have shown to be able to learn
highly multi-modal distributions and scale well on complex
video. As such, we design the TECO architecture with the
intention to retain its high-capacity scaling properties, while
ensuring orders of magnitude more efficient training and
inference. In the following sections, we motivate each com-
ponent for our model, with several specific design choices
to ensure efficiency and scalability. TECO consists of four
components:
Encoder: zt=E(xt, xt1)
Temporal Transformer: ht=H(zt)
Spatial MaskGit: p(zt|ht1)
Decoder: p(xt|zt, ht1)
(5)
Encoder We can achieve compressed representations by
leveraging spatio-temporal redundancy in video data. To
do this, we learn a CNN encoder
zt=E(xt, xt1)
which
encodes the current frame
xt
conditioned on the previous
frame by channel-wise concatenating
xt1
, and then quan-
tizes the output using codebook
C
to produce
zt
. We apply
the VQ loss defined in Equation (1) per timestep. In ad-
dition, we
2
-normalize the codebook and embeddings to
encourage higher codebook usage (Yu et al.,2021). The
first frame is concatenated with zeros and does not quantize
z1to prevent information loss.
Temporal Transformer Compressed, discrete latents are
more lossy and tend to require higher spatial resolutions
compared to continuous latents. Therefore, before modeling
temporal information, we apply a single strided convolu-
tion to downsample each discrete latent
zt
, where visually
simpler datasets allow for more downsampling and visually
complex datasets require less downsampling. Afterwards,
we learn a large transformer to model temporal dependen-
cies, and then apply a transposed convolution to upsample
our representation back to the original resolution of
zt
. In
summary, we use the following architecture:
ht=H(z<t) = ConvT(Transformer(Conv(z<t))) (6)
Decoder The decoder is an upsampling CNN that recon-
structs
ˆxt=D(zt, ht)
, where
zt
can be interpreted as the
posterior of timestep
t
, and
ht
the output of the temporal
transformer which summarizes information from previous
timesteps.
zt
and
ht
are concatenated channel-wise and
into the decoder. Together with the encoder, the decoder
optimizes the following cross entropy reconstruction loss
Lrecon =1
TPT
t=1 log p(xt|zt, ht).(7)
which encourages
zt
features to encode relative information
between frames since the temporal transformer output
ht
ag-
gregates information over time, learning more compressed
codes for efficient modeling over longer sequences.
Spatial MaskGit Lastly, we use a MaskGit (Chang et al.,
2022) to model the prior,
p(zt|ht)
. We show that using a
MaskGit prior allows for not just faster but also higher qual-
ity sampling compared to an autoregressive prior. During
every training iteration, we follow prior work to sample a
random mask mtand optimize
Lprior =1
TPT
t=1 log p(zt|ztmt).(8)
where
ht
is concatenated channel-wise with masked
zt
to
predict the masked tokens. During generation, we follow
Lee et al. (2022), where we initially generate each frame in
chunks of 8 at a time and then go through 2 revise rounds
of re-generating half the tokens each time.
Training Objective The final objective is the following:
LTECO =LVQ +Lrecon +Lprior (9)
3.2. DropLoss
We propose DropLoss, a simple trick to allow for more
scalable and efficient training (Figure 4). Due to its architec-
ture design, TECO can be separated into two components:
(1) learning temporal representations, consisting of the en-
coder and the temporal transformer, and (2) predicting future
frames, consisting of the dynamics prior and decoder. We
can increase training efficiency by dropping out random
timesteps that are not decoded and thus omitted from the
reconstruction loss. For example, given a video of T frames,
we compute
ht
for all
t∈ {1, . . . , T }
, and then compute the
losses
Lprior
and
Lrecon
for only 10% of the indices. Be-
cause random indices are selected each iteration, the model
still needs to learn to accurately predict all timesteps. This
reduces training costs significantly because the decoder and
dynamics prior require non-trivial computations. DropLoss
is applicable to both a wide class of architectures and to
tasks beyond video prediction.
^
x2
Model
x1x2x3x4
^
x5
x5
Figure 4.
DropLoss improves training scalability on longer se-
quences by only computing the loss on a random subset of time
indices for each training iteration. For TECO, we do not need to
compute the decoder and MaskGit for dropped out timesteps.
4
Temporally Consistent Transformers for Video Prediction
Figure 5.
Quantitative comparisons between TECO and baseline
methods in long-horizon temporal consistency, showing LPIPS
between generated and ground-truth frames for each timestep.
Timestep 0 corresponds to the first predicted frame (conditioning
frames are not included in the plot). Our method is able to remain
more temporally consistent over hundreds of timesteps of predic-
tion compared to SOTA models.
4. Experiments
4.1. Datasets
We introduce three challenging video datasets to better mea-
sure long-range consistency in video prediction, centered
around 3D environments in DMLab (Beattie et al.,2016),
Minecraft (Guss et al.,2019), and Habitat (Savva et al.,
2019), with videos of agents randomly traversing scenes of
varying difficulty. These datasets require video prediction
models to re-produce observed parts of scenes, and newly
generate unobserved parts. In contrast, many existing video
benchmarks do not have strong long-range dependencies,
where a model with limited context is sufficient. Refer to
Appendix M for further details on each dataset.
DMLab-40k DeepMind Lab is a simulator that procedu-
rally generates random 3D mazes with random floor and
wall textures. We generate 40k action-conditioned
64 ×64
videos of
300
frames of an agent randomly traversing
7×7
mazes by choosing random points in the maze and navigat-
ing to them via the shortest path. We train all models for
both action-conditioned and unconditional prediction (by
periodically masking out actions) to enable both types of
generations. We further discuss the use cases of both action
and unconditional models in Section 4.3.
Minecraft-200k This popular game features procedurally
generated 3D worlds that contain complex terrain such as
hills, forests, rivers, and lakes. We collect 200k action-
conditioned videos of length
300
and resolution
128 ×128
in Minecraft’s marsh biome. The player iterates between
walking forward for a random number of steps and randomly
rotating left or right, resulting in parts of the scene going
out of view and coming back into view later. We train
action-conditioned for all models for ease of interpreting
and evaluating, though it is generally easy for video models
to unconditionally learn these discrete actions.
Habitat-200k Habitat is a simulator for rendering trajec-
tories through scans of real 3D scenes. We compile
1400
indoor scans from HM3D (Ramakrishnan et al.,2021), Mat-
terPort3D (Chang et al.,2017), and Gibson (Xia et al.,2018)
to generate 200k action-conditioned videos of
300
frames at
a resolution of
128 ×128
pixels. We use Habitat’s in-built
path traversal algorithm to construct action trajectories that
move our agent between randomly sampled locations. Sim-
ilar to DMLab, we train all video models to perform both
unconditional and action-conditioned prediction.
Kinetics-600 Kinetics-600 (Carreira & Zisserman,2017)
is a highly complex real-world video dataset, originally pro-
posed for action recognition. The dataset contains
400k
videos of varying length of up to 300 frames. We evaluate
our method in the video prediction without actions (as they
do not exist), generating 80 future frames conditioned on
20. In addition, we filter out videos shorter than 100 frames,
leaving 392k videos that are split for training and evalua-
tion. We use a resolution of
128 ×128
pixels. Although
Kinetics-600 does not have many long-range dependencies,
we evaluate our method on this dataset to show that it can
scale to complex, natural video.
4.2. Baselines
We compare against SOTA baselines selected from several
different families of models: latent-variable-based varia-
tional models, autoregressive likelihood models, and diffu-
sion models. In addition, for efficiency, we train all models
on VQ codes using a pretrained VQ-GAN for each dataset.
For our diffusion baseline, we follow (Rombach et al.,2022)
and use a VAE instead of a VQ-GAN. Note that we do not
have any GANs for our baselines, since to the best of our
knowledge, there does not exist a GAN that trains on latent
space instead of raw pixels, an important aspect for properly
scaling to long video sequences.
Space-time Transformers We compare TECO to sev-
eral variants of space-time transformers as depicted in Fig-
ure 3: VideoGPT (Yan et al.,2021) (autoregressive over
space-time), Phenaki (Villegas et al.,2022) (MaskGit over
space-time full attention), MaskViT (Gupta et al.,2022)
(MaskGit over space-time axial attention), and Hourglass
transformers (Nawrot et al.,2021) (hierarchical autoregres-
sive over space-time). Note that we do not include the
text-conditioning for Phenaki as it is irrelevant in our case.
We only evaluate these models on DMLab, as Table 2 and
Table 1 show that Perceiver-AR (a space-time transformer
with improvements specifically for learning long dependen-
cies) is a stronger baseline.
5
摘要:

TemporallyConsistentTransformersforVideoGenerationWilsonYan1DanijarHafner23StephenJames14PieterAbbeel1AbstractTogenerateaccuratevideos,algorithmshavetounderstandthespatialandtemporaldependenciesintheworld.Currentalgorithmsenableaccuratepredictionsovershorthorizonsbuttendtosufferfromtemporalinconsist...

展开>> 收起<<
Temporally Consistent Transformers for Video Generation.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:4.21MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注