Temporally Consistent Transformers for Video Generation

2025-05-02 0 0 4.21MB 37 页 10玖币

侵权投诉

Wilson Yan 1Danijar Hafner 2 3 Stephen James 1 4 Pieter Abbeel 1

Abstract

To generate accurate videos, algorithms have to

understand the spatial and temporal dependencies

in the world. Current algorithms enable accurate

predictions over short horizons but tend to suffer

from temporal inconsistencies. When generated

content goes out of view and is later revisited, the

model invents different content instead. Despite

this severe limitation, no established benchmarks

on complex data exist for rigorously evaluating

video generation with long temporal dependen-

cies. In this paper, we curate 3 challenging video

datasets with long-range dependencies by ren-

dering walks through 3D scenes of procedural

mazes, Minecraft worlds, and indoor scans. We

perform a comprehensive evaluation of current

models and observe their limitations in temporal

consistency. Moreover, we introduce the Tempo-

rally Consistent Transformer (TECO), a genera-

tive model that substantially improves long-term

consistency while also reducing sampling time.

By compressing its input sequence into fewer em-

beddings, applying a temporal transformer, and

expanding back using a spatial MaskGit, TECO

outperforms existing models across many met-

rics. Videos are available on the website:

https:

//wilson1yan.github.io/teco

1. Introduction

Recent work in video generation has seen tremendous

progress (Ho et al.,2022;Clark et al.,2019;Yan et al.,2021;

Le Moing et al.,2021;Ge et al.,2022;Tian et al.,2021;Luc

et al.,2020) in producing high-ﬁdelity and diverse samples

on complex video data, which can largely be attributed to

a combination of increased computational resources and

more compute efﬁcient high-capacity neural architectures.

UC Berkeley

University of Toronto

DeepMind

Dyson

Robotics Lab. Correspondence to: Wilson Yan

wil-

son1.yan@berkeley.edu>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Figure 1.

TECO generates temporally consistent videos of high

ﬁdelity (low LPIPS) over hundreds of frames while offering orders

of magnitude faster sampling speed compared to previous video

generation models.

However, much of this progress has focused on generating

short videos, where models perform well by basing their

predictions on only a handful of previous frames.

Video prediction models with short context windows can

generate long videos in a sliding window fashion. While

the resulting videos can look impressive at ﬁrst sight, they

lack temporal consistency. We would like models to predict

temporally consistent videos — where the same content is

generated if a camera pans back to a previously observed

location. On the other hand, the model should imagine a

new part of the scene for locations that have not yet been

observed, and future predictions should remain consistent

to this newly imagined part of the scene.

Prior work has investigated techniques for modeling long-

term dependencies, such as temporal hierarchies (Saxena

et al.,2021) and strided sampling with frame-wise interpola-

tion (Ge et al.,2022;Hong et al.,2022). Other methods train

on sparse sets of frames selected out of long videos (Harvey

et al.,2022;Skorokhodov et al.,2021;Clark et al.,2019;

Saito & Saito,2018;Yu et al.,2022), or model videos via

compressed representations (Yan et al.,2021;Rakhimov

et al.,2020;Le Moing et al.,2021;Seo et al.,2022;Gupta

et al.,2022;Walker et al.,2021). Refer to Appendix L for

more detailed discussion on related work.

Despite this progress, many methods still have difﬁculty

scaling to datasets with many long-range dependencies.

While Clockwork-VAE (Saxena et al.,2021) trains on long

arXiv:2210.02396v2 [cs.CV] 31 May 2023

Temporally Consistent Transformers for Video Prediction

 

Figure 2.

TECO generates sharp and consistent video predictions for hundreds of frames on challenging datasets. The ﬁgure shows evenly

spaced frames of the 264 frame predictions, after being conditioned on 36 context frames. From top to bottom, the datasets are are

DMLab, Minecraft, Habitat, and Kinetics-600.

sequences, it is limited by training time (due to recurrence)

and difﬁcult to scale to complex data. On the other hand,

transformer-based methods over latent spaces (Yan et al.,

2021) scale poorly to long videos due to quadratic com-

plexity in attention, with long videos containing tens of

thousands of tokens. Methods that train on subsets of

tokens are limited by truncated backpropagation through

time (Hutchins et al.,2022;Rae et al.,2019;Dai et al.,2019)

or naive temporal operations (Hawthorne et al.,2022).

In addition, there generally do not exist benchmarks for

properly evaluating temporal consistency in video genera-

tion methods, where prior works either focus on generating

long videos where short-term dependencies are sufﬁcient

for accurate prediction (Ge et al.,2022;Skorokhodov et al.,

2021) and/or rely on metrics such as FVD (Unterthiner et al.,

2019) which are more sensitive to image ﬁdelity rather than

capture long-range temporal dependencies.

In this paper, we introduce a set of novel long-horizon video

generation benchmarks, as well as corresponding evalu-

ation metrics to better capture temporal consistency. In

addition, we propose Temporally Consistent Video Trans-

former (TECO), a vector-quantized latent dynamics model

that effectively models long-term dependencies in a com-

pact representation space using efﬁcient transformers. The

key contributions are summarized as follows:

•

To better evaluate temporal consistency in video pre-

dictions, we propose 3 video datasets with long-

range dependencies including metrics, generated

from 3D scenes in DMLab (Beattie et al.,2016),

Minecraft (Guss et al.,2019), and Habitat (Szot et al.,

2021;Savva et al.,2019)

• We benchmark SOTA video generation models on the

datasets and analyze capabilities of each in learning

long-horizon dependencies.

•

We introduce TECO, an efﬁcient and scalable video

generation model that learns compressed representa-

tions to allow for efﬁcient training and generation. We

show that TECO has strong performance on a vari-

ety of difﬁcult video prediction tasks, and is able to

leverage long-term temporal context to generate high

quality videos with consistency while maintaining fast

sampling speed.

2. Preliminaries

2.1. VQ-GAN

VQ-GAN (Esser et al.,2021;Van Den Oord et al.,2017)

is an autoencoder that learns to compress data into dis-

crete latents, consisting of an encoder

, decoder

, code-

book

, and discriminator

. Given an image

x∈

RH×W×3

, the encoder

maps

to its latent representation

h∈RH′×W′×D

, which is quantized by nearest neighbors

lookup in a codebook of embeddings

C={ei}K

i=1

to pro-

duce

z∈RH′×W′×D

is fed through decoder

to re-

Temporally Consistent Transformers for Video Prediction

TECO

Dec

Space Time

Transformer

Space Time

Transformer

(19K)2 = 386M

TECO Prior

Temporal Causal

Transformer

(1.2K)2 = 1.44M

Spatial MaskGit

300*(64)2 = 1.22M

300 x 1282 x 3

75 × 162 = 19K

300 × 82 = 19K

300 × 22 = 1.2K

300 × 82 = 19K

75 × 162 = 19K

^zz

…

Enc Dec TECO

Enc

(a) Space Time Transformer (b) Temporally Consistent Transformer (Ours)

300 x 1282 x 3

Figure 3.

The architectural design of TECO. (a) Prior work on video generation models over VQ codes adopt a single spatio-temporal

transformer over all codes. This is prohibitive when scaling to long sequences due to the quadratic complexity of attention. (b) We propose

a novel and efﬁcient architecture that aggressively downsamples in space before feeding into a temporal transformer, and then expands

back out with a spatial MaskGit that is separately applied per frame. In the ﬁgure, the transformer blocks show the number of attention

links. On training sequences of

300

frames, TECO sees orders of magnitude more efﬁciency over existing models, allowing the use of

larger models for a given compute budget.

construct

. A straight-through estimator (Bengio,2013) is

used to maintain gradient ﬂow through the quantization step.

The codebook optimizes the following loss:

LVQ =∥sg(h)−e∥2

2+β∥h−sg(e)∥2

2(1)

where

β= 0.25

is a hyperparameter, and

is the nearest-

neighbors embedding from

. For reconstruction, VQ-GAN

replaces the original

ℓ2

loss with a perceptual loss (Zhang

et al.,2012),

LLPIPS

. Finally, in order to encourage higher-

ﬁdelity samples, patch-level discriminator

is trained to

classify between real and reconstructed images, with:

LGAN = log D(x) + log(1 −D(ˆx)) (2)

Overall, VQ-GAN optimizes the following loss:

min

E,G,C max

DLLPIPS +LVQ +λLGAN (3)

where

λ=∥∇GLLLPIPS ∥2

∥∇GLLGAN ∥2+δ

is an adaptive weight,

the last decoder layer,

δ= 10−6

, and

LLP IP S

is the same

distance metric described in Zhang et al. (2012).

2.2. MaskGit

MaskGit (Chang et al.,2022) models distributions over dis-

crete tokens, such as produced by a VQ-GAN. It generates

images with competitive sample quality to autoregressive

models at a fraction of the sampling cost by using a masked

token prediction objective during training. Formally, we

denote

z∈ZH×W

as the discrete latent tokens represent-

ing an image. For each training step, we uniformly sample

t∈[0,1)

and randomly generate a mask

m∈ {0,1}H×W

with

N=⌈γHW ⌉

masked values, where

γ= cos π

2t

Then, MaskGit learns to predict the masked tokens with the

following objective

Lmask =−Ez∈D log p(z|z⊙m).(4)

During inference, because MaskGit has been trained to

model any set of unconditional and conditional probabilities,

we can sample any subset of tokens per sampling iteration.

(Chang et al.,2022) introduces a conﬁdence-based sampling

mechanism whereas other work (Lee et al.,2022) proposes

an iterative sample-and-revise approach.

3. TECO

We present Temporally Consistent Video Transformer

(TECO), a video generation model that more efﬁciently

scales to training on longer horizon videos.

3.1. Architectural Overview

Our proposed framework is shown in Figure 3, where x1:T

consists of a sequence of video frames. Our primary innova-

tion centers around designing a more efﬁcient architecture

that can scale to long sequences. Prior SOTA methods (Yan

et al.,2021;Ge et al.,2022;Villegas et al.,2022) over VQ-

Temporally Consistent Transformers for Video Prediction

codes all train a single spatio-temporal transformer to model

every code, however, this becomes prohibitively expensive

with sequences containing tens of thousands of tokens. On

the other hand, these models have shown to be able to learn

highly multi-modal distributions and scale well on complex

video. As such, we design the TECO architecture with the

intention to retain its high-capacity scaling properties, while

ensuring orders of magnitude more efﬁcient training and

inference. In the following sections, we motivate each com-

ponent for our model, with several speciﬁc design choices

to ensure efﬁciency and scalability. TECO consists of four

components:

Encoder: zt=E(xt, xt−1)

Temporal Transformer: ht=H(z≤t)

Spatial MaskGit: p(zt|ht−1)

Decoder: p(xt|zt, ht−1)

(5)

Encoder We can achieve compressed representations by

leveraging spatio-temporal redundancy in video data. To

do this, we learn a CNN encoder

zt=E(xt, xt−1)

which

encodes the current frame

conditioned on the previous

frame by channel-wise concatenating

xt−1

, and then quan-

tizes the output using codebook

to produce

. We apply

the VQ loss deﬁned in Equation (1) per timestep. In ad-

dition, we

ℓ2

-normalize the codebook and embeddings to

encourage higher codebook usage (Yu et al.,2021). The

ﬁrst frame is concatenated with zeros and does not quantize

z1to prevent information loss.

Temporal Transformer Compressed, discrete latents are

more lossy and tend to require higher spatial resolutions

compared to continuous latents. Therefore, before modeling

temporal information, we apply a single strided convolu-

tion to downsample each discrete latent

, where visually

simpler datasets allow for more downsampling and visually

complex datasets require less downsampling. Afterwards,

we learn a large transformer to model temporal dependen-

cies, and then apply a transposed convolution to upsample

our representation back to the original resolution of

. In

summary, we use the following architecture:

ht=H(z<t) = ConvT(Transformer(Conv(z<t))) (6)

Decoder The decoder is an upsampling CNN that recon-

structs

ˆxt=D(zt, ht)

, where

can be interpreted as the

posterior of timestep

, and

the output of the temporal

transformer which summarizes information from previous

timesteps.

and

are concatenated channel-wise and

into the decoder. Together with the encoder, the decoder

optimizes the following cross entropy reconstruction loss

Lrecon =−1

TPT

t=1 log p(xt|zt, ht).(7)

which encourages

features to encode relative information

between frames since the temporal transformer output

ag-

gregates information over time, learning more compressed

codes for efﬁcient modeling over longer sequences.

Spatial MaskGit Lastly, we use a MaskGit (Chang et al.,

2022) to model the prior,

p(zt|ht)

. We show that using a

MaskGit prior allows for not just faster but also higher qual-

ity sampling compared to an autoregressive prior. During

every training iteration, we follow prior work to sample a

random mask mtand optimize

Lprior =−1

TPT

t=1 log p(zt|zt⊙mt).(8)

where

is concatenated channel-wise with masked

predict the masked tokens. During generation, we follow

Lee et al. (2022), where we initially generate each frame in

chunks of 8 at a time and then go through 2 revise rounds

of re-generating half the tokens each time.

Training Objective The ﬁnal objective is the following:

LTECO =LVQ +Lrecon +Lprior (9)

3.2. DropLoss

We propose DropLoss, a simple trick to allow for more

scalable and efﬁcient training (Figure 4). Due to its architec-

ture design, TECO can be separated into two components:

(1) learning temporal representations, consisting of the en-

coder and the temporal transformer, and (2) predicting future

frames, consisting of the dynamics prior and decoder. We

can increase training efﬁciency by dropping out random

timesteps that are not decoded and thus omitted from the

reconstruction loss. For example, given a video of T frames,

we compute

for all

t∈ {1, . . . , T }

, and then compute the

losses

Lprior

and

Lrecon

for only 10% of the indices. Be-

cause random indices are selected each iteration, the model

still needs to learn to accurately predict all timesteps. This

reduces training costs signiﬁcantly because the decoder and

dynamics prior require non-trivial computations. DropLoss

is applicable to both a wide class of architectures and to

tasks beyond video prediction.

Model

x1x2x3x4

Figure 4.

DropLoss improves training scalability on longer se-

quences by only computing the loss on a random subset of time

indices for each training iteration. For TECO, we do not need to

compute the decoder and MaskGit for dropped out timesteps.

Temporally Consistent Transformers for Video Prediction

Figure 5.

Quantitative comparisons between TECO and baseline

methods in long-horizon temporal consistency, showing LPIPS

between generated and ground-truth frames for each timestep.

Timestep 0 corresponds to the ﬁrst predicted frame (conditioning

frames are not included in the plot). Our method is able to remain

more temporally consistent over hundreds of timesteps of predic-

tion compared to SOTA models.

4. Experiments

4.1. Datasets

We introduce three challenging video datasets to better mea-

sure long-range consistency in video prediction, centered

around 3D environments in DMLab (Beattie et al.,2016),

Minecraft (Guss et al.,2019), and Habitat (Savva et al.,

2019), with videos of agents randomly traversing scenes of

varying difﬁculty. These datasets require video prediction

models to re-produce observed parts of scenes, and newly

generate unobserved parts. In contrast, many existing video

benchmarks do not have strong long-range dependencies,

where a model with limited context is sufﬁcient. Refer to

Appendix M for further details on each dataset.

DMLab-40k DeepMind Lab is a simulator that procedu-

rally generates random 3D mazes with random ﬂoor and

wall textures. We generate 40k action-conditioned

64 ×64

videos of

300

frames of an agent randomly traversing

7×7

mazes by choosing random points in the maze and navigat-

ing to them via the shortest path. We train all models for

both action-conditioned and unconditional prediction (by

periodically masking out actions) to enable both types of

generations. We further discuss the use cases of both action

and unconditional models in Section 4.3.

Minecraft-200k This popular game features procedurally

generated 3D worlds that contain complex terrain such as

hills, forests, rivers, and lakes. We collect 200k action-

conditioned videos of length

300

and resolution

128 ×128

in Minecraft’s marsh biome. The player iterates between

walking forward for a random number of steps and randomly

rotating left or right, resulting in parts of the scene going

out of view and coming back into view later. We train

action-conditioned for all models for ease of interpreting

and evaluating, though it is generally easy for video models

to unconditionally learn these discrete actions.

Habitat-200k Habitat is a simulator for rendering trajec-

tories through scans of real 3D scenes. We compile

∼

1400

indoor scans from HM3D (Ramakrishnan et al.,2021), Mat-

terPort3D (Chang et al.,2017), and Gibson (Xia et al.,2018)

to generate 200k action-conditioned videos of

300

frames at

a resolution of

128 ×128

pixels. We use Habitat’s in-built

path traversal algorithm to construct action trajectories that

move our agent between randomly sampled locations. Sim-

ilar to DMLab, we train all video models to perform both

unconditional and action-conditioned prediction.

Kinetics-600 Kinetics-600 (Carreira & Zisserman,2017)

is a highly complex real-world video dataset, originally pro-

posed for action recognition. The dataset contains

∼

400k

videos of varying length of up to 300 frames. We evaluate

our method in the video prediction without actions (as they

do not exist), generating 80 future frames conditioned on

20. In addition, we ﬁlter out videos shorter than 100 frames,

leaving 392k videos that are split for training and evalua-

tion. We use a resolution of

128 ×128

pixels. Although

Kinetics-600 does not have many long-range dependencies,

we evaluate our method on this dataset to show that it can

scale to complex, natural video.

4.2. Baselines

We compare against SOTA baselines selected from several

different families of models: latent-variable-based varia-

tional models, autoregressive likelihood models, and diffu-

sion models. In addition, for efﬁciency, we train all models

on VQ codes using a pretrained VQ-GAN for each dataset.

For our diffusion baseline, we follow (Rombach et al.,2022)

and use a VAE instead of a VQ-GAN. Note that we do not

have any GANs for our baselines, since to the best of our

knowledge, there does not exist a GAN that trains on latent

space instead of raw pixels, an important aspect for properly

scaling to long video sequences.

Space-time Transformers We compare TECO to sev-

eral variants of space-time transformers as depicted in Fig-

ure 3: VideoGPT (Yan et al.,2021) (autoregressive over

space-time), Phenaki (Villegas et al.,2022) (MaskGit over

space-time full attention), MaskViT (Gupta et al.,2022)

(MaskGit over space-time axial attention), and Hourglass

transformers (Nawrot et al.,2021) (hierarchical autoregres-

sive over space-time). Note that we do not include the

text-conditioning for Phenaki as it is irrelevant in our case.

We only evaluate these models on DMLab, as Table 2 and

Table 1 show that Perceiver-AR (a space-time transformer

with improvements speciﬁcally for learning long dependen-

cies) is a stronger baseline.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TemporallyConsistentTransformersforVideoGenerationWilsonYan1DanijarHafner23StephenJames14PieterAbbeel1AbstractTogenerateaccuratevideos,algorithmshavetounderstandthespatialandtemporaldependenciesintheworld.Currentalgorithmsenableaccuratepredictionsovershorthorizonsbuttendtosufferfromtemporalinconsist...

展开>> 收起<<

Temporally Consistent Transformers for Video Generation.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Temporally Consistent Transformers for Video Generation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: