IMAGEN VIDEO HIGHDEFINITION VIDEO GENERATION WITH DIFFUSION MODELS Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko

2025-05-08 1 0 8.48MB 18 页 10玖币
侵权投诉
IMAGEN VIDEO: HIGH DEFINITION VIDEO
GENERATION WITH DIFFUSION MODELS
Jonathan Ho
, William Chan
, Chitwan Saharia
, Jay Whang
, Ruiqi Gao, Alexey Gritsenko,
Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans
Google Research, Brain Team
{jonathanho,williamchan,sahariac,jwhang,ruiqig,agritsenko,
durk,pooleb,mnorouzi,davidfleet,salimans}@google.com
ABSTRACT
We present Imagen Video, a text-conditional video generation system based on a
cascade of video diffusion models. Given a text prompt, Imagen Video generates
high definition videos using a base video generation model and a sequence of in-
terleaved spatial and temporal video super-resolution models. We describe how
we scale up the system as a high definition text-to-video model including design
decisions such as the choice of fully-convolutional temporal and spatial super-
resolution models at certain resolutions, and the choice of the v-parameterization
of diffusion models. In addition, we confirm and transfer findings from previous
work on diffusion-based image generation to the video generation setting. Fi-
nally, we apply progressive distillation to our video models with classifier-free
guidance for fast, high quality sampling. We find Imagen Video not only capable
of generating videos of high fidelity, but also having a high degree of controlla-
bility and world knowledge, including the ability to generate diverse videos and
text animations in various artistic styles and with 3D object understanding. See
imagen.research.google/video for samples.
Figure 1: Imagen Video sample for the prompt: A bunch of autumn leaves falling on a calm lake to
form the text ‘Imagen Video’. Smooth. The generated video is at 1280×768 resolution, 5.3 second
duration and 24 frames per second.
1 INTRODUCTION
Generative modeling has made tremendous progress with recent text-to-image systems like
DALL-E 2 (Ramesh et al.,2022), Imagen (Saharia et al.,2022b), Parti (Yu et al.,2022), CogView
(Ding et al.,2021) and Latent Diffusion (Rombach et al.,2022). Diffusion models (Sohl-Dickstein
et al.,2015;Ho et al.,2020) in particular have found considerable success in multiple generative
modeling tasks (Nichol & Dhariwal,2021;Ho et al.,2022a;Dhariwal & Nichol,2022) including
density estimation (Kingma et al.,2021), text-to-speech (Chen et al.,2021a;Kong et al.,2021;
Chen et al.,2021b), image-to-image (Saharia et al.,2022c;a;Whang et al.,2022), and text-to-image
(Rombach et al.,2022;Nichol et al.,2021;Ramesh et al.,2022;Saharia et al.,2022b).
Equal contribution.
1
arXiv:2210.02303v1 [cs.CV] 5 Oct 2022
A colorful professional animated logo for ’Imagen Video’ written using paint brush in cursive. Smooth animation.
Blue flame transforming into the text “Imagen”. Smooth animation
Wooden figurine surfing on a surfboard in space.
Balloon full of water exploding in extreme slow motion.
Melting pistachio ice cream dripping down the cone.
A british shorthair jumping over a couch.
Coffee pouring into a cup.
Figure 2: Videos generated from various text prompts. Imagen Video produces diverse and
temporally-coherent videos that are well-aligned with the given prompt.
2
A small hand-crafted wooden boat taking off to space.
A person riding a bike in the sunset.
Drone flythrough interior of Sagrada Familia cathedral
Wooden figurine walking on a treadmill made out of exercise mat.
Origami dancers in white paper, 3D render, ultra-detailed, on white background, studio shot, dancing modern dance.
Campfire at night in a snowy forest with starry sky in the background.
An astronaut riding a horse.
Figure 3: Videos generated from various text prompts. Imagen Video produces diverse and
temporally-coherent videos that are well-aligned with the given prompt.
3
A person riding a horse in the sunrise.
A happy elephant wearing a birthday hat walking under the sea.
Studio shot of minimal kinetic sculpture made from thin wire shaped like a bird on white background.
A bunch of colorful candies falling into a tray in the shape of text ’Imagen Video’. Smooth video.
A group of people hiking in a forest.
A goldendoodle playing in a park by a lake.
Incredibly detailed science fiction scene set on an alien planet, view of a marketplace. Pixel art.
Figure 4: Videos generated from various text prompts. Imagen Video produces diverse and
temporally-coherent videos that are well-aligned with the given prompt.
4
摘要:

IMAGENVIDEO:HIGHDEFINITIONVIDEOGENERATIONWITHDIFFUSIONMODELSJonathanHo,WilliamChan,ChitwanSaharia,JayWhang,RuiqiGao,AlexeyGritsenko,DiederikP.Kingma,BenPoole,MohammadNorouzi,DavidJ.Fleet,TimSalimansGoogleResearch,BrainTeamfjonathanho,williamchan,sahariac,jwhang,ruiqig,agritsenko,durk,pooleb,mno...

展开>> 收起<<
IMAGEN VIDEO HIGHDEFINITION VIDEO GENERATION WITH DIFFUSION MODELS Jonathan Ho William Chan Chitwan Saharia Jay Whang Ruiqi Gao Alexey Gritsenko.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:8.48MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注