Preprint. Under review. NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS Daniel Watson William Chan Ricardo Martin-Brualla

2025-05-06 1 0 1.73MB 21 页 10玖币

侵权投诉

Preprint. Under review.

NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS

Daniel Watson William Chan Ricardo Martin-Brualla

Jonathan Ho Andrea Tagliasacchi Mohammad Norouzi

Google Research

ABSTRACT

We present 3DiM, a diffusion model for 3D novel view synthesis, which is able

to translate a single input view into consistent and sharp completions across many

views. The core component of 3DiM is a pose-conditional image-to-image dif-

fusion model, which takes a source view and its pose as inputs, and generates a

novel view for a target pose as output. 3DiM can generate multiple views that

are 3D consistent using a novel technique called stochastic conditioning. The out-

put views are generated autoregressively, and during the generation of each novel

view, one selects a random conditioning view from the set of available views at

each denoising step. We demonstrate that stochastic conditioning signiﬁcantly

improves the 3D consistency of a na¨

ıve sampler for an image-to-image diffusion

model, which involves conditioning on a single ﬁxed view. We compare 3DiM

to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM’s gener-

ated completions from a single view achieve much higher ﬁdelity, while being

approximately 3D consistent. We also introduce a new evaluation methodology,

3D consistency scoring, to measure the 3D consistency of a generated object by

training a neural ﬁeld on the model’s output views. 3DiM is geometry free, does

not rely on hyper-networks or test-time optimization for novel view synthesis, and

allows a single model to easily scale to a large number of scenes.

| {z }

Input view | {z }

3DiM outputs conditioned on different poses

Figure 1: Given a single input image on the left, 3DiM performs novel view synthesis and generates the

four views on the right. We trained a single ∼471M parameter 3DiM on all of ShapeNet (without class-

conditioning) and sample frames with 256 steps (512 score function evaluations with classiﬁer-free guidance).

See the Supplementary Website for video outputs.

1 INTRODUCTION

Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho

et al., 2020), also known as simply diffusion models, have recently emerged as a powerful family

of generative models, achieving state-of-the-art performance on audio and image synthesis (Chen

et al., 2020; Dhariwal & Nichol, 2021), while admitting better training stability over adversarial

approaches (Goodfellow et al., 2014), as well as likelihood computation, which enables further ap-

plications such as compression and density estimation (Song et al., 2021; Kingma et al., 2021).

Diffusion models have achieved impressive empirical results in a variety of image-to-image trans-

lation tasks not limited to text-to-image, super-resolution, inpainting, colorization, uncropping, and

artifact removal (Song et al., 2020; Saharia et al., 2021a; Ramesh et al., 2022; Saharia et al., 2022).

One particular image-to-image translation problem where diffusion models have not been investi-

gated is novel view synthesis, where, given a set of images of a given 3D scene, the task is to infer

arXiv:2210.04628v1 [cs.CV] 6 Oct 2022

Preprint. Under review.

how the scene looks from novel viewpoints. Before the recent emergence of Scene Representation

Networks (SRN) (Sitzmann et al., 2019) and Neural Radiance Fields (NeRF) (Mildenhall et al.,

2020), state-of-the-art approaches to novel view synthesis were typically built on generative mod-

els (Sun et al., 2018) or more classical techniques on interpolation or disparity estimation (Park et al.,

2017; Zhou et al., 2018). Today, these models have been outperformed by NeRF-class models (Yu

et al., 2021; Niemeyer et al., 2021; Jang & Agapito, 2021), where 3D consistency is guaranteed by

construction, as images are generated by volume rendering of a single underlying 3D representation

(a.k.a. “geometry-aware” models).

Still, these approaches feature different limitations. Heavily regularized NeRFs for novel view syn-

thesis with few images such as RegNeRF (Niemeyer et al., 2021) produce undesired artifacts when

given very few images, and fail to leverage knowledge from multiple scenes (recall NeRFs are

trained on a single scene, i.e., one model per scene), and given one or very few views of a novel

scene, a reasonable model must extrapolate to complete the occluded parts of the scene. Pixel-

NeRF (Yu et al., 2021) and VisionNeRF (Lin et al., 2022) address this by training NeRF-like models

conditioned on feature maps that encode the novel input view(s). However, these approaches are

regressive rather than generative, and as a result, they cannot yield different plausible modes and

are prone to blurriness. This type of failure has also been previously observed in regression-based

models (Saharia et al., 2021b). Other works such as CodeNeRF (Jang & Agapito, 2021) and LoL-

NeRF (Rebain et al., 2021) instead employ test-time optimization to handle novel scenes, but still

have issues with sample quality.

In recent literature, geometry-free approaches (i.e., methods without explicit geometric inductive

biases like those introduced by volume rendering) such as Light Field Networks (LFN) (Sitzmann

et al., 2021) and Scene Representation Transformers (SRT) (Sajjadi et al., 2021) have achieved re-

sults competitive with 3D-aware methods in the “few-shot” setting, where the number of condition-

ing views is limited (i.e., 1-10 images vs. dozens of images as in the usual NeRF setting). Similarly

to our approach, EG3D (Chan et al., 2022) provides approximate 3D consistency by leveraging

generative models. EG3D employs a StyleGAN (Karras et al., 2019) with volumetric rendering,

followed by generative super-resolution (the latter being responsible for the approximation). In

comparison to this complex setup, diffusion not only provides a signiﬁcantly simpler architecture,

but also a simpler hyper-parameter tuning experience compared to GANs, which are well-known

to be notoriously difﬁcult to tune (Mescheder et al., 2018). Notably, diffusion models have already

seen some success on 3D point-cloud generation (Luo & Hu, 2021; Waibel et al., 2022).

Motivated by these observations and the success of diffusion models in image-to-image tasks, we

introduce 3D Diffusion Models (3DiMs). 3DiMs are image-to-image diffusion models trained on

pairs of images of the same scene, where we assume the poses of the two images are known. Draw-

ing inspiration from Scene Representation Transformers (Sajjadi et al., 2021), 3DiMs are trained to

build a conditional generative model of one view given another view and their poses. Our key dis-

covery is that we can turn this image-to-image model into a model that can produce an entire set of

3D-consistent frames through autoregressive generation, which we enable with our novel stochastic

conditioning sampling algorithm. We cover stochastic conditioning in more detail in Section 2.2 and

provide an illustration in Figure 3. Compared to prior work, 3DiMs are generative (vs. regressive)

geometry free models, they allow training to scale to a large number of scenes, and offer a simple

end-to-end approach.

We now summarize our core contributions:

1. We introduce 3DiM, a geometry-free image-to-image diffusion model for novel view synthesis.

2. We introduce the stochastic conditioning sampling algorithm, which encourages 3DiM to gener-

ate 3D-consistent outputs.

3. We introduce X-UNet, a new UNet architecture (Ronneberger et al., 2015) variant for 3D novel

view synthesis, demonstrating that changes in architecture are critical for high ﬁdelity results.

4. We introduce an evaluation scheme for geometry-free view synthesis models, 3D consistency

scoring, that can numerically capture 3D consistency by training neural ﬁelds on model outputs.

Preprint. Under review.

Figure 2: Pose-conditional image-to-image training – Example training inputs and outputs for

pose-conditional image-to-image diffusion models. Given two frames from a common scene and

their poses (R, t), the task is to undo the noise added to one of the two frames. (*) In practice,

our neural network is trained to predict the Gaussian noise used to corrupt the original view – the

predicted view is still just a linear combination of the noisy input and the predicted .

2 POSE-CONDITIONAL DIFFUSION MODELS

To motivate 3DiMs, let us consider the problem of novel view synthesis given few images from a

probabilistic perspective. Given a complete description of a 3D scene S, for any pose p, the view

x(p)at pose pis fully determined from S, i.e., views are conditionally independent given S. How-

ever, we are interested in modeling distributions of the form q(x1, ..., xm|xm+1, ..., xn)without S,

where views are no longer conditionally independent. A concrete example is the following: given

the back of a person’s head, there are multiple plausible views for the front. An image-to-image

model sampling front views given only the back should indeed yield different outputs for each front

view – with no guarantees that they will be consistent with each other – especially if it learns the

data distribution perfectly. Similarly, given a single view of an object that appears small, there is

ambiguity on the pose itself: is it small and close, or simply far away? Thus, given the inherent

ambiguity in the few-shot setting, we need a sampling scheme where generated views can depend

on each other in order to achieve 3D consistency. This contrasts NeRF approaches, where query

rays are conditionally independent given a 3D representation S– an even stronger condition than

imposing conditional independence among frames. Such approaches try to learn the richest possible

representation for a single scene S, while 3DiM avoids the difﬁculty of learning a generative model

for Saltogether.

2.1 IMAGE-TO-IMAGE DIFFUSION MODELS WITH POSE CONDITIONING

Given a data distribution q(x1,x2)of pairs of views from a common scene at poses p1,p2∈SE(3),

we deﬁne an isotropic Gaussian process that adds increasing amounts of noise to data samples as

the signal-to-noise-ratio λdecreases, following Salimans & Ho (2022):

q(z(λ)

k|xk) := N(z(λ)

k;σ(λ)1

2xk, σ(−λ)I)(1)

where σ(·)is the sigmoid function. We can apply the reparametrization trick (Kingma & Welling,

2013) and sample from these marginal distributions via

z(λ)

k=σ(λ)1

2xk+σ(−λ)1

2,∼ N (0,I)(2)

Then, given a pair of views, we learn to reverse this process in one of the two frames by minimizing

the objective proposed by Ho et al. (2020), which has been shown to yield much better sample

quality than maximizing the true evidence lower bound (ELBO):

L=Eq(x1,x2)Eλ,kθ(z(λ)

2,x1, λ, p1,p2)−k2

2(3)

where θis a neural network whose task is to denoise the frame z(λ)

2given a different (clean) frame

x1, and λis the log signal-to-noise-ratio. To make our notation more legible, we slightly abuse

notation and from now on we will simply write θ(z(λ)

2,x1).

Preprint. Under review.

Figure 3: Stochastic conditioning sampler – There are two main components to our sampling pro-

cedure: 1) the autoregressive generation of multiple frames, and 2) the denoising process to generate

each frame. When generating a new frame, we randomly select a previous frame as the conditioning

frame at each denoising step. We omit the pose inputs in the diagram to avoid overloading the ﬁgure,

but they should be understood to be recomputed at each step, depending on the conditioning view

we randomly sample.

2.2 3D CONSISTENCY VIA STOCHASTIC CONDITIONING

Motivation. We begin this section by motivating the need of our stochastic conditioning sampler.

In the ideal situation, we would model our 3D scene frames using the chain rule decomposition:

p(x) = Y

p(xi|x<i)(4)

This factorization is ideal, as it models the distribution exactly without making any conditional in-

dependence assumptions. Each frame is generated autoregressively, conditioned on all the previous

frames. However, we found this solution to perform poorly. Due to memory limitations, we can

only condition on a limited number of frames in practice, (i.e., a k-Markovian model). We also ﬁnd

that, as we increase the maximum number of input frames k, the worse the sample quality becomes.

In order to achieve the best possible sample quality, we thus opt for the bare minimum of k= 2 (i.e.,

an image-to-image model). Our key discovery is that, with k= 2, we can still achieve approximate

3D consistency. Instead of using a sampler that is Markovian over frames, we leverage the iterative

nature of diffusion sampling by varying the conditioning frame at each denoising step.

Stochastic Conditioning. We now detail our novel stochastic conditioning sampling procedure that

allows us to generate 3D-consistent samples from a 3DiM. We start with a set of conditioning views

X={x1, ..., xk}of a static scene, where typically k= 1 or is very small. We then generate a new

frame by running a modiﬁed version of the standard denoising diffusion reverse process for steps

λmin =λT< λT−1< ... < λ0=λmax:

xk+1 =1

σ(λt)1

2z(λt)

k+1 −σ(−λt)1

2θ(z(λt)

k+1,xi)(5)

z(λt−1)

k+1 ∼qz(λt−1)

k+1 |z(λt)

k+1,ˆ

xk+1(6)

where, crucially, i∼Uniform({1, ..., k})is re-sampled at each denoising step. In other words, each

individual denoising step is conditioned on a different random view from X(the set that contains the

input view(s) and the previously generated samples). Once we ﬁnish running this sampling chain

and produce a ﬁnal xk+1, we simply add it to Xand repeat this procedure if we want to sample

more frames. Given sufﬁcient denoising steps, stochastic conditioning allows each generated frame

to be guided by all previous frames. See Figure 3 for an illustration. In practice, we use 256

denoising steps, which we ﬁnd to be sufﬁcient to achieve both high sample quality and approximate

3D consistency. As usual in the literature, the ﬁrst (noisiest sample) is just a Gaussian, i.e., z(λT)

i∼

N(0,I), and at the last step λ0, we sample noiselessly.

We can interpret stochastic conditioning as a na¨

ıve approximation to true autoregressive sampling

that works well in practice. True autoregressive sampling would require a score model of the form

∇z(λ)

k+1

log q(z(λ)

k+1|x1, ..., xk), but this would strictly require multi-view training data, while we are

ultimately interested in enabling novel view synthesis with as few as two training views per scene.

Preprint. Under review.

Figure 4: X-UNet Architecture – We modify the typical UNet architecture used by recent work on

diffusion models to accomodate 3D novel view synthesis. We share the same UNet weights among

the two input frames, the clean conditioning view and the denoising target view. We add cross

attention layers to mix information between the input and output view, illustrated in yellow.

2.3 X-UNET

The 3DiM model needs a neural network architecture that takes both the conditioning frame and

the noisy frame as inputs. One natural way to do this is simply to concatenate the two images

along the channel dimensions, and use the standard UNet architecture (Ronneberger et al., 2015; Ho

et al., 2020). This “Concat-UNet” has found signiﬁcant success in prior work of image-to-image

diffusion models (Saharia et al., 2021b;a). However, in our early experiments, we found that the

Concat-UNet yields very poor results – there were severe 3D inconsistencies and lack of alignment

to the conditioning image. We hypothesize that, given limited model capacity and training data, it is

difﬁcult to learn complex, nonlinear image transformations that only rely on self-attention. We thus

introduce our X-UNet, whose core changes are (1) sharing parameters to process each of the two

views, and (2) using cross attention between the two views. We ﬁnd our X-UNet architecture to be

very effective for 3D novel view synthesis.

We now describe X-UNet in detail. We follow Ho et al. (2020); Song et al. (2020), and use the

UNet (Ronneberger et al., 2015) with residual blocks and self-attention.We also take inspiration

from Video Diffusion Models (Ho et al., 2022) by sharing weights over the two input frames for all

the convolutional and self-attention layers, but with several key differences:

1. We let each frame have its own noise level (recall that the inputs to a DDPM residual block are

feature maps as well as a positional encoding for the noise level). We use a positional encoding

of λmax for the clean frame. Ho et al. (2022) conversely denoise multiple frames simultaneously,

each at the same noise level.

2. Alike Ho et al. (2020), we modulate each UNet block via FiLM (Dumoulin et al., 2018), but we

use the sum of pose and noise-level positional encodings, as opposed to the noise-level embed-

ding alone. Our pose encoding additionally differs in that they are of the same dimensionality as

frames– they are camera rays, identical to those used by Sajjadi et al. (2021).

3. Instead of attending over “time” after each self-attention layer like Ho et al. (2022), which in

our case would entail only two attention weights, we deﬁne a cross-attention layer and let each

frame’s feature maps call this layer to query the other frame’s feature maps.

For more details on our proposed architecture, we refer the reader to the Supplementary Mate-

rial (Sec.6). We also provide a comparison to the “Concat-UNet” architecture in Section 3.2.

3 EXPERIMENTS

We benchmark 3DiMs on the SRN ShapeNet dataset (Sitzmann et al., 2019) to allow comparisons

with prior work on novel view synthesis from a single image. This dataset consists of views and

poses of car and chair ShapeNet (Chang et al., 2015) assets, rendered at the 128x128 resolution.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Preprint.Underreview.NOVELVIEWSYNTHESISWITHDIFFUSIONMODELSDanielWatsonWilliamChanRicardoMartin-BruallaJonathanHoAndreaTagliasacchiMohammadNorouziGoogleResearchABSTRACTWepresent3DiM,adiffusionmodelfor3Dnovelviewsynthesis,whichisabletotranslateasingleinputviewintoconsistentandsharpcompletionsacrossman...

展开>> 收起<<

Preprint. Under review. NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS Daniel Watson William Chan Ricardo Martin-Brualla.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint. Under review. NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS Daniel Watson William Chan Ricardo Martin-Brualla

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: