Preprint. Under review. NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS Daniel Watson William Chan Ricardo Martin-Brualla

2025-05-06 0 0 1.73MB 21 页 10玖币
侵权投诉
Preprint. Under review.
NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS
Daniel Watson William Chan Ricardo Martin-Brualla
Jonathan Ho Andrea Tagliasacchi Mohammad Norouzi
Google Research
ABSTRACT
We present 3DiM, a diffusion model for 3D novel view synthesis, which is able
to translate a single input view into consistent and sharp completions across many
views. The core component of 3DiM is a pose-conditional image-to-image dif-
fusion model, which takes a source view and its pose as inputs, and generates a
novel view for a target pose as output. 3DiM can generate multiple views that
are 3D consistent using a novel technique called stochastic conditioning. The out-
put views are generated autoregressively, and during the generation of each novel
view, one selects a random conditioning view from the set of available views at
each denoising step. We demonstrate that stochastic conditioning significantly
improves the 3D consistency of a na¨
ıve sampler for an image-to-image diffusion
model, which involves conditioning on a single fixed view. We compare 3DiM
to prior work on the SRN ShapeNet dataset, demonstrating that 3DiM’s gener-
ated completions from a single view achieve much higher fidelity, while being
approximately 3D consistent. We also introduce a new evaluation methodology,
3D consistency scoring, to measure the 3D consistency of a generated object by
training a neural field on the model’s output views. 3DiM is geometry free, does
not rely on hyper-networks or test-time optimization for novel view synthesis, and
allows a single model to easily scale to a large number of scenes.
| {z }
Input view | {z }
3DiM outputs conditioned on different poses
Figure 1: Given a single input image on the left, 3DiM performs novel view synthesis and generates the
four views on the right. We trained a single 471M parameter 3DiM on all of ShapeNet (without class-
conditioning) and sample frames with 256 steps (512 score function evaluations with classifier-free guidance).
See the Supplementary Website for video outputs.
1 INTRODUCTION
Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho
et al., 2020), also known as simply diffusion models, have recently emerged as a powerful family
of generative models, achieving state-of-the-art performance on audio and image synthesis (Chen
et al., 2020; Dhariwal & Nichol, 2021), while admitting better training stability over adversarial
approaches (Goodfellow et al., 2014), as well as likelihood computation, which enables further ap-
plications such as compression and density estimation (Song et al., 2021; Kingma et al., 2021).
Diffusion models have achieved impressive empirical results in a variety of image-to-image trans-
lation tasks not limited to text-to-image, super-resolution, inpainting, colorization, uncropping, and
artifact removal (Song et al., 2020; Saharia et al., 2021a; Ramesh et al., 2022; Saharia et al., 2022).
One particular image-to-image translation problem where diffusion models have not been investi-
gated is novel view synthesis, where, given a set of images of a given 3D scene, the task is to infer
1
arXiv:2210.04628v1 [cs.CV] 6 Oct 2022
Preprint. Under review.
how the scene looks from novel viewpoints. Before the recent emergence of Scene Representation
Networks (SRN) (Sitzmann et al., 2019) and Neural Radiance Fields (NeRF) (Mildenhall et al.,
2020), state-of-the-art approaches to novel view synthesis were typically built on generative mod-
els (Sun et al., 2018) or more classical techniques on interpolation or disparity estimation (Park et al.,
2017; Zhou et al., 2018). Today, these models have been outperformed by NeRF-class models (Yu
et al., 2021; Niemeyer et al., 2021; Jang & Agapito, 2021), where 3D consistency is guaranteed by
construction, as images are generated by volume rendering of a single underlying 3D representation
(a.k.a. “geometry-aware” models).
Still, these approaches feature different limitations. Heavily regularized NeRFs for novel view syn-
thesis with few images such as RegNeRF (Niemeyer et al., 2021) produce undesired artifacts when
given very few images, and fail to leverage knowledge from multiple scenes (recall NeRFs are
trained on a single scene, i.e., one model per scene), and given one or very few views of a novel
scene, a reasonable model must extrapolate to complete the occluded parts of the scene. Pixel-
NeRF (Yu et al., 2021) and VisionNeRF (Lin et al., 2022) address this by training NeRF-like models
conditioned on feature maps that encode the novel input view(s). However, these approaches are
regressive rather than generative, and as a result, they cannot yield different plausible modes and
are prone to blurriness. This type of failure has also been previously observed in regression-based
models (Saharia et al., 2021b). Other works such as CodeNeRF (Jang & Agapito, 2021) and LoL-
NeRF (Rebain et al., 2021) instead employ test-time optimization to handle novel scenes, but still
have issues with sample quality.
In recent literature, geometry-free approaches (i.e., methods without explicit geometric inductive
biases like those introduced by volume rendering) such as Light Field Networks (LFN) (Sitzmann
et al., 2021) and Scene Representation Transformers (SRT) (Sajjadi et al., 2021) have achieved re-
sults competitive with 3D-aware methods in the “few-shot” setting, where the number of condition-
ing views is limited (i.e., 1-10 images vs. dozens of images as in the usual NeRF setting). Similarly
to our approach, EG3D (Chan et al., 2022) provides approximate 3D consistency by leveraging
generative models. EG3D employs a StyleGAN (Karras et al., 2019) with volumetric rendering,
followed by generative super-resolution (the latter being responsible for the approximation). In
comparison to this complex setup, diffusion not only provides a significantly simpler architecture,
but also a simpler hyper-parameter tuning experience compared to GANs, which are well-known
to be notoriously difficult to tune (Mescheder et al., 2018). Notably, diffusion models have already
seen some success on 3D point-cloud generation (Luo & Hu, 2021; Waibel et al., 2022).
Motivated by these observations and the success of diffusion models in image-to-image tasks, we
introduce 3D Diffusion Models (3DiMs). 3DiMs are image-to-image diffusion models trained on
pairs of images of the same scene, where we assume the poses of the two images are known. Draw-
ing inspiration from Scene Representation Transformers (Sajjadi et al., 2021), 3DiMs are trained to
build a conditional generative model of one view given another view and their poses. Our key dis-
covery is that we can turn this image-to-image model into a model that can produce an entire set of
3D-consistent frames through autoregressive generation, which we enable with our novel stochastic
conditioning sampling algorithm. We cover stochastic conditioning in more detail in Section 2.2 and
provide an illustration in Figure 3. Compared to prior work, 3DiMs are generative (vs. regressive)
geometry free models, they allow training to scale to a large number of scenes, and offer a simple
end-to-end approach.
We now summarize our core contributions:
1. We introduce 3DiM, a geometry-free image-to-image diffusion model for novel view synthesis.
2. We introduce the stochastic conditioning sampling algorithm, which encourages 3DiM to gener-
ate 3D-consistent outputs.
3. We introduce X-UNet, a new UNet architecture (Ronneberger et al., 2015) variant for 3D novel
view synthesis, demonstrating that changes in architecture are critical for high fidelity results.
4. We introduce an evaluation scheme for geometry-free view synthesis models, 3D consistency
scoring, that can numerically capture 3D consistency by training neural fields on model outputs.
2
Preprint. Under review.
Figure 2: Pose-conditional image-to-image training – Example training inputs and outputs for
pose-conditional image-to-image diffusion models. Given two frames from a common scene and
their poses (R, t), the task is to undo the noise added to one of the two frames. (*) In practice,
our neural network is trained to predict the Gaussian noise used to corrupt the original view – the
predicted view is still just a linear combination of the noisy input and the predicted .
2 POSE-CONDITIONAL DIFFUSION MODELS
To motivate 3DiMs, let us consider the problem of novel view synthesis given few images from a
probabilistic perspective. Given a complete description of a 3D scene S, for any pose p, the view
x(p)at pose pis fully determined from S, i.e., views are conditionally independent given S. How-
ever, we are interested in modeling distributions of the form q(x1, ..., xm|xm+1, ..., xn)without S,
where views are no longer conditionally independent. A concrete example is the following: given
the back of a person’s head, there are multiple plausible views for the front. An image-to-image
model sampling front views given only the back should indeed yield different outputs for each front
view – with no guarantees that they will be consistent with each other – especially if it learns the
data distribution perfectly. Similarly, given a single view of an object that appears small, there is
ambiguity on the pose itself: is it small and close, or simply far away? Thus, given the inherent
ambiguity in the few-shot setting, we need a sampling scheme where generated views can depend
on each other in order to achieve 3D consistency. This contrasts NeRF approaches, where query
rays are conditionally independent given a 3D representation S– an even stronger condition than
imposing conditional independence among frames. Such approaches try to learn the richest possible
representation for a single scene S, while 3DiM avoids the difficulty of learning a generative model
for Saltogether.
2.1 IMAGE-TO-IMAGE DIFFUSION MODELS WITH POSE CONDITIONING
Given a data distribution q(x1,x2)of pairs of views from a common scene at poses p1,p2SE(3),
we define an isotropic Gaussian process that adds increasing amounts of noise to data samples as
the signal-to-noise-ratio λdecreases, following Salimans & Ho (2022):
q(z(λ)
k|xk) := N(z(λ)
k;σ(λ)1
2xk, σ(λ)I)(1)
where σ(·)is the sigmoid function. We can apply the reparametrization trick (Kingma & Welling,
2013) and sample from these marginal distributions via
z(λ)
k=σ(λ)1
2xk+σ(λ)1
2, N (0,I)(2)
Then, given a pair of views, we learn to reverse this process in one of the two frames by minimizing
the objective proposed by Ho et al. (2020), which has been shown to yield much better sample
quality than maximizing the true evidence lower bound (ELBO):
L=Eq(x1,x2)Eλ,kθ(z(λ)
2,x1, λ, p1,p2)k2
2(3)
where θis a neural network whose task is to denoise the frame z(λ)
2given a different (clean) frame
x1, and λis the log signal-to-noise-ratio. To make our notation more legible, we slightly abuse
notation and from now on we will simply write θ(z(λ)
2,x1).
3
Preprint. Under review.
Figure 3: Stochastic conditioning sampler – There are two main components to our sampling pro-
cedure: 1) the autoregressive generation of multiple frames, and 2) the denoising process to generate
each frame. When generating a new frame, we randomly select a previous frame as the conditioning
frame at each denoising step. We omit the pose inputs in the diagram to avoid overloading the figure,
but they should be understood to be recomputed at each step, depending on the conditioning view
we randomly sample.
2.2 3D CONSISTENCY VIA STOCHASTIC CONDITIONING
Motivation. We begin this section by motivating the need of our stochastic conditioning sampler.
In the ideal situation, we would model our 3D scene frames using the chain rule decomposition:
p(x) = Y
i
p(xi|x<i)(4)
This factorization is ideal, as it models the distribution exactly without making any conditional in-
dependence assumptions. Each frame is generated autoregressively, conditioned on all the previous
frames. However, we found this solution to perform poorly. Due to memory limitations, we can
only condition on a limited number of frames in practice, (i.e., a k-Markovian model). We also find
that, as we increase the maximum number of input frames k, the worse the sample quality becomes.
In order to achieve the best possible sample quality, we thus opt for the bare minimum of k= 2 (i.e.,
an image-to-image model). Our key discovery is that, with k= 2, we can still achieve approximate
3D consistency. Instead of using a sampler that is Markovian over frames, we leverage the iterative
nature of diffusion sampling by varying the conditioning frame at each denoising step.
Stochastic Conditioning. We now detail our novel stochastic conditioning sampling procedure that
allows us to generate 3D-consistent samples from a 3DiM. We start with a set of conditioning views
X={x1, ..., xk}of a static scene, where typically k= 1 or is very small. We then generate a new
frame by running a modified version of the standard denoising diffusion reverse process for steps
λmin =λT< λT1< ... < λ0=λmax:
ˆ
xk+1 =1
σ(λt)1
2z(λt)
k+1 σ(λt)1
2θ(z(λt)
k+1,xi)(5)
z(λt1)
k+1 qz(λt1)
k+1 |z(λt)
k+1,ˆ
xk+1(6)
where, crucially, iUniform({1, ..., k})is re-sampled at each denoising step. In other words, each
individual denoising step is conditioned on a different random view from X(the set that contains the
input view(s) and the previously generated samples). Once we finish running this sampling chain
and produce a final xk+1, we simply add it to Xand repeat this procedure if we want to sample
more frames. Given sufficient denoising steps, stochastic conditioning allows each generated frame
to be guided by all previous frames. See Figure 3 for an illustration. In practice, we use 256
denoising steps, which we find to be sufficient to achieve both high sample quality and approximate
3D consistency. As usual in the literature, the first (noisiest sample) is just a Gaussian, i.e., z(λT)
i
N(0,I), and at the last step λ0, we sample noiselessly.
We can interpret stochastic conditioning as a na¨
ıve approximation to true autoregressive sampling
that works well in practice. True autoregressive sampling would require a score model of the form
z(λ)
k+1
log q(z(λ)
k+1|x1, ..., xk), but this would strictly require multi-view training data, while we are
ultimately interested in enabling novel view synthesis with as few as two training views per scene.
4
Preprint. Under review.
Figure 4: X-UNet Architecture – We modify the typical UNet architecture used by recent work on
diffusion models to accomodate 3D novel view synthesis. We share the same UNet weights among
the two input frames, the clean conditioning view and the denoising target view. We add cross
attention layers to mix information between the input and output view, illustrated in yellow.
2.3 X-UNET
The 3DiM model needs a neural network architecture that takes both the conditioning frame and
the noisy frame as inputs. One natural way to do this is simply to concatenate the two images
along the channel dimensions, and use the standard UNet architecture (Ronneberger et al., 2015; Ho
et al., 2020). This “Concat-UNet” has found significant success in prior work of image-to-image
diffusion models (Saharia et al., 2021b;a). However, in our early experiments, we found that the
Concat-UNet yields very poor results – there were severe 3D inconsistencies and lack of alignment
to the conditioning image. We hypothesize that, given limited model capacity and training data, it is
difficult to learn complex, nonlinear image transformations that only rely on self-attention. We thus
introduce our X-UNet, whose core changes are (1) sharing parameters to process each of the two
views, and (2) using cross attention between the two views. We find our X-UNet architecture to be
very effective for 3D novel view synthesis.
We now describe X-UNet in detail. We follow Ho et al. (2020); Song et al. (2020), and use the
UNet (Ronneberger et al., 2015) with residual blocks and self-attention.We also take inspiration
from Video Diffusion Models (Ho et al., 2022) by sharing weights over the two input frames for all
the convolutional and self-attention layers, but with several key differences:
1. We let each frame have its own noise level (recall that the inputs to a DDPM residual block are
feature maps as well as a positional encoding for the noise level). We use a positional encoding
of λmax for the clean frame. Ho et al. (2022) conversely denoise multiple frames simultaneously,
each at the same noise level.
2. Alike Ho et al. (2020), we modulate each UNet block via FiLM (Dumoulin et al., 2018), but we
use the sum of pose and noise-level positional encodings, as opposed to the noise-level embed-
ding alone. Our pose encoding additionally differs in that they are of the same dimensionality as
frames– they are camera rays, identical to those used by Sajjadi et al. (2021).
3. Instead of attending over “time” after each self-attention layer like Ho et al. (2022), which in
our case would entail only two attention weights, we define a cross-attention layer and let each
frame’s feature maps call this layer to query the other frame’s feature maps.
For more details on our proposed architecture, we refer the reader to the Supplementary Mate-
rial (Sec.6). We also provide a comparison to the “Concat-UNet” architecture in Section 3.2.
3 EXPERIMENTS
We benchmark 3DiMs on the SRN ShapeNet dataset (Sitzmann et al., 2019) to allow comparisons
with prior work on novel view synthesis from a single image. This dataset consists of views and
poses of car and chair ShapeNet (Chang et al., 2015) assets, rendered at the 128x128 resolution.
5
摘要:

Preprint.Underreview.NOVELVIEWSYNTHESISWITHDIFFUSIONMODELSDanielWatsonWilliamChanRicardoMartin-BruallaJonathanHoAndreaTagliasacchiMohammadNorouziGoogleResearchABSTRACTWepresent3DiM,adiffusionmodelfor3Dnovelviewsynthesis,whichisabletotranslateasingleinputviewintoconsistentandsharpcompletionsacrossman...

展开>> 收起<<
Preprint. Under review. NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS Daniel Watson William Chan Ricardo Martin-Brualla.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:1.73MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注