
Preprint. Under review.
how the scene looks from novel viewpoints. Before the recent emergence of Scene Representation
Networks (SRN) (Sitzmann et al., 2019) and Neural Radiance Fields (NeRF) (Mildenhall et al.,
2020), state-of-the-art approaches to novel view synthesis were typically built on generative mod-
els (Sun et al., 2018) or more classical techniques on interpolation or disparity estimation (Park et al.,
2017; Zhou et al., 2018). Today, these models have been outperformed by NeRF-class models (Yu
et al., 2021; Niemeyer et al., 2021; Jang & Agapito, 2021), where 3D consistency is guaranteed by
construction, as images are generated by volume rendering of a single underlying 3D representation
(a.k.a. “geometry-aware” models).
Still, these approaches feature different limitations. Heavily regularized NeRFs for novel view syn-
thesis with few images such as RegNeRF (Niemeyer et al., 2021) produce undesired artifacts when
given very few images, and fail to leverage knowledge from multiple scenes (recall NeRFs are
trained on a single scene, i.e., one model per scene), and given one or very few views of a novel
scene, a reasonable model must extrapolate to complete the occluded parts of the scene. Pixel-
NeRF (Yu et al., 2021) and VisionNeRF (Lin et al., 2022) address this by training NeRF-like models
conditioned on feature maps that encode the novel input view(s). However, these approaches are
regressive rather than generative, and as a result, they cannot yield different plausible modes and
are prone to blurriness. This type of failure has also been previously observed in regression-based
models (Saharia et al., 2021b). Other works such as CodeNeRF (Jang & Agapito, 2021) and LoL-
NeRF (Rebain et al., 2021) instead employ test-time optimization to handle novel scenes, but still
have issues with sample quality.
In recent literature, geometry-free approaches (i.e., methods without explicit geometric inductive
biases like those introduced by volume rendering) such as Light Field Networks (LFN) (Sitzmann
et al., 2021) and Scene Representation Transformers (SRT) (Sajjadi et al., 2021) have achieved re-
sults competitive with 3D-aware methods in the “few-shot” setting, where the number of condition-
ing views is limited (i.e., 1-10 images vs. dozens of images as in the usual NeRF setting). Similarly
to our approach, EG3D (Chan et al., 2022) provides approximate 3D consistency by leveraging
generative models. EG3D employs a StyleGAN (Karras et al., 2019) with volumetric rendering,
followed by generative super-resolution (the latter being responsible for the approximation). In
comparison to this complex setup, diffusion not only provides a significantly simpler architecture,
but also a simpler hyper-parameter tuning experience compared to GANs, which are well-known
to be notoriously difficult to tune (Mescheder et al., 2018). Notably, diffusion models have already
seen some success on 3D point-cloud generation (Luo & Hu, 2021; Waibel et al., 2022).
Motivated by these observations and the success of diffusion models in image-to-image tasks, we
introduce 3D Diffusion Models (3DiMs). 3DiMs are image-to-image diffusion models trained on
pairs of images of the same scene, where we assume the poses of the two images are known. Draw-
ing inspiration from Scene Representation Transformers (Sajjadi et al., 2021), 3DiMs are trained to
build a conditional generative model of one view given another view and their poses. Our key dis-
covery is that we can turn this image-to-image model into a model that can produce an entire set of
3D-consistent frames through autoregressive generation, which we enable with our novel stochastic
conditioning sampling algorithm. We cover stochastic conditioning in more detail in Section 2.2 and
provide an illustration in Figure 3. Compared to prior work, 3DiMs are generative (vs. regressive)
geometry free models, they allow training to scale to a large number of scenes, and offer a simple
end-to-end approach.
We now summarize our core contributions:
1. We introduce 3DiM, a geometry-free image-to-image diffusion model for novel view synthesis.
2. We introduce the stochastic conditioning sampling algorithm, which encourages 3DiM to gener-
ate 3D-consistent outputs.
3. We introduce X-UNet, a new UNet architecture (Ronneberger et al., 2015) variant for 3D novel
view synthesis, demonstrating that changes in architecture are critical for high fidelity results.
4. We introduce an evaluation scheme for geometry-free view synthesis models, 3D consistency
scoring, that can numerically capture 3D consistency by training neural fields on model outputs.
2