A Survey on Deep Generative 3D-aware Image Synthesis

2025-04-30 0 0 9.01MB 34 页 10玖币
侵权投诉
1
A Survey on Deep Generative 3D-aware Image Synthesis
WEIHAO XIA and JING-HAO XUE, University College London, UK
Recent years have seen remarkable progress in deep learning powered visual content creation. This includes
deep generative 3D-aware image synthesis, which produces high-delity images in a 3D-consistent manner
while simultaneously capturing compact surfaces of objects from pure image collections without the need
for any 3D supervision, thus bridging the gap between 2D imagery and 3D reality. The eld of computer
vision has been recently captivated by the task of deep generative 3D-aware image synthesis, with hundreds
of papers appearing in top-tier journals and conferences over the past few years (mainly the past two years),
but there lacks a comprehensive survey of this remarkable and swift progress. Our survey aims to introduce
new researchers to this topic, provide a useful reference for related works, and stimulate future research
directions through our discussion section. Apart from the presented papers, we aim to constantly update the
latest relevant papers along with corresponding implementations at https://weihaox.github.io/3D-aware-Gen.
CCS Concepts: General and reference
Surveys and overviews;Computing methodologies
Machine learning;Computer vision;Image manipulation.
Additional Key Words and Phrases: 3D-aware image synthesis, deep generative models, implicit neural
representation, generative adversarial network, diusion probabilistic models
ACM Reference Format:
Weihao Xia and Jing-Hao Xue. 2023. A Survey on Deep Generative 3D-aware Image Synthesis. ACM Comput.
Surv. 1, 1, Article 1 (January 2023), 34 pages. https://doi.org/10.1145/3626193
1 INTRODUCTION
A tremendous amount of progress has been made in deep neural networks that lead to photorealistic
image synthesis. Despite achieving compelling results, most approaches focus on two-dimensional
(2D) images, overlooking the three-dimensional (3D) nature of the physical world. The lack of 3D
structure, therefore, inevitably limits some of their practical applications. Recent studies have thus
proposed generative models that are 3D-aware. That is, they incorporate 3D information into the
generative models to enhance control (especially in terms of multiconsistency) over the generated
images. Examples depicted in Fig. 1 elucidate that the objective is to produce high-quality renderings
which maintain consistency across various views. In contrast to the 2D generative models, the
recently developed 3D-aware generative models [
13
,
33
] bridge the gap between 2D images and 3D
physical world. The physical world surrounding us is intrinsically three-dimensional and images
depict reality under certain conditions of geometry, material, and illumination, making it natural
to model the image generation process in 3D spaces. As shown in Fig. 2, classical rendering (a)
renders images at certain camera positions given human-designed or scanned 3D shape models;
inverse rendering (b) recovers the underlying intrinsic properties of the 3D physical world from
2D images; 2D image generation (c) is mostly driven by generative models, which have achieved
impressive results in photorealistic image synthesis; and 3D-aware generative models (d) oers the
Authors’ address: Weihao Xia, weihao.xia.21@ucl.ac.uk; Jing-Hao Xue, jinghao.xue@ucl.ac.uk, Department of Statistical
Science, University College London, London, UK.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
©2023 Copyright held by the owner/author(s).
0360-0300/2023/1-ART1
https://doi.org/10.1145/3626193
ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.
arXiv:2210.14267v3 [cs.CV] 2 Oct 2023
1:2 Xia et al.
Fig. 1. The illustrative examples of 3D-aware image synthesis demonstrate the objective of this task, which is
to generate high-quality renderings that are consistent across multiple views (top) and potentially provide
detailed geometry (boom). Typically, deep generative 3D-aware image synthesis methods are trained using
a collection of 2D images, without depending on target-specific shape priors, ground truth 3D scans, or
multi-view supervision. Examples are sourced from [12].
possibility of replacing the classical rendering pipeline with eective and ecient models learned
directly from images.
Despite striking progress has been made recently in research of deep generative 3D-aware
image synthesis, it lacks a timely and systematic review of this progress. In this work, we ll the
gap by presenting a comprehensive survey of the latest research in deep generative 3D-aware
image synthesis methods. We envision that our work will elucidate design considerations and
advanced methods for deep generative 3D-aware image synthesis, present its advantages and
disadvantages of dierent kinds, and suggest future research directions. Fig. 3 provides a structured
outline and taxonomy of this survey. Fig. 4 is a chronological overview of representative deep
generative 3D-aware image synthesis methods. We propose to categorize the deep generative
3D-aware image synthesis methods into two primary categories: 3D control of 2D generative
models (Sec. 4) and 3D-aware generative models from single image collections (Sec. 5). Then, every
category is further divided into some subcategories depending on the experimental setting or the
specic utilization of 3D information. In particular, 3D control of 2D generative models are further
divided into 1) exploring 3D control in 2D latent spaces (Sec. 4.1), 2) 3D parameters as controls
(Sec. 4.2), and 3) 3D priors as constraints (Sec. 4.3). Sec. 5 summarizes methods that target generating
photorealistic and multi-view-consistent images by learning 3D representations from single-view
image collections. Broadly speaking, this category leverages neural 3D representations to represent
scenes, use dierentiable neural renderers to render them into the image plane, and optimize the
network parameters by minimizing the dierence between rendered images and observed images.
Here, we present a timely up-to-date overview of the growing eld of deep generative 3D-
aware image synthesis. Considering the lack of a comprehensive survey and an increasing interest
and popularity, we believe it necessary to organize one to help computer vision practitioners
with this emerging topic. The purpose of this survey is to provide researchers new to the eld
with a comprehensive understanding of deep generative 3D-aware image synthesis methods
and show the superior performance over the status quo approaches. To conclude, we highlight
several open research directions and problems that need further investigation. The scope of this
fast-expanding eld is rather extensive and a panoramic review would be challenging. We shall
present only representative methods of deep generative 3D-aware image synthesis rather than
listing exhaustively all literature. This review can therefore serve as a pedagogical tool, providing
researchers with the key information about typical methods of deep generative 3D-aware image
synthesis. Researchers can use these general guidelines to develop the most appropriate technique
for their own particular study. The main technical contributions of this work are as follows:
ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.
A Survey on Deep Generative 3D-aware Image Synthesis 1:3
Classical Rendering
2D Generative Model
3D Generative Model
2D Generative Model with prior as constraints
2D Generative Model with parameter as controls
Inverse Rendering
3d design or scan
3D shape image
rendering
image 3D properties
inverse
rendering
z2D generative
model image
z3D generative
model image
θ
(a)
(b)
(c)
(d)
Fig. 2. Comparison of (a) rendering, (b) inverse rendering, (c) 2D generative models, and (d) 3D generative
models. 3D generative models learn 3D representations first and then render a 2D image at certain viewpoints.
Both 2D and 3D generative models have unconditional and conditional seings. An unconditional generative
model maps a noise input
𝑧
(and a camera pose in 3D models) to a fake image; a conditional model takes
additional inputs as control signals, which could be image, text, or a categorical label.
Systematic taxonomy. We propose a hierarchical taxonomy for deep generative 3D-aware
image synthesis research. We categorize existing models into two main categories: 3D control
of 2D generative models and 3D-aware generative models from image collections.
Comprehensive review. We provide a comprehensive overview of the existing state-of-the-art
deep generative 3D-aware image synthesis methods. We compare and analyze the main
characteristics and improvements for each type, assessing their strengths and weaknesses.
Outstanding challenges. We present open research problems and provide some suggestions
for the future development of deep generative 3D-aware image synthesis.
In an attempt to continuously track recent developments in this fast advancing eld, we
provide an accompanying webpage which catalogs papers addressing deep generative 3D-
aware image synthesis, according to our problem-based taxonomy: https://weihaox.github.
io/3D-aware-Gen.
2 BACKGROUND
This section introduces a few important concepts as the background. In order to formulate deep
generative 3D-aware image synthesis, we rst clarify how 2D and 3D data are expressed, and
how they are converted between each other. Moreover, we introduce two key elements involved
in most deep generative 3D-aware image synthesis methods: implicit neural representations and
dierentiable neural rendering.
2.1 2D and 3D Data, Rendering and Inverse Rendering
The 2D images depict a glimpse into the surrounding 3D physical world with its geometry, materials,
and illumination conditions at that moment. Images are composed of an array of pixels (picture
elements). The 3D reality can be represented in many dierent ways, each with its own advantages
and disadvantages. There are several examples of such 3D shape representations, including
depth images, point clouds, voxel grids, and meshes. Depth images contain distance information
between the object and the camera at every pixel. The distance encodes 3D geometry information
from a xed point of view. Layered depth images (LDIs) use several layers of depth maps and their
associated color values to depict a scene. Point clouds comprise vertices in 3D space, represented
by coordinates along the x, y, and z axes. These types of data can be acquired by 3D scanners, such
as LiDARs or RGB-D sensors, from one or more viewpoints. Voxel grids describe a scene or object
using a regular grid in 3D space. A voxel (volume element) in 3D space is analogous to a pixel in
a 2D image. A voxel grid can be created from a point cloud by voxelization, which groups all 3D
points within a voxel. Meshes are a collection of vertices, edges, and polygonal faces. In contrast
to a point cloud, which only provides vertices locations, a mesh also provides surface information
of an object. Nevertheless, deep learning does not provide a straightforward way to process surface
information. Instead, many techniques resort to sampling points from the surfaces to create a point
cloud from the mesh representation.
ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.
1:4 Xia et al.
All
3D-aware Generative
Models (Sec. 5)
Conditional 3D-aware
Generative Models
(Sec. 5.3) Textual Description
Semantic Label
Image
Unconditional 3D-aware
Diusion Models (Sec. 5.2)
Unconditional 3D-aware
GANs (Sec. 5.1)
User-interactive Editing (Sec. 5.1.4)
Broadening Applicable Scenarios (Sec. 5.1.3)
Ecient and Consistent Rendering (Sec. 5.1.2)
Ecient and Eective Representations
(Sec. 5.1.1)
3D Control of 2D Gen-
erative Models (Sec. 4)
3D Priors as Constraints
(Sec. 4.3) 3D Components into 2D GANs (Sec. 4.3.2)
3D Prior Knowledge (Sec. 4.3.1)
3D Parameters as Controls
(Sec. 4.2) Explicit Control over 3D Parameters (Sec. 4.2.2)
Factors from Pretrained Models (Sec. 4.2.1)
3D Control in 2D Latent
Spaces (Sec. 4.1) Pinpointing Predetermined Targets (Sec. 4.1.2)
3D Control Latent Directions (Sec. 4.1.1)
Fig. 3. A systematic taxonomy proposed in this survey of deep generative 3D-aware image synthesis methods.
The dashed borders at the third level denote preliminaries, applications, or issues discussed in this subcategory.
It should be noted that these methods are not mutually exclusive. For example, a few methods introduce 3D
parameters to improve controllability (Sec. 4.2) while also implementing 3D constraints to improve consistency
across multiple views (Sec. 4.3); EG3D [
12
] is referenced in Sec. 5.1.1 and Sec. 5.1.2 for its approach to 3D-aware
representations and rendering algorithms.
As shown in Fig. 2(a), images can be obtained by rendering a 3D object or scene under certain
viewpoints and lighting conditions. This forward process is called rendering (image synthesis).
Rendering has been studied in computer graphics and a wide variety of renderers are available for
use. The reverse process, inverse rendering, as shown in Fig. 2(b), is to infer underlying intrinsic
components of a scene from rendered 2D images. These properties include shape (surface, depth,
normal), material (albedo, reectivity, shininess), and lighting (direction, intensity), which can be
further used to render photorealistic images. The inverse rendering papers are not classied as
3D-aware image synthesis methods in this survey as they are not deliberately designed for this
purpose. 3D-aware image synthesis in this survey include a similar inverse rendering process and a
rendering process. In contrast, these methods typically do not produce explicit 3D representations
such as meshes or voxels for rendering. They learn neural 3D representations (mostly implicit
functions), render them into images with dierentiable neural rendering technique, and optimize
the network parameters by minimizing the dierence between the observed and rendered images.
2.2 Implicit Scene Representations
In computer vision and computer graphics, 3D shapes are traditionally represented as explicit
representations like depths, voxels, point clouds, or meshes. Recent methods propose to represent
3D scenes with neural implicit functions, such as occupancy eld [
67
], signed distance eld [
83
],
and radiance eld [
68
]. The implicit neural representation (INR, neural elds, or coordinate-based
representation) is a novel way to parameterize signals across a wide range of domains. Taking
images as an example, INR parameterizes an image as a continuous function that maps pixel
ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.
A Survey on Deep Generative 3D-aware Image Synthesis 1:5
coordinates to RGB colors. The implicit functions are often not analytically tractable and are hence
approximated by neural networks. Here are some popular examples of INR.
Neural Occupancy Field [
67
,
78
,
84
] implicitly represents a 3D surface with the continuous
decision boundary of a neural classier. This function approximated with a neural network assigns
to every location
𝑝R3
an occupancy probability between 0 and 1. Given an observation (e.g.,
image or point cloud)
𝑥∈ X
and a location
𝑝R3
, the representation can be simply parameterized
by a neural network
𝑓𝜃
that takes a pair
(𝑝, 𝑥)
as input and outputs a real number which represents
the probability of occupancy: 𝑓𝜃:R3× X [0,1].
Neural Signed Distance Field [
83
] is a continuous function that models the distance from a
queried location to the nearest point on a shape’s surface, whose sign indicates if this location
is inside (negative) or outside (positive):
𝑆𝐷𝐹 (𝒙)=𝑠, 𝒙R3, 𝑠 R
. The underlying surface is
implicitly described as the zero iso-surface decision boundaries of feed-forward networks
𝑆𝐷𝐹 (·) =
0. This implicit surface can be rendered by raycasting or rasterizing a mesh obtained through
marching cubes [63].
Neural Radiance Field [
68
] (NeRF) has attracted growing attention due to its compelling
results in novel view synthesis on complex scenes. It leverages an MLP network to approximate the
radiance elds of static 3D scenes and uses the classic volume rendering technique [
43
] to estimate
the color of each pixel. This function takes as input a 3D location
𝒙
and 2D viewing direction
𝒅
, and
outputs an directional emitted RGB color 𝒄and volume density 𝜎:𝑓𝜃:(𝒙,𝒅)→(𝒄, 𝜎). It captures
3D geometric details based on pure 2D supervision by learning the reconstruction of given views.
There also exist many other implicit functions proposed to represent a scene, such as neural
sparse voxel elds [58], or neural volumes [62].
2.3 Dierentiable Neural Rendering
3D rendering is a function that outputs a 2D image from a 3D scene. Dierentiable rendering
provides a dierentiable rendering function, that is, it computes the derivatives of that function in
response to dierent parameters of the scene. Once a renderer is dierentiable, it can be integrated
into the optimization of neural networks. One use case for dierentiable rendering is to compute a
loss in rendered image space and back propagation can be applied to train the network. Driven by
the prevalence of NeRF-based methods [
68
], volume rendering [
43
] becomes the most commonly
used dierentiable renderer among the methods that this survey targets. It is naturally dierentiable,
and the only input required to optimize the NeRF representation is a set of images with known
camera poses. Given volume density and color functions, volume rendering is used to obtain the
color 𝐶(r)of any camera ray r(𝑡)=o+𝑡d, with camera position oand viewing direction dusing
𝐶(r)=𝑡2
𝑡1
𝑇(𝑡) · 𝜎(r(𝑡)) · c(r(𝑡),d) · 𝑑𝑡, where 𝑇(𝑡)=exp 𝑡
𝑡1
𝜎(r(𝑠))𝑑𝑠.(1)
𝑇(𝑡)
denotes the accumulated transmittance, representing the probability that the ray travels from
𝑡1
to
𝑡
without being intercepted. The rendered image can be obtained by tracing the camera rays
𝐶(r)through each pixel of the to-be-synthesized image.
2.4 INR-based Novel View Synthesis
Novel view synthesis [
30
,
68
,
89
,
142
] is a long-standing problem that involves rendering frames of
scenes from new camera viewpoints. There are existing methods that depend on implicit 3D scene
representations. One of the most representative studies in the eld is NeRF, which employs neural
networks to capture the continuous 3D scene structure within the network weights, resulting
in photorealistic synthesis outcomes. These methods usually operate under the Single-Scene
ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.
摘要:

1ASurveyonDeepGenerative3D-awareImageSynthesisWEIHAOXIAandJING-HAOXUE,UniversityCollegeLondon,UKRecentyearshaveseenremarkableprogressindeeplearningpoweredvisualcontentcreation.Thisincludesdeepgenerative3D-awareimagesynthesis,whichproduceshigh-fidelityimagesina3D-consistentmannerwhilesimultaneouslyca...

展开>> 收起<<
A Survey on Deep Generative 3D-aware Image Synthesis.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:9.01MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注