A Survey on Deep Generative 3D-aware Image Synthesis

2025-04-30 0 0 9.01MB 34 页 10玖币

侵权投诉

WEIHAO XIA and JING-HAO XUE, University College London, UK

Recent years have seen remarkable progress in deep learning powered visual content creation. This includes

deep generative 3D-aware image synthesis, which produces high-delity images in a 3D-consistent manner

while simultaneously capturing compact surfaces of objects from pure image collections without the need

for any 3D supervision, thus bridging the gap between 2D imagery and 3D reality. The eld of computer

vision has been recently captivated by the task of deep generative 3D-aware image synthesis, with hundreds

of papers appearing in top-tier journals and conferences over the past few years (mainly the past two years),

but there lacks a comprehensive survey of this remarkable and swift progress. Our survey aims to introduce

new researchers to this topic, provide a useful reference for related works, and stimulate future research

directions through our discussion section. Apart from the presented papers, we aim to constantly update the

latest relevant papers along with corresponding implementations at https://weihaox.github.io/3D-aware-Gen.

CCS Concepts: •General and reference

→

Surveys and overviews;•Computing methodologies

→

Machine learning;Computer vision;Image manipulation.

Additional Key Words and Phrases: 3D-aware image synthesis, deep generative models, implicit neural

representation, generative adversarial network, diusion probabilistic models

ACM Reference Format:

Weihao Xia and Jing-Hao Xue. 2023. A Survey on Deep Generative 3D-aware Image Synthesis. ACM Comput.

Surv. 1, 1, Article 1 (January 2023), 34 pages. https://doi.org/10.1145/3626193

1 INTRODUCTION

A tremendous amount of progress has been made in deep neural networks that lead to photorealistic

image synthesis. Despite achieving compelling results, most approaches focus on two-dimensional

(2D) images, overlooking the three-dimensional (3D) nature of the physical world. The lack of 3D

structure, therefore, inevitably limits some of their practical applications. Recent studies have thus

proposed generative models that are 3D-aware. That is, they incorporate 3D information into the

generative models to enhance control (especially in terms of multiconsistency) over the generated

images. Examples depicted in Fig. 1 elucidate that the objective is to produce high-quality renderings

which maintain consistency across various views. In contrast to the 2D generative models, the

recently developed 3D-aware generative models [

] bridge the gap between 2D images and 3D

physical world. The physical world surrounding us is intrinsically three-dimensional and images

depict reality under certain conditions of geometry, material, and illumination, making it natural

to model the image generation process in 3D spaces. As shown in Fig. 2, classical rendering (a)

renders images at certain camera positions given human-designed or scanned 3D shape models;

inverse rendering (b) recovers the underlying intrinsic properties of the 3D physical world from

2D images; 2D image generation (c) is mostly driven by generative models, which have achieved

impressive results in photorealistic image synthesis; and 3D-aware generative models (d) oers the

Authors’ address: Weihao Xia, weihao.xia.21@ucl.ac.uk; Jing-Hao Xue, jinghao.xue@ucl.ac.uk, Department of Statistical

Science, University College London, London, UK.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses,

contact the owner/author(s).

0360-0300/2023/1-ART1

https://doi.org/10.1145/3626193

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.

arXiv:2210.14267v3 [cs.CV] 2 Oct 2023

1:2 Xia et al.

Fig. 1. The illustrative examples of 3D-aware image synthesis demonstrate the objective of this task, which is

to generate high-quality renderings that are consistent across multiple views (top) and potentially provide

detailed geometry (boom). Typically, deep generative 3D-aware image synthesis methods are trained using

a collection of 2D images, without depending on target-specific shape priors, ground truth 3D scans, or

multi-view supervision. Examples are sourced from [12].

possibility of replacing the classical rendering pipeline with eective and ecient models learned

directly from images.

Despite striking progress has been made recently in research of deep generative 3D-aware

image synthesis, it lacks a timely and systematic review of this progress. In this work, we ll the

gap by presenting a comprehensive survey of the latest research in deep generative 3D-aware

image synthesis methods. We envision that our work will elucidate design considerations and

advanced methods for deep generative 3D-aware image synthesis, present its advantages and

disadvantages of dierent kinds, and suggest future research directions. Fig. 3 provides a structured

outline and taxonomy of this survey. Fig. 4 is a chronological overview of representative deep

generative 3D-aware image synthesis methods. We propose to categorize the deep generative

3D-aware image synthesis methods into two primary categories: 3D control of 2D generative

models (Sec. 4) and 3D-aware generative models from single image collections (Sec. 5). Then, every

category is further divided into some subcategories depending on the experimental setting or the

specic utilization of 3D information. In particular, 3D control of 2D generative models are further

divided into 1) exploring 3D control in 2D latent spaces (Sec. 4.1), 2) 3D parameters as controls

(Sec. 4.2), and 3) 3D priors as constraints (Sec. 4.3). Sec. 5 summarizes methods that target generating

photorealistic and multi-view-consistent images by learning 3D representations from single-view

image collections. Broadly speaking, this category leverages neural 3D representations to represent

scenes, use dierentiable neural renderers to render them into the image plane, and optimize the

network parameters by minimizing the dierence between rendered images and observed images.

Here, we present a timely up-to-date overview of the growing eld of deep generative 3D-

aware image synthesis. Considering the lack of a comprehensive survey and an increasing interest

and popularity, we believe it necessary to organize one to help computer vision practitioners

with this emerging topic. The purpose of this survey is to provide researchers new to the eld

with a comprehensive understanding of deep generative 3D-aware image synthesis methods

and show the superior performance over the status quo approaches. To conclude, we highlight

several open research directions and problems that need further investigation. The scope of this

fast-expanding eld is rather extensive and a panoramic review would be challenging. We shall

present only representative methods of deep generative 3D-aware image synthesis rather than

listing exhaustively all literature. This review can therefore serve as a pedagogical tool, providing

researchers with the key information about typical methods of deep generative 3D-aware image

synthesis. Researchers can use these general guidelines to develop the most appropriate technique

for their own particular study. The main technical contributions of this work are as follows:

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.

A Survey on Deep Generative 3D-aware Image Synthesis 1:3

Classical Rendering

2D Generative Model

3D Generative Model

2D Generative Model with prior as constraints

2D Generative Model with parameter as controls

Inverse Rendering

3d design or scan

3D shape image

rendering

image 3D properties

inverse

rendering

z2D generative

model image

z3D generative

model image

(a)

(b)

(c)

(d)

Fig. 2. Comparison of (a) rendering, (b) inverse rendering, (c) 2D generative models, and (d) 3D generative

models. 3D generative models learn 3D representations first and then render a 2D image at certain viewpoints.

Both 2D and 3D generative models have unconditional and conditional seings. An unconditional generative

model maps a noise input

𝑧

(and a camera pose in 3D models) to a fake image; a conditional model takes

additional inputs as control signals, which could be image, text, or a categorical label.

•

Systematic taxonomy. We propose a hierarchical taxonomy for deep generative 3D-aware

image synthesis research. We categorize existing models into two main categories: 3D control

of 2D generative models and 3D-aware generative models from image collections.

•

Comprehensive review. We provide a comprehensive overview of the existing state-of-the-art

deep generative 3D-aware image synthesis methods. We compare and analyze the main

characteristics and improvements for each type, assessing their strengths and weaknesses.

•

Outstanding challenges. We present open research problems and provide some suggestions

for the future development of deep generative 3D-aware image synthesis.

•

In an attempt to continuously track recent developments in this fast advancing eld, we

provide an accompanying webpage which catalogs papers addressing deep generative 3D-

aware image synthesis, according to our problem-based taxonomy: https://weihaox.github.

io/3D-aware-Gen.

2 BACKGROUND

This section introduces a few important concepts as the background. In order to formulate deep

generative 3D-aware image synthesis, we rst clarify how 2D and 3D data are expressed, and

how they are converted between each other. Moreover, we introduce two key elements involved

in most deep generative 3D-aware image synthesis methods: implicit neural representations and

dierentiable neural rendering.

2.1 2D and 3D Data, Rendering and Inverse Rendering

The 2D images depict a glimpse into the surrounding 3D physical world with its geometry, materials,

and illumination conditions at that moment. Images are composed of an array of pixels (picture

elements). The 3D reality can be represented in many dierent ways, each with its own advantages

and disadvantages. There are several examples of such 3D shape representations, including

depth images, point clouds, voxel grids, and meshes. Depth images contain distance information

between the object and the camera at every pixel. The distance encodes 3D geometry information

from a xed point of view. Layered depth images (LDIs) use several layers of depth maps and their

associated color values to depict a scene. Point clouds comprise vertices in 3D space, represented

by coordinates along the x, y, and z axes. These types of data can be acquired by 3D scanners, such

as LiDARs or RGB-D sensors, from one or more viewpoints. Voxel grids describe a scene or object

using a regular grid in 3D space. A voxel (volume element) in 3D space is analogous to a pixel in

a 2D image. A voxel grid can be created from a point cloud by voxelization, which groups all 3D

points within a voxel. Meshes are a collection of vertices, edges, and polygonal faces. In contrast

to a point cloud, which only provides vertices locations, a mesh also provides surface information

of an object. Nevertheless, deep learning does not provide a straightforward way to process surface

information. Instead, many techniques resort to sampling points from the surfaces to create a point

cloud from the mesh representation.

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.

1:4 Xia et al.

All

3D-aware Generative

Models (Sec. 5)

Conditional 3D-aware

Generative Models

(Sec. 5.3) Textual Description

Semantic Label

Image

Unconditional 3D-aware

Diusion Models (Sec. 5.2)

Unconditional 3D-aware

GANs (Sec. 5.1)

User-interactive Editing (Sec. 5.1.4)

Broadening Applicable Scenarios (Sec. 5.1.3)

Ecient and Consistent Rendering (Sec. 5.1.2)

Ecient and Eective Representations

(Sec. 5.1.1)

3D Control of 2D Gen-

erative Models (Sec. 4)

3D Priors as Constraints

(Sec. 4.3) 3D Components into 2D GANs (Sec. 4.3.2)

3D Prior Knowledge (Sec. 4.3.1)

3D Parameters as Controls

(Sec. 4.2) Explicit Control over 3D Parameters (Sec. 4.2.2)

Factors from Pretrained Models (Sec. 4.2.1)

3D Control in 2D Latent

Spaces (Sec. 4.1) Pinpointing Predetermined Targets (Sec. 4.1.2)

3D Control Latent Directions (Sec. 4.1.1)

Fig. 3. A systematic taxonomy proposed in this survey of deep generative 3D-aware image synthesis methods.

The dashed borders at the third level denote preliminaries, applications, or issues discussed in this subcategory.

It should be noted that these methods are not mutually exclusive. For example, a few methods introduce 3D

parameters to improve controllability (Sec. 4.2) while also implementing 3D constraints to improve consistency

across multiple views (Sec. 4.3); EG3D [

] is referenced in Sec. 5.1.1 and Sec. 5.1.2 for its approach to 3D-aware

representations and rendering algorithms.

As shown in Fig. 2(a), images can be obtained by rendering a 3D object or scene under certain

viewpoints and lighting conditions. This forward process is called rendering (image synthesis).

Rendering has been studied in computer graphics and a wide variety of renderers are available for

use. The reverse process, inverse rendering, as shown in Fig. 2(b), is to infer underlying intrinsic

components of a scene from rendered 2D images. These properties include shape (surface, depth,

normal), material (albedo, reectivity, shininess), and lighting (direction, intensity), which can be

further used to render photorealistic images. The inverse rendering papers are not classied as

3D-aware image synthesis methods in this survey as they are not deliberately designed for this

purpose. 3D-aware image synthesis in this survey include a similar inverse rendering process and a

rendering process. In contrast, these methods typically do not produce explicit 3D representations

such as meshes or voxels for rendering. They learn neural 3D representations (mostly implicit

functions), render them into images with dierentiable neural rendering technique, and optimize

the network parameters by minimizing the dierence between the observed and rendered images.

2.2 Implicit Scene Representations

In computer vision and computer graphics, 3D shapes are traditionally represented as explicit

representations like depths, voxels, point clouds, or meshes. Recent methods propose to represent

3D scenes with neural implicit functions, such as occupancy eld [

], signed distance eld [

and radiance eld [

]. The implicit neural representation (INR, neural elds, or coordinate-based

representation) is a novel way to parameterize signals across a wide range of domains. Taking

images as an example, INR parameterizes an image as a continuous function that maps pixel

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.

A Survey on Deep Generative 3D-aware Image Synthesis 1:5

coordinates to RGB colors. The implicit functions are often not analytically tractable and are hence

approximated by neural networks. Here are some popular examples of INR.

Neural Occupancy Field [

] implicitly represents a 3D surface with the continuous

decision boundary of a neural classier. This function approximated with a neural network assigns

to every location

𝑝∈R3

an occupancy probability between 0 and 1. Given an observation (e.g.,

image or point cloud)

𝑥∈ X

and a location

𝑝∈R3

, the representation can be simply parameterized

by a neural network

𝑓𝜃

that takes a pair

(𝑝, 𝑥)

as input and outputs a real number which represents

the probability of occupancy: 𝑓𝜃:R3× X → [0,1].

Neural Signed Distance Field [

] is a continuous function that models the distance from a

queried location to the nearest point on a shape’s surface, whose sign indicates if this location

is inside (negative) or outside (positive):

𝑆𝐷𝐹 (𝒙)=𝑠, 𝒙∈R3, 𝑠 ∈R

. The underlying surface is

implicitly described as the zero iso-surface decision boundaries of feed-forward networks

𝑆𝐷𝐹 (·) =

0. This implicit surface can be rendered by raycasting or rasterizing a mesh obtained through

marching cubes [63].

Neural Radiance Field [

] (NeRF) has attracted growing attention due to its compelling

results in novel view synthesis on complex scenes. It leverages an MLP network to approximate the

radiance elds of static 3D scenes and uses the classic volume rendering technique [

] to estimate

the color of each pixel. This function takes as input a 3D location

𝒙

and 2D viewing direction

𝒅

, and

outputs an directional emitted RGB color 𝒄and volume density 𝜎:𝑓𝜃:(𝒙,𝒅)→(𝒄, 𝜎). It captures

3D geometric details based on pure 2D supervision by learning the reconstruction of given views.

There also exist many other implicit functions proposed to represent a scene, such as neural

sparse voxel elds [58], or neural volumes [62].

2.3 Dierentiable Neural Rendering

3D rendering is a function that outputs a 2D image from a 3D scene. Dierentiable rendering

provides a dierentiable rendering function, that is, it computes the derivatives of that function in

response to dierent parameters of the scene. Once a renderer is dierentiable, it can be integrated

into the optimization of neural networks. One use case for dierentiable rendering is to compute a

loss in rendered image space and back propagation can be applied to train the network. Driven by

the prevalence of NeRF-based methods [

], volume rendering [

] becomes the most commonly

used dierentiable renderer among the methods that this survey targets. It is naturally dierentiable,

and the only input required to optimize the NeRF representation is a set of images with known

camera poses. Given volume density and color functions, volume rendering is used to obtain the

color 𝐶(r)of any camera ray r(𝑡)=o+𝑡d, with camera position oand viewing direction dusing

𝐶(r)=∫𝑡2

𝑡1

𝑇(𝑡) · 𝜎(r(𝑡)) · c(r(𝑡),d) · 𝑑𝑡, where 𝑇(𝑡)=exp −∫𝑡

𝑡1

𝜎(r(𝑠))𝑑𝑠.(1)

𝑇(𝑡)

denotes the accumulated transmittance, representing the probability that the ray travels from

𝑡1

𝑡

without being intercepted. The rendered image can be obtained by tracing the camera rays

𝐶(r)through each pixel of the to-be-synthesized image.

2.4 INR-based Novel View Synthesis

Novel view synthesis [

142

] is a long-standing problem that involves rendering frames of

scenes from new camera viewpoints. There are existing methods that depend on implicit 3D scene

representations. One of the most representative studies in the eld is NeRF, which employs neural

networks to capture the continuous 3D scene structure within the network weights, resulting

in photorealistic synthesis outcomes. These methods usually operate under the Single-Scene

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: January 2023.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1ASurveyonDeepGenerative3D-awareImageSynthesisWEIHAOXIAandJING-HAOXUE,UniversityCollegeLondon,UKRecentyearshaveseenremarkableprogressindeeplearningpoweredvisualcontentcreation.Thisincludesdeepgenerative3D-awareimagesynthesis,whichproduceshigh-fidelityimagesina3D-consistentmannerwhilesimultaneouslyca...

展开>> 收起<<

A Survey on Deep Generative 3D-aware Image Synthesis.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Survey on Deep Generative 3D-aware Image Synthesis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: