Photo-realistic Neural Domain Randomization Sergey Zakharov1 Rares Ambrus 1 Vitor Guizilini1 Wadim Kehl2 and Adrien Gaidon1

2025-05-02 0 0 5.47MB 25 页 10玖币

侵权投诉

Photo-realistic Neural Domain Randomization

Sergey Zakharov*1, Rares

,Ambrus

,*1, Vitor Guizilini1, Wadim Kehl2, and

Adrien Gaidon1

1Toyota Research Institute, Los Altos, CA

2Woven Planet, Tokyo, Japan

Abstract. Synthetic data is a scalable alternative to manual supervi-

sion, but it requires overcoming the sim-to-real domain gap. This dis-

crepancy between virtual and real worlds is addressed by two seemingly

opposed approaches: improving the realism of simulation or foregoing re-

alism entirely via domain randomization. In this paper, we show that the

recent progress in neural rendering enables a new uniﬁed approach we

call Photo-realistic Neural Domain Randomization (PNDR). We propose

to learn a composition of neural networks that acts as a physics-based ray

tracer generating high-quality renderings from scene geometry alone. Our

approach is modular, composed of diﬀerent neural networks for materi-

als, lighting, and rendering, thus enabling randomization of diﬀerent key

image generation components in a diﬀerentiable pipeline. Once trained,

our method can be combined with other methods and used to gener-

ate photo-realistic image augmentations online and signiﬁcantly more

eﬃciently than via traditional ray-tracing. We demonstrate the useful-

ness of PNDR through two downstream tasks: 6D object detection and

monocular depth estimation. Our experiments show that training with

PNDR enables generalization to novel scenes and signiﬁcantly outper-

forms the state of the art in terms of real-world transfer.

1 Introduction

Collecting labelled data for various machine learning tasks is an expensive, error-

prone process that does not scale. Instead, simulators hold the promise of un-

limited, perfectly annotated data without any human intervention but often

introduce a domain gap that aﬀects real-world performance. Eﬀectively using

simulated data requires overcoming the sim-to-real domain gap which arises due

to diﬀerences in content or appearance. Domain adaptation methods rely on

target data (i.e., real-world data) to bridge that gap [61,64,47,70,75,72,38,18].

A separate paradigm that requires no target data is that of Domain Random-

ization [59,60], which forgoes expensive, photo-realistic rendering in favor of

random scene augmentations. In the context of object detection, CAD models

are typically assumed known [22,68,49] and a subset of lighting, textures, ma-

terials, and object poses are randomized. Although typically ineﬃcient, sample

eﬃciency can be improved via diﬀerentiable guided augmentations [67], while

content [29,10] and appearance [51,41] gaps can also be addressed by leveraging

real data. However, a signiﬁcant gap remains in terms of the photo-realism of the

arXiv:2210.12682v1 [cs.CV] 23 Oct 2022

2 S. Zakharov, R. Ambrus

,et al.

images generated. As an alternative, recent work [1,18] has shown that down-

stream task performance can be improved by increasing the quality of synthetic

data. However, generating high-quality photo-realistic synthetic data is an ex-

pensive process that requires access to detailed assets and environments, as well

as modeling light sources and materials inside complex graphics pipelines which

are typically not diﬀerentiable.

We propose a novel method that brings together these two separate paradigms

by generating high-quality synthetic data in a domain randomization framework.

We combine intermediate geometry buﬀers (”G-buﬀers”) generated by modern

simulators and game engines together with recent advances in neural render-

ing [53,42,2], and build a neural physics-based ray tracer that models scene

materials and light positions for photo-realistic rendering. Our Photo-realistic

Neural Domain Randomization (PNDR) pipeline learns to map scene geometry

to high quality renderings and is trained on a small amount of high-quality photo-

realistic synthetic data generated by a traditional ray-tracing simulator. Thanks

to its geometric input, PNDR generalizes to novel scenes and novel object con-

ﬁgurations. Once trained, PNDR can be integrated in various downstream task

training pipelines and used online to generate photo-realistic augmentations.

This alleviates the need to resort to expensive simulators to generate additional

high-quality image data when training the downstream task. Our method is more

eﬃcient in terms of time (PNDR renderings are generated 3 orders of magni-

tude faster than traditional simulators), space (PNDR renderings are generated

on-the-ﬂy during training and therefore do not need storage space) and leads

to better generalization. Although our proposed pipeline is generic in nature,

we quantify the usefulness of our synthetic training for the speciﬁc tasks of 6D

object detection and monocular depth estimation in a zero-shot setting (i.e.,

without using any real-world data), and demonstrate that our method presents

a distinct improvement over current SoTA approaches.

In summary, our contributions are:

–We unify photo-realistic rendering and domain randomization for synthetic

data generation using neural rendering;

–Our learned deferred renderer, RenderNet, allows ﬂexible randomization of

physical parameters while being 1,600×faster than comparable ray-tracers;

–Our Photo-realistic Neural Domain Randomization (PNDR) approach yields

state-of-the-art zero-shot sim-to-real transfer for 6D object detection and

monocular depth estimation, almost closing the domain gap;

–We show that realistic physics-based randomization, especially for lighting,

is key for out-of-domain generalization.

2 Related Work

Domain Adaptation. Due to the domain gap, models trained on synthetic data

suﬀer performance drops when applied on statistically diﬀerent unlabelled target

datasets. Domain Adaptation is an active area of research [7] with the aim of

Photo-realistic Neural Domain Randomization 3

Material Sampler RenderNet

MLP

DecoderEncoder

Direct Indirect Rendering

Light Sampler

Glossy Diffuse

idS

Data Generation

Materials

Light

Color

Tone Mapper

MLP

idS

ido

Fig. 1: PNDR Architecture. The main component of our domain randomiza-

tion method is the ray tracer approximator (RenderNet). It takes a G-buﬀer as

well as random material maps and light maps produced by corresponding sam-

plers and generates intermediate light outputs. These outputs are then combined

using a tone mapper to generate a ﬁnal rendering. The lower-right row shows

diﬀerent material and light samples (e.g., roughness, specularity, light position).

minimizing the sim-to-real gap. Common approaches rely on adversarial learning

for feature or pixel adaptation [6,61,14], paired [64] or unpaired [74,47,35] image

translation, style transfer [70], reﬁning pseudo-labels [75,72,38], or unsupervised

geometric guidance [18].

Domain Randomization. A diﬀerent approach to closing the sim-to-real gap

relies on generating augmentations of the input data through random pertur-

bations of the environment (e.g., lights, materials, background) [59,60,22]. The

aim is to learn more discriminative features that generalize to other domains.

While simple and inexpensive, this method is sample ineﬃcient because the

randomization is essentially unguided with many superﬂuous (or even harmful)

augmentations, and it rarely captures the complexity and distribution of real

scenes. Diﬀerently, procedurally generating synthetic scenes [50] can preserve

the context of real scenes while minimizing the gaps in content [29,10,24] and

appearance [51,68,49]. While some of these methods require expensive, bespoke

simulators [29,10], pixel-based augmentations can be generated diﬀerentiably

and combined with the task network to generate adversarial augmentations [67].

Similarly to [67] our pipeline is also diﬀerentiable, however while [67] is lim-

ited to handcrafted image augmentations where respective parameters are sam-

pled from artiﬁcial distributions, our method approximates a material-based ray

tracer simulating the physical process of light scattering and global illumination,

enabling eﬀects such as shadows and diﬀuse interreﬂection. Our augmentations

are solely based on light and material changes, thus reducing the randomization

set to physically plausible augmentations. Moreover, as opposed to [67], we as-

sume no color information of the objects of interest, making our method more

practical for real-world applications.

Photo-Realistic Data Generation. Although expensive to generate, high-

quality synthetic data (i.e., photo-realistic) can increase model generalization

4 S. Zakharov, R. Ambrus

,et al.

capabilities [1,18]. The task of view synthesis allows the rendering of novel data

given a set of input images [57]. Neural Radiance Fields [40] overﬁt to speciﬁc

scenes and can generate novel data with very high levels of ﬁdelity, while also

accounting for materials and lights [4,56,5]. Alternative methods use point-based

diﬀerentiable rendering [54,3] and can optimize over scene geometry, camera

model, and various image formation properties. While these methods overﬁt

to speciﬁc scenes, recent self-supervised approaches learn generative models of

speciﬁc objects [41] and can render novel and controllable complex scenes by

exploiting compositionality [43]. While neural volume rendering and point based

techniques can yield impressive results, other methods aim to explicitly model

various parts of traditional graphics pipelines [53,42,2,28,58]. Our work is similar

to [53] in that we also use intermediate simulation buﬀers to generate photo-

realistic scenes. However, while [53] relies on real data and minimizes a perceptual

loss in an adversarial framework, we focus on the task of 6D object detection in a

zero-shot setting using only object CAD model information and no real images.

6D Object Detection. Correspondence-based methods [69,37,27,23,46,48] tend

to show superior generalization performance in terms of adapting to diﬀerent

pose distributions. However, they use PnP and RANSAC to estimate poses from

correspondences, which makes them non-diﬀerentiable. Additionally, they are

very reliant on the quality of these correspondences, and errors can result in

unreasonable estimates (e.g., behind the camera, or very far away). Conversely,

regression-based methods [73,12,33] show superior performance for in-domain

pose estimation. However they do not generalize very well to out-of-domain

settings. To validate our method we implement a correspondence-based object

detector, which allows us to also evaluate instance segmentation and object

correspondences in addition to the object pose regressed.

3 Photo-realistic Neural Domain Randomization

Our photo-realistic neural domain randomization (PNDR) approach consists of

two main components: a neural ray tracer approximator (RenderNet), and sam-

pling blocks for material and light. To increase perceptual quality and realism,

the network outputs are passed through a non-linear tone-mapping function

which yields the ﬁnal rendering. We now describe the main two components

of PNDR. All other implementation and training details are provided in the

supplementary.

3.1 Geometric Scene Representation

As a ﬁrst step, we deﬁne a geometric room representation outlining our synthetic

environment. We place 3D objects inside an empty room ensuring no collisions.

Next, we assign random materials to both objects and room walls and position

a point light source to illuminate the scene (see Fig 2). Resulting output buﬀers,

consisting of G-Buﬀer (scene coordinates in camera space X, surface normals

map N), material properties (albedo A, roughness R, specularity S), and lighting

Photo-realistic Neural Domain Randomization 5

Materials LightG-Buffer

XYZ Normals Albedo Roughness Specular Light direction Light distance

Geometric scene representation

Fig. 2: Geometric scene representation. Visualization of RenderNet’s input

consisting of G-Buﬀer (scene coordinates in camera space X, surface normals

map N), material properties (albedo A, roughness R, specularity S), and lighting

(light direction map Ldir, and light distance map Ldist).

(light direction map Ldir , and light distance map Ldist), are used by our neural

ray tracer approximator to generate high ﬁdelity renderings in real time (∼2.5ms

per image), as opposed to ∼4s per image with a conventional ray tracer.

3.2 Neural Ray Tracer Approximator

Our neural ray tracer RenderNet fRis an encoder-decoder CNN taking G-buﬀer,

material properties, and lighting as input, and generating a ﬁnal high-ﬁdelity

rendering (see Fig. 1). This is akin to deferred rendering, a common practice in

computer graphics [8]. Instead of outputting a ﬁnal rendering directly, we split

the output into direct and indirect light outputs and colors which can be easily

combined to form a ﬁnal, shaded image. This allows not only for a much more

explainable representation, but also for better control over the complexity of the

rendering. As a result, our RenderNet fRis capable of generating photo-realistic

images, generalizes well to novel material and light distributions, and even novel

scenes, objects, and poses.

Light Modelling Lighting in ray tracers can often be decomposed into (1)

direct lighting as coming from lamps, emitting surfaces, the background, or am-

bient occlusion after a single reﬂection or transmission from a surface; and (2)

indirect lighting that comes from lamps, emitting surfaces or the background

after more than one reﬂection or transmission. Simulating indirect lighting ap-

proximates realistic energy transfer much closer and produces better images, but

comes at much higher computational cost. To be computationally reasonable, we

render all scenes with a single point light source.

Light Sampler. Our light sampler is a uniform random 3D coordinate gener-

ator. We limit the light pose space to the upper hemisphere and normalize the

position to be at a distance of 1.5m from the scene center as deﬁned in our

training data. The resulting light source position in scene coordinates is then

brought into the camera space given a ﬁxed transform. Next, we parametrize

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Photo-realisticNeuralDomainRandomizationSergeyZakharov*1,Rares,Ambrus,*1,VitorGuizilini1,WadimKehl2,andAdrienGaidon11ToyotaResearchInstitute,LosAltos,CA2WovenPlanet,Tokyo,JapanAbstract.Syntheticdataisascalablealternativetomanualsupervi-sion,butitrequiresovercomingthesim-to-realdomaingap.Thisdis-crep...

展开>> 收起<<

Photo-realistic Neural Domain Randomization Sergey Zakharov1 Rares Ambrus 1 Vitor Guizilini1 Wadim Kehl2 and Adrien Gaidon1.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Photo-realistic Neural Domain Randomization Sergey Zakharov1 Rares Ambrus 1 Vitor Guizilini1 Wadim Kehl2 and Adrien Gaidon1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: