Photo-realistic Neural Domain Randomization Sergey Zakharov1 Rares Ambrus 1 Vitor Guizilini1 Wadim Kehl2 and Adrien Gaidon1

2025-05-02 0 0 5.47MB 25 页 10玖币
侵权投诉
Photo-realistic Neural Domain Randomization
Sergey Zakharov*1, Rares
,Ambrus
,*1, Vitor Guizilini1, Wadim Kehl2, and
Adrien Gaidon1
1Toyota Research Institute, Los Altos, CA
2Woven Planet, Tokyo, Japan
Abstract. Synthetic data is a scalable alternative to manual supervi-
sion, but it requires overcoming the sim-to-real domain gap. This dis-
crepancy between virtual and real worlds is addressed by two seemingly
opposed approaches: improving the realism of simulation or foregoing re-
alism entirely via domain randomization. In this paper, we show that the
recent progress in neural rendering enables a new unified approach we
call Photo-realistic Neural Domain Randomization (PNDR). We propose
to learn a composition of neural networks that acts as a physics-based ray
tracer generating high-quality renderings from scene geometry alone. Our
approach is modular, composed of different neural networks for materi-
als, lighting, and rendering, thus enabling randomization of different key
image generation components in a differentiable pipeline. Once trained,
our method can be combined with other methods and used to gener-
ate photo-realistic image augmentations online and significantly more
efficiently than via traditional ray-tracing. We demonstrate the useful-
ness of PNDR through two downstream tasks: 6D object detection and
monocular depth estimation. Our experiments show that training with
PNDR enables generalization to novel scenes and significantly outper-
forms the state of the art in terms of real-world transfer.
1 Introduction
Collecting labelled data for various machine learning tasks is an expensive, error-
prone process that does not scale. Instead, simulators hold the promise of un-
limited, perfectly annotated data without any human intervention but often
introduce a domain gap that affects real-world performance. Effectively using
simulated data requires overcoming the sim-to-real domain gap which arises due
to differences in content or appearance. Domain adaptation methods rely on
target data (i.e., real-world data) to bridge that gap [61,64,47,70,75,72,38,18].
A separate paradigm that requires no target data is that of Domain Random-
ization [59,60], which forgoes expensive, photo-realistic rendering in favor of
random scene augmentations. In the context of object detection, CAD models
are typically assumed known [22,68,49] and a subset of lighting, textures, ma-
terials, and object poses are randomized. Although typically inefficient, sample
efficiency can be improved via differentiable guided augmentations [67], while
content [29,10] and appearance [51,41] gaps can also be addressed by leveraging
real data. However, a significant gap remains in terms of the photo-realism of the
arXiv:2210.12682v1 [cs.CV] 23 Oct 2022
2 S. Zakharov, R. Ambrus
,et al.
images generated. As an alternative, recent work [1,18] has shown that down-
stream task performance can be improved by increasing the quality of synthetic
data. However, generating high-quality photo-realistic synthetic data is an ex-
pensive process that requires access to detailed assets and environments, as well
as modeling light sources and materials inside complex graphics pipelines which
are typically not differentiable.
We propose a novel method that brings together these two separate paradigms
by generating high-quality synthetic data in a domain randomization framework.
We combine intermediate geometry buffers (”G-buffers”) generated by modern
simulators and game engines together with recent advances in neural render-
ing [53,42,2], and build a neural physics-based ray tracer that models scene
materials and light positions for photo-realistic rendering. Our Photo-realistic
Neural Domain Randomization (PNDR) pipeline learns to map scene geometry
to high quality renderings and is trained on a small amount of high-quality photo-
realistic synthetic data generated by a traditional ray-tracing simulator. Thanks
to its geometric input, PNDR generalizes to novel scenes and novel object con-
figurations. Once trained, PNDR can be integrated in various downstream task
training pipelines and used online to generate photo-realistic augmentations.
This alleviates the need to resort to expensive simulators to generate additional
high-quality image data when training the downstream task. Our method is more
efficient in terms of time (PNDR renderings are generated 3 orders of magni-
tude faster than traditional simulators), space (PNDR renderings are generated
on-the-fly during training and therefore do not need storage space) and leads
to better generalization. Although our proposed pipeline is generic in nature,
we quantify the usefulness of our synthetic training for the specific tasks of 6D
object detection and monocular depth estimation in a zero-shot setting (i.e.,
without using any real-world data), and demonstrate that our method presents
a distinct improvement over current SoTA approaches.
In summary, our contributions are:
We unify photo-realistic rendering and domain randomization for synthetic
data generation using neural rendering;
Our learned deferred renderer, RenderNet, allows flexible randomization of
physical parameters while being 1,600×faster than comparable ray-tracers;
Our Photo-realistic Neural Domain Randomization (PNDR) approach yields
state-of-the-art zero-shot sim-to-real transfer for 6D object detection and
monocular depth estimation, almost closing the domain gap;
We show that realistic physics-based randomization, especially for lighting,
is key for out-of-domain generalization.
2 Related Work
Domain Adaptation. Due to the domain gap, models trained on synthetic data
suffer performance drops when applied on statistically different unlabelled target
datasets. Domain Adaptation is an active area of research [7] with the aim of
Photo-realistic Neural Domain Randomization 3
Material Sampler RenderNet
MLP
DecoderEncoder
Direct Indirect Rendering
+
Light Sampler
+
Glossy Diffuse
!
idS
Data Generation
Materials
Light
~
~
Color
Tone Mapper
MLP
!
idS
ido
Fig. 1: PNDR Architecture. The main component of our domain randomiza-
tion method is the ray tracer approximator (RenderNet). It takes a G-buffer as
well as random material maps and light maps produced by corresponding sam-
plers and generates intermediate light outputs. These outputs are then combined
using a tone mapper to generate a final rendering. The lower-right row shows
different material and light samples (e.g., roughness, specularity, light position).
minimizing the sim-to-real gap. Common approaches rely on adversarial learning
for feature or pixel adaptation [6,61,14], paired [64] or unpaired [74,47,35] image
translation, style transfer [70], refining pseudo-labels [75,72,38], or unsupervised
geometric guidance [18].
Domain Randomization. A different approach to closing the sim-to-real gap
relies on generating augmentations of the input data through random pertur-
bations of the environment (e.g., lights, materials, background) [59,60,22]. The
aim is to learn more discriminative features that generalize to other domains.
While simple and inexpensive, this method is sample inefficient because the
randomization is essentially unguided with many superfluous (or even harmful)
augmentations, and it rarely captures the complexity and distribution of real
scenes. Differently, procedurally generating synthetic scenes [50] can preserve
the context of real scenes while minimizing the gaps in content [29,10,24] and
appearance [51,68,49]. While some of these methods require expensive, bespoke
simulators [29,10], pixel-based augmentations can be generated differentiably
and combined with the task network to generate adversarial augmentations [67].
Similarly to [67] our pipeline is also differentiable, however while [67] is lim-
ited to handcrafted image augmentations where respective parameters are sam-
pled from artificial distributions, our method approximates a material-based ray
tracer simulating the physical process of light scattering and global illumination,
enabling effects such as shadows and diffuse interreflection. Our augmentations
are solely based on light and material changes, thus reducing the randomization
set to physically plausible augmentations. Moreover, as opposed to [67], we as-
sume no color information of the objects of interest, making our method more
practical for real-world applications.
Photo-Realistic Data Generation. Although expensive to generate, high-
quality synthetic data (i.e., photo-realistic) can increase model generalization
4 S. Zakharov, R. Ambrus
,et al.
capabilities [1,18]. The task of view synthesis allows the rendering of novel data
given a set of input images [57]. Neural Radiance Fields [40] overfit to specific
scenes and can generate novel data with very high levels of fidelity, while also
accounting for materials and lights [4,56,5]. Alternative methods use point-based
differentiable rendering [54,3] and can optimize over scene geometry, camera
model, and various image formation properties. While these methods overfit
to specific scenes, recent self-supervised approaches learn generative models of
specific objects [41] and can render novel and controllable complex scenes by
exploiting compositionality [43]. While neural volume rendering and point based
techniques can yield impressive results, other methods aim to explicitly model
various parts of traditional graphics pipelines [53,42,2,28,58]. Our work is similar
to [53] in that we also use intermediate simulation buffers to generate photo-
realistic scenes. However, while [53] relies on real data and minimizes a perceptual
loss in an adversarial framework, we focus on the task of 6D object detection in a
zero-shot setting using only object CAD model information and no real images.
6D Object Detection. Correspondence-based methods [69,37,27,23,46,48] tend
to show superior generalization performance in terms of adapting to different
pose distributions. However, they use PnP and RANSAC to estimate poses from
correspondences, which makes them non-differentiable. Additionally, they are
very reliant on the quality of these correspondences, and errors can result in
unreasonable estimates (e.g., behind the camera, or very far away). Conversely,
regression-based methods [73,12,33] show superior performance for in-domain
pose estimation. However they do not generalize very well to out-of-domain
settings. To validate our method we implement a correspondence-based object
detector, which allows us to also evaluate instance segmentation and object
correspondences in addition to the object pose regressed.
3 Photo-realistic Neural Domain Randomization
Our photo-realistic neural domain randomization (PNDR) approach consists of
two main components: a neural ray tracer approximator (RenderNet), and sam-
pling blocks for material and light. To increase perceptual quality and realism,
the network outputs are passed through a non-linear tone-mapping function
which yields the final rendering. We now describe the main two components
of PNDR. All other implementation and training details are provided in the
supplementary.
3.1 Geometric Scene Representation
As a first step, we define a geometric room representation outlining our synthetic
environment. We place 3D objects inside an empty room ensuring no collisions.
Next, we assign random materials to both objects and room walls and position
a point light source to illuminate the scene (see Fig 2). Resulting output buffers,
consisting of G-Buffer (scene coordinates in camera space X, surface normals
map N), material properties (albedo A, roughness R, specularity S), and lighting
Photo-realistic Neural Domain Randomization 5
Materials LightG-Buffer
XYZ Normals Albedo Roughness Specular Light direction Light distance
Geometric scene representation
Fig. 2: Geometric scene representation. Visualization of RenderNet’s input
consisting of G-Buffer (scene coordinates in camera space X, surface normals
map N), material properties (albedo A, roughness R, specularity S), and lighting
(light direction map Ldir, and light distance map Ldist).
(light direction map Ldir , and light distance map Ldist), are used by our neural
ray tracer approximator to generate high fidelity renderings in real time (2.5ms
per image), as opposed to 4s per image with a conventional ray tracer.
3.2 Neural Ray Tracer Approximator
Our neural ray tracer RenderNet fRis an encoder-decoder CNN taking G-buffer,
material properties, and lighting as input, and generating a final high-fidelity
rendering (see Fig. 1). This is akin to deferred rendering, a common practice in
computer graphics [8]. Instead of outputting a final rendering directly, we split
the output into direct and indirect light outputs and colors which can be easily
combined to form a final, shaded image. This allows not only for a much more
explainable representation, but also for better control over the complexity of the
rendering. As a result, our RenderNet fRis capable of generating photo-realistic
images, generalizes well to novel material and light distributions, and even novel
scenes, objects, and poses.
Light Modelling Lighting in ray tracers can often be decomposed into (1)
direct lighting as coming from lamps, emitting surfaces, the background, or am-
bient occlusion after a single reflection or transmission from a surface; and (2)
indirect lighting that comes from lamps, emitting surfaces or the background
after more than one reflection or transmission. Simulating indirect lighting ap-
proximates realistic energy transfer much closer and produces better images, but
comes at much higher computational cost. To be computationally reasonable, we
render all scenes with a single point light source.
Light Sampler. Our light sampler is a uniform random 3D coordinate gener-
ator. We limit the light pose space to the upper hemisphere and normalize the
position to be at a distance of 1.5m from the scene center as defined in our
training data. The resulting light source position in scene coordinates is then
brought into the camera space given a fixed transform. Next, we parametrize
摘要:

Photo-realisticNeuralDomainRandomizationSergeyZakharov*1,Rares,Ambrus,*1,VitorGuizilini1,WadimKehl2,andAdrienGaidon11ToyotaResearchInstitute,LosAltos,CA2WovenPlanet,Tokyo,JapanAbstract.Syntheticdataisascalablealternativetomanualsupervi-sion,butitrequiresovercomingthesim-to-realdomaingap.Thisdis-crep...

展开>> 收起<<
Photo-realistic Neural Domain Randomization Sergey Zakharov1 Rares Ambrus 1 Vitor Guizilini1 Wadim Kehl2 and Adrien Gaidon1.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:5.47MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注