MonoNHR Monocular Neural Human Renderer Hongsuk Choi13Gyeongsik Moon2Matthieu Armando3Vincent Leroy3 Kyoung Mu Lee1Gregory Rogez3

2025-05-02 0 0 4.98MB 15 页 10玖币
侵权投诉
MonoNHR: Monocular Neural Human Renderer
Hongsuk Choi1,3Gyeongsik Moon2Matthieu Armando3Vincent Leroy3
Kyoung Mu Lee1Gr´
egory Rogez3
1Dept. of ECE & ASRI, Seoul National University, Korea
2Meta Reality Labs Research 3NAVER LABS Europe
redstonepo@gmail.com, mks0601@fb.com, kyoungmu@snu.ac.kr
{matthieu.armando,vincent.leroy,gregory.rogez}@naverlabs.com
Abstract
Existing neural human rendering methods struggle with
a single image input due to the lack of information in in-
visible areas and the depth ambiguity of pixels in visible
areas. In this regard, we propose Monocular Neural Hu-
man Renderer (MonoNHR), a novel approach that renders
robust free-viewpoint images of an arbitrary human given
only a single image. MonoNHR is the first method that
(i) renders human subjects never seen during training in a
monocular setup, and (ii) is trained in a weakly-supervised
manner without geometry supervision. First, we propose to
disentangle 3D geometry and texture features and to condi-
tion the texture inference on the 3D geometry features. Sec-
ond, we introduce a Mesh Inpainter module that inpaints the
occluded parts exploiting human structural priors such as
symmetry. Experiments on ZJU-MoCap, AIST and HUMBI
datasets show that our approach significantly outperforms
the recent methods adapted to the monocular case.
1. Introduction
Novel view synthesis from a single image is a very chal-
lenging problem, but has many potential applications, e.g.
in AR/VR or smartphone-based social networking services.
Markerless capture from RGB data has been widely studied
as a tool to generate realistic free-viewpoint renderings of
humans, but it often requires synchronized and calibrated
multi-camera systems. We take a step towards monocular
capture of people’s appearance and shape, and tackle novel
view synthesis of a person observed from a single image,
which extends concurrent works to a more general setting.
Neural human rendering methods, which aim to render
people from arbitrary viewpoints, showed promising re-
* equal contribution
Figure 1: Given a single input image of a human (leftmost),
MonoNHR generates realistic renderings from novel view-
points (right). Tested on unseen subjects from HUMBI [58].
sults for this task. These can generally be grouped into
2 main categories: those learning subject-specific Neu-
ral Radiance Fields (NeRF) [30] to represent the appear-
ance of a particular human [25,35,36,40,46,52] and ap-
proaches that estimate neural surface fields [44] using pixel-
aligned image features [13,38,39,44,60]. The first ones
require a large number of input images [25,52] or multi-
view video frames capturing the complete surface of the
target [25,35,36,52], while the others rely on a detailed ge-
ometric ground-truth during training, and thus, require ex-
pensive, therefore small-scale, 3D scans datasets, prevent-
ing generalization to unseen human poses and appearances.
ARCH [15] and ARCH++ [12], that follow a different ap-
proach by learning an occupancy function in some canoni-
cal body space, also suffer from this problem. We summa-
1
arXiv:2210.00627v1 [cs.CV] 2 Oct 2022
rize these modalities in Table 1.
Interestingly, very recent NeRF-based methods, such as
pixelNeRF [57], PVA [37], and NHP [20], showed that it
is feasible to render free-viewpoint images of humans from
a sparse set of views, while allowing generalization to ar-
bitrary subjects. However, these methods are not designed
to synthesize occluded surfaces, as they only render sur-
faces visible in the input views. Thus these methods strug-
gle with monocular inputs, where more than half the surface
of the observed person can be invisible to the camera, mak-
ing the texture and geometry of the invisible parts largely
ambiguous. Furthermore, a depth ambiguity remains inher-
ent to monocular observations. In this paper, we claim that
prior knowledge of the human appearance and shape, such
as symmetry, color consistency between surfaces, and front-
back coherence, should be better exploited for this task.
Based on these observations, we propose Monocular
Neural Human Renderer (MonoNHR), a novel NeRF-based
architecture that robustly renders free-viewpoint images of
an arbitrary human given a single image of that person. We
address the issues inherent to a monocular observation in
two ways. First, we disentangle the features of 3D geom-
etry and texture by extracting geometry-dedicated features.
Different from neural surface field-based methods [38,39],
MonoNHR is trained only with multi-view images without
ground-truth (GT) 3D scans. Since we do not consider ex-
plicit 3D geometry, extracting geometry-dedicated features
is non-trivial. To do so, we design a geometry estimation
branch, separated from the texture estimation fork, that is
used solely for estimating the density of the radiance field.
Second, we introduce Mesh Inpainter that operates on the
SMPL [26] mesh estimated from the input image. It is used
only during training, to encourage the backbone network to
implicitly learn human priors. Please note that contrary to
SMPL texturing works [56], our method does not rely on
this 3D surface and is able to render shapes that largely dif-
fer from the SMPL model.
We study the efficacy of MonoNHR on ZJU-
MoCap [36], AIST [23,48] and HUMBI [58] datasets. Ex-
periments show that our method significantly outperforms
recent NeRF-based methods on monocular images. To the
best of our knowledge, MonoNHR is the first approach
specifically designed for novel view synthesis of humans
from a monocular image using neural radiance fields. Un-
like previous works, it explicitly and effectively handles the
many ambiguities inherent to monocular observations. Our
contributions are summarized below:
We present MonoNHR, a novel NeRF-based architec-
ture that robustly renders free-viewpoint images of an
arbitrary human from a monocular image. It pushes
the boundaries of NeRF-based novel view synthesis re-
search to a more general setting.
method input supervision unseen
identity
NB [36], Ani-NeRF [35] subject code multi-view
videos 7
PIFu [38,39]
ARCH/ARCH++ [12,15]
monocular
image 3D scans 3
NHP [20]multi-view
videos
multi-view
images 3
MonoNHR (Ours) monocular
image
multi-view
images 3
Table 1: Comparison of recent neural human rendering
methods. MonoNHR is the first work that 1) takes a monoc-
ular image as an input, 2) is supervised with multi-view im-
ages without 3D scans, and 3) is generalizable to unseen
subjects (identities).
We design the network to handle the specific chal-
lenges of the task, such as invisible surface synthesis
and depth ambiguity. We tackle the former via a mesh
inpainting module. For the latter, we disentangle 3D
geometry and texture features, and condition texture
inference based on the geometry features.
The proposed system significantly outperforms previ-
ous methods on monocular images both quantitatively
and qualitatively, and achieves state-of-the-art render-
ing quality on novel view synthesis benchmarks.
2. Related work
Markerless Performance Capture from RGB data of-
ten required large acquisition platforms with tens to hun-
dreds of cameras [4,21,47]. The problem has also been ap-
proached through the use of sparse setups [14,51]. We refer
the reader to [53] for a broader overview. Nevertheless, all
the multi-view approaches share the same limitations: syn-
chronizing and calibrating multi-camera systems is cum-
bersome, requires storing and processing large amounts of
data, and not always feasible in practice. A few recent ap-
proaches tackled the monocular case [10,54,55], but they
all require pre-scanned templates of the subjects.
Furthermore, most of the performance capture meth-
ods [1,4,6,9,28,45] solve the problem in two distinct
steps: 1) reconstructing the mesh of the observed subject,
and 2) coloring it using available observations, possibly
considering lighting information. The main drawback of
such a strategy is that the appearance is conditioned on,
but also limited by, the geometry, which is inherently noisy
and inaccurate if not incomplete. In this work, we wish to
switch paradigms and directly model view-dependent ap-
pearance. In fact, the idea was already introduced in [5].
We would like to follow their work, by investigating the
potential use of NeRF [30] to represent view-dependent ap-
pearances without relying on explicit geometry formulation
using only a single input view.
NeRF-based human rendering methods implicitly encode
a dense scene radiance field in the form of a density and
a color for a given 3D query point and viewing direction
via a neural network. One of their advantages is that they
do not require 3D supervision, and instead rely on 2D su-
pervision from multi-view images. The main drawback of
the original formulation, however, is that a NeRF model has
to be optimized for each scene, since it is, in fact, an opti-
mization scheme, and it is not a learning approach per se.
One stream of NeRF-based approaches for human render-
ing is building a subject-specific representation. NHR [52]
takes a sequence of point clouds as an input and conditions
the rendered novel images using 80 input points of view.
NB [36] utilizes a 3D human mesh model (i.e. SMPL [26])
and subject-specific latent codes, to construct the 3D latent
code volume, which is used for density and color regres-
sion of any 3D point bounded by a given 3D human mesh.
Other works [2,25,35] deform observation-space 3D points
to the canonical 3D space using inverse bone transforma-
tions, and learn the neural radiance fields. The canonical
3D space represents the pose normalized space around the
template human mesh. NARF [33] learns neural radiance
fields per human part, and trains an autoencoder to encode
human appearances using a synthetic human dataset.
Recently, pixelNeRF [57], IBRNet [49], and SRF [3]
proposed to combine an image-based feature encoding and
NeRF. Instead of memorizing the scene radiance in a 3D
space, their networks estimate it based on pixel-aligned im-
age features. We chose to adopt this strategy, as it allows a
single network to be trained across multiple scenes to learn
a scene prior, enabling it to generalize to unseen scenes in a
feed-forward manner from a sparse set of views. PVA [37],
Wang et al. [50], and NHP [20] also apply this approach
to human rendering. Especially, NHP targets human body
rendering, which is also our interest, using sparse multi-
view videos. Given a GT SMPL mesh, it exploits the pixel-
aligned image features to construct the 3D latent volume.
Among the above generalizable NeRF-extensions [20,
37,50], no work has addressed the challenges inherent to
single monocular images, such as occlusions and depth am-
biguity. Instead, they attempted to resolve the issues by in-
creasing the input information, such as using more views
and temporal information, which in practice can be a lim-
iting or even prohibitive factor for real-world applications.
In this paper, we demonstrate the robustness of MonoNHR
on single images by comparing it with pixelNeRF [57] and
NHP [20], which are generalizable NeRF-extensions and
are applicable to monocular images.
Neural surface fields-based human rendering methods
are closely related to NeRF representations, since they also
aim at learning an implicit function, in this case, an indica-
tor of the interior and exterior of the observed shape. They
allow for accurate and detailed 3D geometry reconstruc-
tions, but they require strong 3D supervision, such as 3D
scans. 3D scans are highly costly to obtain at scale, and
consequently, methods are trained on a small scan dataset
and tend to exhibit poor generalization capabilities to un-
seen human poses and appearances. PIFu [38] and its ex-
tensions [13,39] propose to estimate the 3D surface of a
human using an implicit function based on pixel-aligned
image features. The pixel-aligned image features are ob-
tained by projecting 3D points onto the image plane. Sim-
ilar to more classical approaches, they first reconstruct 3D
surfaces and then condition texture inference on surface re-
construction features similar to ours. However, as Double-
Field [44] pointed out, the learning space of texture is highly
limited around the surface and discontinuous, which hin-
ders the optimization. Zins et al. [61] improves PIFu in a
multi-view setting by introducing an attention-based view
fusion layer and a context encoding module using 3D con-
volutions. POSEFusion [24] takes monocular RGBD video
frames as an input and learns to fuse multiple surface esti-
mations in different time steps. DoubleField jointly learns
neural radiance and surface fields, and uses raw RGB pixel
values to render high-resolution images.
Compared to the above neural surface fields-based hu-
man rendering methods, MonoNHR has two clear differ-
ences. First, it does not require 3D scans for training and
follows NeRF-based human rendering pipeline, i.e. using
a weak supervision signal from multi-view images. Al-
though NeRF-based human rendering methods, including
ours, use SMPL fits, the SMPL fits are much easier to ob-
tain than 3D scans using existing powerful 3D human pose
and shape estimation methods [19,31]. On the other hand,
special and expensive equipments, e.g. over 100 multi-view
synchronized cameras, are necessary to generate accurate
3D scans, making them difficult to obtain at a large scale.
Please note that 3D scans obtained from a small number of
cameras, e.g. COLMAP [41,42], are not accurate enough
to provide supervision targets for PIFu and its variants, as
discussed in [22,36]. In consequence, PIFu and its vari-
ants are trained on small scale datasets and tend not to gen-
eralize well to unseen data, especially non-upright stand-
ing poses, as discussed in [20,36].Second, the absence of
3D scans prevents explicit 3D geometry supervision, mak-
ing disentangling geometry and texture non-trivial. We ex-
tract geometry-dedicated features for disentanglement and
use them for density estimation without RGB estimation.
3. MonoNHR
The overall pipeline of MonoNHR is detailed in Fig-
ure 2. It is trained in an end-to-end manner and consists
of an image feature backbone, Mesh Inpainter, a geometry
branch, and a texture branch.
摘要:

MonoNHR:MonocularNeuralHumanRendererHongsukChoi1;3GyeongsikMoon2MatthieuArmando3VincentLeroy3KyoungMuLee1Gr´egoryRogez31Dept.ofECE&ASRI,SeoulNationalUniversity,Korea2MetaRealityLabsResearch3NAVERLABSEuroperedstonepo@gmail.com,mks0601@fb.com,kyoungmu@snu.ac.krfmatthieu.armando,vincent.leroy,gregory...

展开>> 收起<<
MonoNHR Monocular Neural Human Renderer Hongsuk Choi13Gyeongsik Moon2Matthieu Armando3Vincent Leroy3 Kyoung Mu Lee1Gregory Rogez3.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:4.98MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注