MonoNHR Monocular Neural Human Renderer Hongsuk Choi13Gyeongsik Moon2Matthieu Armando3Vincent Leroy3 Kyoung Mu Lee1Gregory Rogez3

2025-05-02 0 0 4.98MB 15 页 10玖币

侵权投诉

MonoNHR: Monocular Neural Human Renderer

Hongsuk Choi∗1,3Gyeongsik Moon∗2Matthieu Armando3Vincent Leroy3

Kyoung Mu Lee1Gr´

egory Rogez3

1Dept. of ECE & ASRI, Seoul National University, Korea

2Meta Reality Labs Research 3NAVER LABS Europe

redstonepo@gmail.com, mks0601@fb.com, kyoungmu@snu.ac.kr

{matthieu.armando,vincent.leroy,gregory.rogez}@naverlabs.com

Abstract

Existing neural human rendering methods struggle with

a single image input due to the lack of information in in-

visible areas and the depth ambiguity of pixels in visible

areas. In this regard, we propose Monocular Neural Hu-

man Renderer (MonoNHR), a novel approach that renders

robust free-viewpoint images of an arbitrary human given

only a single image. MonoNHR is the ﬁrst method that

(i) renders human subjects never seen during training in a

monocular setup, and (ii) is trained in a weakly-supervised

manner without geometry supervision. First, we propose to

disentangle 3D geometry and texture features and to condi-

tion the texture inference on the 3D geometry features. Sec-

ond, we introduce a Mesh Inpainter module that inpaints the

occluded parts exploiting human structural priors such as

symmetry. Experiments on ZJU-MoCap, AIST and HUMBI

datasets show that our approach signiﬁcantly outperforms

the recent methods adapted to the monocular case.

1. Introduction

Novel view synthesis from a single image is a very chal-

lenging problem, but has many potential applications, e.g.

in AR/VR or smartphone-based social networking services.

Markerless capture from RGB data has been widely studied

as a tool to generate realistic free-viewpoint renderings of

humans, but it often requires synchronized and calibrated

multi-camera systems. We take a step towards monocular

capture of people’s appearance and shape, and tackle novel

view synthesis of a person observed from a single image,

which extends concurrent works to a more general setting.

Neural human rendering methods, which aim to render

people from arbitrary viewpoints, showed promising re-

* equal contribution

Figure 1: Given a single input image of a human (leftmost),

MonoNHR generates realistic renderings from novel view-

points (right). Tested on unseen subjects from HUMBI [58].

sults for this task. These can generally be grouped into

2 main categories: those learning subject-speciﬁc Neu-

ral Radiance Fields (NeRF) [30] to represent the appear-

ance of a particular human [25,35,36,40,46,52] and ap-

proaches that estimate neural surface ﬁelds [44] using pixel-

aligned image features [13,38,39,44,60]. The ﬁrst ones

require a large number of input images [25,52] or multi-

view video frames capturing the complete surface of the

target [25,35,36,52], while the others rely on a detailed ge-

ometric ground-truth during training, and thus, require ex-

pensive, therefore small-scale, 3D scans datasets, prevent-

ing generalization to unseen human poses and appearances.

ARCH [15] and ARCH++ [12], that follow a different ap-

proach by learning an occupancy function in some canoni-

cal body space, also suffer from this problem. We summa-

arXiv:2210.00627v1 [cs.CV] 2 Oct 2022

rize these modalities in Table 1.

Interestingly, very recent NeRF-based methods, such as

pixelNeRF [57], PVA [37], and NHP [20], showed that it

is feasible to render free-viewpoint images of humans from

a sparse set of views, while allowing generalization to ar-

bitrary subjects. However, these methods are not designed

to synthesize occluded surfaces, as they only render sur-

faces visible in the input views. Thus these methods strug-

gle with monocular inputs, where more than half the surface

of the observed person can be invisible to the camera, mak-

ing the texture and geometry of the invisible parts largely

ambiguous. Furthermore, a depth ambiguity remains inher-

ent to monocular observations. In this paper, we claim that

prior knowledge of the human appearance and shape, such

as symmetry, color consistency between surfaces, and front-

back coherence, should be better exploited for this task.

Based on these observations, we propose Monocular

Neural Human Renderer (MonoNHR), a novel NeRF-based

architecture that robustly renders free-viewpoint images of

an arbitrary human given a single image of that person. We

address the issues inherent to a monocular observation in

two ways. First, we disentangle the features of 3D geom-

etry and texture by extracting geometry-dedicated features.

Different from neural surface ﬁeld-based methods [38,39],

MonoNHR is trained only with multi-view images without

ground-truth (GT) 3D scans. Since we do not consider ex-

plicit 3D geometry, extracting geometry-dedicated features

is non-trivial. To do so, we design a geometry estimation

branch, separated from the texture estimation fork, that is

used solely for estimating the density of the radiance ﬁeld.

Second, we introduce Mesh Inpainter that operates on the

SMPL [26] mesh estimated from the input image. It is used

only during training, to encourage the backbone network to

implicitly learn human priors. Please note that contrary to

SMPL texturing works [56], our method does not rely on

this 3D surface and is able to render shapes that largely dif-

fer from the SMPL model.

We study the efﬁcacy of MonoNHR on ZJU-

MoCap [36], AIST [23,48] and HUMBI [58] datasets. Ex-

periments show that our method signiﬁcantly outperforms

recent NeRF-based methods on monocular images. To the

best of our knowledge, MonoNHR is the ﬁrst approach

speciﬁcally designed for novel view synthesis of humans

from a monocular image using neural radiance ﬁelds. Un-

like previous works, it explicitly and effectively handles the

many ambiguities inherent to monocular observations. Our

contributions are summarized below:

•We present MonoNHR, a novel NeRF-based architec-

ture that robustly renders free-viewpoint images of an

arbitrary human from a monocular image. It pushes

the boundaries of NeRF-based novel view synthesis re-

search to a more general setting.

method input supervision unseen

identity

NB [36], Ani-NeRF [35] subject code multi-view

videos 7

PIFu [38,39]

ARCH/ARCH++ [12,15]

monocular

image 3D scans 3

NHP [20]multi-view

videos

multi-view

images 3

MonoNHR (Ours) monocular

image

multi-view

images 3

Table 1: Comparison of recent neural human rendering

methods. MonoNHR is the ﬁrst work that 1) takes a monoc-

ular image as an input, 2) is supervised with multi-view im-

ages without 3D scans, and 3) is generalizable to unseen

subjects (identities).

•We design the network to handle the speciﬁc chal-

lenges of the task, such as invisible surface synthesis

and depth ambiguity. We tackle the former via a mesh

inpainting module. For the latter, we disentangle 3D

geometry and texture features, and condition texture

inference based on the geometry features.

•The proposed system signiﬁcantly outperforms previ-

ous methods on monocular images both quantitatively

and qualitatively, and achieves state-of-the-art render-

ing quality on novel view synthesis benchmarks.

2. Related work

Markerless Performance Capture from RGB data of-

ten required large acquisition platforms with tens to hun-

dreds of cameras [4,21,47]. The problem has also been ap-

proached through the use of sparse setups [14,51]. We refer

the reader to [53] for a broader overview. Nevertheless, all

the multi-view approaches share the same limitations: syn-

chronizing and calibrating multi-camera systems is cum-

bersome, requires storing and processing large amounts of

data, and not always feasible in practice. A few recent ap-

proaches tackled the monocular case [10,54,55], but they

all require pre-scanned templates of the subjects.

Furthermore, most of the performance capture meth-

ods [1,4,6,9,28,45] solve the problem in two distinct

steps: 1) reconstructing the mesh of the observed subject,

and 2) coloring it using available observations, possibly

considering lighting information. The main drawback of

such a strategy is that the appearance is conditioned on,

but also limited by, the geometry, which is inherently noisy

and inaccurate if not incomplete. In this work, we wish to

switch paradigms and directly model view-dependent ap-

pearance. In fact, the idea was already introduced in [5].

We would like to follow their work, by investigating the

potential use of NeRF [30] to represent view-dependent ap-

pearances without relying on explicit geometry formulation

using only a single input view.

NeRF-based human rendering methods implicitly encode

a dense scene radiance ﬁeld in the form of a density and

a color for a given 3D query point and viewing direction

via a neural network. One of their advantages is that they

do not require 3D supervision, and instead rely on 2D su-

pervision from multi-view images. The main drawback of

the original formulation, however, is that a NeRF model has

to be optimized for each scene, since it is, in fact, an opti-

mization scheme, and it is not a learning approach per se.

One stream of NeRF-based approaches for human render-

ing is building a subject-speciﬁc representation. NHR [52]

takes a sequence of point clouds as an input and conditions

the rendered novel images using 80 input points of view.

NB [36] utilizes a 3D human mesh model (i.e. SMPL [26])

and subject-speciﬁc latent codes, to construct the 3D latent

code volume, which is used for density and color regres-

sion of any 3D point bounded by a given 3D human mesh.

Other works [2,25,35] deform observation-space 3D points

to the canonical 3D space using inverse bone transforma-

tions, and learn the neural radiance ﬁelds. The canonical

3D space represents the pose normalized space around the

template human mesh. NARF [33] learns neural radiance

ﬁelds per human part, and trains an autoencoder to encode

human appearances using a synthetic human dataset.

Recently, pixelNeRF [57], IBRNet [49], and SRF [3]

proposed to combine an image-based feature encoding and

NeRF. Instead of memorizing the scene radiance in a 3D

space, their networks estimate it based on pixel-aligned im-

age features. We chose to adopt this strategy, as it allows a

single network to be trained across multiple scenes to learn

a scene prior, enabling it to generalize to unseen scenes in a

feed-forward manner from a sparse set of views. PVA [37],

Wang et al. [50], and NHP [20] also apply this approach

to human rendering. Especially, NHP targets human body

rendering, which is also our interest, using sparse multi-

view videos. Given a GT SMPL mesh, it exploits the pixel-

aligned image features to construct the 3D latent volume.

Among the above generalizable NeRF-extensions [20,

37,50], no work has addressed the challenges inherent to

single monocular images, such as occlusions and depth am-

biguity. Instead, they attempted to resolve the issues by in-

creasing the input information, such as using more views

and temporal information, which in practice can be a lim-

iting or even prohibitive factor for real-world applications.

In this paper, we demonstrate the robustness of MonoNHR

on single images by comparing it with pixelNeRF [57] and

NHP [20], which are generalizable NeRF-extensions and

are applicable to monocular images.

Neural surface ﬁelds-based human rendering methods

are closely related to NeRF representations, since they also

aim at learning an implicit function, in this case, an indica-

tor of the interior and exterior of the observed shape. They

allow for accurate and detailed 3D geometry reconstruc-

tions, but they require strong 3D supervision, such as 3D

scans. 3D scans are highly costly to obtain at scale, and

consequently, methods are trained on a small scan dataset

and tend to exhibit poor generalization capabilities to un-

seen human poses and appearances. PIFu [38] and its ex-

tensions [13,39] propose to estimate the 3D surface of a

human using an implicit function based on pixel-aligned

image features. The pixel-aligned image features are ob-

tained by projecting 3D points onto the image plane. Sim-

ilar to more classical approaches, they ﬁrst reconstruct 3D

surfaces and then condition texture inference on surface re-

construction features similar to ours. However, as Double-

Field [44] pointed out, the learning space of texture is highly

limited around the surface and discontinuous, which hin-

ders the optimization. Zins et al. [61] improves PIFu in a

multi-view setting by introducing an attention-based view

fusion layer and a context encoding module using 3D con-

volutions. POSEFusion [24] takes monocular RGBD video

frames as an input and learns to fuse multiple surface esti-

mations in different time steps. DoubleField jointly learns

neural radiance and surface ﬁelds, and uses raw RGB pixel

values to render high-resolution images.

Compared to the above neural surface ﬁelds-based hu-

man rendering methods, MonoNHR has two clear differ-

ences. First, it does not require 3D scans for training and

follows NeRF-based human rendering pipeline, i.e. using

a weak supervision signal from multi-view images. Al-

though NeRF-based human rendering methods, including

ours, use SMPL ﬁts, the SMPL ﬁts are much easier to ob-

tain than 3D scans using existing powerful 3D human pose

and shape estimation methods [19,31]. On the other hand,

special and expensive equipments, e.g. over 100 multi-view

synchronized cameras, are necessary to generate accurate

3D scans, making them difﬁcult to obtain at a large scale.

Please note that 3D scans obtained from a small number of

cameras, e.g. COLMAP [41,42], are not accurate enough

to provide supervision targets for PIFu and its variants, as

discussed in [22,36]. In consequence, PIFu and its vari-

ants are trained on small scale datasets and tend not to gen-

eralize well to unseen data, especially non-upright stand-

ing poses, as discussed in [20,36].Second, the absence of

3D scans prevents explicit 3D geometry supervision, mak-

ing disentangling geometry and texture non-trivial. We ex-

tract geometry-dedicated features for disentanglement and

use them for density estimation without RGB estimation.

3. MonoNHR

The overall pipeline of MonoNHR is detailed in Fig-

ure 2. It is trained in an end-to-end manner and consists

of an image feature backbone, Mesh Inpainter, a geometry

branch, and a texture branch.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MonoNHR:MonocularNeuralHumanRendererHongsukChoi1;3GyeongsikMoon2MatthieuArmando3VincentLeroy3KyoungMuLee1Gr´egoryRogez31Dept.ofECE&ASRI,SeoulNationalUniversity,Korea2MetaRealityLabsResearch3NAVERLABSEuroperedstonepo@gmail.com,mks0601@fb.com,kyoungmu@snu.ac.krfmatthieu.armando,vincent.leroy,gregory...

展开>> 收起<<

MonoNHR Monocular Neural Human Renderer Hongsuk Choi13Gyeongsik Moon2Matthieu Armando3Vincent Leroy3 Kyoung Mu Lee1Gregory Rogez3.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MonoNHR Monocular Neural Human Renderer Hongsuk Choi13Gyeongsik Moon2Matthieu Armando3Vincent Leroy3 Kyoung Mu Lee1Gregory Rogez3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: