Self-Supervised 3D Human Pose Estimation in Static Video Via Neural Rendering Luca Schmidtke12 Benjamin Hou1 Athanasios Vlontzos1 and

2025-05-03 0 0 1.44MB 9 页 10玖币

侵权投诉

Self-Supervised 3D Human Pose Estimation in

Static Video Via Neural Rendering

Luca Schmidtke1,2, Benjamin Hou1, Athanasios Vlontzos1, and

Bernhard Kainz1,2

1Imperial College London, UK

2Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, DE

Abstract. Inferring 3D human pose from 2D images is a challenging

and long-standing problem in the ﬁeld of computer vision with many

applications including motion capture, virtual reality, surveillance or gait

analysis for sports and medicine. We present preliminary results for a

method to estimate 3D pose from 2D video containing a single person

and a static background without the need for any manual landmark

annotations. We achieve this by formulating a simple yet eﬀective self-

supervision task: our model is required to reconstruct a random frame

of a video given a frame from another timepoint and a rendered image

of a transformed human shape template. Crucially for optimisation, our

ray casting based rendering pipeline is fully diﬀerentiable, enabling end

to end training solely based on the reconstruction task.

Keywords: self-supervised learning, 3D human pose estimation, 3D

pose tracking, motion capture

1 Introduction

Inferring 3D properties of our world from 2D images is an intriguing open prob-

lem in computer vision, even more so when no direct supervision is provided in

the form of labels. Although this problem is inherently ill-posed, humans are able

to derive accurate depth estimates, even when their vision is impaired, from mo-

tion cues and semantic prior knowledge about the perceived world around them.

This is especially true for human pose estimation. Self-supervised learning has

proven to be an eﬀective technique to utilise large amounts of unlabelled video

and image sources. On a more fundamental note, self-supervised learning is hy-

pothesised to be an essential component in the emergence of intelligence and

cognition. Moreover, self-supervised approaches allow for more ﬂexibility in do-

mains such as the medical sector where labels are often hard to come by. In this

paper we focus on self-supervised 3D pose estimation from monocular video, a

key element of a wide range of applications including motion capture, visual

surveillance or gait analysis.

Inspired by previous work, we model pose as a factor of variation throughout

diﬀerent frames of a video of a single person and a static background. More

arXiv:2210.04514v1 [cs.CV] 10 Oct 2022

2 L. Schmidtke et al.

formally, self-supervision is provided by formulating a conditional image recon-

struction task: given a pose input diﬀerent from the current image, what would

that image look like if we condition it on the given pose? Diﬀerently from previ-

ous work, we choose to represent pose as a 3D template consisting of connected

parts which we transform and project to two-dimensional image space, thereby

inferring 3D pose from monocular images without explicit supervision.

More speciﬁcally, our method builds upon the recent emergence and success

of combining deep neural networks with an explicit 3D to 2D image formation

process through fully diﬀerentiable rendering pipelines. This inverse-graphics

approach follows the analysis by synthesis principle of generative models in a

broader context: We hope to extract information about the 3D properties of

objects in our world by trying to recreate their perceived appearance on 2D

images. Popular rendering techniques rely on diﬀerent representations including

meshes and polygons, point clouds or implicit surfaces. In our work we make use

of volume rendering with a simple occupancy function or density combined with

a texture ﬁeld that assign an occupancy between [0,1] and RGB colour value

c∈R3for every point deﬁned on a regular 3D grid.

2 Related Work

Monocular 3D Human Pose Estimation Human pose estimation in general

is a long standing problem in computer vision with an associated large body

of work and substantial improvements since the advent of deep-learning based

approaches. Inferring 3D pose from monocular images however remains a chal-

lenging problem tackled by making use of additional cues in the image or video

such as motion or multiple views from synchronised cameras or introducing prior

knowledge about the hierarchical part based structure of the human body.

Lifting from 2D to 3D Many works break down the problem into ﬁrst estimat-

ing 2D pose and subsequently estimate 3D pose either directly [19], by leveraging

self-supervision through transformation and reprojection [15] or a kd-tree to ﬁnd

corresponding pairs of detected 2D pose and stored 3D pose [4].

Motion Cues From Video Videos provide a rich source of additional temporal

information that can be exploited to limit the solution space. [16], [8], [2] and

[10] use recurrent architectures in the form of LSTMs or GRUs to incorporate

temporal context while [23] employ temporal convolutions and a reprojection

objective.

Multiple Views Other approaches incorporate images from multiple, synchro-

nised cameras to alleviate the ill-posedness of the problem. [22], [31] and [24]

fuse multiple 2D heatmaps while [26] and[27] utilize multi-view consistency as a

form of additional supervision in the objective function.

Human Body Prior Using non-paremetric belief propagation, [29] estimate

the 2D pose of loosely-linked human body parts from image features and use

a mixture of experts to estimate a conditional distribution of 3D poses. Many

more recent approaches rely on features extracted from convolutional neural

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-Supervised3DHumanPoseEstimationinStaticVideoViaNeuralRenderingLucaSchmidtke1,2,BenjaminHou1,AthanasiosVlontzos1,andBernhardKainz1,21ImperialCollegeLondon,UK2Friedrich-Alexander-Universit¨atErlangen-N¨urnberg,DEAbstract.Inferring3Dhumanposefrom2Dimagesisachallengingandlong-standingprobleminthefi...

展开>> 收起<<

Self-Supervised 3D Human Pose Estimation in Static Video Via Neural Rendering Luca Schmidtke12 Benjamin Hou1 Athanasios Vlontzos1 and.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-Supervised 3D Human Pose Estimation in Static Video Via Neural Rendering Luca Schmidtke12 Benjamin Hou1 Athanasios Vlontzos1 and

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: