Self-Supervised 3D Human Pose Estimation in Static Video Via Neural Rendering Luca Schmidtke12 Benjamin Hou1 Athanasios Vlontzos1 and

2025-05-03 0 0 1.44MB 9 页 10玖币
侵权投诉
Self-Supervised 3D Human Pose Estimation in
Static Video Via Neural Rendering
Luca Schmidtke1,2, Benjamin Hou1, Athanasios Vlontzos1, and
Bernhard Kainz1,2
1Imperial College London, UK
2Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, DE
Abstract. Inferring 3D human pose from 2D images is a challenging
and long-standing problem in the field of computer vision with many
applications including motion capture, virtual reality, surveillance or gait
analysis for sports and medicine. We present preliminary results for a
method to estimate 3D pose from 2D video containing a single person
and a static background without the need for any manual landmark
annotations. We achieve this by formulating a simple yet effective self-
supervision task: our model is required to reconstruct a random frame
of a video given a frame from another timepoint and a rendered image
of a transformed human shape template. Crucially for optimisation, our
ray casting based rendering pipeline is fully differentiable, enabling end
to end training solely based on the reconstruction task.
Keywords: self-supervised learning, 3D human pose estimation, 3D
pose tracking, motion capture
1 Introduction
Inferring 3D properties of our world from 2D images is an intriguing open prob-
lem in computer vision, even more so when no direct supervision is provided in
the form of labels. Although this problem is inherently ill-posed, humans are able
to derive accurate depth estimates, even when their vision is impaired, from mo-
tion cues and semantic prior knowledge about the perceived world around them.
This is especially true for human pose estimation. Self-supervised learning has
proven to be an effective technique to utilise large amounts of unlabelled video
and image sources. On a more fundamental note, self-supervised learning is hy-
pothesised to be an essential component in the emergence of intelligence and
cognition. Moreover, self-supervised approaches allow for more flexibility in do-
mains such as the medical sector where labels are often hard to come by. In this
paper we focus on self-supervised 3D pose estimation from monocular video, a
key element of a wide range of applications including motion capture, visual
surveillance or gait analysis.
Inspired by previous work, we model pose as a factor of variation throughout
different frames of a video of a single person and a static background. More
arXiv:2210.04514v1 [cs.CV] 10 Oct 2022
2 L. Schmidtke et al.
formally, self-supervision is provided by formulating a conditional image recon-
struction task: given a pose input different from the current image, what would
that image look like if we condition it on the given pose? Differently from previ-
ous work, we choose to represent pose as a 3D template consisting of connected
parts which we transform and project to two-dimensional image space, thereby
inferring 3D pose from monocular images without explicit supervision.
More specifically, our method builds upon the recent emergence and success
of combining deep neural networks with an explicit 3D to 2D image formation
process through fully differentiable rendering pipelines. This inverse-graphics
approach follows the analysis by synthesis principle of generative models in a
broader context: We hope to extract information about the 3D properties of
objects in our world by trying to recreate their perceived appearance on 2D
images. Popular rendering techniques rely on different representations including
meshes and polygons, point clouds or implicit surfaces. In our work we make use
of volume rendering with a simple occupancy function or density combined with
a texture field that assign an occupancy between [0,1] and RGB colour value
cR3for every point defined on a regular 3D grid.
2 Related Work
Monocular 3D Human Pose Estimation Human pose estimation in general
is a long standing problem in computer vision with an associated large body
of work and substantial improvements since the advent of deep-learning based
approaches. Inferring 3D pose from monocular images however remains a chal-
lenging problem tackled by making use of additional cues in the image or video
such as motion or multiple views from synchronised cameras or introducing prior
knowledge about the hierarchical part based structure of the human body.
Lifting from 2D to 3D Many works break down the problem into first estimat-
ing 2D pose and subsequently estimate 3D pose either directly [19], by leveraging
self-supervision through transformation and reprojection [15] or a kd-tree to find
corresponding pairs of detected 2D pose and stored 3D pose [4].
Motion Cues From Video Videos provide a rich source of additional temporal
information that can be exploited to limit the solution space. [16], [8], [2] and
[10] use recurrent architectures in the form of LSTMs or GRUs to incorporate
temporal context while [23] employ temporal convolutions and a reprojection
objective.
Multiple Views Other approaches incorporate images from multiple, synchro-
nised cameras to alleviate the ill-posedness of the problem. [22], [31] and [24]
fuse multiple 2D heatmaps while [26] and[27] utilize multi-view consistency as a
form of additional supervision in the objective function.
Human Body Prior Using non-paremetric belief propagation, [29] estimate
the 2D pose of loosely-linked human body parts from image features and use
a mixture of experts to estimate a conditional distribution of 3D poses. Many
more recent approaches rely on features extracted from convolutional neural
摘要:

Self-Supervised3DHumanPoseEstimationinStaticVideoViaNeuralRenderingLucaSchmidtke1,2,BenjaminHou1,AthanasiosVlontzos1,andBernhardKainz1,21ImperialCollegeLondon,UK2Friedrich-Alexander-Universit¨atErlangen-N¨urnberg,DEAbstract.Inferring3Dhumanposefrom2Dimagesisachallengingandlong-standingprobleminthefi...

展开>> 收起<<
Self-Supervised 3D Human Pose Estimation in Static Video Via Neural Rendering Luca Schmidtke12 Benjamin Hou1 Athanasios Vlontzos1 and.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.44MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注