MM ’22, October 10–14, 2022, Lisboa, Portugal Yunhao Li et al.
CCS CONCEPTS
•Computing methodologies →Motion capture.
KEYWORDS
3D motion in-betweening; inverse kinematics; reinforcement learn-
ing; 3D animation
ACM Reference Format:
Yunhao Li, Zhenbo Yu, Yucheng Zhu, Bingbing Ni, Guangtao Zhai, and Wei
Shen. 2022. Skeleton2Humanoid: Animating Simulated Characters for Physically-
plausible Motion In-betweening. In Proceedings of the 30th ACM International
Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal.
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3503161.3548093
1 INTRODUCTION
Synthesizing both accurate and realistic virtual human motions has
been a widely explored but challenging task in computer vision
and graphics [
48
,
49
] with various applications in digital twins and
the Metaverse. Recently, deep learning sheds light onto a way to
generate accurate human motions and has been applied to various
motion synthesis tasks, such as human motion prediction [
2
–
4
,
11
,
16
,
17
,
53
–
55
,
60
], human motion completion [
31
,
58
,
59
] and
human motion in-betweening [
1
,
47
,
56
,
57
]. Although they have
shown great performance on synthesizing accurate human body
motions with small skeleton joint errors comparing with ground
truth motions, they fail to model the motions under the physics
laws. Consequently, the synthesized motions are usually physically
implausible. For example, the synthesized feet often penetrate the
ground, the body joints are rotated with impossible angles, the
whole body motions are unsmooth, the synthesized feet slide back
and forth while they should be in static and touch the ground.
These synthesized artifacts signicantly limits the application of
motion synthesis on the virtual human animation and the incoming
Metaverse because they easily make humans feel unrealistic.
Utilizing humanoid characters in a physics simulator to optimize
motions is a promising solution because the physics simulator can
guarantee the physical plausibility of the generated motions. Prior
works [
39
,
40
,
52
] utilized reinforcement learning (RL) to actuate
the humanoid character to imitate various reference mocap data
for creating physical character animation. Inspired by them, Recent
works [
8
,
29
] also attempted to utilize RL to imitate motions synthe-
sized by deep neural networks, in the format of skeletons or SMPL
[
9
] models, aiming at producing physically-plausible motions for
3D pose estimation. However, these methods are only validated
on simple motions such as walking and talking in the Human3.6m
dataset and cannot generalize well to complex motions or irregular
motions. In addition, RL based imitation requires transferring syn-
thesized human skeleton motions to humanoid motions, where a
humanoid character should be carefully designed to exactly match
the human skeletons in terms of both shapes and the kinematics
tree. This limits RL based imitation to transfer motions between
skeleton and humanoid with dierent shapes and kinematics trees.
To address these issues, we propose Skeleton2Humanoid, a novel
system which is able to improve the physical plausibility of the
motions synthesized from motion synthesis networks, though the
transfer from human skeleton motions to humanoid character mo-
tions. Our Skeleton2Humanoid system consists of three sequential
stages:
(I) Test Time Motion Synthesis Network Adaptation:
We adapt the motion synthesis network with a few gradients on
the test data using two new self-supervised losses, a foot contact
consistency loss and a motion smoothness loss, which can improve
the physical plausibility of the predicted motions.
(II) Skeleton to
Humanoid Matching:
We match the synthesized human skeleton
motions to humanoid character motions by a novel general analyti-
cal inverse kinematic method. Inverse kinematics is able to convert
human skeleton motions to humanoid motions even when the body
structure is dierent from the human skeleton.
(III) Motion Imi-
tation base on RL:
Finally, we animate the humanoid character to
imitate various synthesized motions. Specically, based on recent
work [
26
,
29
], we propose a curriculum residual force control hu-
manoid control policy (CRP) by introducing a curriculum learning
paradigm that dynamically adjusts a residual force scale during RL
training, which can improves asymptotic RL performance on imi-
tating various synthesized motions. To verify the eectiveness of
our Skeleton2Humanoid system, we select “motion in-betweening”
task, as it is a recent proposed challenging motion prediction task
[
1
,
47
] for evaluation. Motion in-betweening aims at predicting
the transition motions between the past given keyframes and a
provided future keyframe. Experiments on challenging LaFAN1
dataset show the superiority of our Skeleton2Humanoid system.
The main contributions of this paper are as follows:
(1)
We
present Skeleton2Humanoid, a new system that converts human
skeleton motions to humanoid character motions to produce physi-
cal plausible motions.
(2)
Our proposed test time adaptation stage
can further improve the prediction accuracy and physical plausibil-
ity on large mocap dataset LaFAN1 for the motion in-betweening
task. With test time adaptation, we achieve a new benchmark accu-
racy on the motion in-betweening task.
(3)
Our proposed curricu-
lum residual force control policy enables ner character control and
outperforms prior arts on motion imitation.
(4)
Our whole Skele-
ton2Humanoid system signicantly improves the performance of
human in-betweening motions on physical plausibility and achieves
comparable motion prediction accuracy.
2 RELATED WORK
Human/character motion synthesis
: Motion synthesis is a gen-
eral term which contains several tasks including motion prediction,
in-betweening and completion. Motion prediction aims at predict-
ing future human motions given past motions. Deterministic mo-
tion prediction estimates a single accurate motion and prior works
used various network architectures including recurrent neural net-
work [
2
–
4
], graph convolution network [
61
] or transformer [
16
]
to model human motions. Stochastic motion prediction produces
diverse future human motions by utilizing generative model such
as VAE [
6
,
17
,
55
,
66
], GAN [
12
,
14
,
65
]. Motion completion and
in-betweening aim at lling gaps of motion with predened key-
frame constraints. Current works utilized convolution networks
[
31
,
57
,
59
,
62
], recurrent networks [
1
,
63
] or transformers [
47
] to
synthesize accurate and consistent results. For instance, Harvey et
al. [
1
] proposed a transition generation technique based on recur-
rent neural networks for motion in-betweening task. Duan et al.
[
47
] utilized transformer architecture to model human motions in a
sequence-to-sequence manner for the motion in-betweening task.