Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2

2025-04-27 0 0 2.54MB 11 页 10玖币
侵权投诉
Cross-identity Video Motion Retargeting with
Joint Transformation and Synthesis
Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2
1The Pennsylvania State University, University Park, PA, USA
2Johns Hopkins University, Baltimore, MD, USA
1{hfn5052, suh972}@psu.edu 2{yliu236, yuanxue}@jhu.edu
Figure 1: Examples of video motion retargeting, where motion from the driving video (1st column in 2nd&3rd rows) is
transferred to the subject in the subject video (1st row). Videos generated by RegionMM [41] for face video and EDN [4] for
dance video are shown in the 2nd column of each block, 2nd&3rd rows. Videos generated by our proposed TS-Net are in the
3rd column of each block, 2nd&3rd rows (highlighted with blue boxes).
Abstract
In this paper, we propose a novel dual-branch
Transformation-Synthesis network (TS-Net), for video mo-
tion retargeting. Given one subject video and one driving
video, TS-Net can produce a new plausible video with the
subject appearance of the subject video and motion pattern
of the driving video. TS-Net consists of a warp-based trans-
formation branch and a warp-free synthesis branch. The
novel design of dual branches combines the strengths of
deformation-grid-based transformation and warp-free gen-
eration for better identity preservation and robustness to oc-
clusion in the synthesized videos. A mask-aware similarity
module is further introduced to the transformation branch
to reduce computational overhead. Experimental results
on face and dance datasets show that TS-Net achieves bet-
ter performance in video motion retargeting than several
state-of-the-art models as well as its single-branch vari-
ants. Our code is available at https://github.com/
nihaomiao/WACV23_TSNet.
1. Introduction
Motion retargeting aims to transfer motion from a driv-
ing video to a target video while maintaining the subject’s
identity of the target video. It has become an important
topic due to its practical applications in special effects, vir-
tual/augmented reality and video editing, etc. Motion retar-
geting in the image domain has been explored extensively
and compelling results have been shown in many tasks,
such as person image generation [1, 27, 32, 37, 41], and
facial expression generation [5, 21, 34, 55]. Often formu-
lated as a guided video synthesis task, motion retargeting
between videos is known to be more challenging than mo-
tion retargeting between images since the temporal dynam-
ics of the motion to be transferred has to be learned [6].
Moreover, synthesizing realistic videos, especially human
motion videos, is more challenging than the generation of
high-quality images because human perception is sensitive
to unnatural temporal changes, and human motion is often
highly articulated [52, 54]. In this paper, we mainly fo-
cus on video motion retargeting between different human
1
arXiv:2210.01559v1 [cs.CV] 2 Oct 2022
subjects (Fig. 1). Given one subject video and one driving
video, we aim to synthesize a new plausible video with the
same identity of the person from the subject video and the
same motion as the person in the driving video.
Recent works in video motion retargeting [2, 4, 6, 9, 14,
18, 46, 47, 49, 50, 52, 54] have shown impressive progress.
To capture the temporal relationship among video frames,
prior works [6, 49, 50] generated frames via warping subject
frames by motion flow, which is usually extracted by specif-
ically designed warping field estimators, such as FlowNet
[8] or first-order approximation [39]. While warp-based
systems can generally preserve subject identity well, tradi-
tional flow-based warping may suffer from occlusion and
large motion due to its requirement of learning a warp
field with point-to-point correspondence between frames
[15]. Other methods [2, 4, 12, 20, 52, 54] utilized warp-
free (direct) synthesis with a conditional GAN-style struc-
ture [16, 30, 48]. To ease the challenging of direct synthe-
sis, they often employed feature disentangle/decomposition
[52] or followed the state-of-the-art generator architectures
[31, 48] to add various connections among inputs, the en-
coder, and the decoder network. Unlike warp-based gen-
eration, direct synthesis is not limited to only using pixels
from reference images, and therefore is easier to synthesize
novel pixels for unseen/occluded objects. However, such
flexibility can also lead to identity leakage [12], i.e., iden-
tity changes in the generated video.
Considering that warp-based synthesis can better pre-
serve identity while warp-free generation helps produce
new pixels, in this paper, we propose a novel video motion
retargeting framework, termed Transformation-Synthesis
Network, or TS-Net for short, to combine their advantages.
TS-Net has a dual branch structure which consists of a
transformation branch and a synthesis branch. The net-
work architectures within the two branches are inherently
different, thus learning via the two branches can be re-
garded as a special multi-view learning case [51]. Unlike
the popular warp-based methods using specially designed
optical flow estimators [38, 39, 41] and inspired by [26],
our proposed transformation branch computes deformation
flow by weighting the regular grid with a spatial similar-
ity matrix between driving mask features and subject im-
age features. The computation of similarity takes multiple
correspondences into consideration; thus it can better alle-
viate occlusion and handle large motion. We also design a
mask-aware similarity to avoid comparing all pairs of points
within the feature maps and thus be more efficient than tra-
ditional similarity computation methods. In our synthesis
branch, we use a fully-convolutional fusion network. Fea-
tures of two branches are concatenated and fed to the de-
coder network to generate realistic video frames. Experi-
ments in Sec. 4 shows the effectiveness of this simple con-
catenation strategy.
Merely based on sparse 2D masks of driving videos, our
proposed TS-Net can consistently achieve state-of-the-art
results for both face and dance videos, successfully model-
ing hair and clothes details and their motion. TS-Net also
handles large motions and preserves identity better when
compared with other state-of-the-art methods, as shown in
Fig. 1. Our contributions are summarized as follows:
1. We propose a novel dual branch video motion retar-
geting network TS-Net to generate identity-preserving
and temporally coherent videos via joint learning of
transformation and synthesis.
2. We utilize a simple yet effective way to estimate defor-
mation grid based on similarity matrix. Mask-aware
similarity is adopted to further reduce computation
overhead.
3. Comprehensive experiments on facial motion and
body motion retargeting tasks show that TS-Net can
achieve state-of-the-art results by only using sparse 2D
masks.
2. Related Work
Guided Image Generation. For conditional image gen-
eration, many works focused on generation tasks guided
by specific conditions such as pose-guided person image
synthesis [1, 27, 32, 36, 40, 43] and conditioned facial ex-
pression generation [5, 12, 34]. Pose-guided person image
generation can produce person images in arbitrary poses,
based on a subject image of that person and a novel pose
from the driving image. Ma et al. [27] proposed a two-
staged coarse-to-fine Pose Guided Person Generation Net-
work (PG2), which utilizes pose integration and image re-
finement to generate high-quality person images. Condi-
tioned facial expression generation aims to generate a reen-
acted face which shows the same expression as the driv-
ing face image while preserving the identity of the sub-
ject image. Chen et al. [5] proposed a two-stage frame-
work called PuppeteerGAN, which first performs expres-
sion retargeting by the sketching network and then exe-
cutes appearance transformation by the coloring network.
Though these works have shown promising results, they are
restricted to a specific object category (face or human body).
Several recent works [37, 38, 41, 55] have proposed general
guided image generation in various domains. Most of works
[38, 39, 41, 44] applies motion flow to image animation be-
cause it can model the physical dynamics. Siarohin et al.
[39] proposed a general self-supervised first-order-motion
model for estimating dense motion flow to animate arbitrary
objects using learned keypoints and local affine transforma-
tions. In [41], the authors further improved their network
by modeling object movement through unsupervised region
detection. Despite of building upon similar motion flow,
instead of adopting complicated modeling in [39, 41], the
Mask-aware
Similarity
𝑧
Subject Image
Encoder !"#
Driving Mask
Encoder "$%
𝑓
Transformation Branch 𝜞𝐭𝐫𝐚
̅𝑣&
Decoder Ω)*+
(𝑥
Synthesis Branch 𝜞𝐬𝐲𝐧
𝐾
𝐾
𝑒!
𝑒"
𝑒#
𝑆𝐾
̅𝑣.
𝐾
𝑣!
$𝑣"
$
𝑣#
$
𝐾
𝑣!
%𝑣"
%
𝑣#
%
𝐾
concat
concat
Fusion Net Λ
grid
sampling
𝐺𝐾
𝐺′
Figure 2: Illustration of the TS-Net generator to generate one frame ˆxin the target video.
transformation branch in TS-Net generate deformation flow
by weighting regular grid with similarity matrix in feature
space, which shows better simplicity and efficiency.
Video Motion Retargeting. Different from image-based
generation, video motion retargeting is more challenging
due to the additional coherence requirements in the tempo-
ral dimension. Most existing literature focused on specific
domains such as human pose motion retargeting [4, 52],
or facial expression retargeting [9, 12, 20, 49, 50, 54], yet
they may lack generality when applied to multiple domains.
In contrast, our proposed TS-Net can work well on both
face and human body videos. Using off-the-shelf detec-
tors to extract driving motion masks, such as 3D masks
[9, 12, 20], 2D dense mask [46, 47], or 2D sparse mask
[4, 54], is also popular in current video motion retarget-
ing methods. Due to the simplicity of 2D sparse masks,
our proposed TS-Net also utilizes keypoints extracted by
Dlib [22] and OpenPose [3] to synthesize videos of 3D hu-
man face/body. To learn representation and preserve input
information effectively, most recent methods are based on
state-of-the-art generators with U-Net structure and AdaIN
module [12, 20, 46, 54], feature disentangle/decomposition
[49, 52], or specifically designed motion flow estimators
[6, 46, 50]. On the contrary, our proposed TS-Net uses a
more robust and general GAN generator [19] as backbone to
jointly learn transformation and synthesis. Some previous
works [46, 47] also performed video motion retargeting by
combining warp-based and warp-free generation. However,
their warping flows are always applied to previous gener-
ated frames, which may lead to the accumulation of syn-
thesis artifacts. Our proposed TS-Net instead computes the
warping flow between driving mask and real subject images
in feature space to avoid this issue.
3. Methodology
3.1. Model Architecture
Given a sequence X={x1, x2, . . . , xK}with K
subject frames, their corresponding mask sequence Y=
{y1, y2, . . . , yK}, and a mask frame zfrom driving video,
the TS-Net generator can produce a new video frame ˆx
with the subject from Xand mask from z. Masks are gen-
erated by applying off-the-shelf pretrained 2D sparse key-
point detectors, i.e., Dlib [22] for face landmark detection
and OpenPose [3] for pose keypoint estimation. As illus-
trated in Fig. 2, TS-Net generator consists of two branches:
a transformation branch Γtra and a synthesis branch Γsyn for
generating the new video frame using warp-based transfor-
mation and direct synthesis, respectively.
During training, we concatenate Ksubject frames Xand
their masks Yand feed them to an image encoder img to
extract subject embedding features E={e1, e2, . . . , eK}.
A mask encoder msk encodes the input driving mask z
into driving embedding feature f. To reduce computational
costs of matrix multiplication, TS-Net operates in a low-
resolution feature space, where the spatial size of Eand f
are only 1/82of the input frames. We then input Eand f
to the transformation branch Γtra and the synthesis branch
Γsyn, as illustrated as follows.
Transformation Branch. Inside Γtra, we implement warp-
based transformation using spatial sampling grids [17]. We
first compute the cosine similarity matrix Skbetween the
driving embedding feature fand the k-th subject feature ek
as
Skpq =ekp·fq
ekp
2kfqk2
,(1)
where Skpq is the affinity value between fqat position qin
map f, and, ekpat position pin map ek, and k·k2indicates
the L2 norm. Suppose that the size of feature fand ekare
m×m, the size of matrix Skwill be m2×m2, which is
摘要:

Cross-identityVideoMotionRetargetingwithJointTransformationandSynthesisHaomiaoNi1YihaoLiu2SharonX.Huang1YuanXue21ThePennsylvaniaStateUniversity,UniversityPark,PA,USA2JohnsHopkinsUniversity,Baltimore,MD,USA1fhfn5052,suh972g@psu.edu2fyliu236,yuanxueg@jhu.eduFigure1:Examplesofvideomotionretargeting,whe...

展开>> 收起<<
Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:2.54MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注