Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2

2025-04-27 1 0 2.54MB 11 页 10玖币

侵权投诉

Cross-identity Video Motion Retargeting with

Joint Transformation and Synthesis

Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2

1The Pennsylvania State University, University Park, PA, USA

2Johns Hopkins University, Baltimore, MD, USA

1{hfn5052, suh972}@psu.edu 2{yliu236, yuanxue}@jhu.edu

Figure 1: Examples of video motion retargeting, where motion from the driving video (1st column in 2nd&3rd rows) is

transferred to the subject in the subject video (1st row). Videos generated by RegionMM [41] for face video and EDN [4] for

dance video are shown in the 2nd column of each block, 2nd&3rd rows. Videos generated by our proposed TS-Net are in the

3rd column of each block, 2nd&3rd rows (highlighted with blue boxes).

Abstract

In this paper, we propose a novel dual-branch

Transformation-Synthesis network (TS-Net), for video mo-

tion retargeting. Given one subject video and one driving

video, TS-Net can produce a new plausible video with the

subject appearance of the subject video and motion pattern

of the driving video. TS-Net consists of a warp-based trans-

formation branch and a warp-free synthesis branch. The

novel design of dual branches combines the strengths of

deformation-grid-based transformation and warp-free gen-

eration for better identity preservation and robustness to oc-

clusion in the synthesized videos. A mask-aware similarity

module is further introduced to the transformation branch

to reduce computational overhead. Experimental results

on face and dance datasets show that TS-Net achieves bet-

ter performance in video motion retargeting than several

state-of-the-art models as well as its single-branch vari-

ants. Our code is available at https://github.com/

nihaomiao/WACV23_TSNet.

1. Introduction

Motion retargeting aims to transfer motion from a driv-

ing video to a target video while maintaining the subject’s

identity of the target video. It has become an important

topic due to its practical applications in special effects, vir-

tual/augmented reality and video editing, etc. Motion retar-

geting in the image domain has been explored extensively

and compelling results have been shown in many tasks,

such as person image generation [1, 27, 32, 37, 41], and

facial expression generation [5, 21, 34, 55]. Often formu-

lated as a guided video synthesis task, motion retargeting

between videos is known to be more challenging than mo-

tion retargeting between images since the temporal dynam-

ics of the motion to be transferred has to be learned [6].

Moreover, synthesizing realistic videos, especially human

motion videos, is more challenging than the generation of

high-quality images because human perception is sensitive

to unnatural temporal changes, and human motion is often

highly articulated [52, 54]. In this paper, we mainly fo-

cus on video motion retargeting between different human

arXiv:2210.01559v1 [cs.CV] 2 Oct 2022

subjects (Fig. 1). Given one subject video and one driving

video, we aim to synthesize a new plausible video with the

same identity of the person from the subject video and the

same motion as the person in the driving video.

Recent works in video motion retargeting [2, 4, 6, 9, 14,

18, 46, 47, 49, 50, 52, 54] have shown impressive progress.

To capture the temporal relationship among video frames,

prior works [6, 49, 50] generated frames via warping subject

frames by motion ﬂow, which is usually extracted by specif-

ically designed warping ﬁeld estimators, such as FlowNet

[8] or ﬁrst-order approximation [39]. While warp-based

systems can generally preserve subject identity well, tradi-

tional ﬂow-based warping may suffer from occlusion and

large motion due to its requirement of learning a warp

ﬁeld with point-to-point correspondence between frames

[15]. Other methods [2, 4, 12, 20, 52, 54] utilized warp-

free (direct) synthesis with a conditional GAN-style struc-

ture [16, 30, 48]. To ease the challenging of direct synthe-

sis, they often employed feature disentangle/decomposition

[52] or followed the state-of-the-art generator architectures

[31, 48] to add various connections among inputs, the en-

coder, and the decoder network. Unlike warp-based gen-

eration, direct synthesis is not limited to only using pixels

from reference images, and therefore is easier to synthesize

novel pixels for unseen/occluded objects. However, such

ﬂexibility can also lead to identity leakage [12], i.e., iden-

tity changes in the generated video.

Considering that warp-based synthesis can better pre-

serve identity while warp-free generation helps produce

new pixels, in this paper, we propose a novel video motion

retargeting framework, termed Transformation-Synthesis

Network, or TS-Net for short, to combine their advantages.

TS-Net has a dual branch structure which consists of a

transformation branch and a synthesis branch. The net-

work architectures within the two branches are inherently

different, thus learning via the two branches can be re-

garded as a special multi-view learning case [51]. Unlike

the popular warp-based methods using specially designed

optical ﬂow estimators [38, 39, 41] and inspired by [26],

our proposed transformation branch computes deformation

ﬂow by weighting the regular grid with a spatial similar-

ity matrix between driving mask features and subject im-

age features. The computation of similarity takes multiple

correspondences into consideration; thus it can better alle-

viate occlusion and handle large motion. We also design a

mask-aware similarity to avoid comparing all pairs of points

within the feature maps and thus be more efﬁcient than tra-

ditional similarity computation methods. In our synthesis

branch, we use a fully-convolutional fusion network. Fea-

tures of two branches are concatenated and fed to the de-

coder network to generate realistic video frames. Experi-

ments in Sec. 4 shows the effectiveness of this simple con-

catenation strategy.

Merely based on sparse 2D masks of driving videos, our

proposed TS-Net can consistently achieve state-of-the-art

results for both face and dance videos, successfully model-

ing hair and clothes details and their motion. TS-Net also

handles large motions and preserves identity better when

compared with other state-of-the-art methods, as shown in

Fig. 1. Our contributions are summarized as follows:

1. We propose a novel dual branch video motion retar-

geting network TS-Net to generate identity-preserving

and temporally coherent videos via joint learning of

transformation and synthesis.

2. We utilize a simple yet effective way to estimate defor-

mation grid based on similarity matrix. Mask-aware

similarity is adopted to further reduce computation

overhead.

3. Comprehensive experiments on facial motion and

body motion retargeting tasks show that TS-Net can

achieve state-of-the-art results by only using sparse 2D

masks.

2. Related Work

Guided Image Generation. For conditional image gen-

eration, many works focused on generation tasks guided

by speciﬁc conditions such as pose-guided person image

synthesis [1, 27, 32, 36, 40, 43] and conditioned facial ex-

pression generation [5, 12, 34]. Pose-guided person image

generation can produce person images in arbitrary poses,

based on a subject image of that person and a novel pose

from the driving image. Ma et al. [27] proposed a two-

staged coarse-to-ﬁne Pose Guided Person Generation Net-

work (PG2), which utilizes pose integration and image re-

ﬁnement to generate high-quality person images. Condi-

tioned facial expression generation aims to generate a reen-

acted face which shows the same expression as the driv-

ing face image while preserving the identity of the sub-

ject image. Chen et al. [5] proposed a two-stage frame-

work called PuppeteerGAN, which ﬁrst performs expres-

sion retargeting by the sketching network and then exe-

cutes appearance transformation by the coloring network.

Though these works have shown promising results, they are

restricted to a speciﬁc object category (face or human body).

Several recent works [37, 38, 41, 55] have proposed general

guided image generation in various domains. Most of works

[38, 39, 41, 44] applies motion ﬂow to image animation be-

cause it can model the physical dynamics. Siarohin et al.

[39] proposed a general self-supervised ﬁrst-order-motion

model for estimating dense motion ﬂow to animate arbitrary

objects using learned keypoints and local afﬁne transforma-

tions. In [41], the authors further improved their network

by modeling object movement through unsupervised region

detection. Despite of building upon similar motion ﬂow,

instead of adopting complicated modeling in [39, 41], the

Mask-aware

Similarity

𝑧

Subject Image

Encoder ∆!"#

Driving Mask

Encoder ∆"$%

𝑓

Transformation Branch 𝜞𝐭𝐫𝐚

̅𝑣&

Decoder Ω)*+

(𝑥

Synthesis Branch 𝜞𝐬𝐲𝐧

𝐾

𝑒!

𝑒"

𝑒#

𝑆𝐾

̅𝑣.

𝐾

ℰ

𝑣!

$𝑣"

𝑣#

𝐾

𝑣!

%𝑣"

𝑣#

𝐾

concat

Fusion Net Λ

grid

sampling

𝐺𝐾

𝐺′

Figure 2: Illustration of the TS-Net generator to generate one frame ˆxin the target video.

transformation branch in TS-Net generate deformation ﬂow

by weighting regular grid with similarity matrix in feature

space, which shows better simplicity and efﬁciency.

Video Motion Retargeting. Different from image-based

generation, video motion retargeting is more challenging

due to the additional coherence requirements in the tempo-

ral dimension. Most existing literature focused on speciﬁc

domains such as human pose motion retargeting [4, 52],

or facial expression retargeting [9, 12, 20, 49, 50, 54], yet

they may lack generality when applied to multiple domains.

In contrast, our proposed TS-Net can work well on both

face and human body videos. Using off-the-shelf detec-

tors to extract driving motion masks, such as 3D masks

[9, 12, 20], 2D dense mask [46, 47], or 2D sparse mask

[4, 54], is also popular in current video motion retarget-

ing methods. Due to the simplicity of 2D sparse masks,

our proposed TS-Net also utilizes keypoints extracted by

Dlib [22] and OpenPose [3] to synthesize videos of 3D hu-

man face/body. To learn representation and preserve input

information effectively, most recent methods are based on

state-of-the-art generators with U-Net structure and AdaIN

module [12, 20, 46, 54], feature disentangle/decomposition

[49, 52], or speciﬁcally designed motion ﬂow estimators

[6, 46, 50]. On the contrary, our proposed TS-Net uses a

more robust and general GAN generator [19] as backbone to

jointly learn transformation and synthesis. Some previous

works [46, 47] also performed video motion retargeting by

combining warp-based and warp-free generation. However,

their warping ﬂows are always applied to previous gener-

ated frames, which may lead to the accumulation of syn-

thesis artifacts. Our proposed TS-Net instead computes the

warping ﬂow between driving mask and real subject images

in feature space to avoid this issue.

3. Methodology

3.1. Model Architecture

Given a sequence X={x1, x2, . . . , xK}with K

subject frames, their corresponding mask sequence Y=

{y1, y2, . . . , yK}, and a mask frame zfrom driving video,

the TS-Net generator can produce a new video frame ˆx

with the subject from Xand mask from z. Masks are gen-

erated by applying off-the-shelf pretrained 2D sparse key-

point detectors, i.e., Dlib [22] for face landmark detection

and OpenPose [3] for pose keypoint estimation. As illus-

trated in Fig. 2, TS-Net generator consists of two branches:

a transformation branch Γtra and a synthesis branch Γsyn for

generating the new video frame using warp-based transfor-

mation and direct synthesis, respectively.

During training, we concatenate Ksubject frames Xand

their masks Yand feed them to an image encoder ∆img to

extract subject embedding features E={e1, e2, . . . , eK}.

A mask encoder ∆msk encodes the input driving mask z

into driving embedding feature f. To reduce computational

costs of matrix multiplication, TS-Net operates in a low-

resolution feature space, where the spatial size of Eand f

are only 1/82of the input frames. We then input Eand f

to the transformation branch Γtra and the synthesis branch

Γsyn, as illustrated as follows.

Transformation Branch. Inside Γtra, we implement warp-

based transformation using spatial sampling grids [17]. We

ﬁrst compute the cosine similarity matrix Skbetween the

driving embedding feature fand the k-th subject feature ek

Skpq =ekp·fq



ekp

2kfqk2

,(1)

where Skpq is the afﬁnity value between fqat position qin

map f, and, ekpat position pin map ek, and k·k2indicates

the L2 norm. Suppose that the size of feature fand ekare

m×m, the size of matrix Skwill be m2×m2, which is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Cross-identityVideoMotionRetargetingwithJointTransformationandSynthesisHaomiaoNi1YihaoLiu2SharonX.Huang1YuanXue21ThePennsylvaniaStateUniversity,UniversityPark,PA,USA2JohnsHopkinsUniversity,Baltimore,MD,USA1fhfn5052,suh972g@psu.edu2fyliu236,yuanxueg@jhu.eduFigure1:Examplesofvideomotionretargeting,whe...

展开>> 收起<<

Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Cross-identity Video Motion Retargeting with Joint Transformation and Synthesis Haomiao Ni1Yihao Liu2Sharon X. Huang1Yuan Xue2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: