subjects (Fig. 1). Given one subject video and one driving
video, we aim to synthesize a new plausible video with the
same identity of the person from the subject video and the
same motion as the person in the driving video.
Recent works in video motion retargeting [2, 4, 6, 9, 14,
18, 46, 47, 49, 50, 52, 54] have shown impressive progress.
To capture the temporal relationship among video frames,
prior works [6, 49, 50] generated frames via warping subject
frames by motion flow, which is usually extracted by specif-
ically designed warping field estimators, such as FlowNet
[8] or first-order approximation [39]. While warp-based
systems can generally preserve subject identity well, tradi-
tional flow-based warping may suffer from occlusion and
large motion due to its requirement of learning a warp
field with point-to-point correspondence between frames
[15]. Other methods [2, 4, 12, 20, 52, 54] utilized warp-
free (direct) synthesis with a conditional GAN-style struc-
ture [16, 30, 48]. To ease the challenging of direct synthe-
sis, they often employed feature disentangle/decomposition
[52] or followed the state-of-the-art generator architectures
[31, 48] to add various connections among inputs, the en-
coder, and the decoder network. Unlike warp-based gen-
eration, direct synthesis is not limited to only using pixels
from reference images, and therefore is easier to synthesize
novel pixels for unseen/occluded objects. However, such
flexibility can also lead to identity leakage [12], i.e., iden-
tity changes in the generated video.
Considering that warp-based synthesis can better pre-
serve identity while warp-free generation helps produce
new pixels, in this paper, we propose a novel video motion
retargeting framework, termed Transformation-Synthesis
Network, or TS-Net for short, to combine their advantages.
TS-Net has a dual branch structure which consists of a
transformation branch and a synthesis branch. The net-
work architectures within the two branches are inherently
different, thus learning via the two branches can be re-
garded as a special multi-view learning case [51]. Unlike
the popular warp-based methods using specially designed
optical flow estimators [38, 39, 41] and inspired by [26],
our proposed transformation branch computes deformation
flow by weighting the regular grid with a spatial similar-
ity matrix between driving mask features and subject im-
age features. The computation of similarity takes multiple
correspondences into consideration; thus it can better alle-
viate occlusion and handle large motion. We also design a
mask-aware similarity to avoid comparing all pairs of points
within the feature maps and thus be more efficient than tra-
ditional similarity computation methods. In our synthesis
branch, we use a fully-convolutional fusion network. Fea-
tures of two branches are concatenated and fed to the de-
coder network to generate realistic video frames. Experi-
ments in Sec. 4 shows the effectiveness of this simple con-
catenation strategy.
Merely based on sparse 2D masks of driving videos, our
proposed TS-Net can consistently achieve state-of-the-art
results for both face and dance videos, successfully model-
ing hair and clothes details and their motion. TS-Net also
handles large motions and preserves identity better when
compared with other state-of-the-art methods, as shown in
Fig. 1. Our contributions are summarized as follows:
1. We propose a novel dual branch video motion retar-
geting network TS-Net to generate identity-preserving
and temporally coherent videos via joint learning of
transformation and synthesis.
2. We utilize a simple yet effective way to estimate defor-
mation grid based on similarity matrix. Mask-aware
similarity is adopted to further reduce computation
overhead.
3. Comprehensive experiments on facial motion and
body motion retargeting tasks show that TS-Net can
achieve state-of-the-art results by only using sparse 2D
masks.
2. Related Work
Guided Image Generation. For conditional image gen-
eration, many works focused on generation tasks guided
by specific conditions such as pose-guided person image
synthesis [1, 27, 32, 36, 40, 43] and conditioned facial ex-
pression generation [5, 12, 34]. Pose-guided person image
generation can produce person images in arbitrary poses,
based on a subject image of that person and a novel pose
from the driving image. Ma et al. [27] proposed a two-
staged coarse-to-fine Pose Guided Person Generation Net-
work (PG2), which utilizes pose integration and image re-
finement to generate high-quality person images. Condi-
tioned facial expression generation aims to generate a reen-
acted face which shows the same expression as the driv-
ing face image while preserving the identity of the sub-
ject image. Chen et al. [5] proposed a two-stage frame-
work called PuppeteerGAN, which first performs expres-
sion retargeting by the sketching network and then exe-
cutes appearance transformation by the coloring network.
Though these works have shown promising results, they are
restricted to a specific object category (face or human body).
Several recent works [37, 38, 41, 55] have proposed general
guided image generation in various domains. Most of works
[38, 39, 41, 44] applies motion flow to image animation be-
cause it can model the physical dynamics. Siarohin et al.
[39] proposed a general self-supervised first-order-motion
model for estimating dense motion flow to animate arbitrary
objects using learned keypoints and local affine transforma-
tions. In [41], the authors further improved their network
by modeling object movement through unsupervised region
detection. Despite of building upon similar motion flow,
instead of adopting complicated modeling in [39, 41], the