Match Cutting Finding Cuts with Smooth Visual Transitions Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie fbchen aziai btucker yxie gnetflix.com

2025-04-27 0 0 7.87MB 15 页 10玖币
侵权投诉
Match Cutting: Finding Cuts with Smooth Visual Transitions
Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie
{bchen, aziai, btucker, yxie}@netflix.com
Netflix Inc.
Los Gatos, CA, USA
Figure 1. Three example match cuts where the framing of the sub-
ject is matched: (left) Forrest Gump (1994), (center) Up (2009),
and (right) 2001: A Space Odyssey (1968).
Abstract
A match cut is a transition between a pair of shots that
uses similar framing, composition, or action to fluidly bring
the viewer from one scene to the next. Match cuts are
frequently used in film, television, and advertising. How-
ever, finding shots that work together is a highly man-
ual and time-consuming process that can take days. We
propose a modular and flexible system to efficiently find
high-quality match cut candidates starting from millions
of shot pairs. We annotate and release a dataset of ap-
proximately 20k labeled pairs that we use to evaluate our
system, using both classification and metric learning ap-
proaches that leverage a variety of image, video, audio,
and audio-visual feature extractors. In addition, we release
code and embeddings for reproducing our experiments at
github.com/netflix/matchcut.
1. Introduction
In film, a shot is a series of frames representing an unin-
terrupted period of time between two cuts [12]. A match cut
is a transition between a pair of shots that uses similar fram-
ing, composition, or action to fluidly bring the viewer from
one scene to the next. It is a powerful visual storytelling
tool used to create a connection between two scenes.
For example, a match cut from a person to their younger
or older self is commonly used in film to signify a flashback
Figure 2. A match cut from the Star Wars: The Rise of Skywalker
(2019) [1] trailer. The trailer editor took two shots from the differ-
ent scenes with similar jump motions and cut them together. The
matched motion gives the illusion of one continuous jump.
or flash-forward to help build the backstory of a character.
Two example films that used this are Forrest Gump (1994)
[79] and Up (2009) [21] (Fig. 1). Without this technique, a
narrator or character might have to explicitly verbalize that
information, which may ruin the flow of the film.
A famous example from Stanley Kubrik’s 2001: A Space
Odyssey [43] is also shown in Fig. 1. This iconic match cut
from a spinning bone to a spaceship instantaneously takes
the viewer forward millions of years into the future. It is a
highly artistic edit which suggests that mankind’s evolution
from primates to space technology is natural and inevitable.
Match cuts can use any combination of elements, such
as framing, motion, action, subject matter, audio, lighting,
and color. In this paper, we will specifically address two
types: (1) character frame match cuts, in which the framing
of the character in the first shot aligns with the character in
the second shot, and (2) motion match cuts, where shots are
matched together on the basis of general movement. Mo-
tion match cuts can use common camera movement (pan
left/right, zoom in/out) or motion of subjects. They create
the feeling of smooth transitions between inherently discon-
tinuous shots. An example is shown in Fig. 2.
Match cutting is considered one of the most difficult
video editing techniques [22], because finding a pair of
shots that match well is tedious and time-consuming. For
a feature film, there are approximately 2k shots on average,
which translates to 2M possible shot pairs, the vast majority
of which will not be good match cuts. An editor typically
watches one or more long-form videos and relies on mem-
ory or manual tagging to identify shots that would match to
a reference shot observed earlier. Given the large number of
arXiv:2210.05766v1 [cs.CV] 11 Oct 2022
shot pairs that need to be compared, it is easy to overlook
many desirable match cuts.
Our goal is to make finding match cuts vastly more ef-
ficient by presenting a ranked list of match cut pair can-
didates to the editors, so they are selecting from, e.g., the
top 50 shot pairs most likely to be good match cuts, rather
than millions of random ones. This is a challenging video
editing task that requires complex understanding of visual
composition, motion, action, and sound.
Our contributions in this paper are the following: (1) We
propose a modular and flexible system for generating match
cut candidates. Our system has been successfully utilized
by editors in creating promotional media assets (e.g. trail-
ers) and can also be used in post-production to find matched
shots in large amount of pre-final video. (2) We release a
dataset of roughly 20k labeled match cut pairs for two types
of match cuts: character framing and motion. (3) We eval-
uate our system using classification and metric learning ap-
proaches that leverage a variety of image, video, audio, and
audio-visual feature extractors. (4) We release code and em-
beddings for reproducing our experiments.
2. Related Work
Computational video editing There is no computational
or algorithmic approach to video editing that matches the
skill and creative vision of a professional editor. However,
a number of methods and techniques have been proposed to
address sub-problems within video editing, particularly the
automation of slow and manual tasks.
Automated video editing techniques for specialized non-
fiction videos has seen success with rules-based methods,
such as those for group meetings [58, 64], educational lec-
tures [31], interviews [8] and social gatherings [5]. Broadly
speaking, these methods combine general film editing con-
ventions (e.g. the speaker should be shown on camera) with
heuristics specific to the subject domain (e.g. for educa-
tional lectures, the white board should be visible).
Computational video editing for fictional works tends to
fall in one of two lines of research: transcript-based ap-
proaches [45, 72, 25, 68] and learning-based approaches
[53]. Leake et. al. [45] generates edited video sequences
using a standard film script and multiple takes of the scene,
but their work is specific to dialogue-driven scenes. Two
similar concepts, Write-A-Video [72] and QuickCut [68],
generate video montages using a combination of text and a
video library. Learning-based approaches have seen success
in recent years, notably in Learning to Cut [53], which pro-
poses a method to rank realistic cuts via contrastive learning
[20]. The MovieCuts dataset [54] includes match cuts as a
subtype, though it is by far the smallest category and does
not distinguish between kinds of match cuts. In contrast,
we release a data set of 20k pairs that differentiate between
frame and motion cuts, with the goal of finding these pairs
from shots throughout the film instead of detecting exist-
ing cuts. Our work advances learning-based computational
video editing by introducing a method to generate and then
rank proposed pairs of match cuts without fixed rules or
transcripts.
Video Representation Learning Self-supervised methods
have dominated much of the progress in multi-modal me-
dia understanding in recent years [76, 47, 24, 37]. CLIP
[57] was an early example of achieving impressive zero-
shot visual classification following self-supervised training
with over 400M image-caption pairs. Similar advances have
been made for audio [29] and video [50] by utilizing differ-
ent augmented views of the same modality [19, 18, 27, 60],
or by learning joint embeddings of short [29, 3, 50] or long-
form [39] videos. Our system leverages such work for
learning video representations that capture matching video
pairs for the task of match cutting.
Movie understanding There is a deep and rich literature
on models that understand and analyze the information in
movies. Many movie-specific datasets [34] have been de-
veloped that have enabled research into a variety of topics
such as human-centric situations [70], story-based retrieval
[6], shot type classification [59], narrative understanding
[6, 10, 44], and trailer analysis [35]. We release a dataset
which contributes a novel and challenging movie under-
standing task.
3. Methodology
In this section, we present a flexible and scalable sys-
tem for finding Kmatching shot pairs given a video. This
system consists of five steps, as depicted in Fig. 3.
3.1. Preprocessing
The first two steps of our system segment a video into a
sequence of contiguous and non-overlapping shots and re-
move near-duplicate shots. Although we present concrete
implementations for these steps, our system is agnostic to
these choices.
Step 1: Shot segmentation. For each movie m, we run a
shot segmentation algorithm to split that title into nmshots.
Let Sm={sm
i}nm
i=1 be the set of shots where sm
icorre-
sponds to the i-th shot of the m-th movie. Shot sm
iconsists
of an ordered set of frames Fm
i={fm
(i,j)}lm
i
j=1, where fm
(i,j)
is the j-th frame of sm
i, and lm
iis the number of frames
in sm
i. We use a custom shot segmentation algorithm but
similar results can by achieved with PySceneDetect [15] or
TransNetV2 [62].
Step 2: Near-duplicate shot deduplication. Matching
shots should have at least one difference in character, back-
ground, clothing, or character age. Therefore, we remove
near-duplicate shots (e.g. two shots of the same character
in the same scene and framing, but with a slightly different
facial expression).
Figure 3. System diagram for generating candidate match cut pairs. The input is a video file for movie mand the output is Kmatch
cut candidates. (1) Video is split into shots using a shot segmentation algorithm. (2) Near-duplicate shots are removed. (3) A tensor
representation rm
iis computed for each shot sm
iusing an encoder. (4) All unique shot pairs are enumerated and a score function sim
is used to compute the similarity between shot representations. (5) The top-Kpairs with highest similarity are returned. We show an
illustrative example with four shots from Moonrise Kingdom (2012) [4] and K= 2.
Our specific methodology for deduplication is as fol-
lows: we first extract the center frame cm
ifor each shot sm
i
defined as cm
i=fm
(i,blm
i/2c). For each center frame, we
extract the penultimate embeddings out of MobileNet [33]
pretrained on ImageNet [42]. Let em
i=enc(cm
i)R1024
be the embedding for frame cm
iwhere enc takes an image
and outputs a 1024-dimensional vector.
We define the set of duplicate shot indices for movie m
as
Dm={j|i, j ∈ {1,2, . . . , nm}, i < j, cos(em
i, em
j)Td}
(1)
where cos computes the cosine similarity between a pair of
embeddings and Tdis the similarity threshold.
Finally, the set of deduplicated shots for movie mcan be
constructed by excluding the shots corresponding to the in-
dices in Dmas follows: Sm
d={sm
i|i∈ {1,2, . . . , nm}, i 6∈
Dm}. We leverage the imagededup [36] library and find
that setting Td= 0.8removes most of the near-duplicates.
3.2. Shot Pair Ranking
Steps 3-5 score and rank pairs of deduplicated shots fol-
lowing step 2.
Step 3: Shot representation computation. In this step,
we compute a tensor representation rm
ifor each shot sm
i.
Representations for different shots need to preserve some
notion of similarity for matching pairs. Representations can
be extracted using any video, image, audio, text, or multi-
modal encoders. We present a few such choices in the up-
coming sections.
Step 4: Shot pair score computation. In this step, we
enumerate all unique shot pairs for movie m,
Pm={(sm
i, sm
j)|sm
i, sm
jSm
d, i < j}(2)
and compute a similarity score sim(rm
i, rm
j)Rfor each
pair of shots (sm
i, sm
j). This similarity score is used for
ranking pairs where higher-scoring pairs are considered
higher quality. The function sim can be any function that
takes a pair of tensors and outputs a real scalar. This func-
tion can be chosen beforehand (e.g. cosine similarity) or
learned through supervision.
Step 5: Top-Kpair extraction. This step simply ranks
the results from the previous step and returns the top-K
pairs.
3.3. Heuristics
We define a heuristic has a specific combination of shot
representation and predetermined scoring function. These
heuristics serve two functions. We use them to generate
candidate pairs for manual annotation by video editors, and
then also to evaluate the annotated data set. Here, evaluate
means that we use the heuristic to rank the candidate pairs
and compute the average precision of that ranked list. More
details about evaluation can be found in Supplementary 8.3.
We leverage four of the heuristics presented in this sec-
tion (h1,h2,h4, and h5) to generate candidate pairs for
annotation in Sec. 4 and report how all of the heuristics
perform on our dataset in Sec. 5.
Heuristic 1 (h1): equal number of faces. One very
crude heuristic for character frame match cutting is to con-
sider pairs where the number of faces between the two shots
is equal. For the shot representation, we extract the center
frame and use a face detection model (Inception-ResNet-v1
[63] pretrained on VGGFace2 [13]) to determine the num-
ber of faces. The scoring function outputs 1 if the two shots
have the same number of faces, and 0 otherwise.
Heuristics 2 (h2) and 3 (h3): Instance segmentation.
摘要:

MatchCutting:FindingCutswithSmoothVisualTransitionsBorisChenAmirZiaiRebeccaS.TuckerYuchenXiefbchen,aziai,btucker,yxieg@netix.comNetixInc.LosGatos,CA,USAFigure1.Threeexamplematchcutswheretheframingofthesub-jectismatched:(left)ForrestGump(1994),(center)Up(2009),and(right)2001:ASpaceOdyssey(1968).Abs...

展开>> 收起<<
Match Cutting Finding Cuts with Smooth Visual Transitions Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie fbchen aziai btucker yxie gnetflix.com.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:7.87MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注