Match Cutting Finding Cuts with Smooth Visual Transitions Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie fbchen aziai btucker yxie gnetﬂix.com

2025-04-27 0 0 7.87MB 15 页 10玖币

侵权投诉

Match Cutting: Finding Cuts with Smooth Visual Transitions

Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie

{bchen, aziai, btucker, yxie}@netﬂix.com

Netﬂix Inc.

Los Gatos, CA, USA

Figure 1. Three example match cuts where the framing of the sub-

ject is matched: (left) Forrest Gump (1994), (center) Up (2009),

and (right) 2001: A Space Odyssey (1968).

Abstract

A match cut is a transition between a pair of shots that

uses similar framing, composition, or action to ﬂuidly bring

the viewer from one scene to the next. Match cuts are

frequently used in ﬁlm, television, and advertising. How-

ever, ﬁnding shots that work together is a highly man-

ual and time-consuming process that can take days. We

propose a modular and ﬂexible system to efﬁciently ﬁnd

high-quality match cut candidates starting from millions

of shot pairs. We annotate and release a dataset of ap-

proximately 20k labeled pairs that we use to evaluate our

system, using both classiﬁcation and metric learning ap-

proaches that leverage a variety of image, video, audio,

and audio-visual feature extractors. In addition, we release

code and embeddings for reproducing our experiments at

github.com/netﬂix/matchcut.

1. Introduction

In ﬁlm, a shot is a series of frames representing an unin-

terrupted period of time between two cuts [12]. A match cut

is a transition between a pair of shots that uses similar fram-

ing, composition, or action to ﬂuidly bring the viewer from

one scene to the next. It is a powerful visual storytelling

tool used to create a connection between two scenes.

For example, a match cut from a person to their younger

or older self is commonly used in ﬁlm to signify a ﬂashback

Figure 2. A match cut from the Star Wars: The Rise of Skywalker

(2019) [1] trailer. The trailer editor took two shots from the differ-

ent scenes with similar jump motions and cut them together. The

matched motion gives the illusion of one continuous jump.

or ﬂash-forward to help build the backstory of a character.

Two example ﬁlms that used this are Forrest Gump (1994)

[79] and Up (2009) [21] (Fig. 1). Without this technique, a

narrator or character might have to explicitly verbalize that

information, which may ruin the ﬂow of the ﬁlm.

A famous example from Stanley Kubrik’s 2001: A Space

Odyssey [43] is also shown in Fig. 1. This iconic match cut

from a spinning bone to a spaceship instantaneously takes

the viewer forward millions of years into the future. It is a

highly artistic edit which suggests that mankind’s evolution

from primates to space technology is natural and inevitable.

Match cuts can use any combination of elements, such

as framing, motion, action, subject matter, audio, lighting,

and color. In this paper, we will speciﬁcally address two

types: (1) character frame match cuts, in which the framing

of the character in the ﬁrst shot aligns with the character in

the second shot, and (2) motion match cuts, where shots are

matched together on the basis of general movement. Mo-

tion match cuts can use common camera movement (pan

left/right, zoom in/out) or motion of subjects. They create

the feeling of smooth transitions between inherently discon-

tinuous shots. An example is shown in Fig. 2.

Match cutting is considered one of the most difﬁcult

video editing techniques [22], because ﬁnding a pair of

shots that match well is tedious and time-consuming. For

a feature ﬁlm, there are approximately 2k shots on average,

which translates to 2M possible shot pairs, the vast majority

of which will not be good match cuts. An editor typically

watches one or more long-form videos and relies on mem-

ory or manual tagging to identify shots that would match to

a reference shot observed earlier. Given the large number of

arXiv:2210.05766v1 [cs.CV] 11 Oct 2022

shot pairs that need to be compared, it is easy to overlook

many desirable match cuts.

Our goal is to make ﬁnding match cuts vastly more ef-

ﬁcient by presenting a ranked list of match cut pair can-

didates to the editors, so they are selecting from, e.g., the

top 50 shot pairs most likely to be good match cuts, rather

than millions of random ones. This is a challenging video

editing task that requires complex understanding of visual

composition, motion, action, and sound.

Our contributions in this paper are the following: (1) We

propose a modular and ﬂexible system for generating match

cut candidates. Our system has been successfully utilized

by editors in creating promotional media assets (e.g. trail-

ers) and can also be used in post-production to ﬁnd matched

shots in large amount of pre-ﬁnal video. (2) We release a

dataset of roughly 20k labeled match cut pairs for two types

of match cuts: character framing and motion. (3) We eval-

uate our system using classiﬁcation and metric learning ap-

proaches that leverage a variety of image, video, audio, and

audio-visual feature extractors. (4) We release code and em-

beddings for reproducing our experiments.

2. Related Work

Computational video editing There is no computational

or algorithmic approach to video editing that matches the

skill and creative vision of a professional editor. However,

a number of methods and techniques have been proposed to

address sub-problems within video editing, particularly the

automation of slow and manual tasks.

Automated video editing techniques for specialized non-

ﬁction videos has seen success with rules-based methods,

such as those for group meetings [58, 64], educational lec-

tures [31], interviews [8] and social gatherings [5]. Broadly

speaking, these methods combine general ﬁlm editing con-

ventions (e.g. the speaker should be shown on camera) with

heuristics speciﬁc to the subject domain (e.g. for educa-

tional lectures, the white board should be visible).

Computational video editing for ﬁctional works tends to

fall in one of two lines of research: transcript-based ap-

proaches [45, 72, 25, 68] and learning-based approaches

[53]. Leake et. al. [45] generates edited video sequences

using a standard ﬁlm script and multiple takes of the scene,

but their work is speciﬁc to dialogue-driven scenes. Two

similar concepts, Write-A-Video [72] and QuickCut [68],

generate video montages using a combination of text and a

video library. Learning-based approaches have seen success

in recent years, notably in Learning to Cut [53], which pro-

poses a method to rank realistic cuts via contrastive learning

[20]. The MovieCuts dataset [54] includes match cuts as a

subtype, though it is by far the smallest category and does

not distinguish between kinds of match cuts. In contrast,

we release a data set of 20k pairs that differentiate between

frame and motion cuts, with the goal of ﬁnding these pairs

from shots throughout the ﬁlm instead of detecting exist-

ing cuts. Our work advances learning-based computational

video editing by introducing a method to generate and then

rank proposed pairs of match cuts without ﬁxed rules or

transcripts.

Video Representation Learning Self-supervised methods

have dominated much of the progress in multi-modal me-

dia understanding in recent years [76, 47, 24, 37]. CLIP

[57] was an early example of achieving impressive zero-

shot visual classiﬁcation following self-supervised training

with over 400M image-caption pairs. Similar advances have

been made for audio [29] and video [50] by utilizing differ-

ent augmented views of the same modality [19, 18, 27, 60],

or by learning joint embeddings of short [29, 3, 50] or long-

form [39] videos. Our system leverages such work for

learning video representations that capture matching video

pairs for the task of match cutting.

Movie understanding There is a deep and rich literature

on models that understand and analyze the information in

movies. Many movie-speciﬁc datasets [34] have been de-

veloped that have enabled research into a variety of topics

such as human-centric situations [70], story-based retrieval

[6], shot type classiﬁcation [59], narrative understanding

[6, 10, 44], and trailer analysis [35]. We release a dataset

which contributes a novel and challenging movie under-

standing task.

3. Methodology

In this section, we present a ﬂexible and scalable sys-

tem for ﬁnding Kmatching shot pairs given a video. This

system consists of ﬁve steps, as depicted in Fig. 3.

3.1. Preprocessing

The ﬁrst two steps of our system segment a video into a

sequence of contiguous and non-overlapping shots and re-

move near-duplicate shots. Although we present concrete

implementations for these steps, our system is agnostic to

these choices.

Step 1: Shot segmentation. For each movie m, we run a

shot segmentation algorithm to split that title into nmshots.

Let Sm={sm

i}nm

i=1 be the set of shots where sm

icorre-

sponds to the i-th shot of the m-th movie. Shot sm

iconsists

of an ordered set of frames Fm

i={fm

(i,j)}lm

j=1, where fm

(i,j)

is the j-th frame of sm

i, and lm

iis the number of frames

in sm

i. We use a custom shot segmentation algorithm but

similar results can by achieved with PySceneDetect [15] or

TransNetV2 [62].

Step 2: Near-duplicate shot deduplication. Matching

shots should have at least one difference in character, back-

ground, clothing, or character age. Therefore, we remove

near-duplicate shots (e.g. two shots of the same character

in the same scene and framing, but with a slightly different

facial expression).

Figure 3. System diagram for generating candidate match cut pairs. The input is a video ﬁle for movie mand the output is Kmatch

cut candidates. (1) Video is split into shots using a shot segmentation algorithm. (2) Near-duplicate shots are removed. (3) A tensor

representation rm

iis computed for each shot sm

iusing an encoder. (4) All unique shot pairs are enumerated and a score function sim

is used to compute the similarity between shot representations. (5) The top-Kpairs with highest similarity are returned. We show an

illustrative example with four shots from Moonrise Kingdom (2012) [4] and K= 2.

Our speciﬁc methodology for deduplication is as fol-

lows: we ﬁrst extract the center frame cm

ifor each shot sm

deﬁned as cm

i=fm

(i,blm

i/2c). For each center frame, we

extract the penultimate embeddings out of MobileNet [33]

pretrained on ImageNet [42]. Let em

i=enc(cm

i)∈R1024

be the embedding for frame cm

iwhere enc takes an image

and outputs a 1024-dimensional vector.

We deﬁne the set of duplicate shot indices for movie m

Dm={j|i, j ∈ {1,2, . . . , nm}, i < j, cos(em

i, em

j)≥Td}

(1)

where cos computes the cosine similarity between a pair of

embeddings and Tdis the similarity threshold.

Finally, the set of deduplicated shots for movie mcan be

constructed by excluding the shots corresponding to the in-

dices in Dmas follows: Sm

d={sm

i|i∈ {1,2, . . . , nm}, i 6∈

Dm}. We leverage the imagededup [36] library and ﬁnd

that setting Td= 0.8removes most of the near-duplicates.

3.2. Shot Pair Ranking

Steps 3-5 score and rank pairs of deduplicated shots fol-

lowing step 2.

Step 3: Shot representation computation. In this step,

we compute a tensor representation rm

ifor each shot sm

Representations for different shots need to preserve some

notion of similarity for matching pairs. Representations can

be extracted using any video, image, audio, text, or multi-

modal encoders. We present a few such choices in the up-

coming sections.

Step 4: Shot pair score computation. In this step, we

enumerate all unique shot pairs for movie m,

Pm={(sm

i, sm

j)|sm

i, sm

j∈Sm

d, i < j}(2)

and compute a similarity score sim(rm

i, rm

j)∈Rfor each

pair of shots (sm

i, sm

j). This similarity score is used for

ranking pairs where higher-scoring pairs are considered

higher quality. The function sim can be any function that

takes a pair of tensors and outputs a real scalar. This func-

tion can be chosen beforehand (e.g. cosine similarity) or

learned through supervision.

Step 5: Top-Kpair extraction. This step simply ranks

the results from the previous step and returns the top-K

pairs.

3.3. Heuristics

We deﬁne a heuristic has a speciﬁc combination of shot

representation and predetermined scoring function. These

heuristics serve two functions. We use them to generate

candidate pairs for manual annotation by video editors, and

then also to evaluate the annotated data set. Here, evaluate

means that we use the heuristic to rank the candidate pairs

and compute the average precision of that ranked list. More

details about evaluation can be found in Supplementary 8.3.

We leverage four of the heuristics presented in this sec-

tion (h1,h2,h4, and h5) to generate candidate pairs for

annotation in Sec. 4 and report how all of the heuristics

perform on our dataset in Sec. 5.

Heuristic 1 (h1): equal number of faces. One very

crude heuristic for character frame match cutting is to con-

sider pairs where the number of faces between the two shots

is equal. For the shot representation, we extract the center

frame and use a face detection model (Inception-ResNet-v1

[63] pretrained on VGGFace2 [13]) to determine the num-

ber of faces. The scoring function outputs 1 if the two shots

have the same number of faces, and 0 otherwise.

Heuristics 2 (h2) and 3 (h3): Instance segmentation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MatchCutting:FindingCutswithSmoothVisualTransitionsBorisChenAmirZiaiRebeccaS.TuckerYuchenXiefbchen,aziai,btucker,yxieg@netix.comNetixInc.LosGatos,CA,USAFigure1.Threeexamplematchcutswheretheframingofthesub-jectismatched:(left)ForrestGump(1994),(center)Up(2009),and(right)2001:ASpaceOdyssey(1968).Abs...

展开>> 收起<<

Match Cutting Finding Cuts with Smooth Visual Transitions Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie fbchen aziai btucker yxie gnetﬂix.com.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Match Cutting Finding Cuts with Smooth Visual Transitions Boris Chen Amir Ziai Rebecca S. Tucker Yuchen Xie fbchen aziai btucker yxie gnetﬂix.com

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: