shot pairs that need to be compared, it is easy to overlook
many desirable match cuts.
Our goal is to make finding match cuts vastly more ef-
ficient by presenting a ranked list of match cut pair can-
didates to the editors, so they are selecting from, e.g., the
top 50 shot pairs most likely to be good match cuts, rather
than millions of random ones. This is a challenging video
editing task that requires complex understanding of visual
composition, motion, action, and sound.
Our contributions in this paper are the following: (1) We
propose a modular and flexible system for generating match
cut candidates. Our system has been successfully utilized
by editors in creating promotional media assets (e.g. trail-
ers) and can also be used in post-production to find matched
shots in large amount of pre-final video. (2) We release a
dataset of roughly 20k labeled match cut pairs for two types
of match cuts: character framing and motion. (3) We eval-
uate our system using classification and metric learning ap-
proaches that leverage a variety of image, video, audio, and
audio-visual feature extractors. (4) We release code and em-
beddings for reproducing our experiments.
2. Related Work
Computational video editing There is no computational
or algorithmic approach to video editing that matches the
skill and creative vision of a professional editor. However,
a number of methods and techniques have been proposed to
address sub-problems within video editing, particularly the
automation of slow and manual tasks.
Automated video editing techniques for specialized non-
fiction videos has seen success with rules-based methods,
such as those for group meetings [58, 64], educational lec-
tures [31], interviews [8] and social gatherings [5]. Broadly
speaking, these methods combine general film editing con-
ventions (e.g. the speaker should be shown on camera) with
heuristics specific to the subject domain (e.g. for educa-
tional lectures, the white board should be visible).
Computational video editing for fictional works tends to
fall in one of two lines of research: transcript-based ap-
proaches [45, 72, 25, 68] and learning-based approaches
[53]. Leake et. al. [45] generates edited video sequences
using a standard film script and multiple takes of the scene,
but their work is specific to dialogue-driven scenes. Two
similar concepts, Write-A-Video [72] and QuickCut [68],
generate video montages using a combination of text and a
video library. Learning-based approaches have seen success
in recent years, notably in Learning to Cut [53], which pro-
poses a method to rank realistic cuts via contrastive learning
[20]. The MovieCuts dataset [54] includes match cuts as a
subtype, though it is by far the smallest category and does
not distinguish between kinds of match cuts. In contrast,
we release a data set of 20k pairs that differentiate between
frame and motion cuts, with the goal of finding these pairs
from shots throughout the film instead of detecting exist-
ing cuts. Our work advances learning-based computational
video editing by introducing a method to generate and then
rank proposed pairs of match cuts without fixed rules or
transcripts.
Video Representation Learning Self-supervised methods
have dominated much of the progress in multi-modal me-
dia understanding in recent years [76, 47, 24, 37]. CLIP
[57] was an early example of achieving impressive zero-
shot visual classification following self-supervised training
with over 400M image-caption pairs. Similar advances have
been made for audio [29] and video [50] by utilizing differ-
ent augmented views of the same modality [19, 18, 27, 60],
or by learning joint embeddings of short [29, 3, 50] or long-
form [39] videos. Our system leverages such work for
learning video representations that capture matching video
pairs for the task of match cutting.
Movie understanding There is a deep and rich literature
on models that understand and analyze the information in
movies. Many movie-specific datasets [34] have been de-
veloped that have enabled research into a variety of topics
such as human-centric situations [70], story-based retrieval
[6], shot type classification [59], narrative understanding
[6, 10, 44], and trailer analysis [35]. We release a dataset
which contributes a novel and challenging movie under-
standing task.
3. Methodology
In this section, we present a flexible and scalable sys-
tem for finding Kmatching shot pairs given a video. This
system consists of five steps, as depicted in Fig. 3.
3.1. Preprocessing
The first two steps of our system segment a video into a
sequence of contiguous and non-overlapping shots and re-
move near-duplicate shots. Although we present concrete
implementations for these steps, our system is agnostic to
these choices.
Step 1: Shot segmentation. For each movie m, we run a
shot segmentation algorithm to split that title into nmshots.
Let Sm={sm
i}nm
i=1 be the set of shots where sm
icorre-
sponds to the i-th shot of the m-th movie. Shot sm
iconsists
of an ordered set of frames Fm
i={fm
(i,j)}lm
i
j=1, where fm
(i,j)
is the j-th frame of sm
i, and lm
iis the number of frames
in sm
i. We use a custom shot segmentation algorithm but
similar results can by achieved with PySceneDetect [15] or
TransNetV2 [62].
Step 2: Near-duplicate shot deduplication. Matching
shots should have at least one difference in character, back-
ground, clothing, or character age. Therefore, we remove
near-duplicate shots (e.g. two shots of the same character
in the same scene and framing, but with a slightly different
facial expression).