
dressed, as shown in Fig. 1:
Domain mismatch - scene images vs movie frames: Vi-
sual scenes depicted in movies are distinct compared to nat-
ural scenes due to increased focus on actors, multiple ac-
tivities and viewpoint variations like extreme closeup, wide
angle shots etc. An example is shown in Fig. 1 (a) for im-
ages from Places2 dataset [57] and movie frames from Con-
densed Movies dataset [1].
Lack of completeness in scene taxonomy: Movies depict
both real life and fictional scenarios that span a wide vari-
ety of visual scenes. As shown in Fig. 1(b), certain movie
centric visual scene classes like battlefield,control room,
prison,war room,funeral,casino are absent from existing
public scene taxonomies associated with natural scene im-
age and video datasets.
Lack of shot specific visual scene annotations: Existing
datasets like Condensed Movies [1] and VidSitu [46] pro-
vide a single visual scene label for the entire movie clip
(around 2 minutes long), obtained through descriptions pro-
vided as part of YouTube channel Fandango Movie clips 1.
In Fig. 1 (c), the provided description: Johnny Five (Tim
Blaney) searches for his humanity in the streets of New York.
mentions only the visual scene street, while the initial set
of events takes place inside church. Instead of considering
a single scene label for the entire movie clip, shot level vi-
sual scene annotation can help in tracking the scene change
from church to street.
In our work, we consider shots within a given movie
clip as the fundamental units for visual scene analysis since
shots consist of consecutive set of frames related to the same
content, whose starting and ending points are triggered by
recording using a single camera [25]. Our contributions are
as follows:
•Movie-centric scene taxonomy: We develop a movie-
centric scene taxonomy by leveraging scene headers
(sluglines) from movie scripts and existing video datasets
with scene labels like HVU[13].
•Automatic shot tagging: We utilize our generated scene
taxonomy to automatically tag around 1.12M shots from
32K movie clips using CLIP [41] based on a frame-wise
aggregation scheme.
•Multi-label scene classification: We develop multi-label
scene classification baselines using the shot-level tagged
dataset called MovieCLIP and evaluate them on an in-
dependent shot level dataset curated by human experts.
The dataset and associated codebase can be accessed at
https://sail.usc.edu/mica/MovieCLIP/
•Downstream tasks: We further extract feature rep-
resentations from the baseline models pretrained on
MovieCLIP and explore their applicability in diverse
downstream tasks of multi-label scene and movie genre
1https://www.youtube.com/channel/UC3gNmTGu-TTbFPpfSs5kNkg
classification from web videos [13] and trailers [9], re-
spectively.
2. Related work
Image datasets for visual scene recognition: Image
datasets for scene classification like MIT Indoor67 [40] re-
lied on categorizing a finite set of (67) indoor scene classes.
A broad categorization into indoor, outdoor (natural) and
outdoor (man-made) groups for 130K images across 397
subcategories was introduced by the SUN dataset [56]. For
large scale scene recognition, the Places dataset [57] was
developed with 434 scene labels spanning 10 million im-
ages. The scene taxonomy considered in Places dataset was
derived from the SUN dataset, followed by careful merg-
ing of similar pairs. It should be noted that the curation of
large scale visual scene datasets like Places relied on crowd-
sourced manual annotations over multiple rounds.
Video datasets for visual scene recognition: While there
has been considerable progress in terms of action recogni-
tion capabilities from videos due to introduction of large
scale datasets like Kinetics [28], ActivityNet [16], AVA
[22], Something-Something [21], only few large scale
datasets like HVU [13] and Scenes, Objects and Actions
(SOA) [43] have focused on scene categorization with ac-
tions and associated objects. SOA was introduced as a
multi-task multi-label dataset of social-media videos across
49 scenes with objects and actions but the taxonomy cura-
tion involves free-form tagging by human annotators fol-
lowed by automatic cleanup. HVU [13], a recently released
public dataset of web videos with 248 scene labels, relied
on initial tag generation based on cloud APIs followed by
human verification.
Movie-centric visual scene recognition: In the domain
of scene recognition from movies, Hollywood scenes [36]
was first introduced with 10 scene classes extracted from
headers in movie scripts across 3669 movie clips. A so-
cially grounded approach was explored in Moviegraphs
[54] with emphasis on the underlying interactions (relation-
ships/situations) along with spatio-temporal localizations
and associated visual scenes (59 classes). For holistic movie
understanding tasks, the Movienet dataset[27] was intro-
duced with the largest movie-centric scene taxonomy con-
sisting of 90 place (visual scene) tags with segment wise hu-
man annotations of entire movies. Instead of entire movie
data, short movie clips sourced from YouTube channel of
Fandango Movie clips were used for text-video retrieval in
Condensed movies dataset [1], visual semantic role label-
ing [46] and pretraining object-centric transformers [53] for
long-term video understanding in LVU dataset [55]. While
there is no explicit visual scene labeling, the raw descrip-
tions available on Youtube with the movie clips have men-
tions of certain visual scene classes.
MovieCLIP, our curated dataset, is built on top of movie