MovieCLIP Visual Scene Recognition in Movies Digbalay Bose1 Rajat Hebbar1 Krishna Somandepalli2 Haoyang Zhang1 Yin Cui2 Kree Cole-McLaughlin2 Huisheng Wang2 and Shrikanth Narayanan1

2025-05-02 0 0 8.75MB 10 页 10玖币
侵权投诉
MovieCLIP: Visual Scene Recognition in Movies
Digbalay Bose1, Rajat Hebbar1, Krishna Somandepalli2, Haoyang Zhang1, Yin Cui2, Kree
Cole-McLaughlin2, Huisheng Wang2, and Shrikanth Narayanan1
1University of Southern California, Los Angeles, CA 2Google
1{dbose@,rajatheb@,zhangh21@,shri@ee.}usc.edu 2{ksoman@,yincui@,kree@,huishengw@}google.com
Abstract
Longform media such as movies have complex narra-
tive structures, with events spanning a rich variety of am-
bient visual scenes. Domain specific challenges associated
with visual scenes in movies include transitions, person cov-
erage, and a wide array of real-life and fictional scenar-
ios. Existing visual scene datasets in movies have limited
taxonomies and don’t consider the visual scene transition
within movie clips. In this work, we address the problem
of visual scene recognition in movies by first automatically
curating a new and extensive movie-centric taxonomy of
179 scene labels derived from movie scripts and auxiliary
web-based video datasets. Instead of manual annotations
which can be expensive, we use CLIP to weakly label 1.12
million shots from 32K movie clips based on our proposed
taxonomy. We provide baseline visual models trained on
the weakly labeled dataset called MovieCLIP and evaluate
them on an independent dataset verified by human raters.
We show that leveraging features from models pretrained on
MovieCLIP benefits downstream tasks such as multi-label
scene and genre classification of web videos and movie
trailers.
1. Introduction
Media, in its diverse forms and modalities, is used to cre-
ate and share narratives across domains including movies,
television shows, advertisements, games, news, and user
generated social stories. Movies represent a major form
of media content, with box office revenues estimated at
$4.48 billion across 329 movies released in 2021 [37], with
a global reach and societal influence. The computational
analysis of media content [49] especially movies presents
unique challenges due to their long-form narrative struc-
tures with character interactions often spanning diverse vi-
sual scenes and contexts. In cinematic terms, mis-en-scene
Figure 1. Overview diagram highlighting the challenges associ-
ated with visual scene recognition in movies (a) Domain mismatch
between Natural scene images,(Source: http://places2.
csail.mit.edu/explore.html) vs frames from Movies
for living room (b) Movie centric visual scene classes like prison,
control room etc that are absent from existing taxonomies (c)
Change in visual scene between shots in the same movie clip.
[4] refers to how the different elements of a film are de-
picted and arranged in front of camera. Key components of
mis-en-scene include the actors with their different styles,
visual scenes where the interactions take place, set design
including lighting and camera placement and the accompa-
nying costumes and makeup of the artists. The visual scene
is considered a crucial component since it sets the mood and
provides a background for the various actions performed by
the actors in the scene. Visual scenes in movies are often
tied to social settings like weddings, birthday parties and
workplace gatherings that provide information about char-
acter interactions.
Accurate recognition of visual scenes can help in uncov-
ering the bias involved in portrayal of under-represented
characters vis-a-vis different scenes e.g., fewer women
shown in office as compared to kitchen. For content tagging
tasks like genre classification, visual scenes provide context
information like battlefield portrayals in action/adventure
movies, space-shuttle in sci-fi movies or courtrooms in dra-
mas. However, there are certain inherent challenges in
visual scene recognition for movies that needs to be ad-
arXiv:2210.11065v2 [cs.CV] 23 Oct 2022
dressed, as shown in Fig. 1:
Domain mismatch - scene images vs movie frames: Vi-
sual scenes depicted in movies are distinct compared to nat-
ural scenes due to increased focus on actors, multiple ac-
tivities and viewpoint variations like extreme closeup, wide
angle shots etc. An example is shown in Fig. 1 (a) for im-
ages from Places2 dataset [57] and movie frames from Con-
densed Movies dataset [1].
Lack of completeness in scene taxonomy: Movies depict
both real life and fictional scenarios that span a wide vari-
ety of visual scenes. As shown in Fig. 1(b), certain movie
centric visual scene classes like battlefield,control room,
prison,war room,funeral,casino are absent from existing
public scene taxonomies associated with natural scene im-
age and video datasets.
Lack of shot specific visual scene annotations: Existing
datasets like Condensed Movies [1] and VidSitu [46] pro-
vide a single visual scene label for the entire movie clip
(around 2 minutes long), obtained through descriptions pro-
vided as part of YouTube channel Fandango Movie clips 1.
In Fig. 1 (c), the provided description: Johnny Five (Tim
Blaney) searches for his humanity in the streets of New York.
mentions only the visual scene street, while the initial set
of events takes place inside church. Instead of considering
a single scene label for the entire movie clip, shot level vi-
sual scene annotation can help in tracking the scene change
from church to street.
In our work, we consider shots within a given movie
clip as the fundamental units for visual scene analysis since
shots consist of consecutive set of frames related to the same
content, whose starting and ending points are triggered by
recording using a single camera [25]. Our contributions are
as follows:
Movie-centric scene taxonomy: We develop a movie-
centric scene taxonomy by leveraging scene headers
(sluglines) from movie scripts and existing video datasets
with scene labels like HVU[13].
Automatic shot tagging: We utilize our generated scene
taxonomy to automatically tag around 1.12M shots from
32K movie clips using CLIP [41] based on a frame-wise
aggregation scheme.
Multi-label scene classification: We develop multi-label
scene classification baselines using the shot-level tagged
dataset called MovieCLIP and evaluate them on an in-
dependent shot level dataset curated by human experts.
The dataset and associated codebase can be accessed at
https://sail.usc.edu/mica/MovieCLIP/
Downstream tasks: We further extract feature rep-
resentations from the baseline models pretrained on
MovieCLIP and explore their applicability in diverse
downstream tasks of multi-label scene and movie genre
1https://www.youtube.com/channel/UC3gNmTGu-TTbFPpfSs5kNkg
classification from web videos [13] and trailers [9], re-
spectively.
2. Related work
Image datasets for visual scene recognition: Image
datasets for scene classification like MIT Indoor67 [40] re-
lied on categorizing a finite set of (67) indoor scene classes.
A broad categorization into indoor, outdoor (natural) and
outdoor (man-made) groups for 130K images across 397
subcategories was introduced by the SUN dataset [56]. For
large scale scene recognition, the Places dataset [57] was
developed with 434 scene labels spanning 10 million im-
ages. The scene taxonomy considered in Places dataset was
derived from the SUN dataset, followed by careful merg-
ing of similar pairs. It should be noted that the curation of
large scale visual scene datasets like Places relied on crowd-
sourced manual annotations over multiple rounds.
Video datasets for visual scene recognition: While there
has been considerable progress in terms of action recogni-
tion capabilities from videos due to introduction of large
scale datasets like Kinetics [28], ActivityNet [16], AVA
[22], Something-Something [21], only few large scale
datasets like HVU [13] and Scenes, Objects and Actions
(SOA) [43] have focused on scene categorization with ac-
tions and associated objects. SOA was introduced as a
multi-task multi-label dataset of social-media videos across
49 scenes with objects and actions but the taxonomy cura-
tion involves free-form tagging by human annotators fol-
lowed by automatic cleanup. HVU [13], a recently released
public dataset of web videos with 248 scene labels, relied
on initial tag generation based on cloud APIs followed by
human verification.
Movie-centric visual scene recognition: In the domain
of scene recognition from movies, Hollywood scenes [36]
was first introduced with 10 scene classes extracted from
headers in movie scripts across 3669 movie clips. A so-
cially grounded approach was explored in Moviegraphs
[54] with emphasis on the underlying interactions (relation-
ships/situations) along with spatio-temporal localizations
and associated visual scenes (59 classes). For holistic movie
understanding tasks, the Movienet dataset[27] was intro-
duced with the largest movie-centric scene taxonomy con-
sisting of 90 place (visual scene) tags with segment wise hu-
man annotations of entire movies. Instead of entire movie
data, short movie clips sourced from YouTube channel of
Fandango Movie clips were used for text-video retrieval in
Condensed movies dataset [1], visual semantic role label-
ing [46] and pretraining object-centric transformers [53] for
long-term video understanding in LVU dataset [55]. While
there is no explicit visual scene labeling, the raw descrip-
tions available on Youtube with the movie clips have men-
tions of certain visual scene classes.
MovieCLIP, our curated dataset, is built on top of movie
摘要:

MovieCLIP:VisualSceneRecognitioninMoviesDigbalayBose1,RajatHebbar1,KrishnaSomandepalli2,HaoyangZhang1,YinCui2,KreeCole-McLaughlin2,HuishengWang2,andShrikanthNarayanan11UniversityofSouthernCalifornia,LosAngeles,CA2Google1{dbose@,rajatheb@,zhangh21@,shri@ee.}usc.edu2{ksoman@,yincui@,kree@,huishengw@}g...

展开>> 收起<<
MovieCLIP Visual Scene Recognition in Movies Digbalay Bose1 Rajat Hebbar1 Krishna Somandepalli2 Haoyang Zhang1 Yin Cui2 Kree Cole-McLaughlin2 Huisheng Wang2 and Shrikanth Narayanan1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:8.75MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注