MovieCLIP Visual Scene Recognition in Movies Digbalay Bose1 Rajat Hebbar1 Krishna Somandepalli2 Haoyang Zhang1 Yin Cui2 Kree Cole-McLaughlin2 Huisheng Wang2 and Shrikanth Narayanan1

2025-05-02 0 0 8.75MB 10 页 10玖币

侵权投诉

MovieCLIP: Visual Scene Recognition in Movies

Digbalay Bose1, Rajat Hebbar1, Krishna Somandepalli2, Haoyang Zhang1, Yin Cui2, Kree

Cole-McLaughlin2, Huisheng Wang2, and Shrikanth Narayanan1

1University of Southern California, Los Angeles, CA 2Google

1{dbose@,rajatheb@,zhangh21@,shri@ee.}usc.edu 2{ksoman@,yincui@,kree@,huishengw@}google.com

Abstract

Longform media such as movies have complex narra-

tive structures, with events spanning a rich variety of am-

bient visual scenes. Domain speciﬁc challenges associated

with visual scenes in movies include transitions, person cov-

erage, and a wide array of real-life and ﬁctional scenar-

ios. Existing visual scene datasets in movies have limited

taxonomies and don’t consider the visual scene transition

within movie clips. In this work, we address the problem

of visual scene recognition in movies by ﬁrst automatically

curating a new and extensive movie-centric taxonomy of

179 scene labels derived from movie scripts and auxiliary

web-based video datasets. Instead of manual annotations

which can be expensive, we use CLIP to weakly label 1.12

million shots from 32K movie clips based on our proposed

taxonomy. We provide baseline visual models trained on

the weakly labeled dataset called MovieCLIP and evaluate

them on an independent dataset veriﬁed by human raters.

We show that leveraging features from models pretrained on

MovieCLIP beneﬁts downstream tasks such as multi-label

scene and genre classiﬁcation of web videos and movie

trailers.

1. Introduction

Media, in its diverse forms and modalities, is used to cre-

ate and share narratives across domains including movies,

television shows, advertisements, games, news, and user

generated social stories. Movies represent a major form

of media content, with box ofﬁce revenues estimated at

$4.48 billion across 329 movies released in 2021 [37], with

a global reach and societal inﬂuence. The computational

analysis of media content [49] especially movies presents

unique challenges due to their long-form narrative struc-

tures with character interactions often spanning diverse vi-

sual scenes and contexts. In cinematic terms, mis-en-scene

Figure 1. Overview diagram highlighting the challenges associ-

ated with visual scene recognition in movies (a) Domain mismatch

between Natural scene images,(Source: http://places2.

csail.mit.edu/explore.html) vs frames from Movies

for living room (b) Movie centric visual scene classes like prison,

control room etc that are absent from existing taxonomies (c)

Change in visual scene between shots in the same movie clip.

[4] refers to how the different elements of a ﬁlm are de-

picted and arranged in front of camera. Key components of

mis-en-scene include the actors with their different styles,

visual scenes where the interactions take place, set design

including lighting and camera placement and the accompa-

nying costumes and makeup of the artists. The visual scene

is considered a crucial component since it sets the mood and

provides a background for the various actions performed by

the actors in the scene. Visual scenes in movies are often

tied to social settings like weddings, birthday parties and

workplace gatherings that provide information about char-

acter interactions.

Accurate recognition of visual scenes can help in uncov-

ering the bias involved in portrayal of under-represented

characters vis-a-vis different scenes e.g., fewer women

shown in ofﬁce as compared to kitchen. For content tagging

tasks like genre classiﬁcation, visual scenes provide context

information like battleﬁeld portrayals in action/adventure

movies, space-shuttle in sci-ﬁ movies or courtrooms in dra-

mas. However, there are certain inherent challenges in

visual scene recognition for movies that needs to be ad-

arXiv:2210.11065v2 [cs.CV] 23 Oct 2022

dressed, as shown in Fig. 1:

Domain mismatch - scene images vs movie frames: Vi-

sual scenes depicted in movies are distinct compared to nat-

ural scenes due to increased focus on actors, multiple ac-

tivities and viewpoint variations like extreme closeup, wide

angle shots etc. An example is shown in Fig. 1 (a) for im-

ages from Places2 dataset [57] and movie frames from Con-

densed Movies dataset [1].

Lack of completeness in scene taxonomy: Movies depict

both real life and ﬁctional scenarios that span a wide vari-

ety of visual scenes. As shown in Fig. 1(b), certain movie

centric visual scene classes like battleﬁeld,control room,

prison,war room,funeral,casino are absent from existing

public scene taxonomies associated with natural scene im-

age and video datasets.

Lack of shot speciﬁc visual scene annotations: Existing

datasets like Condensed Movies [1] and VidSitu [46] pro-

vide a single visual scene label for the entire movie clip

(around 2 minutes long), obtained through descriptions pro-

vided as part of YouTube channel Fandango Movie clips 1.

In Fig. 1 (c), the provided description: Johnny Five (Tim

Blaney) searches for his humanity in the streets of New York.

mentions only the visual scene street, while the initial set

of events takes place inside church. Instead of considering

a single scene label for the entire movie clip, shot level vi-

sual scene annotation can help in tracking the scene change

from church to street.

In our work, we consider shots within a given movie

clip as the fundamental units for visual scene analysis since

shots consist of consecutive set of frames related to the same

content, whose starting and ending points are triggered by

recording using a single camera [25]. Our contributions are

as follows:

•Movie-centric scene taxonomy: We develop a movie-

centric scene taxonomy by leveraging scene headers

(sluglines) from movie scripts and existing video datasets

with scene labels like HVU[13].

•Automatic shot tagging: We utilize our generated scene

taxonomy to automatically tag around 1.12M shots from

32K movie clips using CLIP [41] based on a frame-wise

aggregation scheme.

•Multi-label scene classiﬁcation: We develop multi-label

scene classiﬁcation baselines using the shot-level tagged

dataset called MovieCLIP and evaluate them on an in-

dependent shot level dataset curated by human experts.

The dataset and associated codebase can be accessed at

https://sail.usc.edu/mica/MovieCLIP/

•Downstream tasks: We further extract feature rep-

resentations from the baseline models pretrained on

MovieCLIP and explore their applicability in diverse

downstream tasks of multi-label scene and movie genre

1https://www.youtube.com/channel/UC3gNmTGu-TTbFPpfSs5kNkg

classiﬁcation from web videos [13] and trailers [9], re-

spectively.

2. Related work

Image datasets for visual scene recognition: Image

datasets for scene classiﬁcation like MIT Indoor67 [40] re-

lied on categorizing a ﬁnite set of (67) indoor scene classes.

A broad categorization into indoor, outdoor (natural) and

outdoor (man-made) groups for 130K images across 397

subcategories was introduced by the SUN dataset [56]. For

large scale scene recognition, the Places dataset [57] was

developed with 434 scene labels spanning 10 million im-

ages. The scene taxonomy considered in Places dataset was

derived from the SUN dataset, followed by careful merg-

ing of similar pairs. It should be noted that the curation of

large scale visual scene datasets like Places relied on crowd-

sourced manual annotations over multiple rounds.

Video datasets for visual scene recognition: While there

has been considerable progress in terms of action recogni-

tion capabilities from videos due to introduction of large

scale datasets like Kinetics [28], ActivityNet [16], AVA

[22], Something-Something [21], only few large scale

datasets like HVU [13] and Scenes, Objects and Actions

(SOA) [43] have focused on scene categorization with ac-

tions and associated objects. SOA was introduced as a

multi-task multi-label dataset of social-media videos across

49 scenes with objects and actions but the taxonomy cura-

tion involves free-form tagging by human annotators fol-

lowed by automatic cleanup. HVU [13], a recently released

public dataset of web videos with 248 scene labels, relied

on initial tag generation based on cloud APIs followed by

human veriﬁcation.

Movie-centric visual scene recognition: In the domain

of scene recognition from movies, Hollywood scenes [36]

was ﬁrst introduced with 10 scene classes extracted from

headers in movie scripts across 3669 movie clips. A so-

cially grounded approach was explored in Moviegraphs

[54] with emphasis on the underlying interactions (relation-

ships/situations) along with spatio-temporal localizations

and associated visual scenes (59 classes). For holistic movie

understanding tasks, the Movienet dataset[27] was intro-

duced with the largest movie-centric scene taxonomy con-

sisting of 90 place (visual scene) tags with segment wise hu-

man annotations of entire movies. Instead of entire movie

data, short movie clips sourced from YouTube channel of

Fandango Movie clips were used for text-video retrieval in

Condensed movies dataset [1], visual semantic role label-

ing [46] and pretraining object-centric transformers [53] for

long-term video understanding in LVU dataset [55]. While

there is no explicit visual scene labeling, the raw descrip-

tions available on Youtube with the movie clips have men-

tions of certain visual scene classes.

MovieCLIP, our curated dataset, is built on top of movie

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MovieCLIP:VisualSceneRecognitioninMoviesDigbalayBose1,RajatHebbar1,KrishnaSomandepalli2,HaoyangZhang1,YinCui2,KreeCole-McLaughlin2,HuishengWang2,andShrikanthNarayanan11UniversityofSouthernCalifornia,LosAngeles,CA2Google1{dbose@,rajatheb@,zhangh21@,shri@ee.}usc.edu2{ksoman@,yincui@,kree@,huishengw@}g...

展开>> 收起<<

MovieCLIP Visual Scene Recognition in Movies Digbalay Bose1 Rajat Hebbar1 Krishna Somandepalli2 Haoyang Zhang1 Yin Cui2 Kree Cole-McLaughlin2 Huisheng Wang2 and Shrikanth Narayanan1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MovieCLIP Visual Scene Recognition in Movies Digbalay Bose1 Rajat Hebbar1 Krishna Somandepalli2 Haoyang Zhang1 Yin Cui2 Kree Cole-McLaughlin2 Huisheng Wang2 and Shrikanth Narayanan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: