Language-free Training for Zero-shot Video Grounding Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn13 1Yonsei University2NA VER AI Lab3Korea Institute of Science and Technology KIST

2025-05-03 0 0 3.82MB 10 页 10玖币

侵权投诉

Language-free Training for Zero-shot Video Grounding

Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn1,3*

1Yonsei University 2NAVER AI Lab 3Korea Institute of Science and Technology (KIST)

{dadaday, newrun, sam121796, khsohn}@yonsei.ac.kr lee.j@navercorp.com

Abstract

Given an untrimmed video and a language query depict-

ing a speciﬁc temporal moment in the video, video ground-

ing aims to localize the time interval by understanding the

text and video simultaneously. One of the most challenging

issues is an extremely time- and cost-consuming annotation

collection, including video captions in a natural language

form and their corresponding temporal regions. In this pa-

per, we present a simple yet novel training framework for

video grounding in the zero-shot setting, which learns a net-

work with only video data without any annotation. Inspired

by the recent language-free paradigm, i.e. training with-

out language data, we train the network without compelling

the generation of fake (pseudo) text queries into a natural

language form. Speciﬁcally, we propose a method for learn-

ing a video grounding model by selecting a temporal inter-

val as a hypothetical correct answer and considering the

visual feature selected by our method in the interval as a

language feature, with the help of the well-aligned visual-

language space of CLIP. Extensive experiments demon-

strate the prominence of our language-free training frame-

work, outperforming the existing zero-shot video ground-

ing method and even several weakly-supervised approaches

with large margins on two standard datasets.

1. Introduction

In our daily life, we surf, think, and learn through loads

of videos. By extension, we wish to search for the informa-

tion we want in the videos. Video grounding (also called

video moment retrieval) with natural language query aims

to help such video search by automatically localizing a tem-

poral moment for various applications such as video surveil-

lance [7] and smart video search [37, 38].

A major challenge of video grounding is the exorbi-

tant cost of constructing time interval annotations aligned

to a given text that is also collected. Although recent

fully-supervised video grounding (FSVG) methods [24, 39]

*Corresponding author

Language query: The person opens the bag.

Input Video

6.4s0.5s Time Interval

Video Grounding Model

Supervision Method

FSVG

WSVG

ZSVG

Training Time interval

Language query

Inference

Language query

(a) Video grounding

Setting

Training Test

Time interval Language query Language query

FSVG ✓ ✓ ✓

WSVG ✗✓ ✓

ZSVG ✗ ✗ ✓

(b) Annotation types depending on the settings

Figure 1. Given a video and a language query, video grounding

aims to retrieve the time interval corresponding to the language

query in the video. In this paper, we address the zero-shot video

grounding (ZSVG) problem which is the most challenging setting

and cannot use any annotations for training.

have shown remarkable performance on the limited size of

datasets [14, 19], there is still room for improvement with

scale-up training. Especially in such a ﬁeld, large-scale

training data is required to cover numerous video domains

(e.g., instructional videos, movies, and so forth). How-

ever, building massive annotations as more billion scales

like image-language datasets, such as LAION-5B [35], in

video scale is an impractical solution.

To address the burden of annotations, researchers have

proposed weakly-supervised video grounding (WSVG)

methods [15, 23, 28] which use only coarse video-level de-

scriptions for training. But they still require paired video-

language data, showing limited applicability in the open

world. Recently zero-shot video grounding (ZSVG) has

been proposed in [30]. As illustrated in Fig. 1, ZSVG uti-

lizes only videos to learn the video grounding model in the

training stage. To learn the localizing capability in a semi-

arXiv:2210.12977v1 [cs.CV] 24 Oct 2022

supervised manner, [30] generates pseudo temporal event

regions and corresponding pseudo sentence queries by ex-

amining noun-verb statistical co-occurrence patterns. How-

ever, pseudo sentences are built upon the composition of

nouns and verbs (e.g., ‘ﬂip person switch door’), which

is naturally different from the form of natural language

query (e.g., ‘person ﬂipped the light switch near the door.’).

Namely, contrived sentences with the simple composition

of nouns and verbs break the structural and compositional

generalization inherent in natural language that might harm

the performance [22].

In this paper, we propose a novel language-free training

framework for zero-shot video grounding. Our solution is to

treat the visual feature as pseudo textual information while

being ﬂexible in responding to the act of forcing sentences

to generate pseudo forced sentences. Speciﬁcally, we lever-

age an image-language pretraining model (i.e., CLIP [33])

trained on large-scale web-collected data that have revealed

a breakthrough in the multi-modal research ﬁeld. We con-

jecture that text and visual features can replace each other

without trouble in that CLIP provides a well-aligned visual-

language semantic space.

To this end, we ﬁrst generate temporal proposals that

contain meaningful events from a given untrimmed video.

With the visual encoder of CLIP, visual features are ex-

tracted from all the frames in the proposal. Then our learn-

able selection transformer takes a dominant feature that has

a role of the pseudo language feature in a video ground-

ing model instead of generating a natural sentence from the

proposal. Therefore, our method is free to generate high-

quality natural language form from the proposal. More-

over, since the dominant visual feature is directly used for

the pseudo textual feature, our method has no need to pro-

duce textual embedding from a pseudo text label, which is

a time-consuming yet necessary step for the training of the

previous method [30]. Finally, the whole model is learned

to predict time intervals corresponding to pseudo sentence

features with generated temporal proposals as ground-truth.

Our contributions are summarized three-fold:

• We introduce a language-free framework for video

grounding that can be an affordable solution to effec-

tively reduce the annotation cost.

• We validate the applicability of the pretrained visual-

language model to the video-language task by provid-

ing extensive experimental analysis.

• Our language-free training framework outperforms

the existing method, achieving state-of-the-art per-

formance, and even shows comparable performance

with weakly-supervised approaches on the Charades-

STA [14] and ActivityNet Captions [19] datasets.

2. Related Work

2.1. Video Grounding

Video grounding is a recently proposed task [1, 14],

which aims to ﬁnd the best moment in a video grounded

on language queries. Most of the existing methods followed

fully-supervised setting [9, 22, 24, 29, 34, 44, 51, 52, 53, 54,

57, 59] to model ﬁne-grained semantic relations of video

and language. However, since such a setting requires pre-

cise annotations for the start and end timestamps, manual

annotations of the temporal boundary were required, which

also led to subjectivity across different annotators.

Weakly-supervised video grounding has been introduced

to alleviate this burden. Existing works can be catego-

rized into two groups. 1) Multi-instance learning [8] (MIL)

based methods [15, 16, 27, 28, 41, 56] utilized similar-

ity scores by maximizing scores between positive samples

and minimizing scores between negative samples. 2) The

reconstruction-based method [10, 23, 40, 50, 58] used the

assumption that the video segment that best reconstructs the

text query is close to the ground-truth.

However, while weakly-supervised approaches were

successful in lowering the cost of temporal annotation, the

cost of text query remains problematic. Several works [25,

30] considered an unsupervised setting that does not access

the paired annotations. [25] proposed a deep semantic clus-

tering network, which ﬁrst aggregates distinct semantic fea-

tures from the whole query set and then generates pseudo la-

bels to provide pseudo supervision for training. [30] gener-

ated pseudo labels of temporal boundaries and correspond-

ing query sentences. They ﬁrst utilized a temporal similarity

matrix to ﬁnd temporal event proposals, then used an off-

the-shelf object detector and ﬁne-tuned RoBERTa [26] to

make a structure-less pseudo query. However, a structure-

less pseudo query, especially composed of nouns and verbs,

can be interpreted in several meanings, due to systematic

compositionality [5, 13] of natural language. In addition,

the existence of the uninformative word in the query makes

it hard for the model to distinguish the exact meaning of

what the query originally intended to mean. Furthermore,

inferred verbs from detected objects are loosely bonded in

the sense that the verbs are not predicted directly from the

video, which leads to the generation of inaccurate pseudo

queries.

2.2. Language-free Paradigm

As recent trends shift from uni-modal learning to multi-

modal learning, vision-language related tasks have attracted

attention. Since the modality to be processed has dou-

bled, it becomes difﬁcult to obtain high-quality vision-

language training pairs. Several works [30, 60] proposed a

so-called ‘language-free paradigm’ to address this problem,

which means training without language data in the vision-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Language-freeTrainingforZero-shotVideoGroundingDahyeKim1JunginPark1JiyoungLee2SeongheonPark1KwanghoonSohn1,3*1YonseiUniversity2NAVERAILab3KoreaInstituteofScienceandTechnology(KIST){dadaday,newrun,sam121796,khsohn}@yonsei.ac.krlee.j@navercorp.comAbstractGivenanuntrimmedvideoandalanguagequerydepict-in...

展开>> 收起<<

Language-free Training for Zero-shot Video Grounding Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn13 1Yonsei University2NA VER AI Lab3Korea Institute of Science and Technology KIST.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Language-free Training for Zero-shot Video Grounding Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn13 1Yonsei University2NA VER AI Lab3Korea Institute of Science and Technology KIST

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: