Language-free Training for Zero-shot Video Grounding Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn13 1Yonsei University2NA VER AI Lab3Korea Institute of Science and Technology KIST

2025-05-03 0 0 3.82MB 10 页 10玖币
侵权投诉
Language-free Training for Zero-shot Video Grounding
Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn1,3*
1Yonsei University 2NAVER AI Lab 3Korea Institute of Science and Technology (KIST)
{dadaday, newrun, sam121796, khsohn}@yonsei.ac.kr lee.j@navercorp.com
Abstract
Given an untrimmed video and a language query depict-
ing a specific temporal moment in the video, video ground-
ing aims to localize the time interval by understanding the
text and video simultaneously. One of the most challenging
issues is an extremely time- and cost-consuming annotation
collection, including video captions in a natural language
form and their corresponding temporal regions. In this pa-
per, we present a simple yet novel training framework for
video grounding in the zero-shot setting, which learns a net-
work with only video data without any annotation. Inspired
by the recent language-free paradigm, i.e. training with-
out language data, we train the network without compelling
the generation of fake (pseudo) text queries into a natural
language form. Specifically, we propose a method for learn-
ing a video grounding model by selecting a temporal inter-
val as a hypothetical correct answer and considering the
visual feature selected by our method in the interval as a
language feature, with the help of the well-aligned visual-
language space of CLIP. Extensive experiments demon-
strate the prominence of our language-free training frame-
work, outperforming the existing zero-shot video ground-
ing method and even several weakly-supervised approaches
with large margins on two standard datasets.
1. Introduction
In our daily life, we surf, think, and learn through loads
of videos. By extension, we wish to search for the informa-
tion we want in the videos. Video grounding (also called
video moment retrieval) with natural language query aims
to help such video search by automatically localizing a tem-
poral moment for various applications such as video surveil-
lance [7] and smart video search [37, 38].
A major challenge of video grounding is the exorbi-
tant cost of constructing time interval annotations aligned
to a given text that is also collected. Although recent
fully-supervised video grounding (FSVG) methods [24, 39]
*Corresponding author
Language query: The person opens the bag.
Input Video
6.4s0.5s Time Interval
Video Grounding Model
Supervision Method
FSVG
WSVG
ZSVG
Training Time interval
Language query
Inference
Language query
(a) Video grounding
Setting
Training Test
Time interval Language query Language query
FSVG ✓ ✓
WSVG ✓ ✓
ZSVG ✗ ✗
(b) Annotation types depending on the settings
Figure 1. Given a video and a language query, video grounding
aims to retrieve the time interval corresponding to the language
query in the video. In this paper, we address the zero-shot video
grounding (ZSVG) problem which is the most challenging setting
and cannot use any annotations for training.
have shown remarkable performance on the limited size of
datasets [14, 19], there is still room for improvement with
scale-up training. Especially in such a field, large-scale
training data is required to cover numerous video domains
(e.g., instructional videos, movies, and so forth). How-
ever, building massive annotations as more billion scales
like image-language datasets, such as LAION-5B [35], in
video scale is an impractical solution.
To address the burden of annotations, researchers have
proposed weakly-supervised video grounding (WSVG)
methods [15, 23, 28] which use only coarse video-level de-
scriptions for training. But they still require paired video-
language data, showing limited applicability in the open
world. Recently zero-shot video grounding (ZSVG) has
been proposed in [30]. As illustrated in Fig. 1, ZSVG uti-
lizes only videos to learn the video grounding model in the
training stage. To learn the localizing capability in a semi-
arXiv:2210.12977v1 [cs.CV] 24 Oct 2022
supervised manner, [30] generates pseudo temporal event
regions and corresponding pseudo sentence queries by ex-
amining noun-verb statistical co-occurrence patterns. How-
ever, pseudo sentences are built upon the composition of
nouns and verbs (e.g., ‘flip person switch door’), which
is naturally different from the form of natural language
query (e.g., ‘person flipped the light switch near the door.’).
Namely, contrived sentences with the simple composition
of nouns and verbs break the structural and compositional
generalization inherent in natural language that might harm
the performance [22].
In this paper, we propose a novel language-free training
framework for zero-shot video grounding. Our solution is to
treat the visual feature as pseudo textual information while
being flexible in responding to the act of forcing sentences
to generate pseudo forced sentences. Specifically, we lever-
age an image-language pretraining model (i.e., CLIP [33])
trained on large-scale web-collected data that have revealed
a breakthrough in the multi-modal research field. We con-
jecture that text and visual features can replace each other
without trouble in that CLIP provides a well-aligned visual-
language semantic space.
To this end, we first generate temporal proposals that
contain meaningful events from a given untrimmed video.
With the visual encoder of CLIP, visual features are ex-
tracted from all the frames in the proposal. Then our learn-
able selection transformer takes a dominant feature that has
a role of the pseudo language feature in a video ground-
ing model instead of generating a natural sentence from the
proposal. Therefore, our method is free to generate high-
quality natural language form from the proposal. More-
over, since the dominant visual feature is directly used for
the pseudo textual feature, our method has no need to pro-
duce textual embedding from a pseudo text label, which is
a time-consuming yet necessary step for the training of the
previous method [30]. Finally, the whole model is learned
to predict time intervals corresponding to pseudo sentence
features with generated temporal proposals as ground-truth.
Our contributions are summarized three-fold:
• We introduce a language-free framework for video
grounding that can be an affordable solution to effec-
tively reduce the annotation cost.
We validate the applicability of the pretrained visual-
language model to the video-language task by provid-
ing extensive experimental analysis.
Our language-free training framework outperforms
the existing method, achieving state-of-the-art per-
formance, and even shows comparable performance
with weakly-supervised approaches on the Charades-
STA [14] and ActivityNet Captions [19] datasets.
2. Related Work
2.1. Video Grounding
Video grounding is a recently proposed task [1, 14],
which aims to find the best moment in a video grounded
on language queries. Most of the existing methods followed
fully-supervised setting [9, 22, 24, 29, 34, 44, 51, 52, 53, 54,
57, 59] to model fine-grained semantic relations of video
and language. However, since such a setting requires pre-
cise annotations for the start and end timestamps, manual
annotations of the temporal boundary were required, which
also led to subjectivity across different annotators.
Weakly-supervised video grounding has been introduced
to alleviate this burden. Existing works can be catego-
rized into two groups. 1) Multi-instance learning [8] (MIL)
based methods [15, 16, 27, 28, 41, 56] utilized similar-
ity scores by maximizing scores between positive samples
and minimizing scores between negative samples. 2) The
reconstruction-based method [10, 23, 40, 50, 58] used the
assumption that the video segment that best reconstructs the
text query is close to the ground-truth.
However, while weakly-supervised approaches were
successful in lowering the cost of temporal annotation, the
cost of text query remains problematic. Several works [25,
30] considered an unsupervised setting that does not access
the paired annotations. [25] proposed a deep semantic clus-
tering network, which first aggregates distinct semantic fea-
tures from the whole query set and then generates pseudo la-
bels to provide pseudo supervision for training. [30] gener-
ated pseudo labels of temporal boundaries and correspond-
ing query sentences. They first utilized a temporal similarity
matrix to find temporal event proposals, then used an off-
the-shelf object detector and fine-tuned RoBERTa [26] to
make a structure-less pseudo query. However, a structure-
less pseudo query, especially composed of nouns and verbs,
can be interpreted in several meanings, due to systematic
compositionality [5, 13] of natural language. In addition,
the existence of the uninformative word in the query makes
it hard for the model to distinguish the exact meaning of
what the query originally intended to mean. Furthermore,
inferred verbs from detected objects are loosely bonded in
the sense that the verbs are not predicted directly from the
video, which leads to the generation of inaccurate pseudo
queries.
2.2. Language-free Paradigm
As recent trends shift from uni-modal learning to multi-
modal learning, vision-language related tasks have attracted
attention. Since the modality to be processed has dou-
bled, it becomes difficult to obtain high-quality vision-
language training pairs. Several works [30, 60] proposed a
so-called ‘language-free paradigm’ to address this problem,
which means training without language data in the vision-
摘要:

Language-freeTrainingforZero-shotVideoGroundingDahyeKim1JunginPark1JiyoungLee2SeongheonPark1KwanghoonSohn1,3*1YonseiUniversity2NAVERAILab3KoreaInstituteofScienceandTechnology(KIST){dadaday,newrun,sam121796,khsohn}@yonsei.ac.krlee.j@navercorp.comAbstractGivenanuntrimmedvideoandalanguagequerydepict-in...

展开>> 收起<<
Language-free Training for Zero-shot Video Grounding Dahye Kim1Jungin Park1Jiyoung Lee2Seongheon Park1Kwanghoon Sohn13 1Yonsei University2NA VER AI Lab3Korea Institute of Science and Technology KIST.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:3.82MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注