Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding Yuechen Wang Wengang Zhou Houqiang Li

2025-05-06 0 0 2.2MB 11 页 10玖币
侵权投诉
Fine-grained Semantic Alignment Network for Weakly Supervised
Temporal Language Grounding
Yuechen Wang Wengang Zhou Houqiang Li
University of Science and Technology of China
wyc9725@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn
Abstract
Temporal language grounding (TLG) aims to
localize a video segment in an untrimmed
video based on a natural language descrip-
tion. To alleviate the expensive cost of man-
ual annotations for temporal boundary labels,
we are dedicated to the weakly supervised
setting, where only video-level descriptions
are provided for training. Most of the ex-
isting weakly supervised methods generate a
candidate segment set and learn cross-modal
alignment through a MIL-based framework.
However, the temporal structure of the video
as well as the complicated semantics in the
sentence are lost during the learning. In
this work, we propose a novel candidate-
free framework: Fine-grained Semantic Align-
ment Network (FSAN), for weakly supervised
TLG. Instead of view the sentence and can-
didate moments as a whole, FSAN learns
token-by-clip cross-modal semantic alignment
by an iterative cross-modal interaction mod-
ule, generates a fine-grained cross-modal se-
mantic alignment map, and performs ground-
ing directly on top of the map. Extensive
experiments are conducted on two widely-
used benchmarks: ActivityNet-Captions, and
DiDeMo, where our FSAN achieves state-of-
the-art performance.
1 Introduction
Given an untrimmed video and a natural language
sentence, Temporal Language Grounding (TLG)
aims to localize the temporal boundaries of the
video segment described by a referred sentence.
TLG is a challenging problem with great impor-
tance in various multimedia applications, e.g.,
video retrieval (Shao et al.,2018), visual question
answering (Tapaswi et al.,2016;Antol et al.,2015;
Yu et al.,2020), and visual reasoning (Yang et al.,
2018). Since its first proposal (Gao et al.,2017;
Hendricks et al.,2017), tremendous success has
Corresponding author.
Input Query:
person
takes
a phone
off
a desk
Input Query: person takes a phone off a desk.
Segment: (1, 3)
Semantic Alignment Map
τ2τ3τ4τ5τ6τClips:
clips
tokens
Figure 1: Illustration of fine-grained semantic align-
ment map for temporal language grounding.
been made on this problem (Wu and Han,2018;
Chen et al.,2018;Ge et al.,2019;Yuan et al.,
2018;Zhang et al.,2019a;He et al.,2019;Wang
et al.,2019;Zhang et al.,2020b;Ning et al.,2021).
Despite the achievements with supervised learn-
ing, the temporal boundaries for every sentence
query need to be manually annotated for training,
which is expensive, time-consuming, and poten-
tially noisy. On the other hand, it is much easier to
collect a large amount of video-level descriptions
without detailed temporal annotations, since video-
level descriptions naturally appear with videos si-
multaneously on the Internet (e.g., YouTube). To
this end, some prior works are dedicated to weakly
supervised setting, where only video-level descrip-
tions are provided, without temporal labels.
Most of the previous weakly supervised meth-
ods follow a Multiple Instance Learning (MIL)
paradigm, which samples matched and non-
matched video-sentence pairs, and learn a match-
ing classifier to implicitly learn the cross-modal
alignment. However, during the matching classi-
fication, the input sentence is often treated as a
single feature query, neglecting the complicated
linguistic semantics. VLANet (Ma et al.,2020)
treats tokens in the input sentence separately, and
performs cross-modal attention on token-moment
arXiv:2210.11933v1 [cs.CV] 21 Oct 2022
pairs, where the moment candidates are carefully
selected by a surrogate proposal selection module
to reduce computation cost. SCN (Lin et al.,2020)
proposes to generate and select moment candidates
and performs semantic completion for the sentence
to rank selected candidates. Nevertheless, the gen-
eration and selection process of moment candidates
also involves high computational costs. In addition,
the moment candidates are considered separately,
while the temporal structure of the video is also im-
portant for grounding. Figure 1shows an example
to localize the query “person takes a phone off a
desk” in the given video. If the model views the
sentence as a whole and performs matching classifi-
cation, it is hard to learn undistinguished words like
off ” during the training. However, the neglected
words may play important roles to determine the
temporal boundaries of the described moment.
In this paper, we propose a novel framework
named a Fine-grained Semantic Alignment Net-
work (FSAN), for weakly supervised temporal lan-
guage grounding. The core idea of FSAN is to
learn token-by-clip cross-modal semantic align-
ment presenting as a token-clip map, and ground
the sentence on video directly based on it. Specifi-
cally, given an untrimmed video and a description
sentence, we first extract their features by visual
encoder and textual encoder independently. Then,
an Iterative Cross-modal Interaction Module is de-
vised to learn the correspondence between visual
and linguistic representations. To make temporal
predictions for grounding, we further devise a se-
mantic alignment-based grounding module. Based
on the learned cross-modal interacted features, a
token-by-clip semantic alignment map is generated,
where the
(i, j)
-th element on the map indicates rel-
evance between the
i
-th token in the sentence and
j
-clip in the video. Finally, an alignment-based
grounding module predicts the grounding result
corresponding to the input sentence.
Instead of aggregating sentence semantics into
one representation and generating video moment
candidates, FSAN learns a fine-grained cross-
modal alignment map that helps to retain both the
temporal structure among video clips and the com-
plicated semantics in the sentence. Furthermore,
the grounding module in FSAN makes predictions
mainly based on the cross-modal alignment map,
which alleviates the computation cost of candidate
moment representation generation. We demon-
strate the effectiveness of the proposed method
on two widely-used benchmarks: ActivityNet-
Captions (Krishna et al.,2017) and DiDeMo (Hen-
dricks et al.,2017), where state-of-the-art perfor-
mance is achieved by FSAN.
2 Related Work
2.1 Temporal Language Grounding
Temporal language grounding is proposed (Gao
et al.,2017;Hendricks et al.,2017) as a new chal-
lenging task, which requires deep interactions be-
tween two visual and linguistic modalities. Pre-
vious methods have explored this task in a fully
supervised setting (Gao et al.,2017;Hendricks
et al.,2017;Chen et al.,2018;Ge et al.,2019;Xu
et al.,2019;Chen and Jiang,2019;Yuan et al.,
2018;Zhang et al.,2019b,a;Lu et al.,2019). Most
of them follow a two-stage paradigm: generating
candidate moments with sliding windows and sub-
sequently matching the language query. Reinforce-
ment learning has also been leveraged for temporal
language grounding (He et al.,2019;Wang et al.,
2019;Cao et al.,2020).
Despite the boom of fully supervised methods,
it is very time-consuming and labor-intensive to
annotate temporal boundaries for a large number
of videos. And due to the annotation inconsistency
among annotators, temporal labels are often am-
biguous for models to learn. To alleviate the cost
of fine-grained annotation, weakly supervised set-
ting is explored lately (Mithun et al.,2019;Gao
et al.,2019;Lin et al.,2020;Ma et al.,2020;Zhang
et al.,2020c). TGA (Mithun et al.,2019) exploits
maps video candidate features and query features
into a latent space to learn cross-modal similarity.
In (Ma et al.,2020), a video-language attention net-
work is proposed to learn cross-modal alignment
between language tokens and video segment candi-
dates. Differently, our FSAN gets rid of the trouble
of generating candidates and learns fine-grained
token-by-clip semantic alignment.
2.2 Transformer in Language and Vision
Since it is first proposed by Vaswani et al. (Vaswani
et al.,2017) for machine translation, transformer
has become a prevailing architecture in NLP. The
basic block of transformer is the multi-head atten-
tion module, which aggregates information from
the whole input in both transformer encoder and
decoder module. Transformer demonstrates su-
perior performance in language model pretrain-
ing methods (Devlin et al.,2019;Radford et al.,
Cross-modal
attention
Cross-modal
attention
Inner-modal
attention
Inner-modal
attention
Query: Brown
dog runs at the
camera
Text
Enc.
Vis.
Enc.
Iterative Cross-Modal Interaction Module
FC
FC
FC
FC
Grounding
module
Figure 2: The architecture of FSAN. It consists of four main components: (1) a text encoder, (2) a visual encoder,
(3) an iterative cross-modal interaction module, and (4) a proposal module.
2018;Yang et al.,2019), and achieves competi-
tive performance on diverse NLP problems. Re-
cently, transformer has been introduced to various
computer vision tasks, such as image classifica-
tion (Chen et al.,2020b), image generation (Par-
mar et al.,2018), object detection (Carion et al.,
2020), semantic segmentation (Wang et al.,2021a),
tracking (Wang et al.,2021b), etc. Comparing to
CNN, the attention mechanism learns more global
dependencies, therefore, transformer also shows
great performance in low-level tasks (Chen et al.,
2020a). Transformer has also been proved effective
in multi-modal area, including multi-modal repre-
sentations (Zhang et al.,2020a;Tan and Bansal,
2019;Su et al.,2020;Sun et al.,2019) and applica-
tions (Shi et al.,2020;Ju et al.,2020;Liang et al.,
2020). Inspired by the great success, we devise
an iterative cross-modal interaction module mainly
based on the multi-head attention mechanism.
3 Our Approach
Given an untrimmed video and a text-sentence
query, a temporal grounding model aims to localize
the most relevant moment in the video, represented
by its beginning and ending timestamps. In this pa-
per, we consider the weakly supervised setting, i.e.,
for each video
V
, a textual query
S
is provided for
training. The query sentence describes a specific
moment in the video, yet the temporal boundaries
are not provided for training. In the inference stage,
the weakly trained model is required to predict the
beginning and ending timestamps of the video mo-
ment that corresponds to the input sentence S.
We present a novel framework named Fine-
grained Semantic Alignment Network (FSAN) for
the temporal language grounding problem. As
shown in Figure 2, given a video and text query,
we first encode them separately. The resulting rep-
resentations then interact with each other through
an iterative cross-modal interaction module. The
outputs are used to learn a Semantic Alignment
Map (SAP) between the two modalities. Finally,
the SAP is fed into an alignment-based grounding
module to predict scores for all possible moments.
In the following subsections, we will first intro-
duce the visual and language encoder, then describe
the Iterative Cross-Modal Interaction Module. Fi-
nally, we will elaborate on the semantic alignment
map and the grounding module based on it.
3.1 Input Representation
Language Encoder.
We use a standard trans-
former encoder (Vaswani et al.,2017) to extract the
semantic information for the input query sentence
S
. Each token in the input query is first embedded
using GloVe (Pennington et al.,2014). The result-
ing vectors are mapped to dimension of
ds
by a
linear layer and fed into a transformer encoder to
obtain context-aware token features
S={wi}Ns
i=1
,
where
Ns
is the number of tokens and
wkRds
denotes the feature of k-th token in the sentence.
Video Encoder.
For the input videos, we extract
visual features using a pretrained feature extractor
and then apply a temporal pooling on frame fea-
tures to divide it into
Nv
clips. Hence the video can
be represented by
V={vj}Nv
j=1
, where
vjRdv
denotes the feature of
j
-th video clip, and
dv=ds
is the dimension of visual feature. Experimental
results illustrate that the computation cost is con-
siderably reduced by the temporal pooling.
摘要:

Fine-grainedSemanticAlignmentNetworkforWeaklySupervisedTemporalLanguageGroundingYuechenWangWengangZhouHouqiangLiUniversityofScienceandTechnologyofChinawyc9725@mail.ustc.edu.cn,{zhwg,lihq}@ustc.edu.cnAbstractTemporallanguagegrounding(TLG)aimstolocalizeavideosegmentinanuntrimmedvideobasedonanaturalla...

展开>> 收起<<
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding Yuechen Wang Wengang Zhou Houqiang Li.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:2.2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注