Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding Yuechen Wang Wengang Zhou Houqiang Li

2025-05-06 0 0 2.2MB 11 页 10玖币

侵权投诉

Fine-grained Semantic Alignment Network for Weakly Supervised

Temporal Language Grounding

Yuechen Wang Wengang Zhou Houqiang Li∗

University of Science and Technology of China

wyc9725@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn

Abstract

Temporal language grounding (TLG) aims to

localize a video segment in an untrimmed

video based on a natural language descrip-

tion. To alleviate the expensive cost of man-

ual annotations for temporal boundary labels,

we are dedicated to the weakly supervised

setting, where only video-level descriptions

are provided for training. Most of the ex-

isting weakly supervised methods generate a

candidate segment set and learn cross-modal

alignment through a MIL-based framework.

However, the temporal structure of the video

as well as the complicated semantics in the

sentence are lost during the learning. In

this work, we propose a novel candidate-

free framework: Fine-grained Semantic Align-

ment Network (FSAN), for weakly supervised

TLG. Instead of view the sentence and can-

didate moments as a whole, FSAN learns

token-by-clip cross-modal semantic alignment

by an iterative cross-modal interaction mod-

ule, generates a ﬁne-grained cross-modal se-

mantic alignment map, and performs ground-

ing directly on top of the map. Extensive

experiments are conducted on two widely-

used benchmarks: ActivityNet-Captions, and

DiDeMo, where our FSAN achieves state-of-

the-art performance.

1 Introduction

Given an untrimmed video and a natural language

sentence, Temporal Language Grounding (TLG)

aims to localize the temporal boundaries of the

video segment described by a referred sentence.

TLG is a challenging problem with great impor-

tance in various multimedia applications, e.g.,

video retrieval (Shao et al.,2018), visual question

answering (Tapaswi et al.,2016;Antol et al.,2015;

Yu et al.,2020), and visual reasoning (Yang et al.,

2018). Since its ﬁrst proposal (Gao et al.,2017;

Hendricks et al.,2017), tremendous success has

∗Corresponding author.

Input Query:

person

takes

a phone

off

a desk

Input Query: person takes a phone off a desk.

Segment: (1, 3)

Semantic Alignment Map

τ2τ3τ4τ5τ6τClips:

clips

tokens

Figure 1: Illustration of ﬁne-grained semantic align-

ment map for temporal language grounding.

been made on this problem (Wu and Han,2018;

Chen et al.,2018;Ge et al.,2019;Yuan et al.,

2018;Zhang et al.,2019a;He et al.,2019;Wang

et al.,2019;Zhang et al.,2020b;Ning et al.,2021).

Despite the achievements with supervised learn-

ing, the temporal boundaries for every sentence

query need to be manually annotated for training,

which is expensive, time-consuming, and poten-

tially noisy. On the other hand, it is much easier to

collect a large amount of video-level descriptions

without detailed temporal annotations, since video-

level descriptions naturally appear with videos si-

multaneously on the Internet (e.g., YouTube). To

this end, some prior works are dedicated to weakly

supervised setting, where only video-level descrip-

tions are provided, without temporal labels.

Most of the previous weakly supervised meth-

ods follow a Multiple Instance Learning (MIL)

paradigm, which samples matched and non-

matched video-sentence pairs, and learn a match-

ing classiﬁer to implicitly learn the cross-modal

alignment. However, during the matching classi-

ﬁcation, the input sentence is often treated as a

single feature query, neglecting the complicated

linguistic semantics. VLANet (Ma et al.,2020)

treats tokens in the input sentence separately, and

performs cross-modal attention on token-moment

arXiv:2210.11933v1 [cs.CV] 21 Oct 2022

pairs, where the moment candidates are carefully

selected by a surrogate proposal selection module

to reduce computation cost. SCN (Lin et al.,2020)

proposes to generate and select moment candidates

and performs semantic completion for the sentence

to rank selected candidates. Nevertheless, the gen-

eration and selection process of moment candidates

also involves high computational costs. In addition,

the moment candidates are considered separately,

while the temporal structure of the video is also im-

portant for grounding. Figure 1shows an example

to localize the query “person takes a phone off a

desk” in the given video. If the model views the

sentence as a whole and performs matching classiﬁ-

cation, it is hard to learn undistinguished words like

“off ” during the training. However, the neglected

words may play important roles to determine the

temporal boundaries of the described moment.

In this paper, we propose a novel framework

named a Fine-grained Semantic Alignment Net-

work (FSAN), for weakly supervised temporal lan-

guage grounding. The core idea of FSAN is to

learn token-by-clip cross-modal semantic align-

ment presenting as a token-clip map, and ground

the sentence on video directly based on it. Speciﬁ-

cally, given an untrimmed video and a description

sentence, we ﬁrst extract their features by visual

encoder and textual encoder independently. Then,

an Iterative Cross-modal Interaction Module is de-

vised to learn the correspondence between visual

and linguistic representations. To make temporal

predictions for grounding, we further devise a se-

mantic alignment-based grounding module. Based

on the learned cross-modal interacted features, a

token-by-clip semantic alignment map is generated,

where the

(i, j)

-th element on the map indicates rel-

evance between the

-th token in the sentence and

-clip in the video. Finally, an alignment-based

grounding module predicts the grounding result

corresponding to the input sentence.

Instead of aggregating sentence semantics into

one representation and generating video moment

candidates, FSAN learns a ﬁne-grained cross-

modal alignment map that helps to retain both the

temporal structure among video clips and the com-

plicated semantics in the sentence. Furthermore,

the grounding module in FSAN makes predictions

mainly based on the cross-modal alignment map,

which alleviates the computation cost of candidate

moment representation generation. We demon-

strate the effectiveness of the proposed method

on two widely-used benchmarks: ActivityNet-

Captions (Krishna et al.,2017) and DiDeMo (Hen-

dricks et al.,2017), where state-of-the-art perfor-

mance is achieved by FSAN.

2 Related Work

2.1 Temporal Language Grounding

Temporal language grounding is proposed (Gao

et al.,2017;Hendricks et al.,2017) as a new chal-

lenging task, which requires deep interactions be-

tween two visual and linguistic modalities. Pre-

vious methods have explored this task in a fully

supervised setting (Gao et al.,2017;Hendricks

et al.,2017;Chen et al.,2018;Ge et al.,2019;Xu

et al.,2019;Chen and Jiang,2019;Yuan et al.,

2018;Zhang et al.,2019b,a;Lu et al.,2019). Most

of them follow a two-stage paradigm: generating

candidate moments with sliding windows and sub-

sequently matching the language query. Reinforce-

ment learning has also been leveraged for temporal

language grounding (He et al.,2019;Wang et al.,

2019;Cao et al.,2020).

Despite the boom of fully supervised methods,

it is very time-consuming and labor-intensive to

annotate temporal boundaries for a large number

of videos. And due to the annotation inconsistency

among annotators, temporal labels are often am-

biguous for models to learn. To alleviate the cost

of ﬁne-grained annotation, weakly supervised set-

ting is explored lately (Mithun et al.,2019;Gao

et al.,2019;Lin et al.,2020;Ma et al.,2020;Zhang

et al.,2020c). TGA (Mithun et al.,2019) exploits

maps video candidate features and query features

into a latent space to learn cross-modal similarity.

In (Ma et al.,2020), a video-language attention net-

work is proposed to learn cross-modal alignment

between language tokens and video segment candi-

dates. Differently, our FSAN gets rid of the trouble

of generating candidates and learns ﬁne-grained

token-by-clip semantic alignment.

2.2 Transformer in Language and Vision

Since it is ﬁrst proposed by Vaswani et al. (Vaswani

et al.,2017) for machine translation, transformer

has become a prevailing architecture in NLP. The

basic block of transformer is the multi-head atten-

tion module, which aggregates information from

the whole input in both transformer encoder and

decoder module. Transformer demonstrates su-

perior performance in language model pretrain-

ing methods (Devlin et al.,2019;Radford et al.,

Cross-modal

attention

Cross-modal

attention

Inner-modal

attention

Inner-modal

attention

Query: Brown

dog runs at the

camera

Text

Enc.

Vis.

Enc.

Iterative Cross-Modal Interaction Module

Grounding

module

Figure 2: The architecture of FSAN. It consists of four main components: (1) a text encoder, (2) a visual encoder,

(3) an iterative cross-modal interaction module, and (4) a proposal module.

2018;Yang et al.,2019), and achieves competi-

tive performance on diverse NLP problems. Re-

cently, transformer has been introduced to various

computer vision tasks, such as image classiﬁca-

tion (Chen et al.,2020b), image generation (Par-

mar et al.,2018), object detection (Carion et al.,

2020), semantic segmentation (Wang et al.,2021a),

tracking (Wang et al.,2021b), etc. Comparing to

CNN, the attention mechanism learns more global

dependencies, therefore, transformer also shows

great performance in low-level tasks (Chen et al.,

2020a). Transformer has also been proved effective

in multi-modal area, including multi-modal repre-

sentations (Zhang et al.,2020a;Tan and Bansal,

2019;Su et al.,2020;Sun et al.,2019) and applica-

tions (Shi et al.,2020;Ju et al.,2020;Liang et al.,

2020). Inspired by the great success, we devise

an iterative cross-modal interaction module mainly

based on the multi-head attention mechanism.

3 Our Approach

Given an untrimmed video and a text-sentence

query, a temporal grounding model aims to localize

the most relevant moment in the video, represented

by its beginning and ending timestamps. In this pa-

per, we consider the weakly supervised setting, i.e.,

for each video

, a textual query

is provided for

training. The query sentence describes a speciﬁc

moment in the video, yet the temporal boundaries

are not provided for training. In the inference stage,

the weakly trained model is required to predict the

beginning and ending timestamps of the video mo-

ment that corresponds to the input sentence S.

We present a novel framework named Fine-

grained Semantic Alignment Network (FSAN) for

the temporal language grounding problem. As

shown in Figure 2, given a video and text query,

we ﬁrst encode them separately. The resulting rep-

resentations then interact with each other through

an iterative cross-modal interaction module. The

outputs are used to learn a Semantic Alignment

Map (SAP) between the two modalities. Finally,

the SAP is fed into an alignment-based grounding

module to predict scores for all possible moments.

In the following subsections, we will ﬁrst intro-

duce the visual and language encoder, then describe

the Iterative Cross-Modal Interaction Module. Fi-

nally, we will elaborate on the semantic alignment

map and the grounding module based on it.

3.1 Input Representation

Language Encoder.

We use a standard trans-

former encoder (Vaswani et al.,2017) to extract the

semantic information for the input query sentence

. Each token in the input query is ﬁrst embedded

using GloVe (Pennington et al.,2014). The result-

ing vectors are mapped to dimension of

by a

linear layer and fed into a transformer encoder to

obtain context-aware token features

S={wi}Ns

i=1

where

is the number of tokens and

wk∈Rds

denotes the feature of k-th token in the sentence.

Video Encoder.

For the input videos, we extract

visual features using a pretrained feature extractor

and then apply a temporal pooling on frame fea-

tures to divide it into

clips. Hence the video can

be represented by

V={vj}Nv

j=1

, where

vj∈Rdv

denotes the feature of

-th video clip, and

dv=ds

is the dimension of visual feature. Experimental

results illustrate that the computation cost is con-

siderably reduced by the temporal pooling.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Fine-grainedSemanticAlignmentNetworkforWeaklySupervisedTemporalLanguageGroundingYuechenWangWengangZhouHouqiangLiUniversityofScienceandTechnologyofChinawyc9725@mail.ustc.edu.cn,{zhwg,lihq}@ustc.edu.cnAbstractTemporallanguagegrounding(TLG)aimstolocalizeavideosegmentinanuntrimmedvideobasedonanaturalla...

展开>> 收起<<

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding Yuechen Wang Wengang Zhou Houqiang Li.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding Yuechen Wang Wengang Zhou Houqiang Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: