pairs, where the moment candidates are carefully
selected by a surrogate proposal selection module
to reduce computation cost. SCN (Lin et al.,2020)
proposes to generate and select moment candidates
and performs semantic completion for the sentence
to rank selected candidates. Nevertheless, the gen-
eration and selection process of moment candidates
also involves high computational costs. In addition,
the moment candidates are considered separately,
while the temporal structure of the video is also im-
portant for grounding. Figure 1shows an example
to localize the query “person takes a phone off a
desk” in the given video. If the model views the
sentence as a whole and performs matching classifi-
cation, it is hard to learn undistinguished words like
“off ” during the training. However, the neglected
words may play important roles to determine the
temporal boundaries of the described moment.
In this paper, we propose a novel framework
named a Fine-grained Semantic Alignment Net-
work (FSAN), for weakly supervised temporal lan-
guage grounding. The core idea of FSAN is to
learn token-by-clip cross-modal semantic align-
ment presenting as a token-clip map, and ground
the sentence on video directly based on it. Specifi-
cally, given an untrimmed video and a description
sentence, we first extract their features by visual
encoder and textual encoder independently. Then,
an Iterative Cross-modal Interaction Module is de-
vised to learn the correspondence between visual
and linguistic representations. To make temporal
predictions for grounding, we further devise a se-
mantic alignment-based grounding module. Based
on the learned cross-modal interacted features, a
token-by-clip semantic alignment map is generated,
where the
(i, j)
-th element on the map indicates rel-
evance between the
i
-th token in the sentence and
j
-clip in the video. Finally, an alignment-based
grounding module predicts the grounding result
corresponding to the input sentence.
Instead of aggregating sentence semantics into
one representation and generating video moment
candidates, FSAN learns a fine-grained cross-
modal alignment map that helps to retain both the
temporal structure among video clips and the com-
plicated semantics in the sentence. Furthermore,
the grounding module in FSAN makes predictions
mainly based on the cross-modal alignment map,
which alleviates the computation cost of candidate
moment representation generation. We demon-
strate the effectiveness of the proposed method
on two widely-used benchmarks: ActivityNet-
Captions (Krishna et al.,2017) and DiDeMo (Hen-
dricks et al.,2017), where state-of-the-art perfor-
mance is achieved by FSAN.
2 Related Work
2.1 Temporal Language Grounding
Temporal language grounding is proposed (Gao
et al.,2017;Hendricks et al.,2017) as a new chal-
lenging task, which requires deep interactions be-
tween two visual and linguistic modalities. Pre-
vious methods have explored this task in a fully
supervised setting (Gao et al.,2017;Hendricks
et al.,2017;Chen et al.,2018;Ge et al.,2019;Xu
et al.,2019;Chen and Jiang,2019;Yuan et al.,
2018;Zhang et al.,2019b,a;Lu et al.,2019). Most
of them follow a two-stage paradigm: generating
candidate moments with sliding windows and sub-
sequently matching the language query. Reinforce-
ment learning has also been leveraged for temporal
language grounding (He et al.,2019;Wang et al.,
2019;Cao et al.,2020).
Despite the boom of fully supervised methods,
it is very time-consuming and labor-intensive to
annotate temporal boundaries for a large number
of videos. And due to the annotation inconsistency
among annotators, temporal labels are often am-
biguous for models to learn. To alleviate the cost
of fine-grained annotation, weakly supervised set-
ting is explored lately (Mithun et al.,2019;Gao
et al.,2019;Lin et al.,2020;Ma et al.,2020;Zhang
et al.,2020c). TGA (Mithun et al.,2019) exploits
maps video candidate features and query features
into a latent space to learn cross-modal similarity.
In (Ma et al.,2020), a video-language attention net-
work is proposed to learn cross-modal alignment
between language tokens and video segment candi-
dates. Differently, our FSAN gets rid of the trouble
of generating candidates and learns fine-grained
token-by-clip semantic alignment.
2.2 Transformer in Language and Vision
Since it is first proposed by Vaswani et al. (Vaswani
et al.,2017) for machine translation, transformer
has become a prevailing architecture in NLP. The
basic block of transformer is the multi-head atten-
tion module, which aggregates information from
the whole input in both transformer encoder and
decoder module. Transformer demonstrates su-
perior performance in language model pretrain-
ing methods (Devlin et al.,2019;Radford et al.,