Weakly-Supervised Temporal Article Grounding Long Chen Yulei Niu Brian Chen Xudong Lin Guangxing Han Christopher Thomas Hammad Ayyubi Heng Ji and Shih-Fu Chang

2025-05-06 0 0 4.15MB 13 页 10玖币
侵权投诉
Weakly-Supervised Temporal Article Grounding
Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han,
Christopher Thomas, Hammad Ayyubi, Heng Ji, and Shih-Fu Chang
Columbia University Virginia Tech University of Illinois at Urbana-Champaign
{cl3695, yn2338, bc2754, xl2798, gh2561, ha2578, sc250}@columbia.edu
chris@cs.vt.edu,hengji@illinois.edu
Abstract
Given a long untrimmed video and natural lan-
guage queries, video grounding (VG) aims to
temporally localize the semantically-aligned
video segments. Almost all existing VG work
holds two simple but unrealistic assumptions:
1) All query sentences can be grounded in the
corresponding video. 2) All query sentences
for the same video are always at the same se-
mantic scale. Unfortunately, both assumptions
make today’s VG models fail to work in prac-
tice. For example, in real-world multimodal
assets (e.g., news articles), most of the sen-
tences in the article can not be grounded in
their affiliated videos, and they typically have
rich hierarchical relations (at different seman-
tic scales). To this end, we propose a new chal-
lenging grounding task: Weakly-Supervised
temporal Article Grounding (WSAG). Specif-
ically, given an article and a relevant video,
WSAG aims to localize all “groundable” sen-
tences to the video, and these sentences are
possibly at different semantic scales. Ac-
cordingly, we collect the first WSAG dataset
to facilitate this task: YouwikiHow, which
borrows the inherent multi-scale descriptions
in wikiHow articles and plentiful YouTube
videos. In addition, we propose a simple but
effective method DualMIL for WSAG, which
consists of a two-level MIL1loss and a single-
/cross- sentence constraint loss. These training
objectives are carefully designed for these re-
laxed assumptions. Extensive ablations have
verified the effectiveness of DualMIL2.
1 Introduction
Video Grounding (VG), i.e., localizing video seg-
ments that semantically correspond to (coreference
relation) query sentences, is one of the fundamental
tasks in multimodal understanding. Further, video
grounding can serve as an indispensable technique
1MIL: Multiple Instance Learning.
2Codes: https://github.com/zjuchenlong/WSAG.
Sentence: The man cracks an egg and whisks the
ingredients in a bowl.
Sentence1: A man in a chef's coat is standing in a
kitchen talking to the camera.
Sentence2: The man cracks an egg and whisks the
ingredients in a bowl.
Sentence3: The man flips the pancakes.
Sentence4: The man is eating a piece of the pancake.
Article (How to Make Pancakes)
1. Crack the eggs into a bowl and beat until creamy.
2. Melt the butter in a bowl.
Make sure that it's completely melted.
A minute is sufficient.
3. Add the butter and milk to the mix.
Stir gently, leaving some small clumps of dry ingredients
in the batter.
Do not blend until completely smooth.
If your batter is smooth, your pancakes will be tough and
flat as opposed to fluffy.
4. Heat the frying pan to a medium low flame.
If you have an initial "pancake" setting on your stove,
use that.
Video
Time
(a)
(b)
(c)
Figure 1: (a) Single sentence grounding: The query is
a single sentence. (b) Multi-sentence grounding: The
queries are multiple sentences. (c) Article grounding:
The query is an article, which consists of multiple sen-
tences at different scales (e.g., How to Make Pancakes).
High-level and low-level sentences are denoted with
corresponding formats. 3and 7denote that sentence
can or cannot be grounded to the video, respectively.
for many downstream applications, such as the text-
oriented highlight detection (Lei et al.,2021), video
retrieval (Miech et al.,2020) or video question an-
swering (Ye et al.,2017;Xiao et al.,2022).
Early VG efforts mainly focus on single sen-
tence grounding (Gao et al.,2017;Hendricks et al.,
2017) (cf. Figure 1(a)). Thanks to advanced rep-
resentation learning and multimodal fusion tech-
niques, single sentence VG has achieved unprece-
dented progress over the recent years (Cao et al.,
2021). The next step towards general VG is to
ground multiple sentences to the same video (cf.
Figure 1(b)). A straightforward solution for multi-
sentence VG is utilizing the single sentence VG
model for each sentence individually. Since these
query sentences associated with the same video are
always semantically related, recent multi-sentence
VG methods directly ground all queries simulta-
neously by considering their temporal order or se-
arXiv:2210.12444v2 [cs.CV] 24 Feb 2023
wikiHow Article
1. Find a large pitcher. The pitcher will need to
be able to hold liquid.
2. Squeeze some lemons to make lemon
juice. Cut the lemons in half, and use a citrus
squeezer, a hand juicer, or a wooden reamer
to squeeze the juice from the lemons.
3. Pour the lemon juice into the pitcher. You
may add the pulp if you like a thicker
lemonade, or you may discard it along with
the seeds. If you do not want the pulp, you
can put a strainer over the pitcher, before
pouring in the lemon juice. Once all of the
juice is inside the pitcher, take the strainer off
the pitcher and discard the pulp and seeds.
4. Add in some cold water. You will need cold
water. You can also use sparkling for a
carbonated lemonade.
You Tu be Vid e os
Figure 2: The only supervision for WSAG is a wiki-
How article (e.g., How to Make Lemonade) and some
corresponding YouTube videos about the same task.
mantic relations (Bao et al.,2021;Shi et al.,2021).
Unfortunately, all existing VG attempts hold two
simple but unrealistic assumptions: 1)
All query
sentences can be grounded in the corresponding
video
. Although this assumption is acceptable for
the VG task itself, it greatly limits the usage of VG
models in real-world multimodal assets. For exam-
ple in news articles, most of the sentences in an arti-
cle cannot be grounded in their affiliated videos. 2)
All query sentences for the same video are always
at the same semantic scale
. By “same scale”, we
mean that all VG models overlook the hierarchical
(or subevent) relations (Aldawsari and Finlayson,
2019;Yao et al.,2020) between these query sen-
tences. For example, in Figure 1(c), the sentence
Stir gently, leaving some small clumps of dry ingre-
dients in the batter” (
S2
) is one of the subevents of
Add the butter and milk to the mix” (
S1
), i.e.,
S1
and
S2
are at different semantic scales. Thus, the
second assumption makes current VG models fail
to perceive the semantic scales, and achieve unsat-
isfactory performance with multi-scale queries.
To this end, we propose a more realistic but chal-
lenging grounding task:
Article Grounding
(AG),
which relaxes both above-mentioned assumptions.
Specifically, given a video and a relevant article
(i.e., a sequence of sentences), AG requires the
model to localize only “groundable” sentences to
video segments, and these sentences are possibly at
different semantic scales. To further avoid the man-
ual annotations for the large-scale training set, in
this paper, we consider a more meaningful setting:
weakly-supervised AG (
WSAG
). As shown in Fig-
ure 2, the only supervision for WSAG is that the
given video and article are about the same task3.
Since there is no prior work on WSAG, we col-
lect a new dataset,
YouwikiHow
, to benchmark
the research. YouwikiHow is built on top of wiki-
3
A task means the same topic with clear and specific steps.
How articles and YouTube videos
4
. In particular,
we group a wikiHow article and an arbitrary video
about the same task as a document-level pair (cf.
Figure 2). For the training set, we conduct a set of
carefully designed operations to control the quality
of training samples, e.g., task filtering or sentence
simplification. For the test set, we directly borrow
the manual step grounding annotations in the ex-
isting CrossTask (Zhukov et al.,2019) dataset and
propagate them to wikiHow article sentences.
In addition, we propose a simple but effective
Dual loss constraint MIL-based method for WSAG,
dubbed
DualMIL
. Specifically,
for the first as-
sumption
, we relax the widely-used Multiple In-
stance Learning (MIL) loss into a two-level MIL
loss. By “two-level”, we mean that we regard all
sentences for each article (sentence-level) and all
proposals for each sentence (segment-level) as the
“bag” at two different levels. Then, we obtain the
global video-article matching score by aggregat-
ing all matching scores over the two-level bag.
This two-level MIL inherently allows some queries
that cannot be grounded in the video. Meanwhile,
to avoid obtaining many highly-overlapping seg-
ments, we propose a single-sentence constraint to
suppress the proposals whose neighbor proposals
have higher matching scores with the query.
For
the second assumption
, we enhance models’ abil-
ities in perceiving different semantic scale queries
by considering these hierarchical relations across
sentences. In particular, we assume that high-level
sentences should be more likely to be grounded
than its low-level sentences for highly matched pro-
posals, and propose a cross-sentence constraint loss.
We show the effectiveness of DualMIL over state-
of-the-art methods through extensive ablations.
In summary, we make three contributions:
1.
To the best of our knowledge, we are the first
work to discuss the two unrealistic assumptions:
all query sentences are groundable and all query
sentences are at the same semantic scale. Mean-
while, we propose a meaningful WSAG task.
2.
To benchmark the research, we collect the first
WSAG dataset: YouwikiHow.
3.
We further propose a simple but effective method
DualMIL for WSAG, which consists of three
different model-agnostic training objectives.
4
https://www.wikihow.com/ &https://www.youtube.com/.
2 Related Work
Single Sentence & Multi-Sentence VG.
Main-
stream solutions for single sentence VG can be
coarsely categorized into two groups: 1) Top-down
Methods (Hendricks et al.,2017;Gao et al.,2017;
Zhang et al.,2019,2020b;Chen et al.,2018;Yuan
et al.,2019a,2021;Wang et al.,2020;Xiao et al.,
2021b,a;Liu et al.,2021b,a;Lan et al.,2022): They
first cut given video into a set of segment proposals
with different durations, and then calculate match-
ing scores between query and all segment propos-
als. Their performance heavily relies on predefined
rules for proposal settings (e.g., temporal sizes). 2)
Bottom-up Methods (Yuan et al.,2019b;Lu et al.,
2019;Zeng et al.,2020;Chen et al.,2020a,2018;
Zhang et al.,2020a): They directly predict the two
temporal boundaries of the target segment by re-
garding the query as a conditional input. Compared
to their top-down counterpart, bottom-up methods
always fail to consider the global context between
two boundaries (i.e., inside segment). In this pa-
per, we follow the top-down framework and our
DualMIL is model-agnostic.
Existing multi-sentence VG work all takes an as-
sumption: the query sentences are ranked by their
corresponding segments. This is an unrealistic and
artificial setting. In contrast, real-world articles al-
ways do not meet this strict requirement, and most
of the sentences are not even groundable in affili-
ated videos. In this paper, we take more realistic
assumptions for the multi-sentence VG problem.
Weakly-Supervised VG.
Since the agreements on
the manually annotated target segments tend to be
low (Otani et al.,2020), a surge of efforts aims to
solve this challenging task in a weakly-supervised
manner, i.e., there are only video-level supervi-
sions at the training stage. Currently, there are
two typical frameworks: 1) MIL-based (Gao et al.,
2019;Mithun et al.,2019;Chen et al.,2020b;Ma
et al.,2020;Zhang et al.,2020c,d;Tan et al.,2021):
They first calculate the matching scores between
the query sentence and all segment proposals and
then aggregate scores of multiple proposals as the
score of whole “bag”. State-of-the-art MIL-based
methods usually focus on designing better posi-
tive/negative bag selections. 2) Reconstruction-
based (Duan et al.,2018;Lin et al.,2020): They
utilize the consistency between dual tasks sentence
localization and caption generation, and infer the
final grounding results from intermediate attention
weights. Among them, the most related work to
us is CRM (Huang et al.,2021), which consid-
ers both multi-sentence and weakly-supervised set-
tings. Compared to CRM, our setting is more chal-
lenging: a) Sentences are from different scales; b)
Not all sentences can be groundable; and c) Sen-
tence sequences are not consistent with GTs.
Multi-Scale VL Benchmarks.
With the develop-
ment of large-scale annotation tools, hundreds of
video-language (VL) datasets are proposed. To the
best of our knowledge, three (types of) VL datasets
also have considered the multiple semantic scale is-
sue: 1) TACoS Multi-Level (Rohrbach et al.,2014):
It provides three-level summaries for videos. In
contrast, their middle-level sentences are more like
extractive summarization (instead of abstractive).
Thus, the grounding results for different-scale sen-
tences may be the same. 2) Movie-related (Xiong
et al.,2019;Huang et al.,2020;Bain et al.,2020):
They always have multiple-level sentences to de-
scribe videos, such as overview, storyline, plot, and
synopsis. They have two characteristics: a) Numer-
ous sentences are abstract descriptions, i.e., they do
not have exact grounding temporal boundaries. b)
The high-level summaries are more like highlights
or salient events. 3) COIN (Tang et al.,2019): It
defines multi-level predefined steps. Thus, it sacri-
fices the ability to ground any open-ended queries.
3 Dataset: YouwikiHow
We built
YouwikiHow
dataset from wikiHow arti-
cles and YouTube videos. As shown in Figure 2, we
group a wikiHow article and any video about the
same task as a pair. Thanks to the inherent hierar-
chical structure of wikiHow articles, we can easily
obtain sentences from different scales:
high-level
summaries and
low-level
details. As in Figure 2,
Pour the lemon juice into the pitcher. is a high-
level sentence summary and “You may add the pulp
if .... along with the seeds. is a low-level sentence
detail of this summary. In this section, we first in-
troduce the details of dataset construction, and then
compare YouwikiHow to existing VG benchmarks.
3.1 Dataset Construction
3.1.1 Training Set
Initial Visual Tasks.
Each wikiHow article de-
scribes a sequence of steps to instruct humans to
perform a certain “task”, and these tasks range
from physical world interactions to abstract mental
well-being improvement. In YouwikiHow, we fol-
low (Miech et al.,2019) and only focus on “visual
摘要:

Weakly-SupervisedTemporalArticleGroundingLongChen~,YuleiNiu~,BrianChen~,XudongLin~,GuangxingHan~,ChristopherThomas},HammadAyyubi~,HengJi,andShih-FuChang~~ColumbiaUniversity}VirginiaTechUniversityofIllinoisatUrbana-Champaign{cl3695,yn2338,bc2754,xl2798,gh2561,ha2578,sc250}@columbia.educhris@cs.vt.e...

展开>> 收起<<
Weakly-Supervised Temporal Article Grounding Long Chen Yulei Niu Brian Chen Xudong Lin Guangxing Han Christopher Thomas Hammad Ayyubi Heng Ji and Shih-Fu Chang.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:4.15MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注