Weakly-Supervised Temporal Article Grounding Long Chen Yulei Niu Brian Chen Xudong Lin Guangxing Han Christopher Thomas Hammad Ayyubi Heng Ji and Shih-Fu Chang

2025-05-06 0 0 4.15MB 13 页 10玖币

侵权投诉

Weakly-Supervised Temporal Article Grounding

Long Chen♥, Yulei Niu♥, Brian Chen♥, Xudong Lin♥, Guangxing Han♥,

Christopher Thomas♦, Hammad Ayyubi♥, Heng Ji♠, and Shih-Fu Chang♥

♥Columbia University ♦Virginia Tech ♠University of Illinois at Urbana-Champaign

{cl3695, yn2338, bc2754, xl2798, gh2561, ha2578, sc250}@columbia.edu

chris@cs.vt.edu,hengji@illinois.edu

Abstract

Given a long untrimmed video and natural lan-

guage queries, video grounding (VG) aims to

temporally localize the semantically-aligned

video segments. Almost all existing VG work

holds two simple but unrealistic assumptions:

1) All query sentences can be grounded in the

corresponding video. 2) All query sentences

for the same video are always at the same se-

mantic scale. Unfortunately, both assumptions

make today’s VG models fail to work in prac-

tice. For example, in real-world multimodal

assets (e.g., news articles), most of the sen-

tences in the article can not be grounded in

their afﬁliated videos, and they typically have

rich hierarchical relations (at different seman-

tic scales). To this end, we propose a new chal-

lenging grounding task: Weakly-Supervised

temporal Article Grounding (WSAG). Specif-

ically, given an article and a relevant video,

WSAG aims to localize all “groundable” sen-

tences to the video, and these sentences are

possibly at different semantic scales. Ac-

cordingly, we collect the ﬁrst WSAG dataset

to facilitate this task: YouwikiHow, which

borrows the inherent multi-scale descriptions

in wikiHow articles and plentiful YouTube

videos. In addition, we propose a simple but

effective method DualMIL for WSAG, which

consists of a two-level MIL1loss and a single-

/cross- sentence constraint loss. These training

objectives are carefully designed for these re-

laxed assumptions. Extensive ablations have

veriﬁed the effectiveness of DualMIL2.

1 Introduction

Video Grounding (VG), i.e., localizing video seg-

ments that semantically correspond to (coreference

relation) query sentences, is one of the fundamental

tasks in multimodal understanding. Further, video

grounding can serve as an indispensable technique

1MIL: Multiple Instance Learning.

2Codes: https://github.com/zjuchenlong/WSAG.

Sentence: The man cracks an egg and whisks the

ingredients in a bowl.

Sentence1: A man in a chef's coat is standing in a

kitchen talking to the camera.

Sentence2: The man cracks an egg and whisks the

ingredients in a bowl.

Sentence3: The man flips the pancakes.

Sentence4: The man is eating a piece of the pancake.

Article (How to Make Pancakes)

1. Crack the eggs into a bowl and beat until creamy.

2. Melt the butter in a bowl.

Make sure that it's completely melted.

A minute is sufficient.

3. Add the butter and milk to the mix.

Stir gently, leaving some small clumps of dry ingredients

in the batter.

Do not blend until completely smooth.

If your batter is smooth, your pancakes will be tough and

flat as opposed to fluffy.

4. Heat the frying pan to a medium low flame.

If you have an initial "pancake" setting on your stove,

use that.

Video

Time

(a)

(b)

(c)

✓

✓✓

✓

✗

✗✗

✗

…

Figure 1: (a) Single sentence grounding: The query is

a single sentence. (b) Multi-sentence grounding: The

queries are multiple sentences. (c) Article grounding:

The query is an article, which consists of multiple sen-

tences at different scales (e.g., How to Make Pancakes).

High-level and low-level sentences are denoted with

corresponding formats. 3and 7denote that sentence

can or cannot be grounded to the video, respectively.

for many downstream applications, such as the text-

oriented highlight detection (Lei et al.,2021), video

retrieval (Miech et al.,2020) or video question an-

swering (Ye et al.,2017;Xiao et al.,2022).

Early VG efforts mainly focus on single sen-

tence grounding (Gao et al.,2017;Hendricks et al.,

2017) (cf. Figure 1(a)). Thanks to advanced rep-

resentation learning and multimodal fusion tech-

niques, single sentence VG has achieved unprece-

dented progress over the recent years (Cao et al.,

2021). The next step towards general VG is to

ground multiple sentences to the same video (cf.

Figure 1(b)). A straightforward solution for multi-

sentence VG is utilizing the single sentence VG

model for each sentence individually. Since these

query sentences associated with the same video are

always semantically related, recent multi-sentence

VG methods directly ground all queries simulta-

neously by considering their temporal order or se-

arXiv:2210.12444v2 [cs.CV] 24 Feb 2023

wikiHow Article

…

1. Find a large pitcher. The pitcher will need to

be able to hold liquid.

2. Squeeze some lemons to make lemon

juice. Cut the lemons in half, and use a citrus

squeezer, a hand juicer, or a wooden reamer

to squeeze the juice from the lemons.

3. Pour the lemon juice into the pitcher. You

may add the pulp if you like a thicker

lemonade, or you may discard it along with

the seeds. If you do not want the pulp, you

can put a strainer over the pitcher, before

pouring in the lemon juice. Once all of the

juice is inside the pitcher, take the strainer off

the pitcher and discard the pulp and seeds.

4. Add in some cold water. You will need cold

water. You can also use sparkling for a

carbonated lemonade.

You Tu be Vid e os

Figure 2: The only supervision for WSAG is a wiki-

How article (e.g., How to Make Lemonade) and some

corresponding YouTube videos about the same task.

mantic relations (Bao et al.,2021;Shi et al.,2021).

Unfortunately, all existing VG attempts hold two

simple but unrealistic assumptions: 1)

All query

sentences can be grounded in the corresponding

video

. Although this assumption is acceptable for

the VG task itself, it greatly limits the usage of VG

models in real-world multimodal assets. For exam-

ple in news articles, most of the sentences in an arti-

cle cannot be grounded in their afﬁliated videos. 2)

All query sentences for the same video are always

at the same semantic scale

. By “same scale”, we

mean that all VG models overlook the hierarchical

(or subevent) relations (Aldawsari and Finlayson,

2019;Yao et al.,2020) between these query sen-

tences. For example, in Figure 1(c), the sentence

“Stir gently, leaving some small clumps of dry ingre-

dients in the batter” (

) is one of the subevents of

“Add the butter and milk to the mix” (

), i.e.,

and

are at different semantic scales. Thus, the

second assumption makes current VG models fail

to perceive the semantic scales, and achieve unsat-

isfactory performance with multi-scale queries.

To this end, we propose a more realistic but chal-

lenging grounding task:

Article Grounding

(AG),

which relaxes both above-mentioned assumptions.

Speciﬁcally, given a video and a relevant article

(i.e., a sequence of sentences), AG requires the

model to localize only “groundable” sentences to

video segments, and these sentences are possibly at

different semantic scales. To further avoid the man-

ual annotations for the large-scale training set, in

this paper, we consider a more meaningful setting:

weakly-supervised AG (

WSAG

). As shown in Fig-

ure 2, the only supervision for WSAG is that the

given video and article are about the same task3.

Since there is no prior work on WSAG, we col-

lect a new dataset,

YouwikiHow

, to benchmark

the research. YouwikiHow is built on top of wiki-

A task means the same topic with clear and speciﬁc steps.

How articles and YouTube videos

. In particular,

we group a wikiHow article and an arbitrary video

about the same task as a document-level pair (cf.

Figure 2). For the training set, we conduct a set of

carefully designed operations to control the quality

of training samples, e.g., task ﬁltering or sentence

simpliﬁcation. For the test set, we directly borrow

the manual step grounding annotations in the ex-

isting CrossTask (Zhukov et al.,2019) dataset and

propagate them to wikiHow article sentences.

In addition, we propose a simple but effective

Dual loss constraint MIL-based method for WSAG,

dubbed

DualMIL

. Speciﬁcally,

for the ﬁrst as-

sumption

, we relax the widely-used Multiple In-

stance Learning (MIL) loss into a two-level MIL

loss. By “two-level”, we mean that we regard all

sentences for each article (sentence-level) and all

proposals for each sentence (segment-level) as the

“bag” at two different levels. Then, we obtain the

global video-article matching score by aggregat-

ing all matching scores over the two-level bag.

This two-level MIL inherently allows some queries

that cannot be grounded in the video. Meanwhile,

to avoid obtaining many highly-overlapping seg-

ments, we propose a single-sentence constraint to

suppress the proposals whose neighbor proposals

have higher matching scores with the query.

For

the second assumption

, we enhance models’ abil-

ities in perceiving different semantic scale queries

by considering these hierarchical relations across

sentences. In particular, we assume that high-level

sentences should be more likely to be grounded

than its low-level sentences for highly matched pro-

posals, and propose a cross-sentence constraint loss.

We show the effectiveness of DualMIL over state-

of-the-art methods through extensive ablations.

In summary, we make three contributions:

To the best of our knowledge, we are the ﬁrst

work to discuss the two unrealistic assumptions:

all query sentences are groundable and all query

sentences are at the same semantic scale. Mean-

while, we propose a meaningful WSAG task.

To benchmark the research, we collect the ﬁrst

WSAG dataset: YouwikiHow.

We further propose a simple but effective method

DualMIL for WSAG, which consists of three

different model-agnostic training objectives.

https://www.wikihow.com/ &https://www.youtube.com/.

2 Related Work

Single Sentence & Multi-Sentence VG.

Main-

stream solutions for single sentence VG can be

coarsely categorized into two groups: 1) Top-down

Methods (Hendricks et al.,2017;Gao et al.,2017;

Zhang et al.,2019,2020b;Chen et al.,2018;Yuan

et al.,2019a,2021;Wang et al.,2020;Xiao et al.,

2021b,a;Liu et al.,2021b,a;Lan et al.,2022): They

ﬁrst cut given video into a set of segment proposals

with different durations, and then calculate match-

ing scores between query and all segment propos-

als. Their performance heavily relies on predeﬁned

rules for proposal settings (e.g., temporal sizes). 2)

Bottom-up Methods (Yuan et al.,2019b;Lu et al.,

2019;Zeng et al.,2020;Chen et al.,2020a,2018;

Zhang et al.,2020a): They directly predict the two

temporal boundaries of the target segment by re-

garding the query as a conditional input. Compared

to their top-down counterpart, bottom-up methods

always fail to consider the global context between

two boundaries (i.e., inside segment). In this pa-

per, we follow the top-down framework and our

DualMIL is model-agnostic.

Existing multi-sentence VG work all takes an as-

sumption: the query sentences are ranked by their

corresponding segments. This is an unrealistic and

artiﬁcial setting. In contrast, real-world articles al-

ways do not meet this strict requirement, and most

of the sentences are not even groundable in afﬁli-

ated videos. In this paper, we take more realistic

assumptions for the multi-sentence VG problem.

Weakly-Supervised VG.

Since the agreements on

the manually annotated target segments tend to be

low (Otani et al.,2020), a surge of efforts aims to

solve this challenging task in a weakly-supervised

manner, i.e., there are only video-level supervi-

sions at the training stage. Currently, there are

two typical frameworks: 1) MIL-based (Gao et al.,

2019;Mithun et al.,2019;Chen et al.,2020b;Ma

et al.,2020;Zhang et al.,2020c,d;Tan et al.,2021):

They ﬁrst calculate the matching scores between

the query sentence and all segment proposals and

then aggregate scores of multiple proposals as the

score of whole “bag”. State-of-the-art MIL-based

methods usually focus on designing better posi-

tive/negative bag selections. 2) Reconstruction-

based (Duan et al.,2018;Lin et al.,2020): They

utilize the consistency between dual tasks sentence

localization and caption generation, and infer the

ﬁnal grounding results from intermediate attention

weights. Among them, the most related work to

us is CRM (Huang et al.,2021), which consid-

ers both multi-sentence and weakly-supervised set-

tings. Compared to CRM, our setting is more chal-

lenging: a) Sentences are from different scales; b)

Not all sentences can be groundable; and c) Sen-

tence sequences are not consistent with GTs.

Multi-Scale VL Benchmarks.

With the develop-

ment of large-scale annotation tools, hundreds of

video-language (VL) datasets are proposed. To the

best of our knowledge, three (types of) VL datasets

also have considered the multiple semantic scale is-

sue: 1) TACoS Multi-Level (Rohrbach et al.,2014):

It provides three-level summaries for videos. In

contrast, their middle-level sentences are more like

extractive summarization (instead of abstractive).

Thus, the grounding results for different-scale sen-

tences may be the same. 2) Movie-related (Xiong

et al.,2019;Huang et al.,2020;Bain et al.,2020):

They always have multiple-level sentences to de-

scribe videos, such as overview, storyline, plot, and

synopsis. They have two characteristics: a) Numer-

ous sentences are abstract descriptions, i.e., they do

not have exact grounding temporal boundaries. b)

The high-level summaries are more like highlights

or salient events. 3) COIN (Tang et al.,2019): It

deﬁnes multi-level predeﬁned steps. Thus, it sacri-

ﬁces the ability to ground any open-ended queries.

3 Dataset: YouwikiHow

We built

YouwikiHow

dataset from wikiHow arti-

cles and YouTube videos. As shown in Figure 2, we

group a wikiHow article and any video about the

same task as a pair. Thanks to the inherent hierar-

chical structure of wikiHow articles, we can easily

obtain sentences from different scales:

high-level

summaries and

low-level

details. As in Figure 2,

“Pour the lemon juice into the pitcher.” is a high-

level sentence summary and “You may add the pulp

if .... along with the seeds.” is a low-level sentence

detail of this summary. In this section, we ﬁrst in-

troduce the details of dataset construction, and then

compare YouwikiHow to existing VG benchmarks.

3.1 Dataset Construction

3.1.1 Training Set

Initial Visual Tasks.

Each wikiHow article de-

scribes a sequence of steps to instruct humans to

perform a certain “task”, and these tasks range

from physical world interactions to abstract mental

well-being improvement. In YouwikiHow, we fol-

low (Miech et al.,2019) and only focus on “visual

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Weakly-SupervisedTemporalArticleGroundingLongChen~,YuleiNiu~,BrianChen~,XudongLin~,GuangxingHan~,ChristopherThomas},HammadAyyubi~,HengJi,andShih-FuChang~~ColumbiaUniversity}VirginiaTechUniversityofIllinoisatUrbana-Champaign{cl3695,yn2338,bc2754,xl2798,gh2561,ha2578,sc250}@columbia.educhris@cs.vt.e...

展开>> 收起<<

Weakly-Supervised Temporal Article Grounding Long Chen Yulei Niu Brian Chen Xudong Lin Guangxing Han Christopher Thomas Hammad Ayyubi Heng Ji and Shih-Fu Chang.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Weakly-Supervised Temporal Article Grounding Long Chen Yulei Niu Brian Chen Xudong Lin Guangxing Han Christopher Thomas Hammad Ayyubi Heng Ji and Shih-Fu Chang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: