2 Related Work
Single Sentence & Multi-Sentence VG.
Main-
stream solutions for single sentence VG can be
coarsely categorized into two groups: 1) Top-down
Methods (Hendricks et al.,2017;Gao et al.,2017;
Zhang et al.,2019,2020b;Chen et al.,2018;Yuan
et al.,2019a,2021;Wang et al.,2020;Xiao et al.,
2021b,a;Liu et al.,2021b,a;Lan et al.,2022): They
first cut given video into a set of segment proposals
with different durations, and then calculate match-
ing scores between query and all segment propos-
als. Their performance heavily relies on predefined
rules for proposal settings (e.g., temporal sizes). 2)
Bottom-up Methods (Yuan et al.,2019b;Lu et al.,
2019;Zeng et al.,2020;Chen et al.,2020a,2018;
Zhang et al.,2020a): They directly predict the two
temporal boundaries of the target segment by re-
garding the query as a conditional input. Compared
to their top-down counterpart, bottom-up methods
always fail to consider the global context between
two boundaries (i.e., inside segment). In this pa-
per, we follow the top-down framework and our
DualMIL is model-agnostic.
Existing multi-sentence VG work all takes an as-
sumption: the query sentences are ranked by their
corresponding segments. This is an unrealistic and
artificial setting. In contrast, real-world articles al-
ways do not meet this strict requirement, and most
of the sentences are not even groundable in affili-
ated videos. In this paper, we take more realistic
assumptions for the multi-sentence VG problem.
Weakly-Supervised VG.
Since the agreements on
the manually annotated target segments tend to be
low (Otani et al.,2020), a surge of efforts aims to
solve this challenging task in a weakly-supervised
manner, i.e., there are only video-level supervi-
sions at the training stage. Currently, there are
two typical frameworks: 1) MIL-based (Gao et al.,
2019;Mithun et al.,2019;Chen et al.,2020b;Ma
et al.,2020;Zhang et al.,2020c,d;Tan et al.,2021):
They first calculate the matching scores between
the query sentence and all segment proposals and
then aggregate scores of multiple proposals as the
score of whole “bag”. State-of-the-art MIL-based
methods usually focus on designing better posi-
tive/negative bag selections. 2) Reconstruction-
based (Duan et al.,2018;Lin et al.,2020): They
utilize the consistency between dual tasks sentence
localization and caption generation, and infer the
final grounding results from intermediate attention
weights. Among them, the most related work to
us is CRM (Huang et al.,2021), which consid-
ers both multi-sentence and weakly-supervised set-
tings. Compared to CRM, our setting is more chal-
lenging: a) Sentences are from different scales; b)
Not all sentences can be groundable; and c) Sen-
tence sequences are not consistent with GTs.
Multi-Scale VL Benchmarks.
With the develop-
ment of large-scale annotation tools, hundreds of
video-language (VL) datasets are proposed. To the
best of our knowledge, three (types of) VL datasets
also have considered the multiple semantic scale is-
sue: 1) TACoS Multi-Level (Rohrbach et al.,2014):
It provides three-level summaries for videos. In
contrast, their middle-level sentences are more like
extractive summarization (instead of abstractive).
Thus, the grounding results for different-scale sen-
tences may be the same. 2) Movie-related (Xiong
et al.,2019;Huang et al.,2020;Bain et al.,2020):
They always have multiple-level sentences to de-
scribe videos, such as overview, storyline, plot, and
synopsis. They have two characteristics: a) Numer-
ous sentences are abstract descriptions, i.e., they do
not have exact grounding temporal boundaries. b)
The high-level summaries are more like highlights
or salient events. 3) COIN (Tang et al.,2019): It
defines multi-level predefined steps. Thus, it sacri-
fices the ability to ground any open-ended queries.
3 Dataset: YouwikiHow
We built
YouwikiHow
dataset from wikiHow arti-
cles and YouTube videos. As shown in Figure 2, we
group a wikiHow article and any video about the
same task as a pair. Thanks to the inherent hierar-
chical structure of wikiHow articles, we can easily
obtain sentences from different scales:
high-level
summaries and
low-level
details. As in Figure 2,
“Pour the lemon juice into the pitcher.” is a high-
level sentence summary and “You may add the pulp
if .... along with the seeds.” is a low-level sentence
detail of this summary. In this section, we first in-
troduce the details of dataset construction, and then
compare YouwikiHow to existing VG benchmarks.
3.1 Dataset Construction
3.1.1 Training Set
Initial Visual Tasks.
Each wikiHow article de-
scribes a sequence of steps to instruct humans to
perform a certain “task”, and these tasks range
from physical world interactions to abstract mental
well-being improvement. In YouwikiHow, we fol-
low (Miech et al.,2019) and only focus on “visual