supervised manner, [30] generates pseudo temporal event
regions and corresponding pseudo sentence queries by ex-
amining noun-verb statistical co-occurrence patterns. How-
ever, pseudo sentences are built upon the composition of
nouns and verbs (e.g., ‘flip person switch door’), which
is naturally different from the form of natural language
query (e.g., ‘person flipped the light switch near the door.’).
Namely, contrived sentences with the simple composition
of nouns and verbs break the structural and compositional
generalization inherent in natural language that might harm
the performance [22].
In this paper, we propose a novel language-free training
framework for zero-shot video grounding. Our solution is to
treat the visual feature as pseudo textual information while
being flexible in responding to the act of forcing sentences
to generate pseudo forced sentences. Specifically, we lever-
age an image-language pretraining model (i.e., CLIP [33])
trained on large-scale web-collected data that have revealed
a breakthrough in the multi-modal research field. We con-
jecture that text and visual features can replace each other
without trouble in that CLIP provides a well-aligned visual-
language semantic space.
To this end, we first generate temporal proposals that
contain meaningful events from a given untrimmed video.
With the visual encoder of CLIP, visual features are ex-
tracted from all the frames in the proposal. Then our learn-
able selection transformer takes a dominant feature that has
a role of the pseudo language feature in a video ground-
ing model instead of generating a natural sentence from the
proposal. Therefore, our method is free to generate high-
quality natural language form from the proposal. More-
over, since the dominant visual feature is directly used for
the pseudo textual feature, our method has no need to pro-
duce textual embedding from a pseudo text label, which is
a time-consuming yet necessary step for the training of the
previous method [30]. Finally, the whole model is learned
to predict time intervals corresponding to pseudo sentence
features with generated temporal proposals as ground-truth.
Our contributions are summarized three-fold:
• We introduce a language-free framework for video
grounding that can be an affordable solution to effec-
tively reduce the annotation cost.
• We validate the applicability of the pretrained visual-
language model to the video-language task by provid-
ing extensive experimental analysis.
• Our language-free training framework outperforms
the existing method, achieving state-of-the-art per-
formance, and even shows comparable performance
with weakly-supervised approaches on the Charades-
STA [14] and ActivityNet Captions [19] datasets.
2. Related Work
2.1. Video Grounding
Video grounding is a recently proposed task [1, 14],
which aims to find the best moment in a video grounded
on language queries. Most of the existing methods followed
fully-supervised setting [9, 22, 24, 29, 34, 44, 51, 52, 53, 54,
57, 59] to model fine-grained semantic relations of video
and language. However, since such a setting requires pre-
cise annotations for the start and end timestamps, manual
annotations of the temporal boundary were required, which
also led to subjectivity across different annotators.
Weakly-supervised video grounding has been introduced
to alleviate this burden. Existing works can be catego-
rized into two groups. 1) Multi-instance learning [8] (MIL)
based methods [15, 16, 27, 28, 41, 56] utilized similar-
ity scores by maximizing scores between positive samples
and minimizing scores between negative samples. 2) The
reconstruction-based method [10, 23, 40, 50, 58] used the
assumption that the video segment that best reconstructs the
text query is close to the ground-truth.
However, while weakly-supervised approaches were
successful in lowering the cost of temporal annotation, the
cost of text query remains problematic. Several works [25,
30] considered an unsupervised setting that does not access
the paired annotations. [25] proposed a deep semantic clus-
tering network, which first aggregates distinct semantic fea-
tures from the whole query set and then generates pseudo la-
bels to provide pseudo supervision for training. [30] gener-
ated pseudo labels of temporal boundaries and correspond-
ing query sentences. They first utilized a temporal similarity
matrix to find temporal event proposals, then used an off-
the-shelf object detector and fine-tuned RoBERTa [26] to
make a structure-less pseudo query. However, a structure-
less pseudo query, especially composed of nouns and verbs,
can be interpreted in several meanings, due to systematic
compositionality [5, 13] of natural language. In addition,
the existence of the uninformative word in the query makes
it hard for the model to distinguish the exact meaning of
what the query originally intended to mean. Furthermore,
inferred verbs from detected objects are loosely bonded in
the sense that the verbs are not predicted directly from the
video, which leads to the generation of inaccurate pseudo
queries.
2.2. Language-free Paradigm
As recent trends shift from uni-modal learning to multi-
modal learning, vision-language related tasks have attracted
attention. Since the modality to be processed has dou-
bled, it becomes difficult to obtain high-quality vision-
language training pairs. Several works [30, 60] proposed a
so-called ‘language-free paradigm’ to address this problem,
which means training without language data in the vision-