trastive loss on frame-text pairs, in order to miti-
gate the problem of weak correspondence in video-
language representation learning. Existing video-
language learning approaches (Miech et al.,2020;
Xu et al.,2021) only optimise pair-level alignment
but do not explicitly learn which part of a video
contributes to its alignment with the text. FineCo
focuses on aligning relevant frames with the text. It
is inspired by the text-based temporal localisation
task (Zhang et al.,2020a), however, the motivation
of FineCo is different: to learn better video-level
representation by adding a frame-level contrastive
learning signal to the pair-level objective, with no
need for temporal annotation within a video-text
pair.
We hypothesise that FineCo is particularly ben-
eficial for long videos, where each video pro-
vides more information and only a small propor-
tion of frames will be relevant to its text coun-
terpart, as shown in Figure 1. FineCo is able to
model frame-text similarity through fine-grained
contrastive learning, where the most informative
frames are paired with the text as positive pairs
and the remaining frames, as negatives. It then
explicitly contrasts the selected informative frames
against the noisy frames, without the need for
frame-text annotations. This frame-level distilla-
tion provides a strong learning signal, which en-
courages the alignment of semantically equivalent
video-text pairs. The fine-grained contrastive loss
abstracts the learning signal from pair-level annota-
tions and is trained in an end-to-end manner. This
combination of pair-level learning signal and frame-
level contrastive loss is novel and effective, and
boosts the performance on two important video-
language benchmark tasks, especially in text-video
retrieval with longer videos. We devised FineCo
by building on the recently proposed and well per-
forming VideoCLIP (Xu et al.,2021), in which
a video clip is represented as sequence of frame
features.
Our contributions are summarised as follows:
(1) We propose FineCo, an approach trained with
fine-grained contrastive loss to mitigate the weak
correspondence problem in video-text pairs; (2)
We use FineCo to distil a video clip by sampling
frames that are relevant to its text counterpart ac-
cording to frame-text similarities; (3) On text-video
retrieval and video question answering benchmarks,
we show that FineCo achieves state-of-the-art per-
formance on YouCookII and MSR-VTT MC (mul-
tiple choice).
2 Related Work
Contrastive Learning
The use of contrastive
loss (Gutmann and Hyvärinen,2010) has become
the dominant paradigm for learning video-language
representations. The aim is to maximise the sim-
ilarity of video-text pairs that are aligned to each
other (positive pairs) while pushing away irrele-
vant (negative) pairs. However, the semantic align-
ment between most video-text pairs is weak, which
makes it difficult to ground textual information on
the videos. In order to mitigate the pair-level weak
alignment issue, MIL-NCE (Miech et al.,2020)
leverages multiple surrounding captions as the pos-
itive pairs and makes use of multiple instance learn-
ing (MIL) (Dietterich et al.,1997) with contrastive
loss to mitigate noise in cross-modal correspon-
dences. The main idea is to consider multiple con-
textual sentences for matching a video, instead of
only comparing a video against a single sentence.
To alleviate the issue that semantically equivalent
videos and texts from different pairs may be taken
as dissimilar in contrastive learning, support-set
(Patrick et al.,2021) introduces a generative ap-
proach for captioning over a set of visual candi-
dates that ensures that video-language representa-
tion does not over specialise to individual samples.
MIL-NCE and support-set focus on pair-level con-
trastive signals to align relevant video-text pairs.
However, even within a positive video-text pair, the
video is likely to contain many irrelevant frames.
Therefore, it can be beneficial to distil the video
such that only the relevant frames, i.e. those which
have similar content to the text, are selected for
cross-modal learning.
Video-language Learning
(Sun et al.,2019;
Zhu and Yang,2020;Gabeur et al.,2020;Li
et al.,2020a;Miech et al.,2020;Ging et al.,2020;
Luo et al.,2020) have shown promising results
for video-language learning with pre-training fol-
lowed by fine-tuning. This strategy has become
very prominent since the release of BERT (De-
vlin et al.,2019) and many image-text pre-training
frameworks (Tan and Bansal,2019;Li et al.,2019,
2020b;Zhang et al.,2021;Chen et al.,2020;Zhang
et al.,2019;Kim et al.,2021;Li et al.,2021,
2022). The release of datasets such as HowTo100M
(Miech et al.,2019) and WebVid-2M (Bain et al.,
2021) has enabled large-scale pre-training on un-
labelled video-text pairs to improve representation