Contrastive Video-Language Learning with Fine-grained Frame Sampling

2025-04-22 0 0 7.12MB 12 页 10玖币
侵权投诉
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang1, Yujie Zhong2, Yishu Miao3, Lin Ma2, Lucia Specia1
1Language and Multimodal AI Lab (LAMA), Imperial College London
2Meituan Inc., 3Haiper.ai
zixu.wang@imperial.ac.uk, jaszhong@hotmail.com, yishu.miao@haiper.ai
forest.linma@gmail.com, l.specia@imperial.ac.uk
Abstract
Despite recent progress in video and language
representation learning, the weak or sparse cor-
respondence between the two modalities re-
mains a bottleneck in the area. Most video-
language models are trained via pair-level loss
to predict whether a pair of video and text is
aligned. However, even in paired video-text
segments, only a subset of the frames are se-
mantically relevant to the corresponding text,
with the remainder representing noise; where
the ratio of noisy frames is higher for longer
videos. We propose FineCo (Fine-grained
Contrastive Loss for Frame Sampling), an ap-
proach to better learn video and language rep-
resentations with a fine-grained contrastive ob-
jective operating on video frames. It helps
distil a video by selecting the frames that are
semantically equivalent to the text, improv-
ing cross-modal correspondence. Building on
the well established VideoCLIP model as a
starting point, FineCo achieves state-of-the-art
performance on YouCookII, a text-video re-
trieval benchmark with long videos. FineCo
also achieves competitive results on text-video
retrieval (MSR-VTT), and video question an-
swering datasets (MSR-VTT QA and MSR-
VTT MC) with shorter videos.
1 Introduction
Human perception is multimodal, including visual,
textual, and audial information. To achieve human-
level perceptional ability, intelligent systems need
to understand and interpret these multimodal sig-
nals and summarise the relevant information in
them. Learning from video and language data has
received significant attention in recent multimodal
machine learning work for downstream tasks that
require joint understanding of video and textual in-
formation, including text-video retrieval (Lin et al.,
2014;Liu et al.,2019;Miech et al.,2018;Wang
et al.,2016;Bain et al.,2021), video question an-
swering (Fan et al.,2019;Yang et al.,2021;Huang
et al.,2020;Jiang et al.,2020;Le et al.,2020;Lei
flip the pancakes when the edge turns brown
mince the tuna and add it to a bowl
Figure 1: Illustration of the weak correspondence prob-
lem in video-language learning. Given a pair of video
and its text (e.g. caption, instruction, or transcription),
only a subset of the frames (here indicated by coloured
bounding boxes) is semantically aligned to the textual
content. The remaining frames represent irrelevant vi-
sual information and will not contribute to language
grounding on videos.
et al.,2021), and video captioning (Ging et al.,
2020;Luo et al.,2020;Zhang et al.,2020b). In
most of this work, contrastive learning (Gutmann
and Hyvärinen,2010) is used as training objective.
The aim of a cross-modal contrastive loss is to
maximise the similarity between an aligned video-
text pair while minimising the similarity for all
other pairs. One issue with standard cross-modal
contrastive loss is that it focuses on pair-level align-
ment but ignores the negative effects of irrelevant
frames that are present in a single video clip, even
in a pair of aligned video and text. We define ir-
relevant frames as those with no or little shared
semantics with the text. These irrelevant frames
may negatively affect the contribution of frames
that are semantically similar to the text, which fur-
ther results in less informative video representation.
Therefore, we posit that frame-level learning is a
better strategy for video-language tasks.
In this paper, we propose FineCo, an approach
that has a frame selector to sample relevant frames
in a video and is trained with a fine-grained con-
arXiv:2210.05039v1 [cs.LG] 10 Oct 2022
trastive loss on frame-text pairs, in order to miti-
gate the problem of weak correspondence in video-
language representation learning. Existing video-
language learning approaches (Miech et al.,2020;
Xu et al.,2021) only optimise pair-level alignment
but do not explicitly learn which part of a video
contributes to its alignment with the text. FineCo
focuses on aligning relevant frames with the text. It
is inspired by the text-based temporal localisation
task (Zhang et al.,2020a), however, the motivation
of FineCo is different: to learn better video-level
representation by adding a frame-level contrastive
learning signal to the pair-level objective, with no
need for temporal annotation within a video-text
pair.
We hypothesise that FineCo is particularly ben-
eficial for long videos, where each video pro-
vides more information and only a small propor-
tion of frames will be relevant to its text coun-
terpart, as shown in Figure 1. FineCo is able to
model frame-text similarity through fine-grained
contrastive learning, where the most informative
frames are paired with the text as positive pairs
and the remaining frames, as negatives. It then
explicitly contrasts the selected informative frames
against the noisy frames, without the need for
frame-text annotations. This frame-level distilla-
tion provides a strong learning signal, which en-
courages the alignment of semantically equivalent
video-text pairs. The fine-grained contrastive loss
abstracts the learning signal from pair-level annota-
tions and is trained in an end-to-end manner. This
combination of pair-level learning signal and frame-
level contrastive loss is novel and effective, and
boosts the performance on two important video-
language benchmark tasks, especially in text-video
retrieval with longer videos. We devised FineCo
by building on the recently proposed and well per-
forming VideoCLIP (Xu et al.,2021), in which
a video clip is represented as sequence of frame
features.
Our contributions are summarised as follows:
(1) We propose FineCo, an approach trained with
fine-grained contrastive loss to mitigate the weak
correspondence problem in video-text pairs; (2)
We use FineCo to distil a video clip by sampling
frames that are relevant to its text counterpart ac-
cording to frame-text similarities; (3) On text-video
retrieval and video question answering benchmarks,
we show that FineCo achieves state-of-the-art per-
formance on YouCookII and MSR-VTT MC (mul-
tiple choice).
2 Related Work
Contrastive Learning
The use of contrastive
loss (Gutmann and Hyvärinen,2010) has become
the dominant paradigm for learning video-language
representations. The aim is to maximise the sim-
ilarity of video-text pairs that are aligned to each
other (positive pairs) while pushing away irrele-
vant (negative) pairs. However, the semantic align-
ment between most video-text pairs is weak, which
makes it difficult to ground textual information on
the videos. In order to mitigate the pair-level weak
alignment issue, MIL-NCE (Miech et al.,2020)
leverages multiple surrounding captions as the pos-
itive pairs and makes use of multiple instance learn-
ing (MIL) (Dietterich et al.,1997) with contrastive
loss to mitigate noise in cross-modal correspon-
dences. The main idea is to consider multiple con-
textual sentences for matching a video, instead of
only comparing a video against a single sentence.
To alleviate the issue that semantically equivalent
videos and texts from different pairs may be taken
as dissimilar in contrastive learning, support-set
(Patrick et al.,2021) introduces a generative ap-
proach for captioning over a set of visual candi-
dates that ensures that video-language representa-
tion does not over specialise to individual samples.
MIL-NCE and support-set focus on pair-level con-
trastive signals to align relevant video-text pairs.
However, even within a positive video-text pair, the
video is likely to contain many irrelevant frames.
Therefore, it can be beneficial to distil the video
such that only the relevant frames, i.e. those which
have similar content to the text, are selected for
cross-modal learning.
Video-language Learning
(Sun et al.,2019;
Zhu and Yang,2020;Gabeur et al.,2020;Li
et al.,2020a;Miech et al.,2020;Ging et al.,2020;
Luo et al.,2020) have shown promising results
for video-language learning with pre-training fol-
lowed by fine-tuning. This strategy has become
very prominent since the release of BERT (De-
vlin et al.,2019) and many image-text pre-training
frameworks (Tan and Bansal,2019;Li et al.,2019,
2020b;Zhang et al.,2021;Chen et al.,2020;Zhang
et al.,2019;Kim et al.,2021;Li et al.,2021,
2022). The release of datasets such as HowTo100M
(Miech et al.,2019) and WebVid-2M (Bain et al.,
2021) has enabled large-scale pre-training on un-
labelled video-text pairs to improve representation
learning of video and language. Many approaches
(Miech et al.,2020;Zhu and Yang,2020;Patrick
et al.,2021) use HowTo100M as their pre-training
dataset. FiT (Bain et al.,2021) uses WebVid-2M
and Google Conceptual Captions (CC3M) to take
advantage of the large collection of video-text and
image-text pairs for pre-training. However, large
pre-training datasets rely on loosely aligned video-
text pairs, without any fine-grained supervision on
alignment. This makes it difficult to learn cross-
modal cues present in the given video-text pairs.
It is also computationally expensive to improve
video-language representation learning, given that
videos can contain a large number of frames, espe-
cially longer videos. ClipBERT (Lei et al.,2021)
randomly samples a few frames from a video for
video-language representation learning. Their mo-
tivation is to minimise memory and computation
costs from processing the full sequence of frames.
This sampling strategy is over simplistic and can
thus be improved by better approaches to select
frames based on their relevance to the paired text.
3 FineCo
3.1 Preliminaries
The most widely used objective function for video-
language learning is contrastive loss, specifically
the softmax version of noise-contrastive estima-
tion (NCE) (Gutmann and Hyvärinen,2010). It is
formulated as
n
X
i=1
log
ef(xi)Tg(yi)
ef(xi)Tg(yi)+P
(x0,y0)∈Ni
ef(x0
i)Tg(y0
i)
(1)
where
xi
denotes a video clip and
yi
represents the
corresponding text (e.g. a caption, an instruction,
or transcription);
f
and
g
are video encoder and
text encoder respectively;
ef(xi)Tg(yi)
denotes the
similarity of a positive video-text pair, calculated
as the exponentiated dot product of the video rep-
resentation
f(xi)
and text representation
g(yi)
;
Ni
is a set of negative video-text pairs
x0
i
and
y0
i
that
are not aligned.
This contrastive loss leverages pair-level similar-
ity of video and text, but ignores the fact that weak
video-language correspondence does not stem only
from entirely negative pairs of video and text, but
also from frame-level noise, which happens even
when a video-text pair is aligned as a whole. Stan-
dard contrastive loss does not explicitly model
frame-text relevance, i.e. it does not differentiate
between frames that are semantically equivalent
to the corresponding text and frames that are not.
It can thus suffer by learning from noisy signals,
particularly in long videos with various scenes.
3.2 Fine-grained Contrastive Learning
A video consists of a sequence of frames. For
video-language learning, the video is paired with a
text which describes/refers to some of the content
of the video. For most tasks, only some of the
visual information has an equivalent textual signal,
e.g. a video description is only a summary of the
visual information. To sample and optimise for
the relevant visual information from a video, we
propose a fine-grained contrastive loss to distil each
video-text pair.
Formally, a video-text pair is denoted as (
x
,
y
),
where
x
is a video clip consisting of a sequence of
N
video frames
{x1, x2, . . . , xK}
where
K
is the
number of frames in the video clip, and
y
is the
paired text. We assume that a video
x
contains a
set of
C
positive frames
P(x)
and a set of
(KC)
negative frames
N(x)
, where positive frames con-
tains relevant information to the text while negative
frames are noisy/irrelevant ones. The aim is to max-
imise the joint probability of relevant frame-text
pairs
(xk, y)
by exponentiating the similarity of the
two representations:
p(xk, y) = h(f(xk), g(y)) esim(f(xk),g(y)) (2)
3.2.1 Objective Function
Given
n
pairs of video representation
f(x)
and
text representation
g(y)
, the
i
th pair is denoted as
f(xi) = {f(xi1), f(xi2), . . . , f(xiK)}
and
g(yi)
,
our fine-grained contrastive loss Lis defined as:
Ai=X
xik∈P(xi)
esim(f(xik),g(yi))
Bi=X
x0
ik
∈N (xi)
esim(f(x0
ik),g(yi))
L=
n
X
i=1
log Ai
Ai+Bi
(3)
where
P(xi)
contains the positive frames in a video
that have higher similarities to the text represen-
tation
g(yi)
, and
N(xi)
is the set of remaining
frames in the same video, which refers to the neg-
ative frames. The similarity is calculated by our
frame selector (
FS
) (Section 3.2.2) with the frame
摘要:

ContrastiveVideo-LanguageLearningwithFine-grainedFrameSamplingZixuWang1,YujieZhong2,YishuMiao3,LinMa2,LuciaSpecia11LanguageandMultimodalAILab(LAMA),ImperialCollegeLondon2MeituanInc.,3Haiper.aizixu.wang@imperial.ac.uk,jaszhong@hotmail.com,yishu.miao@haiper.aiforest.linma@gmail.com,l.specia@imperial.a...

展开>> 收起<<
Contrastive Video-Language Learning with Fine-grained Frame Sampling.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:7.12MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注