Contrastive Video-Language Learning with Fine-grained Frame Sampling

2025-04-22 0 0 7.12MB 12 页 10玖币

侵权投诉

Zixu Wang1, Yujie Zhong2, Yishu Miao3, Lin Ma2, Lucia Specia1

1Language and Multimodal AI Lab (LAMA), Imperial College London

2Meituan Inc., 3Haiper.ai

zixu.wang@imperial.ac.uk, jaszhong@hotmail.com, yishu.miao@haiper.ai

forest.linma@gmail.com, l.specia@imperial.ac.uk

Abstract

Despite recent progress in video and language

representation learning, the weak or sparse cor-

respondence between the two modalities re-

mains a bottleneck in the area. Most video-

language models are trained via pair-level loss

to predict whether a pair of video and text is

aligned. However, even in paired video-text

segments, only a subset of the frames are se-

mantically relevant to the corresponding text,

with the remainder representing noise; where

the ratio of noisy frames is higher for longer

videos. We propose FineCo (Fine-grained

Contrastive Loss for Frame Sampling), an ap-

proach to better learn video and language rep-

resentations with a ﬁne-grained contrastive ob-

jective operating on video frames. It helps

distil a video by selecting the frames that are

semantically equivalent to the text, improv-

ing cross-modal correspondence. Building on

the well established VideoCLIP model as a

starting point, FineCo achieves state-of-the-art

performance on YouCookII, a text-video re-

trieval benchmark with long videos. FineCo

also achieves competitive results on text-video

retrieval (MSR-VTT), and video question an-

swering datasets (MSR-VTT QA and MSR-

VTT MC) with shorter videos.

1 Introduction

Human perception is multimodal, including visual,

textual, and audial information. To achieve human-

level perceptional ability, intelligent systems need

to understand and interpret these multimodal sig-

nals and summarise the relevant information in

them. Learning from video and language data has

received signiﬁcant attention in recent multimodal

machine learning work for downstream tasks that

require joint understanding of video and textual in-

formation, including text-video retrieval (Lin et al.,

2014;Liu et al.,2019;Miech et al.,2018;Wang

et al.,2016;Bain et al.,2021), video question an-

swering (Fan et al.,2019;Yang et al.,2021;Huang

et al.,2020;Jiang et al.,2020;Le et al.,2020;Lei

flip the pancakes when the edge turns brown

mince the tuna and add it to a bowl

Figure 1: Illustration of the weak correspondence prob-

lem in video-language learning. Given a pair of video

and its text (e.g. caption, instruction, or transcription),

only a subset of the frames (here indicated by coloured

bounding boxes) is semantically aligned to the textual

content. The remaining frames represent irrelevant vi-

sual information and will not contribute to language

grounding on videos.

et al.,2021), and video captioning (Ging et al.,

2020;Luo et al.,2020;Zhang et al.,2020b). In

most of this work, contrastive learning (Gutmann

and Hyvärinen,2010) is used as training objective.

The aim of a cross-modal contrastive loss is to

maximise the similarity between an aligned video-

text pair while minimising the similarity for all

other pairs. One issue with standard cross-modal

contrastive loss is that it focuses on pair-level align-

ment but ignores the negative effects of irrelevant

frames that are present in a single video clip, even

in a pair of aligned video and text. We deﬁne ir-

relevant frames as those with no or little shared

semantics with the text. These irrelevant frames

may negatively affect the contribution of frames

that are semantically similar to the text, which fur-

ther results in less informative video representation.

Therefore, we posit that frame-level learning is a

better strategy for video-language tasks.

In this paper, we propose FineCo, an approach

that has a frame selector to sample relevant frames

in a video and is trained with a ﬁne-grained con-

arXiv:2210.05039v1 [cs.LG] 10 Oct 2022

trastive loss on frame-text pairs, in order to miti-

gate the problem of weak correspondence in video-

language representation learning. Existing video-

language learning approaches (Miech et al.,2020;

Xu et al.,2021) only optimise pair-level alignment

but do not explicitly learn which part of a video

contributes to its alignment with the text. FineCo

focuses on aligning relevant frames with the text. It

is inspired by the text-based temporal localisation

task (Zhang et al.,2020a), however, the motivation

of FineCo is different: to learn better video-level

representation by adding a frame-level contrastive

learning signal to the pair-level objective, with no

need for temporal annotation within a video-text

pair.

We hypothesise that FineCo is particularly ben-

eﬁcial for long videos, where each video pro-

vides more information and only a small propor-

tion of frames will be relevant to its text coun-

terpart, as shown in Figure 1. FineCo is able to

model frame-text similarity through ﬁne-grained

contrastive learning, where the most informative

frames are paired with the text as positive pairs

and the remaining frames, as negatives. It then

explicitly contrasts the selected informative frames

against the noisy frames, without the need for

frame-text annotations. This frame-level distilla-

tion provides a strong learning signal, which en-

courages the alignment of semantically equivalent

video-text pairs. The ﬁne-grained contrastive loss

abstracts the learning signal from pair-level annota-

tions and is trained in an end-to-end manner. This

combination of pair-level learning signal and frame-

level contrastive loss is novel and effective, and

boosts the performance on two important video-

language benchmark tasks, especially in text-video

retrieval with longer videos. We devised FineCo

by building on the recently proposed and well per-

forming VideoCLIP (Xu et al.,2021), in which

a video clip is represented as sequence of frame

features.

Our contributions are summarised as follows:

(1) We propose FineCo, an approach trained with

ﬁne-grained contrastive loss to mitigate the weak

correspondence problem in video-text pairs; (2)

We use FineCo to distil a video clip by sampling

frames that are relevant to its text counterpart ac-

cording to frame-text similarities; (3) On text-video

retrieval and video question answering benchmarks,

we show that FineCo achieves state-of-the-art per-

formance on YouCookII and MSR-VTT MC (mul-

tiple choice).

2 Related Work

Contrastive Learning

The use of contrastive

loss (Gutmann and Hyvärinen,2010) has become

the dominant paradigm for learning video-language

representations. The aim is to maximise the sim-

ilarity of video-text pairs that are aligned to each

other (positive pairs) while pushing away irrele-

vant (negative) pairs. However, the semantic align-

ment between most video-text pairs is weak, which

makes it difﬁcult to ground textual information on

the videos. In order to mitigate the pair-level weak

alignment issue, MIL-NCE (Miech et al.,2020)

leverages multiple surrounding captions as the pos-

itive pairs and makes use of multiple instance learn-

ing (MIL) (Dietterich et al.,1997) with contrastive

loss to mitigate noise in cross-modal correspon-

dences. The main idea is to consider multiple con-

textual sentences for matching a video, instead of

only comparing a video against a single sentence.

To alleviate the issue that semantically equivalent

videos and texts from different pairs may be taken

as dissimilar in contrastive learning, support-set

(Patrick et al.,2021) introduces a generative ap-

proach for captioning over a set of visual candi-

dates that ensures that video-language representa-

tion does not over specialise to individual samples.

MIL-NCE and support-set focus on pair-level con-

trastive signals to align relevant video-text pairs.

However, even within a positive video-text pair, the

video is likely to contain many irrelevant frames.

Therefore, it can be beneﬁcial to distil the video

such that only the relevant frames, i.e. those which

have similar content to the text, are selected for

cross-modal learning.

Video-language Learning

(Sun et al.,2019;

Zhu and Yang,2020;Gabeur et al.,2020;Li

et al.,2020a;Miech et al.,2020;Ging et al.,2020;

Luo et al.,2020) have shown promising results

for video-language learning with pre-training fol-

lowed by ﬁne-tuning. This strategy has become

very prominent since the release of BERT (De-

vlin et al.,2019) and many image-text pre-training

frameworks (Tan and Bansal,2019;Li et al.,2019,

2020b;Zhang et al.,2021;Chen et al.,2020;Zhang

et al.,2019;Kim et al.,2021;Li et al.,2021,

2022). The release of datasets such as HowTo100M

(Miech et al.,2019) and WebVid-2M (Bain et al.,

2021) has enabled large-scale pre-training on un-

labelled video-text pairs to improve representation

learning of video and language. Many approaches

(Miech et al.,2020;Zhu and Yang,2020;Patrick

et al.,2021) use HowTo100M as their pre-training

dataset. FiT (Bain et al.,2021) uses WebVid-2M

and Google Conceptual Captions (CC3M) to take

advantage of the large collection of video-text and

image-text pairs for pre-training. However, large

pre-training datasets rely on loosely aligned video-

text pairs, without any ﬁne-grained supervision on

alignment. This makes it difﬁcult to learn cross-

modal cues present in the given video-text pairs.

It is also computationally expensive to improve

video-language representation learning, given that

videos can contain a large number of frames, espe-

cially longer videos. ClipBERT (Lei et al.,2021)

randomly samples a few frames from a video for

video-language representation learning. Their mo-

tivation is to minimise memory and computation

costs from processing the full sequence of frames.

This sampling strategy is over simplistic and can

thus be improved by better approaches to select

frames based on their relevance to the paired text.

3 FineCo

3.1 Preliminaries

The most widely used objective function for video-

language learning is contrastive loss, speciﬁcally

the softmax version of noise-contrastive estima-

tion (NCE) (Gutmann and Hyvärinen,2010). It is

formulated as

i=1

log 





ef(xi)Tg(yi)

ef(xi)Tg(yi)+P

(x0,y0)∈Ni

ef(x0

i)Tg(y0

i)





(1)

where

denotes a video clip and

represents the

corresponding text (e.g. a caption, an instruction,

or transcription);

and

are video encoder and

text encoder respectively;

ef(xi)Tg(yi)

denotes the

similarity of a positive video-text pair, calculated

as the exponentiated dot product of the video rep-

resentation

f(xi)

and text representation

g(yi)

;

is a set of negative video-text pairs

and

that

are not aligned.

This contrastive loss leverages pair-level similar-

ity of video and text, but ignores the fact that weak

video-language correspondence does not stem only

from entirely negative pairs of video and text, but

also from frame-level noise, which happens even

when a video-text pair is aligned as a whole. Stan-

dard contrastive loss does not explicitly model

frame-text relevance, i.e. it does not differentiate

between frames that are semantically equivalent

to the corresponding text and frames that are not.

It can thus suffer by learning from noisy signals,

particularly in long videos with various scenes.

3.2 Fine-grained Contrastive Learning

A video consists of a sequence of frames. For

video-language learning, the video is paired with a

text which describes/refers to some of the content

of the video. For most tasks, only some of the

visual information has an equivalent textual signal,

e.g. a video description is only a summary of the

visual information. To sample and optimise for

the relevant visual information from a video, we

propose a ﬁne-grained contrastive loss to distil each

video-text pair.

Formally, a video-text pair is denoted as (

where

is a video clip consisting of a sequence of

video frames

{x1, x2, . . . , xK}

where

is the

number of frames in the video clip, and

is the

paired text. We assume that a video

contains a

set of

positive frames

P(x)

and a set of

(K−C)

negative frames

N(x)

, where positive frames con-

tains relevant information to the text while negative

frames are noisy/irrelevant ones. The aim is to max-

imise the joint probability of relevant frame-text

pairs

(xk, y)

by exponentiating the similarity of the

two representations:

p(xk, y) = h(f(xk), g(y)) ∝esim(f(xk),g(y)) (2)

3.2.1 Objective Function

Given

pairs of video representation

f(x)

and

text representation

g(y)

, the

th pair is denoted as

f(xi) = {f(xi1), f(xi2), . . . , f(xiK)}

and

g(yi)

our ﬁne-grained contrastive loss Lis deﬁned as:

Ai=X

xik∈P(xi)

esim(f(xik),g(yi))

Bi=X

∈N (xi)

esim(f(x0

ik),g(yi))

i=1

log Ai

Ai+Bi

(3)

where

P(xi)

contains the positive frames in a video

that have higher similarities to the text represen-

tation

g(yi)

, and

N(xi)

is the set of remaining

frames in the same video, which refers to the neg-

ative frames. The similarity is calculated by our

frame selector (

) (Section 3.2.2) with the frame

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContrastiveVideo-LanguageLearningwithFine-grainedFrameSamplingZixuWang1,YujieZhong2,YishuMiao3,LinMa2,LuciaSpecia11LanguageandMultimodalAILab(LAMA),ImperialCollegeLondon2MeituanInc.,3Haiper.aizixu.wang@imperial.ac.uk,jaszhong@hotmail.com,yishu.miao@haiper.aiforest.linma@gmail.com,l.specia@imperial.a...

展开>> 收起<<

Contrastive Video-Language Learning with Fine-grained Frame Sampling.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Contrastive Video-Language Learning with Fine-grained Frame Sampling

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: