Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Yuchong Sun1 Hongwei Xue2 Ruihua Song1y Bei Liu3y Huan Yang3 Jianlong Fu3

2025-04-27 0 0 1.65MB 20 页 10玖币

侵权投诉

Long-Form Video-Language Pre-Training with

Multimodal Temporal Contrastive Learning

Yuchong Sun1∗

, Hongwei Xue2∗

, Ruihua Song1†

, Bei Liu3†

, Huan Yang3, Jianlong Fu3

1Renmin University of China, Beijing, China,

2University of Science and Technology of China, Hefei, China,

3Microsoft Research, Beijing, China,

1{ycsun, rsong}@ruc.edu.cn,2gh051120@mail.ustc.edu.cn,

3{bei.liu, huayan, jianf}@microsoft.com

Abstract

Large-scale video-language pre-training has shown signiﬁcant improvement in

video-language understanding tasks. Previous studies of video-language pre-

training mainly focus on short-form videos (i.e., within 30 seconds) and sentences,

leaving long-form video-language pre-training rarely explored. Directly learning

representation from long-form videos and language may beneﬁt many long-form

video-language understanding tasks. However, it is challenging due to the difﬁculty

of modeling long-range relationships and the heavy computational burden caused

by more frames. In this paper, we introduce a

ong-

orm

deo-

nguage

pre-training model (LF-VILA) and train it on a large-scale long-form video and

paragraph dataset constructed from an existing public dataset. To effectively capture

the rich temporal dynamics and to better align video and language in an efﬁcient

end-to-end manner, we introduce two novel designs in our LF-VILA model. We

ﬁrst propose a Multimodal Temporal Contrastive (MTC) loss to learn the tem-

poral relation across different modalities by encouraging ﬁne-grained alignment

between long-form videos and paragraphs. Second, we propose a Hierarchical

Temporal Window Attention (HTWA) mechanism to effectively capture long-range

dependency while reducing computational cost in Transformer. We ﬁne-tune

the pre-trained LF-VILA model on seven downstream long-form video-language

understanding tasks of paragraph-to-video retrieval and long-form video question-

answering, and achieve new state-of-the-art performances. Speciﬁcally, our model

achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval

task and 2.4% on How2QA task, respectively. We release our code, dataset, and

pre-trained models at https://github.com/microsoft/XPretrain.

1 Introduction

In recent years, research on video understanding has attracted extensive attention due to the huge

amount of videos available everywhere in our daily life. Previous research works on video under-

standing [

] mainly focus on short-form video (i.e.,

30 seconds) analysis and the

semantics are limited to certain types (e.g., actions, scenes). However, there are so many long-form

videos (i.e.,

30 seconds) [

] in real scenarios. Human annotated labels (e.g., actions) are difﬁcult

to cover the rich semantic and dynamic information contained in those videos. On the other hand,

the video-language pre-training paradigm provides a way to learn cross-modal representation from

∗

This work was performed when Yuchong Sun and Hongwei Xue were visiting Microsoft Research as

research interns.

†Ruihua Song and Bei Liu are the corresponding authors.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06031v2 [cs.CV] 2 Mar 2023

0’00’’ 0’11’’ 0’28’’ 0’52’’ 1’08’’

One morning we had a

Heron join us out of the

blue, fish were constantly

visible in the water and

there’s even a …

Later in the evening, we

would be treated to a

sunset right in front of us

and for the most part, it all

felt like a very …

In the living room there are

all the amenities you would

hope for. Moving into the

bathroom, they have a

really nice …

The room itself was of a

very high quality with lots of

space for 2 people.

Figure 1: An example of long-form video-paragraph pair with several clips and sentences. It contains

a complicated storyline and a rich temporal dynamic. Each sentence can only describe a short clip,

and understanding the whole video needs the ability of long-range spatial-temporal reasoning.

video and language pairs and shows promising results on various high-level video understanding

tasks joint with language [3,27,54,59]. However, these studies mainly focus on short-form videos.

In this paper, we explore directly exploiting long-form video and language pairs for pre-training to

beneﬁt a wide range of long-form video-language understanding tasks.

Although long-form video-language joint learning has been explored in downstream tasks [

], they either use pre-extracted video features which lead to the sub-optimal problem,

or utilize image encoder to extract frame features that fail to model the long-range dependency in

long-form videos. Recent works [

] have shown that a video Transformer [

] backbone

helps to capture long-range dependency in an end-to-end fashion. An intuitive way for long-form

video-language pre-training is to adopt a video Transformer based short-form video-language pre-

training model [

] with long-form data. However, there are two main challenges in such a design.

First, long-form videos often contain more complicated storylines and richer temporal dynamics as

shown in Fig. 1. Simply aligning video and paragraph using a vanilla contrastive loss like previous

models [

] will ignore the temporal relation between clips and sentences, thus hindering the

quality of learned representation. Second, feeding more frames in a Transformer based video encoder

will largely increase computational cost considering the self-attention operation.

To overcome the above challenges, we propose a

ong-

orm

deo-

nguage pre-training model

(LF-VILA) with two novel designs. First, to better align long-form video-language pairs and learn the

temporal relationship between visual and language modalities, we propose a Multimodal Temporal

Contrastive (MTC) loss that learns temporal alignment between video clips and single sentences.

MTC encourages similarity between two modalities to be consistent with their temporal relationship.

In other words, the embedding distance between a video clip and a sentence closer in time should

be smaller than its distance with sentences that are far in time. Combining with global alignment

between video and paragraph, MTC ensures the model capture the temporal relation between video

clips and single sentences and further helps to improve the quality of joint representation.

Second, to utilize the advantage of the Transformer for capturing long-range dependency while

efﬁciently processing more frames for end-to-end training, we propose a Hierarchical Temporal

Window Attention (HTWA) mechanism. As shown in Fig. 1, the frames sparsely sampled from a

long-form video have large spatial and motion gaps, thus directly computing self-attention on all

frames in all layers of the Transformer is inefﬁcient and unnecessary. Instead, we only learn the

attention between adjacent frames in the ﬁrst few layers that focus more on details of spatial and

temporal information. Then we gradually expand the window size in the following layers, where the

high-level representation enables the model to better capture the relation between frames far apart.

The computational cost is largely reduced with the proposed HTWA mechanism.

We conduct experiments and evaluate LF-VILA on seven downstream long-form video-language

understanding tasks of paragraph-to-video retrieval and long-form video question-answering. We

surpass the state-of-the-art models pre-trained on short videos by a large margin. Our results

demonstrate the beneﬁt of modeling long-range dependency for long-form videos. We also verify the

effectiveness of our proposed MTC loss and HTWA mechanism through ablation studies.

Our contributions are summarized as follows: (1) We are the ﬁrst to study end-to-end long-form video-

language pre-training with large-scale video-paragraph data to beneﬁt long-form video understanding.

(2) We propose an MTC loss to capture the temporal relationship between clips and sentences while

improving the joint representation of long-form video and language. (3) We design an HTWA

mechanism for the video Transformer backbone, which can capture the long-range dependency in

long-form videos effectively and efﬁciently. (4) We verify the effectiveness of our LF-VILA model

on a wide range of downstream long-form video-language understanding tasks. Our model achieves

state-of-the-art performance on four paragraph-to-video retrieval tasks and three long-form video

question-answering tasks.

2 Related Work

2.1 Video Representation

Most previous video encoders utilize 3D-CNN based backbones [

]. These models show

promising performance on short-form video understanding tasks, such as action classiﬁcation and

detection [

]. However, CNN has a limited receptive ﬁeld and cannot effectively capture long-

range dependency. Recent works have extended Vision Transformer [

] for video representation and

demonstrated the beneﬁt of long-range temporal learning [

]. To reduce the computational cost,

TimeSformer [

] introduces a factorized spacetime attention, while Video Swin-Transformer [

]

restricts self-attention in a local 3D window. However, TimeSformer [

] is still computationally

expensive when the number of input frames becomes large. Video Swin-Transformer [

] adopts

a ﬁx-sized temporal window which is not suitable for videos with large duration. We propose

hierarchical temporal window attention to effectively learn the long-range dependency in long-form

videos while reducing the computational cost.

2.2 Long-form Video Understanding

Long-form video understanding is less explored in previous studies. Some works use long-term

context for improving recognition performance [

]. Typical long-form video understanding tasks

contain shot or event boundary detection [

] and temporal action detection [

], but these tasks cannot

reveal the ability of a high-level understanding of the model. Jointly understanding long-form videos

with language is a way to discover the rich semantics contained in videos and many benchmarks are

proposed recently, such as paragraph-to-video retrieval [

] and long-form video question-

answering [

]. Previous works that explore these tasks mostly use pre-extracted features,

which hinder the performance because of sub-optimal features [

]. We study end-to-end

long-form video-language pre-training and transfer to long-form video-language understanding tasks.

2.3 Video-Language Pre-training

Inspired by the success of image-language pre-training [

], video-language

pre-training is also explored recently. However, these works mainly focus on short-form videos [

]. Some works use 3D-CNN as a video backbone [

]. To utilize the advancement of

Transformer, some works use sparsely sampled frames to reduce the computation requirements [

One key factor for learning good representation is using contrastive loss to align multi-modal

features [

]. We further design a multimodal temporal contrastive loss to conduct ﬁne-

grained alignment between long-form videos and paragraphs. The power of the pre-training model

is largely dependent on the amount of training data, some works built large-scale video-language

datasets [

], and we build a long-from video-paragraph dataset based on HD-VILA-100M

[

]. There are several works have explored long-form video-language pre-training, HERO [

]

uses pre-extracted features, while MELORT [

] uses an image encoder to separately encode frames

which ignores joint spatial-temporal representation. Different from them, we use a video Transformer

backbone and end-to-end pre-training on large-scale long-form video-paragraph datasets.

3 Approach

In this section, we ﬁrst show the overall architecture of the proposed Long-Form VIdeo-LAnguage

pre-training model (LF-VILA) in Sec. 3.1. Then we explain our proposed Multimodal Temporal

…………

𝐿𝑚𝑙𝑚

…… … …

(a) Long-form Video-Language Pre-training

The room itself

was of a very

high quality with

lots of …

In the living

room there are

all the amenities

you would …

Later in the

evening, we

would be treated

to a sunset …

One morning

we had a Heron

join us out of

the blue, fish …

Text Encoder Video Encoder

𝑑𝑖𝑠 , < 𝑑𝑖𝑠 ,

(b) Multimodal Temporal Contrastive Loss Multimodal space

MTC loss:

Video:

Paragraph:

Cross-modal Encoder

𝐿𝑔𝑙𝑜𝑏𝑎𝑙

𝐿𝑣𝑡𝑚

𝐿mtc

Figure 2: The framework of (a) Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and

illustration of (b) Multimodal Temporal Contrastive (MTC) learning. (a) LF-VILA consists of a text

encoder, a video encoder, and a cross-modal encoder. In the text encoder, attention computing is ﬁrst

within each sentence then the whole paragraph. The pink boxes in the video encoder illustrate the

proposed Hierarchical Temporal Window Attention (HTWA) mechanism. (b) the MTC loss aligns

two sequences of representations (e.g., clip and sentence representations in our case), the distance of

two element’s representations is smaller when they are closer in time.

Contrastive (MTC) loss for learning cross-modal temporal relationship in Sec. 3.2, followed by the

designed Hierarchical Temporal Window Attention (HTWA) mechanism for efﬁcient video encoder

in Sec. 3.3. Finally, we introduce the pre-training pipeline with target pre-training tasks in Sec. 3.4.

3.1 Model Architecture

As illustrated in Fig. 2, our proposed Long-Form VIdeo-LAnguage pre-training model (LF-VILA)

consists of three parts: a video encoder

, a text encoder

and a cross-modal encoder

. With

a video and a paragraph as input, we ﬁrst pass them to the video encoder

and the text encoder

for embedding learning, respectively. Then we concatenate visual and language embeddings as

input to the cross-modal encoder

, where further cross-modal joint learning is conducted. The

details of these encoders are as follows.

Text Encoder.

The text encoder

is based on Transformer network [

]. We divide it into two parts:

sentence-level and paragraph-level encoding. In the ﬁrst several layers, self-attention is conducted

within word tokens from the same sentence, and sentence embedding can be learned individually. In

higher layers, we add segment embedding to distinguish each sentence, and the attention computation

is extended to all word tokens of the paragraph to output paragraph representation.

Video Encoder.

Our video encoder is also stacked by Transformer layers. In particular, we design a

Hierarchical Temporal Window Attention (HTWA) mechanism for efﬁcient attention computation.

Given a long-form video that has

clips, we sample

frames from each clip and divide each

raw frame into

H×W

patches. Then the

M×N×H×W

patches are encoded by

. With

our designed HTWA mechanism, the temporal window is gradually expanded, so that we can get

hierarchical feature maps with different temporal receptive ﬁelds. In addition, video features with the

same temporal window size as clip frame number

in the middle layer can be utilized as the clip

representation for ﬁne-grained alignment with sentences.

Cross-modal Encoder.

The cross-modal encoder

consists of Transformer layers. Visual and

language embeddings from the output of

and

are concatenated as the input to

. Self-

attention is used to capture the joint relation between visual and language modalities.

outputs the

representation of [CLS] token and each textual and visual token.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Long-FormVideo-LanguagePre-TrainingwithMultimodalTemporalContrastiveLearningYuchongSun1,HongweiXue2,RuihuaSong1y,BeiLiu3y,HuanYang3,JianlongFu31RenminUniversityofChina,Beijing,China,2UniversityofScienceandTechnologyofChina,Hefei,China,3MicrosoftResearch,Beijing,China,1{ycsun,rsong}@ruc.edu.cn,2gh0...

展开>> 收起<<

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Yuchong Sun1 Hongwei Xue2 Ruihua Song1y Bei Liu3y Huan Yang3 Jianlong Fu3.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Yuchong Sun1 Hongwei Xue2 Ruihua Song1y Bei Liu3y Huan Yang3 Jianlong Fu3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: