Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Yuchong Sun1 Hongwei Xue2 Ruihua Song1y Bei Liu3y Huan Yang3 Jianlong Fu3

2025-04-27 0 0 1.65MB 20 页 10玖币
侵权投诉
Long-Form Video-Language Pre-Training with
Multimodal Temporal Contrastive Learning
Yuchong Sun1
, Hongwei Xue2
, Ruihua Song1
, Bei Liu3
, Huan Yang3, Jianlong Fu3
1Renmin University of China, Beijing, China,
2University of Science and Technology of China, Hefei, China,
3Microsoft Research, Beijing, China,
1{ycsun, rsong}@ruc.edu.cn,2gh051120@mail.ustc.edu.cn,
3{bei.liu, huayan, jianf}@microsoft.com
Abstract
Large-scale video-language pre-training has shown significant improvement in
video-language understanding tasks. Previous studies of video-language pre-
training mainly focus on short-form videos (i.e., within 30 seconds) and sentences,
leaving long-form video-language pre-training rarely explored. Directly learning
representation from long-form videos and language may benefit many long-form
video-language understanding tasks. However, it is challenging due to the difficulty
of modeling long-range relationships and the heavy computational burden caused
by more frames. In this paper, we introduce a
L
ong-
F
orm
VI
deo-
LA
nguage
pre-training model (LF-VILA) and train it on a large-scale long-form video and
paragraph dataset constructed from an existing public dataset. To effectively capture
the rich temporal dynamics and to better align video and language in an efficient
end-to-end manner, we introduce two novel designs in our LF-VILA model. We
first propose a Multimodal Temporal Contrastive (MTC) loss to learn the tem-
poral relation across different modalities by encouraging fine-grained alignment
between long-form videos and paragraphs. Second, we propose a Hierarchical
Temporal Window Attention (HTWA) mechanism to effectively capture long-range
dependency while reducing computational cost in Transformer. We fine-tune
the pre-trained LF-VILA model on seven downstream long-form video-language
understanding tasks of paragraph-to-video retrieval and long-form video question-
answering, and achieve new state-of-the-art performances. Specifically, our model
achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval
task and 2.4% on How2QA task, respectively. We release our code, dataset, and
pre-trained models at https://github.com/microsoft/XPretrain.
1 Introduction
In recent years, research on video understanding has attracted extensive attention due to the huge
amount of videos available everywhere in our daily life. Previous research works on video under-
standing [
14
,
15
,
43
,
47
,
61
] mainly focus on short-form video (i.e.,
<
30 seconds) analysis and the
semantics are limited to certain types (e.g., actions, scenes). However, there are so many long-form
videos (i.e.,
>
30 seconds) [
51
] in real scenarios. Human annotated labels (e.g., actions) are difficult
to cover the rich semantic and dynamic information contained in those videos. On the other hand,
the video-language pre-training paradigm provides a way to learn cross-modal representation from
This work was performed when Yuchong Sun and Hongwei Xue were visiting Microsoft Research as
research interns.
Ruihua Song and Bei Liu are the corresponding authors.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06031v2 [cs.CV] 2 Mar 2023
0’00’’ 0’11’ 0’28’’ 0’52’’ 1’08’’
One morning we had a
Heron join us out of the
blue, fish were constantly
visible in the water and
theres even a
Later in the evening, we
would be treated to a
sunset right in front of us
and for the most part, it all
felt like a very
In the living room there are
all the amenities you would
hope for. Moving into the
bathroom, they have a
really nice
The room itself was of a
very high quality with lots of
space for 2 people.
Figure 1: An example of long-form video-paragraph pair with several clips and sentences. It contains
a complicated storyline and a rich temporal dynamic. Each sentence can only describe a short clip,
and understanding the whole video needs the ability of long-range spatial-temporal reasoning.
video and language pairs and shows promising results on various high-level video understanding
tasks joint with language [3,27,54,59]. However, these studies mainly focus on short-form videos.
In this paper, we explore directly exploiting long-form video and language pairs for pre-training to
benefit a wide range of long-form video-language understanding tasks.
Although long-form video-language joint learning has been explored in downstream tasks [
16
,
27
,
28
,
30
,
58
,
60
,
62
], they either use pre-extracted video features which lead to the sub-optimal problem,
or utilize image encoder to extract frame features that fail to model the long-range dependency in
long-form videos. Recent works [
3
,
5
,
33
] have shown that a video Transformer [
48
] backbone
helps to capture long-range dependency in an end-to-end fashion. An intuitive way for long-form
video-language pre-training is to adopt a video Transformer based short-form video-language pre-
training model [
3
,
54
] with long-form data. However, there are two main challenges in such a design.
First, long-form videos often contain more complicated storylines and richer temporal dynamics as
shown in Fig. 1. Simply aligning video and paragraph using a vanilla contrastive loss like previous
models [
3
,
54
] will ignore the temporal relation between clips and sentences, thus hindering the
quality of learned representation. Second, feeding more frames in a Transformer based video encoder
will largely increase computational cost considering the self-attention operation.
To overcome the above challenges, we propose a
L
ong-
F
orm
VI
deo-
LA
nguage pre-training model
(LF-VILA) with two novel designs. First, to better align long-form video-language pairs and learn the
temporal relationship between visual and language modalities, we propose a Multimodal Temporal
Contrastive (MTC) loss that learns temporal alignment between video clips and single sentences.
MTC encourages similarity between two modalities to be consistent with their temporal relationship.
In other words, the embedding distance between a video clip and a sentence closer in time should
be smaller than its distance with sentences that are far in time. Combining with global alignment
between video and paragraph, MTC ensures the model capture the temporal relation between video
clips and single sentences and further helps to improve the quality of joint representation.
Second, to utilize the advantage of the Transformer for capturing long-range dependency while
efficiently processing more frames for end-to-end training, we propose a Hierarchical Temporal
Window Attention (HTWA) mechanism. As shown in Fig. 1, the frames sparsely sampled from a
long-form video have large spatial and motion gaps, thus directly computing self-attention on all
frames in all layers of the Transformer is inefficient and unnecessary. Instead, we only learn the
attention between adjacent frames in the first few layers that focus more on details of spatial and
temporal information. Then we gradually expand the window size in the following layers, where the
high-level representation enables the model to better capture the relation between frames far apart.
The computational cost is largely reduced with the proposed HTWA mechanism.
We conduct experiments and evaluate LF-VILA on seven downstream long-form video-language
understanding tasks of paragraph-to-video retrieval and long-form video question-answering. We
surpass the state-of-the-art models pre-trained on short videos by a large margin. Our results
demonstrate the benefit of modeling long-range dependency for long-form videos. We also verify the
effectiveness of our proposed MTC loss and HTWA mechanism through ablation studies.
2
Our contributions are summarized as follows: (1) We are the first to study end-to-end long-form video-
language pre-training with large-scale video-paragraph data to benefit long-form video understanding.
(2) We propose an MTC loss to capture the temporal relationship between clips and sentences while
improving the joint representation of long-form video and language. (3) We design an HTWA
mechanism for the video Transformer backbone, which can capture the long-range dependency in
long-form videos effectively and efficiently. (4) We verify the effectiveness of our LF-VILA model
on a wide range of downstream long-form video-language understanding tasks. Our model achieves
state-of-the-art performance on four paragraph-to-video retrieval tasks and three long-form video
question-answering tasks.
2 Related Work
2.1 Video Representation
Most previous video encoders utilize 3D-CNN based backbones [
7
,
47
,
52
]. These models show
promising performance on short-form video understanding tasks, such as action classification and
detection [
7
,
6
,
17
]. However, CNN has a limited receptive field and cannot effectively capture long-
range dependency. Recent works have extended Vision Transformer [
13
] for video representation and
demonstrated the benefit of long-range temporal learning [
5
,
33
]. To reduce the computational cost,
TimeSformer [
5
] introduces a factorized spacetime attention, while Video Swin-Transformer [
32
]
restricts self-attention in a local 3D window. However, TimeSformer [
5
] is still computationally
expensive when the number of input frames becomes large. Video Swin-Transformer [
33
] adopts
a fix-sized temporal window which is not suitable for videos with large duration. We propose
hierarchical temporal window attention to effectively learn the long-range dependency in long-form
videos while reducing the computational cost.
2.2 Long-form Video Understanding
Long-form video understanding is less explored in previous studies. Some works use long-term
context for improving recognition performance [
42
,
50
]. Typical long-form video understanding tasks
contain shot or event boundary detection [
4
] and temporal action detection [
6
], but these tasks cannot
reveal the ability of a high-level understanding of the model. Jointly understanding long-form videos
with language is a way to discover the rich semantics contained in videos and many benchmarks are
proposed recently, such as paragraph-to-video retrieval [
1
,
2
,
23
,
37
] and long-form video question-
answering [
27
,
28
,
30
,
58
]. Previous works that explore these tasks mostly use pre-extracted features,
which hinder the performance because of sub-optimal features [
16
,
27
,
60
,
62
]. We study end-to-end
long-form video-language pre-training and transfer to long-form video-language understanding tasks.
2.3 Video-Language Pre-training
Inspired by the success of image-language pre-training [
10
,
19
,
20
,
21
,
22
,
39
,
55
], video-language
pre-training is also explored recently. However, these works mainly focus on short-form videos [
3
,
34
,
36
,
54
]. Some works use 3D-CNN as a video backbone [
34
,
36
]. To utilize the advancement of
Transformer, some works use sparsely sampled frames to reduce the computation requirements [
3
,
54
].
One key factor for learning good representation is using contrastive loss to align multi-modal
features [
3
,
34
,
36
,
54
,
59
]. We further design a multimodal temporal contrastive loss to conduct fine-
grained alignment between long-form videos and paragraphs. The power of the pre-training model
is largely dependent on the amount of training data, some works built large-scale video-language
datasets [
36
,
54
,
59
], and we build a long-from video-paragraph dataset based on HD-VILA-100M
[
54
]. There are several works have explored long-form video-language pre-training, HERO [
27
]
uses pre-extracted features, while MELORT [
59
] uses an image encoder to separately encode frames
which ignores joint spatial-temporal representation. Different from them, we use a video Transformer
backbone and end-to-end pre-training on large-scale long-form video-paragraph datasets.
3 Approach
In this section, we first show the overall architecture of the proposed Long-Form VIdeo-LAnguage
pre-training model (LF-VILA) in Sec. 3.1. Then we explain our proposed Multimodal Temporal
3
𝐿𝑚𝑙𝑚
(a) Long-form Video-Language Pre-training
The room itself
was of a very
high quality with
lots of
In the living
room there are
all the amenities
you would
Later in the
evening, we
would be treated
to a sunset
One morning
we had a Heron
join us out of
the blue, fish
Text Encoder Video Encoder
𝑑𝑖𝑠 , < 𝑑𝑖𝑠 ,
(b) Multimodal Temporal Contrastive Loss Multimodal space
MTC loss:
Video:
Paragraph:
Cross-modal Encoder
𝐿𝑔𝑙𝑜𝑏𝑎𝑙
𝐿𝑣𝑡𝑚
𝐿mtc
Figure 2: The framework of (a) Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and
illustration of (b) Multimodal Temporal Contrastive (MTC) learning. (a) LF-VILA consists of a text
encoder, a video encoder, and a cross-modal encoder. In the text encoder, attention computing is first
within each sentence then the whole paragraph. The pink boxes in the video encoder illustrate the
proposed Hierarchical Temporal Window Attention (HTWA) mechanism. (b) the MTC loss aligns
two sequences of representations (e.g., clip and sentence representations in our case), the distance of
two element’s representations is smaller when they are closer in time.
Contrastive (MTC) loss for learning cross-modal temporal relationship in Sec. 3.2, followed by the
designed Hierarchical Temporal Window Attention (HTWA) mechanism for efficient video encoder
in Sec. 3.3. Finally, we introduce the pre-training pipeline with target pre-training tasks in Sec. 3.4.
3.1 Model Architecture
As illustrated in Fig. 2, our proposed Long-Form VIdeo-LAnguage pre-training model (LF-VILA)
consists of three parts: a video encoder
EV
, a text encoder
ET
and a cross-modal encoder
EC
. With
a video and a paragraph as input, we first pass them to the video encoder
EV
and the text encoder
ET
for embedding learning, respectively. Then we concatenate visual and language embeddings as
input to the cross-modal encoder
EC
, where further cross-modal joint learning is conducted. The
details of these encoders are as follows.
Text Encoder.
The text encoder
ET
is based on Transformer network [
48
]. We divide it into two parts:
sentence-level and paragraph-level encoding. In the first several layers, self-attention is conducted
within word tokens from the same sentence, and sentence embedding can be learned individually. In
higher layers, we add segment embedding to distinguish each sentence, and the attention computation
is extended to all word tokens of the paragraph to output paragraph representation.
Video Encoder.
Our video encoder is also stacked by Transformer layers. In particular, we design a
Hierarchical Temporal Window Attention (HTWA) mechanism for efficient attention computation.
Given a long-form video that has
M
clips, we sample
N
frames from each clip and divide each
raw frame into
H×W
patches. Then the
M×N×H×W
patches are encoded by
EV
. With
our designed HTWA mechanism, the temporal window is gradually expanded, so that we can get
hierarchical feature maps with different temporal receptive fields. In addition, video features with the
same temporal window size as clip frame number
N
in the middle layer can be utilized as the clip
representation for fine-grained alignment with sentences.
Cross-modal Encoder.
The cross-modal encoder
EC
consists of Transformer layers. Visual and
language embeddings from the output of
EV
and
ET
are concatenated as the input to
EC
. Self-
attention is used to capture the joint relation between visual and language modalities.
EC
outputs the
representation of [CLS] token and each textual and visual token.
4
摘要:

Long-FormVideo-LanguagePre-TrainingwithMultimodalTemporalContrastiveLearningYuchongSun1,HongweiXue2,RuihuaSong1y,BeiLiu3y,HuanYang3,JianlongFu31RenminUniversityofChina,Beijing,China,2UniversityofScienceandTechnologyofChina,Hefei,China,3MicrosoftResearch,Beijing,China,1{ycsun,rsong}@ruc.edu.cn,2gh0...

展开>> 收起<<
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Yuchong Sun1 Hongwei Xue2 Ruihua Song1y Bei Liu3y Huan Yang3 Jianlong Fu3.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.65MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注