Our contributions are summarized as follows: (1) We are the first to study end-to-end long-form video-
language pre-training with large-scale video-paragraph data to benefit long-form video understanding.
(2) We propose an MTC loss to capture the temporal relationship between clips and sentences while
improving the joint representation of long-form video and language. (3) We design an HTWA
mechanism for the video Transformer backbone, which can capture the long-range dependency in
long-form videos effectively and efficiently. (4) We verify the effectiveness of our LF-VILA model
on a wide range of downstream long-form video-language understanding tasks. Our model achieves
state-of-the-art performance on four paragraph-to-video retrieval tasks and three long-form video
question-answering tasks.
2 Related Work
2.1 Video Representation
Most previous video encoders utilize 3D-CNN based backbones [
7
,
47
,
52
]. These models show
promising performance on short-form video understanding tasks, such as action classification and
detection [
7
,
6
,
17
]. However, CNN has a limited receptive field and cannot effectively capture long-
range dependency. Recent works have extended Vision Transformer [
13
] for video representation and
demonstrated the benefit of long-range temporal learning [
5
,
33
]. To reduce the computational cost,
TimeSformer [
5
] introduces a factorized spacetime attention, while Video Swin-Transformer [
32
]
restricts self-attention in a local 3D window. However, TimeSformer [
5
] is still computationally
expensive when the number of input frames becomes large. Video Swin-Transformer [
33
] adopts
a fix-sized temporal window which is not suitable for videos with large duration. We propose
hierarchical temporal window attention to effectively learn the long-range dependency in long-form
videos while reducing the computational cost.
2.2 Long-form Video Understanding
Long-form video understanding is less explored in previous studies. Some works use long-term
context for improving recognition performance [
42
,
50
]. Typical long-form video understanding tasks
contain shot or event boundary detection [
4
] and temporal action detection [
6
], but these tasks cannot
reveal the ability of a high-level understanding of the model. Jointly understanding long-form videos
with language is a way to discover the rich semantics contained in videos and many benchmarks are
proposed recently, such as paragraph-to-video retrieval [
1
,
2
,
23
,
37
] and long-form video question-
answering [
27
,
28
,
30
,
58
]. Previous works that explore these tasks mostly use pre-extracted features,
which hinder the performance because of sub-optimal features [
16
,
27
,
60
,
62
]. We study end-to-end
long-form video-language pre-training and transfer to long-form video-language understanding tasks.
2.3 Video-Language Pre-training
Inspired by the success of image-language pre-training [
10
,
19
,
20
,
21
,
22
,
39
,
55
], video-language
pre-training is also explored recently. However, these works mainly focus on short-form videos [
3
,
34
,
36
,
54
]. Some works use 3D-CNN as a video backbone [
34
,
36
]. To utilize the advancement of
Transformer, some works use sparsely sampled frames to reduce the computation requirements [
3
,
54
].
One key factor for learning good representation is using contrastive loss to align multi-modal
features [
3
,
34
,
36
,
54
,
59
]. We further design a multimodal temporal contrastive loss to conduct fine-
grained alignment between long-form videos and paragraphs. The power of the pre-training model
is largely dependent on the amount of training data, some works built large-scale video-language
datasets [
36
,
54
,
59
], and we build a long-from video-paragraph dataset based on HD-VILA-100M
[
54
]. There are several works have explored long-form video-language pre-training, HERO [
27
]
uses pre-extracted features, while MELORT [
59
] uses an image encoder to separately encode frames
which ignores joint spatial-temporal representation. Different from them, we use a video Transformer
backbone and end-to-end pre-training on large-scale long-form video-paragraph datasets.
3 Approach
In this section, we first show the overall architecture of the proposed Long-Form VIdeo-LAnguage
pre-training model (LF-VILA) in Sec. 3.1. Then we explain our proposed Multimodal Temporal
3