
2
former architecture for video classification that relied on
the factorization of the spatial and temporal dimensions
of the input. Bertasius et al. (2021) proposed the ap-
plication of self-attention between space and time sepa-
rately. More recently, VST (Liu et al., 2021b) proposed a
pure transformer architecture for video classification that
is able to surpass the factorized models in efficiency and
accuracy by taking advantage of the spatiotemporal lo-
cality of videos. Training a model to recognise images or
videos from zero requires a lot of data and computational
resources, namely memory. In this paper, we perform a
study of the accuracy of the VST in a transfer-learning sce-
nario which requires around 4x less memory than train-
ing from scratch. We aim to understand if Video Swin
Transfer generalizes well enough to be used on an out of
domain setting. We evaluate the performance of our ap-
proach on FCVID (Jiang et al., 2016), a large-scale in-the-
wild dataset, and on Something-Something, a dataset of
humans performing actions. Our approach takes the ad-
vantage of only having to train one layer to achieve the
results stated. This paper is organized as follows: in sec-
tion 2 we describe the VST model. Section 3 describes the
datasets relevant for the experiments. Section 4 describes
our setup. Section 5 presents the conclusions from the ex-
periments and proposals for future work.
2. Video Swin Transformer
Video Swin Transformer (VST) (Liu et al., 2021b)
is an adaptation of the Swin Transformer (Liu et al.,
2021a) and currently achieves state-of-the-art results on
the video classification task on Kinetics-400 (Kay et al.,
2017), Kinetics-600 (Carreira et al., 2018), and Something-
Something (Goyal et al., 2017).
The architecture of the model is shown in figure 1. The
input is a video, defined by a tensor with dimensions T×
H×W×3, where T corresponds to the time dimension,
i.e. the number of frames, and H×W×3to the number of
pixels in each frame. VST applies 3-dimensional patching
to the video (Dosovitskiy et al., 2020). Each patch has size
2×4×4×3. From patching the video results T
2×H
4×
W
4patches that represent the whole video. Each patch is
represented by a C dimensional vector.
The Patch Merging layers placed between each VST
Block on fig. 1 are responsible for merging groups of 2×2
patches and then applying a linear transformation that
reduces the features to half of the original dimension.
The main block in this architecture is the VST block
represented on Figure 2, this block has the structure
of a standard transformer where the Multi-Head self-
attention module is replaced with a 3D-shifted window-
based multi-head self-attention module.
Given a video composed of T′×H′×W′patches and
a 3D window with size P×M×M, the window is used
to partition the video patches into T′
P×H′
M×W′
Mnon-
overlapping 3D windows.
After the first VST layer, the window partitioning con-
figuration is shifted along the temporal, the height, and
the width axis by P
2,M
2,M
2patches from the preceding
layer. This shifted attention architecture was originally
developed by Liu et al. (2021a) and introduces connec-
tions between the neighboring non-overlapping 3D win-
dows from the previous layer.
A relative position bias is also added to each head
in the self-attention blocks that was shown to be advan-
tageous (Liu et al., 2021a). The attention computed by
these blocks is expressed by equation 1. Where Q, K, V ∈
RP M 2×dare the query, key, and value matrices. dis the di-
mension of the features and P M2is the number of tokens
in a 3D window.
Attention(Q, K, V ) = softmax QKT
√dk
+BV(1)
It is worth mentioning that in the last dense layer of
the model just performs classification over the set of pos-
sible classes in the dataset used.
3. Datasets
There are three datasets relevant to this work, namely
Kinetics-400, FCVID and Something-Something, these
datasets are described below.
3.1. Kinetics
Kinetics is a collection of three datasets, Kinetics-400
(Kay et al., 2017) Kinetics-600 (Carreira et al., 2018) and
Kinetics-700 (Carreira et al., 2019). Each of these datasets
contains a set of URLs to 10 second YouTube videos and
one label describing the contents of the video. Kinet-
ics covers 400/600/700 human action classes, depending
on the version of the dataset. These classes include sev-
eral human-to-object interactions such as steering a car,
as well as human-to-human interactions, such as shaking
hands.
3.2. FCVID
FCVID (Jiang et al., 2016) is a large-scale in-the-
wild dataset containing a total of 91,223 videos col-
lected from YouTube with an average duration of 167
seconds. These videos are organized in 239 different
categories such as social events, procedural events, ob-
jects and scenes. The videos were collected by perform-
ing YouTube searches using the categories as keywords.
These categories were used as the labels for the videos
and were revised by humans. Videos longer than 30
minutes were removed. Each video of FCVID can have
one or more labels. One video for instance can be clas-
sified as ”river” and ”bridge” at the same time. Some
classes appear to be sub classes of another class, such as
dog/playingFrisbeeWithDog or river/waterfall, despite
the author’s not providing any class hierarchy.