Transfer-learning for video classification Video Swin Transformer on multiple domains

2025-05-06 0 0 545.15KB 7 页 10玖币
侵权投诉
1
Pattern Recognition Letters
journal homepage: www.elsevier.com
Transfer-learning for video classification: Video Swin Transformer on multiple
domains
Daniel Oliveiraa,b, David Martins de Matosa,b
aINESC-ID Lisboa, R. Alves Redol 9, 1000-029 Lisboa, Portugal
bInstituto Superior T´ecnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001 Lisboa, Portugal
Article history:
Video Classification, Action Classi-
fication, Transformers, Video Trans-
formers, Video Swin Transformer,
Transfer Learning, FCVID, Kinetics,
Something-Something
ABSTRACT
The computer vision community has seen a shift from convolutional-based to
pure transformer architectures for both image and video tasks. Training a trans-
former from zero for these tasks usually requires a lot of data and computational
resources. Video Swin Transformer (VST) is a pure-transformer model devel-
oped for video classification which achieves state-of-the-art results in accuracy
and efficiency on several datasets. In this paper, we aim to understand if VST
generalizes well enough to be used in an out-of-domain setting. We study the
performance of VST on two large-scale datasets, namely FCVID and Something–
Something using a transfer learning approach from Kinetics-400, which requires
around 4x less memory than training from scratch. We then break down the
results to understand where VST fails the most and in which scenarios the trans-
fer-learning approach is viable. Our experiments show an 85% top-1 accuracy on
FCVID without retraining the whole model which is equal to the state-of-the-art
for the dataset and a 21% accuracy on Something-Something. The experiments
also suggest that the performance of the VST decreases on average when the
video duration increases which seems to be a consequence of a design choice
of the model. From the results, we conclude that VST generalizes well enough
to classify out-of-domain videos without retraining when the target classes are
from the same type as the classes used to train the model. We observed this effect
when we performed transfer-learning from Kinetics-400 to FCVID, where most
datasets target mostly objects. On the other hand, if the classes are not from the
same type, then the accuracy after the transfer-learning approach is expected to
be poor. We observed this effect when we performed transfer-learning from Ki-
netics-400, where the classes represent mostly objects, to Something-Something,
where the classes represent mostly actions.
© 2025 Elsevier Ltd. All rights reserved.
1. Introduction
Recognizing and understanding the contents of im-
ages and videos is a crucial problem for many appli-
cations, such as image and video retrieval, smart ad-
vertising, allowing artificial agents to perceive the sur-
e-mail: daniel.oliveira@inesc-id.pt (Daniel Oliveira),
david.matos@inesc-id.pt (David Martins de Matos)
rounding world, among others. Convolutional neu-
ral networks (CNN) have been widely used for video
classification, namely 3D convolution (Tran et al., 2014)
which is an extension of 2D convolution. Recently, we
have seen a shift from convolution based architectures to
transformer-based architectures. This shift started for im-
age classification with the introduction of ViT (Dosovit-
skiy et al., 2020), a visual transformer for image classifi-
cation. Later, Arnab et al. (2021) proposed a pure trans-
arXiv:2210.09969v2 [cs.CV] 28 Mar 2025
2
former architecture for video classification that relied on
the factorization of the spatial and temporal dimensions
of the input. Bertasius et al. (2021) proposed the ap-
plication of self-attention between space and time sepa-
rately. More recently, VST (Liu et al., 2021b) proposed a
pure transformer architecture for video classification that
is able to surpass the factorized models in efficiency and
accuracy by taking advantage of the spatiotemporal lo-
cality of videos. Training a model to recognise images or
videos from zero requires a lot of data and computational
resources, namely memory. In this paper, we perform a
study of the accuracy of the VST in a transfer-learning sce-
nario which requires around 4x less memory than train-
ing from scratch. We aim to understand if Video Swin
Transfer generalizes well enough to be used on an out of
domain setting. We evaluate the performance of our ap-
proach on FCVID (Jiang et al., 2016), a large-scale in-the-
wild dataset, and on Something-Something, a dataset of
humans performing actions. Our approach takes the ad-
vantage of only having to train one layer to achieve the
results stated. This paper is organized as follows: in sec-
tion 2 we describe the VST model. Section 3 describes the
datasets relevant for the experiments. Section 4 describes
our setup. Section 5 presents the conclusions from the ex-
periments and proposals for future work.
2. Video Swin Transformer
Video Swin Transformer (VST) (Liu et al., 2021b)
is an adaptation of the Swin Transformer (Liu et al.,
2021a) and currently achieves state-of-the-art results on
the video classification task on Kinetics-400 (Kay et al.,
2017), Kinetics-600 (Carreira et al., 2018), and Something-
Something (Goyal et al., 2017).
The architecture of the model is shown in figure 1. The
input is a video, defined by a tensor with dimensions T×
H×W×3, where T corresponds to the time dimension,
i.e. the number of frames, and H×W×3to the number of
pixels in each frame. VST applies 3-dimensional patching
to the video (Dosovitskiy et al., 2020). Each patch has size
2×4×4×3. From patching the video results T
2×H
4×
W
4patches that represent the whole video. Each patch is
represented by a C dimensional vector.
The Patch Merging layers placed between each VST
Block on fig. 1 are responsible for merging groups of 2×2
patches and then applying a linear transformation that
reduces the features to half of the original dimension.
The main block in this architecture is the VST block
represented on Figure 2, this block has the structure
of a standard transformer where the Multi-Head self-
attention module is replaced with a 3D-shifted window-
based multi-head self-attention module.
Given a video composed of T×H×Wpatches and
a 3D window with size P×M×M, the window is used
to partition the video patches into T
P×H
M×W
Mnon-
overlapping 3D windows.
After the first VST layer, the window partitioning con-
figuration is shifted along the temporal, the height, and
the width axis by P
2,M
2,M
2patches from the preceding
layer. This shifted attention architecture was originally
developed by Liu et al. (2021a) and introduces connec-
tions between the neighboring non-overlapping 3D win-
dows from the previous layer.
A relative position bias is also added to each head
in the self-attention blocks that was shown to be advan-
tageous (Liu et al., 2021a). The attention computed by
these blocks is expressed by equation 1. Where Q, K, V
RP M 2×dare the query, key, and value matrices. dis the di-
mension of the features and P M2is the number of tokens
in a 3D window.
Attention(Q, K, V ) = softmax QKT
dk
+BV(1)
It is worth mentioning that in the last dense layer of
the model just performs classification over the set of pos-
sible classes in the dataset used.
3. Datasets
There are three datasets relevant to this work, namely
Kinetics-400, FCVID and Something-Something, these
datasets are described below.
3.1. Kinetics
Kinetics is a collection of three datasets, Kinetics-400
(Kay et al., 2017) Kinetics-600 (Carreira et al., 2018) and
Kinetics-700 (Carreira et al., 2019). Each of these datasets
contains a set of URLs to 10 second YouTube videos and
one label describing the contents of the video. Kinet-
ics covers 400/600/700 human action classes, depending
on the version of the dataset. These classes include sev-
eral human-to-object interactions such as steering a car,
as well as human-to-human interactions, such as shaking
hands.
3.2. FCVID
FCVID (Jiang et al., 2016) is a large-scale in-the-
wild dataset containing a total of 91,223 videos col-
lected from YouTube with an average duration of 167
seconds. These videos are organized in 239 different
categories such as social events, procedural events, ob-
jects and scenes. The videos were collected by perform-
ing YouTube searches using the categories as keywords.
These categories were used as the labels for the videos
and were revised by humans. Videos longer than 30
minutes were removed. Each video of FCVID can have
one or more labels. One video for instance can be clas-
sified as ”river” and ”bridge” at the same time. Some
classes appear to be sub classes of another class, such as
dog/playingFrisbeeWithDog or river/waterfall, despite
the author’s not providing any class hierarchy.
摘要:

1PatternRecognitionLettersjournalhomepage:www.elsevier.comTransfer-learningforvideoclassification:VideoSwinTransformeronmultipledomainsDanielOliveiraa,b,DavidMartinsdeMatosa,baINESC-IDLisboa,R.AlvesRedol9,1000-029Lisboa,PortugalbInstitutoSuperiorT´ecnico,UniversidadedeLisboa,Av.RoviscoPais,1049-001L...

展开>> 收起<<
Transfer-learning for video classification Video Swin Transformer on multiple domains.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:545.15KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注