Transfer-learning for video classification Video Swin Transformer on multiple domains

2025-05-06 0 0 545.15KB 7 页 10玖币

侵权投诉

Pattern Recognition Letters

journal homepage: www.elsevier.com

Transfer-learning for video classiﬁcation: Video Swin Transformer on multiple

domains

Daniel Oliveiraa,b, David Martins de Matosa,b

aINESC-ID Lisboa, R. Alves Redol 9, 1000-029 Lisboa, Portugal

bInstituto Superior T´ecnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001 Lisboa, Portugal

Article history:

Video Classiﬁcation, Action Classi-

ﬁcation, Transformers, Video Trans-

formers, Video Swin Transformer,

Transfer Learning, FCVID, Kinetics,

Something-Something

ABSTRACT

The computer vision community has seen a shift from convolutional-based to

pure transformer architectures for both image and video tasks. Training a trans-

former from zero for these tasks usually requires a lot of data and computational

resources. Video Swin Transformer (VST) is a pure-transformer model devel-

oped for video classiﬁcation which achieves state-of-the-art results in accuracy

and efﬁciency on several datasets. In this paper, we aim to understand if VST

generalizes well enough to be used in an out-of-domain setting. We study the

performance of VST on two large-scale datasets, namely FCVID and Something–

Something using a transfer learning approach from Kinetics-400, which requires

around 4x less memory than training from scratch. We then break down the

results to understand where VST fails the most and in which scenarios the trans-

fer-learning approach is viable. Our experiments show an 85% top-1 accuracy on

FCVID without retraining the whole model which is equal to the state-of-the-art

for the dataset and a 21% accuracy on Something-Something. The experiments

also suggest that the performance of the VST decreases on average when the

video duration increases which seems to be a consequence of a design choice

of the model. From the results, we conclude that VST generalizes well enough

to classify out-of-domain videos without retraining when the target classes are

from the same type as the classes used to train the model. We observed this effect

when we performed transfer-learning from Kinetics-400 to FCVID, where most

datasets target mostly objects. On the other hand, if the classes are not from the

same type, then the accuracy after the transfer-learning approach is expected to

be poor. We observed this effect when we performed transfer-learning from Ki-

netics-400, where the classes represent mostly objects, to Something-Something,

where the classes represent mostly actions.

1. Introduction

Recognizing and understanding the contents of im-

ages and videos is a crucial problem for many appli-

cations, such as image and video retrieval, smart ad-

vertising, allowing artiﬁcial agents to perceive the sur-

e-mail: daniel.oliveira@inesc-id.pt (Daniel Oliveira),

david.matos@inesc-id.pt (David Martins de Matos)

rounding world, among others. Convolutional neu-

ral networks (CNN) have been widely used for video

classiﬁcation, namely 3D convolution (Tran et al., 2014)

which is an extension of 2D convolution. Recently, we

have seen a shift from convolution based architectures to

transformer-based architectures. This shift started for im-

age classiﬁcation with the introduction of ViT (Dosovit-

skiy et al., 2020), a visual transformer for image classiﬁ-

cation. Later, Arnab et al. (2021) proposed a pure trans-

arXiv:2210.09969v2 [cs.CV] 28 Mar 2025

former architecture for video classiﬁcation that relied on

the factorization of the spatial and temporal dimensions

of the input. Bertasius et al. (2021) proposed the ap-

plication of self-attention between space and time sepa-

rately. More recently, VST (Liu et al., 2021b) proposed a

pure transformer architecture for video classiﬁcation that

is able to surpass the factorized models in efﬁciency and

accuracy by taking advantage of the spatiotemporal lo-

cality of videos. Training a model to recognise images or

videos from zero requires a lot of data and computational

resources, namely memory. In this paper, we perform a

study of the accuracy of the VST in a transfer-learning sce-

nario which requires around 4x less memory than train-

ing from scratch. We aim to understand if Video Swin

Transfer generalizes well enough to be used on an out of

domain setting. We evaluate the performance of our ap-

proach on FCVID (Jiang et al., 2016), a large-scale in-the-

wild dataset, and on Something-Something, a dataset of

humans performing actions. Our approach takes the ad-

vantage of only having to train one layer to achieve the

results stated. This paper is organized as follows: in sec-

tion 2 we describe the VST model. Section 3 describes the

datasets relevant for the experiments. Section 4 describes

our setup. Section 5 presents the conclusions from the ex-

periments and proposals for future work.

2. Video Swin Transformer

Video Swin Transformer (VST) (Liu et al., 2021b)

is an adaptation of the Swin Transformer (Liu et al.,

2021a) and currently achieves state-of-the-art results on

the video classiﬁcation task on Kinetics-400 (Kay et al.,

2017), Kinetics-600 (Carreira et al., 2018), and Something-

Something (Goyal et al., 2017).

The architecture of the model is shown in ﬁgure 1. The

input is a video, deﬁned by a tensor with dimensions T×

H×W×3, where T corresponds to the time dimension,

i.e. the number of frames, and H×W×3to the number of

pixels in each frame. VST applies 3-dimensional patching

to the video (Dosovitskiy et al., 2020). Each patch has size

2×4×4×3. From patching the video results T

2×H

4×

4patches that represent the whole video. Each patch is

represented by a C dimensional vector.

The Patch Merging layers placed between each VST

Block on ﬁg. 1 are responsible for merging groups of 2×2

patches and then applying a linear transformation that

reduces the features to half of the original dimension.

The main block in this architecture is the VST block

represented on Figure 2, this block has the structure

of a standard transformer where the Multi-Head self-

attention module is replaced with a 3D-shifted window-

based multi-head self-attention module.

Given a video composed of T′×H′×W′patches and

a 3D window with size P×M×M, the window is used

to partition the video patches into T′

P×H′

M×W′

Mnon-

overlapping 3D windows.

After the ﬁrst VST layer, the window partitioning con-

ﬁguration is shifted along the temporal, the height, and

the width axis by P

2,M

2patches from the preceding

layer. This shifted attention architecture was originally

developed by Liu et al. (2021a) and introduces connec-

tions between the neighboring non-overlapping 3D win-

dows from the previous layer.

A relative position bias is also added to each head

in the self-attention blocks that was shown to be advan-

tageous (Liu et al., 2021a). The attention computed by

these blocks is expressed by equation 1. Where Q, K, V ∈

RP M 2×dare the query, key, and value matrices. dis the di-

mension of the features and P M2is the number of tokens

in a 3D window.

Attention(Q, K, V ) = softmax QKT

√dk

+BV(1)

It is worth mentioning that in the last dense layer of

the model just performs classiﬁcation over the set of pos-

sible classes in the dataset used.

3. Datasets

There are three datasets relevant to this work, namely

Kinetics-400, FCVID and Something-Something, these

datasets are described below.

3.1. Kinetics

Kinetics is a collection of three datasets, Kinetics-400

(Kay et al., 2017) Kinetics-600 (Carreira et al., 2018) and

Kinetics-700 (Carreira et al., 2019). Each of these datasets

contains a set of URLs to 10 second YouTube videos and

one label describing the contents of the video. Kinet-

ics covers 400/600/700 human action classes, depending

on the version of the dataset. These classes include sev-

eral human-to-object interactions such as steering a car,

as well as human-to-human interactions, such as shaking

hands.

3.2. FCVID

FCVID (Jiang et al., 2016) is a large-scale in-the-

wild dataset containing a total of 91,223 videos col-

lected from YouTube with an average duration of 167

seconds. These videos are organized in 239 different

categories such as social events, procedural events, ob-

jects and scenes. The videos were collected by perform-

ing YouTube searches using the categories as keywords.

These categories were used as the labels for the videos

and were revised by humans. Videos longer than 30

minutes were removed. Each video of FCVID can have

one or more labels. One video for instance can be clas-

siﬁed as ”river” and ”bridge” at the same time. Some

classes appear to be sub classes of another class, such as

dog/playingFrisbeeWithDog or river/waterfall, despite

the author’s not providing any class hierarchy.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1PatternRecognitionLettersjournalhomepage:www.elsevier.comTransfer-learningforvideoclassification:VideoSwinTransformeronmultipledomainsDanielOliveiraa,b,DavidMartinsdeMatosa,baINESC-IDLisboa,R.AlvesRedol9,1000-029Lisboa,PortugalbInstitutoSuperiorT´ecnico,UniversidadedeLisboa,Av.RoviscoPais,1049-001L...

展开>> 收起<<

Transfer-learning for video classification Video Swin Transformer on multiple domains.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transfer-learning for video classification Video Swin Transformer on multiple domains

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: