Introduc ing Vision Transformer for Alzheimers Disease classification task with 3D input Zilun Zhang1 Farzad Khalvati12345 for the Alzheimers Disease

2025-05-06 0 0 684.68KB 21 页 10玖币

侵权投诉

Introducing Vision Transformer for Alzheimer's Disease

classification task with 3D input

Zilun Zhang1, Farzad Khalvati1,2,3,4,5 for the Alzheimer's Disease

Neuroimaging Initiative

1Department of Mechanical and Industrial Engineering, University of Toronto

2Department of Medical Imaging, University of Toronto

3Institute of Medical Science, University of Toronto

4Department of Computer Science, University of Toronto

5Department of Diagnostic Imaging, Neurosciences & Mental Health Research

Program, The Hospital for Sick Children

Abstract. Many high-performance classification models utilize complex

CNN-based architectures for Alzheimer’s Disease classification. We aim to

investigate two relevant questions regarding classification of Alzheimer’s

Disease using MRI: “Do Vision Transformer-based models perform better

than CNN-based models?” and “Is it possible to use a shallow 3D CNN-based

model to obtain satisfying results?” To achieve these goals, we propose two

models that can take in and process 3D MRI scans: Convolutional Voxel

Vision Transformer (CVVT) architecture, and ConvNet3D-4, a shallow 4-

block 3D CNN-based model. Our results indicate that the shallow 3D CNN-

based models are sufficient to achieve good classification results for

Alzheimer’s Disease using MRI scans.

Keywords: Alzheimer’s Disease · Convolutional Neural Network · Vision

Transformer · ADNI.

1 Introduction

Alzheimer’s disease (AD) is a neurodegenerative disease that usually starts

slowly and deteriorates progressively. It is the cause of 60–80 % of dementia cases

[1]. Memory loss is mild in its early stages, but with late-stage Alzheimer’s,

individuals lose the ability to respond to the environment [2]. A hallmark of

Data used in preparation of this article were obtained from the Alzheimer's Disease

Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators

within the ADNI contributed to the design and implementation of ADNI and/or

provided data but did not participate in analysis or writing of this report. A complete

listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-

content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

Alzheimer’s disease is the presence of beta-amyloid, the aggregates of proteins

that build up in the spaces between nerve cells [2]. There is no cure for AD, and

the only treatment is the removal of amyloid from the brain to reduce cognitive

and functional decline [2]. As of 2015, there were approximately 29.8 million

people worldwide with AD [3].

In AD context, patients can be categorized to AD, Mild Cognition Impairment

(MCI), or Cognitive Normal (CN). Machine learning models can help diagnose AD

by classifying a patient into these three categories – AD, MCI, or CN – using data

from different modalities (e.g., medical imaging, genomic and clinical data)

separately or by combining these modalities together. Machine learning models

can also predict whether MCI subjects progress to AD (denoted as pMCI) or remain

stable (denoted as sMCI) [4][5].

In this paper, we focus on using medical imaging data, specifically MRI scans, to

develop classification models for AD vs. CN. For this task, two major pipelines are

widely used in medical imaging. The first one exploits radiomic features from MRI

scans and trains a separate classifier to classify if a scan belongs to the AD or CN

class. The second pipeline functions to train an end-to-end deep learning model

(including a feature extractor and a classifier) to extract deep learning features and

then to classify the MRI scans. The former mainly uses predefined, hand-crafted

features (e.g., size, shape, intensity, and texture information) with feature

reduction methods and machine learning classifiers. Li et al. extract texture and

wavelet features from each region of interest of the MRI scans and classify the

feature using SVM and Random Forest to get an accuracy of 0.861 and 0.835,

respectively [6]. Feng et al. use radiomic feature of amygdala with multivariable

logistic regression to obtain an accuracy of 0.90 [7]. Li et al. apply wavelet bandpass

filtering and select texture radiomic features to calculate high order radiomic

features. Then, a SVM classifier is used with the classification accuracy of 0.915 [8].

Du et al. use radiomic features from segmented Hippocampus MRI with SVM to

classify different types of AD with accuracy of 0.77 [9]. All of them are subject-wise

models while combining ROI techniques. A shortcoming for radiomics models is

that they can only look after a small number of pre-defined features. Since 2012,

Convolution Neural Networks (CNNs) models with deep learning techniques have

been widely applied to different applications in medical imaging and many of them

can achieve higher accuracy and precision than models using radiomic features.

CNN models are capable of obtaining useful features without pre-defining them.

However, the CNN-based deep learning models tend to have lower interpretability

compared to Radiomics and they do not generalize well when trained on small

datasets. In addition, the CNN structure is hard to capture non-local information

within the image due to the limitation of the size of CNN’s receptive field [10].

Most recently, Vision Transformer (ViT) based models [11] have shown great

potential on classification [[11]–[13] and detection tasks [14], [15] for commonly

used benchmarks such as ImageNet [16], and MSCOCO [17]. However, they have

rarely been used in classification tasks related to medical imaging, and particularly

AD classification tasks using MRI scans. This may be due to several reasons. First,

most ViT-based models require a large amount of data to train (e.g., the JFT300M

dataset has 300 million images [18], and ImageNet21K [19] has 12 million images)

because these models lack inductive bias compared to CNN-based models

Inductive bias is a set of assumptions that the learner uses to predict outputs for

given inputs. For example, in CNN inductive bias could consist of local visual

structures (such as edges and corners), scale-invariance, and translation-

invariance. Moreover, due to the large model size of ViT, the training process

becomes unstable and hard to converge. Second, most ViT-based models are

designed for 2D images. However, MRI scans are 3D images, and it is difficult to

apply a ViT-based model directly to MRI scans.

There were several reasons to hypothesize that ViT might perform better on MRI

scans, mainly the good representations of the features of AD should not just focus on

local regions of the MRI scan. On the contrary, representations that are fused from

different regions of an MRI scan may provide the additional information about

associations between these regions, which make them become more important for

characterization of AD. ViT introduces a sophisticated framework for organizing and

utilizing the attention module to extract non-local features. Traditional CNN-based

models have difficulty capturing global relationships within an image due to the

constraint of the size of the receptive field with limited model depth. In contrast, the

attention map calculated from the operating query, key and value pairs can provide

non-local information to the ViT model, resulting in better features. For CNN-based

models, the problem of the size of the receptive field is more severe for 3D images

(MRI scans) because the number of parameters in the 3D-CNN model are much more

than in the 2D-CNN model with the same depth. If the model size is fixed, the 3D-CNN

model will have a much smaller receptive field which leads to a degeneration of the

model capacity.

The ViT model seems promising in the classification of AD, but it was not

originally designed for 3D images. To this end, we implemented a 3D version of ViT

that can use 3D MRI scans as input and extract non-local features within the

Transformer Encoder. Also, we design a convolutional patch embedding module to

make this 3D ViT integrate the advantages of CNN. To compare with ViT-based

model, we developed a very shallow 3D-CNN model with a Leaky ReLU activation

layer and various normalization layers. We evaluated the performance of the

models using the public ADNI [20] dataset, which is summarized in section 3.1.

This paper makes several contributions. First, we designed a ViT-based model,

the Convolutional Voxel Vision Transformer (CVVT), which takes 3D images (MRI

scans) and theoretically exploits the internal associations between different

regions of the image. Although the idea of this Convolutional Voxel Vision

Transformer is exciting, we found this model does not generalize well when trained

on insufficient amounts of data, resulting in a moderate performance in the test set

of ADNI. Second, we proposed a shallow 4-block 3D-CNN model, ConvNet3D-4

with Leaky ReLU activation layer and Instance Normalization which makes the

model suitable for limited data and small batch size. The ConvNet3D-4 model

achieves the state-of-the-art result in the AD vs. CN classification task.

2 Related Work

2.1 CNN models for Alzheimer’s Disease Classification

Most classification architectures using CNN models for AD diagnosis can be

divided into three types: 2D slice-level CNN, 3D patch-level CNN (often involved

with Region of Interest, ROI), and 3D subject-level CNN [5].

2D slice-level CNN models are designed for 2D images. Their inputs are 4D

tensors with the shape of batch size, number of channels, image width, and image

height. CT and MRI scans are 3D images. Many researchers choose to slice 3D

tensors to 2D tensors and train the model using these 2D tensors. There are several

advantages in doing this: First, more data can be generated by slicing operation;

second, mature 2D-CNN models, such as ResNet [21], VGGNet [22], and

InceptionNet [23] can be applied; third, 2D-CNN models consume significantly less

GPU RAM compared to 3D-CNN models. Valliani et al. [24] first applied ResNet

which was pre-trained on ImageNet to the axial slices of MRI scans at subject level

with affine data augmentation and concluded that the ResNet structure is better

than Shallow ConvNet (accuracy of 0.813 vs. of 0.738). Islam et al. [25] and Qiu et

al. [26] investigated the slice selection towards the classification result (accuracy

of 0.833).

3D-CNN models have an irreplaceable advantage -- they exploit the spatial

relationship of voxels explicitly, which is crucial for isotropic scans. 3D-CNN

models are designed for 3D images (tensor) that have the 5D shape of batch size,

number of channels, image width, image height, and image depth. The input for 3D

patch-level CNN consists of a set of 3D patches extracted from an image (scan).

Cheng et al. [27] chose 27 large patches with size of 50×41×40 voxels in each MRI

scan and trained 27 CNN models. Then they ensembled the features to make a

prediction at subject level (accuracy of 0.872). Mingxia et al. [28] used a smaller

patch size (19×19×19 voxels) based on anatomical landmarks (accuracy of 0.911).

The advantage of patch-based models is that leveraging information from patches

requires less GPU memory consumption compared to models using the entire 3D

images. The disadvantage is that the model relies on the architecture design when

connecting all patches. However, most of the patches used in these models are not

informative because they contain parts of the brain that may not be affected by the

disease. Therefore, the region of interest (ROI) based CNN model has been

developed. Example for these models include 2D+ϵ CNN [29] (28x28 patches made

of three neighboring 2D slices in the hippocampus, accuracy of 0.828) and 2.5D

CNN [30] (concatenation of three 2D slices along the three possible planes: sagittal,

coronal, and axial. They used 32x32 patches to train the model with accuracy of

0.888).

As for the 3D subject-level CNN, it takes the entire MRI scan of a subject at once

and the classification is performed at the subject level. Korolev et al. [31] (ADNI

dataset, with 111 scans, accuracy of 0.80), Senanayake et al. [32] (ADNI dataset,

with 322 scans, accuracy of 0.78), and Cheng et al. [33] (ADNI dataset, with 193

scans, accuracy of 0.896) used the entire MRI scans as input to classical deep

learning models such as VGG and ResNet. The advantage is that the spatial

information is fully integrated. But these models require a large amount of GPU

memory and only achieved suboptimal classification accuracy.

2.2 Transformer and Vision Transformer

Transformer [34] is a powerful model in the field of Natural Language

Processing, which has an encoder-decoder structure. Since Vision Transformer

adopts its encoder only [11], we will focus on the structure of Transformer Encoder,

which is shown in the right part of Figure 1. The Transformer Encoder takes

language tokens (tokenized sentences) with positional encoding as input and

exploits the inner relationship of tokens using a stack of pre-defined network

structures including the self-attention mechanism (Multi-Head Attention block),

the normalization layer (Norm block), and the feed forward network (MLP block)

to obtain encoded features.

Unlike the original Transformer which uses language tokens (obtained from

sentences) with fixed positional encoding as input, Vision Transformer uses visual

tokens (obtained from images) with learnable positional encoding as input. The

method for obtaining visual token is presented in the left part of Figure 1. For a

given image, the ViT first divides it into patches at first, and flattens each patch to

a vector. After applying a linear projection to the patch vectors and concatenating

an additional vector of class embedding with the result, a learnable positional

embedding is added. The resulted vectors are sent to the Transformer Encoder as

visual tokens.

Fig.1. Vision Transformer Structure

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IntroducingVisionTransformerforAlzheimer'sDiseaseclassificationtaskwith3DinputZilunZhang1,FarzadKhalvati1,2,3,4,5fortheAlzheimer'sDiseaseNeuroimagingInitiative*1DepartmentofMechanicalandIndustrialEngineering,UniversityofToronto2DepartmentofMedicalImaging,UniversityofToronto3InstituteofMedicalScience...

展开>> 收起<<

Introduc ing Vision Transformer for Alzheimers Disease classification task with 3D input Zilun Zhang1 Farzad Khalvati12345 for the Alzheimers Disease.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Introduc ing Vision Transformer for Alzheimers Disease classification task with 3D input Zilun Zhang1 Farzad Khalvati12345 for the Alzheimers Disease

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: