Introduc ing Vision Transformer for Alzheimers Disease classification task with 3D input Zilun Zhang1 Farzad Khalvati12345 for the Alzheimers Disease

2025-05-06 0 0 684.68KB 21 页 10玖币
侵权投诉
Introducing Vision Transformer for Alzheimer's Disease
classification task with 3D input
Zilun Zhang1, Farzad Khalvati1,2,3,4,5 for the Alzheimer's Disease
Neuroimaging Initiative
*
1Department of Mechanical and Industrial Engineering, University of Toronto
2Department of Medical Imaging, University of Toronto
3Institute of Medical Science, University of Toronto
4Department of Computer Science, University of Toronto
5Department of Diagnostic Imaging, Neurosciences & Mental Health Research
Program, The Hospital for Sick Children
Abstract. Many high-performance classification models utilize complex
CNN-based architectures for Alzheimer’s Disease classification. We aim to
investigate two relevant questions regarding classification of Alzheimer’s
Disease using MRI: “Do Vision Transformer-based models perform better
than CNN-based models?” and Is it possible to use a shallow 3D CNN-based
model to obtain satisfying results?” To achieve these goals, we propose two
models that can take in and process 3D MRI scans: Convolutional Voxel
Vision Transformer (CVVT) architecture, and ConvNet3D-4, a shallow 4-
block 3D CNN-based model. Our results indicate that the shallow 3D CNN-
based models are sufficient to achieve good classification results for
Alzheimer’s Disease using MRI scans.
Keywords: Alzheimer’s Disease · Convolutional Neural Network · Vision
Transformer · ADNI.
1 Introduction
Alzheimer’s disease (AD) is a neurodegenerative disease that usually starts
slowly and deteriorates progressively. It is the cause of 60–80 % of dementia cases
[1]. Memory loss is mild in its early stages, but with late-stage Alzheimer’s,
individuals lose the ability to respond to the environment [2]. A hallmark of
*
Data used in preparation of this article were obtained from the Alzheimer's Disease
Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators
within the ADNI contributed to the design and implementation of ADNI and/or
provided data but did not participate in analysis or writing of this report. A complete
listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-
content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
2
Alzheimer’s disease is the presence of beta-amyloid, the aggregates of proteins
that build up in the spaces between nerve cells [2]. There is no cure for AD, and
the only treatment is the removal of amyloid from the brain to reduce cognitive
and functional decline [2]. As of 2015, there were approximately 29.8 million
people worldwide with AD [3].
In AD context, patients can be categorized to AD, Mild Cognition Impairment
(MCI), or Cognitive Normal (CN). Machine learning models can help diagnose AD
by classifying a patient into these three categories AD, MCI, or CN using data
from different modalities (e.g., medical imaging, genomic and clinical data)
separately or by combining these modalities together. Machine learning models
can also predict whether MCI subjects progress to AD (denoted as pMCI) or remain
stable (denoted as sMCI) [4][5].
In this paper, we focus on using medical imaging data, specifically MRI scans, to
develop classification models for AD vs. CN. For this task, two major pipelines are
widely used in medical imaging. The first one exploits radiomic features from MRI
scans and trains a separate classifier to classify if a scan belongs to the AD or CN
class. The second pipeline functions to train an end-to-end deep learning model
(including a feature extractor and a classifier) to extract deep learning features and
then to classify the MRI scans. The former mainly uses predefined, hand-crafted
features (e.g., size, shape, intensity, and texture information) with feature
reduction methods and machine learning classifiers. Li et al. extract texture and
wavelet features from each region of interest of the MRI scans and classify the
feature using SVM and Random Forest to get an accuracy of 0.861 and 0.835,
respectively [6]. Feng et al. use radiomic feature of amygdala with multivariable
logistic regression to obtain an accuracy of 0.90 [7]. Li et al. apply wavelet bandpass
filtering and select texture radiomic features to calculate high order radiomic
features. Then, a SVM classifier is used with the classification accuracy of 0.915 [8].
Du et al. use radiomic features from segmented Hippocampus MRI with SVM to
classify different types of AD with accuracy of 0.77 [9]. All of them are subject-wise
models while combining ROI techniques. A shortcoming for radiomics models is
that they can only look after a small number of pre-defined features. Since 2012,
Convolution Neural Networks (CNNs) models with deep learning techniques have
been widely applied to different applications in medical imaging and many of them
can achieve higher accuracy and precision than models using radiomic features.
CNN models are capable of obtaining useful features without pre-defining them.
However, the CNN-based deep learning models tend to have lower interpretability
compared to Radiomics and they do not generalize well when trained on small
datasets. In addition, the CNN structure is hard to capture non-local information
within the image due to the limitation of the size of CNN’s receptive field [10].
Most recently, Vision Transformer (ViT) based models [11] have shown great
potential on classification [[11]–[13] and detection tasks [14], [15] for commonly
used benchmarks such as ImageNet [16], and MSCOCO [17]. However, they have
rarely been used in classification tasks related to medical imaging, and particularly
AD classification tasks using MRI scans. This may be due to several reasons. First,
3
most ViT-based models require a large amount of data to train (e.g., the JFT300M
dataset has 300 million images [18], and ImageNet21K [19] has 12 million images)
because these models lack inductive bias compared to CNN-based models
Inductive bias is a set of assumptions that the learner uses to predict outputs for
given inputs. For example, in CNN inductive bias could consist of local visual
structures (such as edges and corners), scale-invariance, and translation-
invariance. Moreover, due to the large model size of ViT, the training process
becomes unstable and hard to converge. Second, most ViT-based models are
designed for 2D images. However, MRI scans are 3D images, and it is difficult to
apply a ViT-based model directly to MRI scans.
There were several reasons to hypothesize that ViT might perform better on MRI
scans, mainly the good representations of the features of AD should not just focus on
local regions of the MRI scan. On the contrary, representations that are fused from
different regions of an MRI scan may provide the additional information about
associations between these regions, which make them become more important for
characterization of AD. ViT introduces a sophisticated framework for organizing and
utilizing the attention module to extract non-local features. Traditional CNN-based
models have difficulty capturing global relationships within an image due to the
constraint of the size of the receptive field with limited model depth. In contrast, the
attention map calculated from the operating query, key and value pairs can provide
non-local information to the ViT model, resulting in better features. For CNN-based
models, the problem of the size of the receptive field is more severe for 3D images
(MRI scans) because the number of parameters in the 3D-CNN model are much more
than in the 2D-CNN model with the same depth. If the model size is fixed, the 3D-CNN
model will have a much smaller receptive field which leads to a degeneration of the
model capacity.
The ViT model seems promising in the classification of AD, but it was not
originally designed for 3D images. To this end, we implemented a 3D version of ViT
that can use 3D MRI scans as input and extract non-local features within the
Transformer Encoder. Also, we design a convolutional patch embedding module to
make this 3D ViT integrate the advantages of CNN. To compare with ViT-based
model, we developed a very shallow 3D-CNN model with a Leaky ReLU activation
layer and various normalization layers. We evaluated the performance of the
models using the public ADNI [20] dataset, which is summarized in section 3.1.
This paper makes several contributions. First, we designed a ViT-based model,
the Convolutional Voxel Vision Transformer (CVVT), which takes 3D images (MRI
scans) and theoretically exploits the internal associations between different
regions of the image. Although the idea of this Convolutional Voxel Vision
Transformer is exciting, we found this model does not generalize well when trained
on insufficient amounts of data, resulting in a moderate performance in the test set
of ADNI. Second, we proposed a shallow 4-block 3D-CNN model, ConvNet3D-4
with Leaky ReLU activation layer and Instance Normalization which makes the
model suitable for limited data and small batch size. The ConvNet3D-4 model
achieves the state-of-the-art result in the AD vs. CN classification task.
4
2 Related Work
2.1 CNN models for Alzheimer’s Disease Classification
Most classification architectures using CNN models for AD diagnosis can be
divided into three types: 2D slice-level CNN, 3D patch-level CNN (often involved
with Region of Interest, ROI), and 3D subject-level CNN [5].
2D slice-level CNN models are designed for 2D images. Their inputs are 4D
tensors with the shape of batch size, number of channels, image width, and image
height. CT and MRI scans are 3D images. Many researchers choose to slice 3D
tensors to 2D tensors and train the model using these 2D tensors. There are several
advantages in doing this: First, more data can be generated by slicing operation;
second, mature 2D-CNN models, such as ResNet [21], VGGNet [22], and
InceptionNet [23] can be applied; third, 2D-CNN models consume significantly less
GPU RAM compared to 3D-CNN models. Valliani et al. [24] first applied ResNet
which was pre-trained on ImageNet to the axial slices of MRI scans at subject level
with affine data augmentation and concluded that the ResNet structure is better
than Shallow ConvNet (accuracy of 0.813 vs. of 0.738). Islam et al. [25] and Qiu et
al. [26] investigated the slice selection towards the classification result (accuracy
of 0.833).
3D-CNN models have an irreplaceable advantage -- they exploit the spatial
relationship of voxels explicitly, which is crucial for isotropic scans. 3D-CNN
models are designed for 3D images (tensor) that have the 5D shape of batch size,
number of channels, image width, image height, and image depth. The input for 3D
patch-level CNN consists of a set of 3D patches extracted from an image (scan).
Cheng et al. [27] chose 27 large patches with size of 50×41×40 voxels in each MRI
scan and trained 27 CNN models. Then they ensembled the features to make a
prediction at subject level (accuracy of 0.872). Mingxia et al. [28] used a smaller
patch size (19×19×19 voxels) based on anatomical landmarks (accuracy of 0.911).
The advantage of patch-based models is that leveraging information from patches
requires less GPU memory consumption compared to models using the entire 3D
images. The disadvantage is that the model relies on the architecture design when
connecting all patches. However, most of the patches used in these models are not
informative because they contain parts of the brain that may not be affected by the
disease. Therefore, the region of interest (ROI) based CNN model has been
developed. Example for these models include 2D+ϵ CNN [29] (28x28 patches made
of three neighboring 2D slices in the hippocampus, accuracy of 0.828) and 2.5D
CNN [30] (concatenation of three 2D slices along the three possible planes: sagittal,
coronal, and axial. They used 32x32 patches to train the model with accuracy of
0.888).
As for the 3D subject-level CNN, it takes the entire MRI scan of a subject at once
and the classification is performed at the subject level. Korolev et al. [31] (ADNI
dataset, with 111 scans, accuracy of 0.80), Senanayake et al. [32] (ADNI dataset,
with 322 scans, accuracy of 0.78), and Cheng et al. [33] (ADNI dataset, with 193
5
scans, accuracy of 0.896) used the entire MRI scans as input to classical deep
learning models such as VGG and ResNet. The advantage is that the spatial
information is fully integrated. But these models require a large amount of GPU
memory and only achieved suboptimal classification accuracy.
2.2 Transformer and Vision Transformer
Transformer [34] is a powerful model in the field of Natural Language
Processing, which has an encoder-decoder structure. Since Vision Transformer
adopts its encoder only [11], we will focus on the structure of Transformer Encoder,
which is shown in the right part of Figure 1. The Transformer Encoder takes
language tokens (tokenized sentences) with positional encoding as input and
exploits the inner relationship of tokens using a stack of pre-defined network
structures including the self-attention mechanism (Multi-Head Attention block),
the normalization layer (Norm block), and the feed forward network (MLP block)
to obtain encoded features.
Unlike the original Transformer which uses language tokens (obtained from
sentences) with fixed positional encoding as input, Vision Transformer uses visual
tokens (obtained from images) with learnable positional encoding as input. The
method for obtaining visual token is presented in the left part of Figure 1. For a
given image, the ViT first divides it into patches at first, and flattens each patch to
a vector. After applying a linear projection to the patch vectors and concatenating
an additional vector of class embedding with the result, a learnable positional
embedding is added. The resulted vectors are sent to the Transformer Encoder as
visual tokens.
Fig.1. Vision Transformer Structure
摘要:

IntroducingVisionTransformerforAlzheimer'sDiseaseclassificationtaskwith3DinputZilunZhang1,FarzadKhalvati1,2,3,4,5fortheAlzheimer'sDiseaseNeuroimagingInitiative*1DepartmentofMechanicalandIndustrialEngineering,UniversityofToronto2DepartmentofMedicalImaging,UniversityofToronto3InstituteofMedicalScience...

展开>> 收起<<
Introduc ing Vision Transformer for Alzheimers Disease classification task with 3D input Zilun Zhang1 Farzad Khalvati12345 for the Alzheimers Disease.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:684.68KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注