SPATIO-TEMPORAL HYBRID FUSION OF CAE AND SWIN TRANSFORMERS FOR LUNG
CANCER MALIGNANCY PREDICTION
Sadaf Khademi†, Shahin Heidarian‡, Parnian Afshar†, Farnoosh Naderkhani†Anastasia Oikonomou††,
Konstantinos N. Plataniotis‡‡, and Arash Mohammadi†
†Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Canada
‡Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada
††Department of Medical Imaging, Sunnybrook Health Sciences Centre, Toronto, Canada
‡‡Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
ABSTRACT
The paper proposes a novel hybrid discovery Radiomics framework
that simultaneously integrates temporal and spatial features extracted
from non-thin chest Computed Tomography (CT) slices to predict
Lung Adenocarcinoma (LUAC) malignancy with minimum expert
involvement. Lung cancer is the leading cause of mortality from can-
cer worldwide and has various histologic types, among which LUAC
has recently been the most prevalent. LUACs are classified as pre-
invasive, minimally invasive, and invasive adenocarcinomas. Timely
and accurate knowledge of the lung nodules malignancy leads to a
proper treatment plan and reduces the risk of unnecessary or late
surgeries. Currently, chest CT scan is the primary imaging modality
to assess and predict the invasiveness of LUACs. However, the radi-
ologists’ analysis based on CT images is subjective and suffers from
a low accuracy compared to the ground truth pathological reviews
provided after surgical resections. The proposed hybrid framework,
referred to as the CAET-SWin, consists of two parallel paths: (i) The
Convolutional Auto-Encoder (CAE) Transformer path that extracts
and captures informative features related to inter-slice relations via
a modified Transformer architecture, and; (ii) The Shifted Window
(SWin) Transformer path, which is a hierarchical vision transformer
that extracts nodules’ related spatial features from a volumetric CT
scan. Extracted temporal (from the CAET-path) and spatial (from the
Swin path) are then fused through a fusion path to classify LUACs.
Experimental results on our in-house dataset of 114 pathologically
proven Sub-Solid Nodules (SSNs) demonstrate that the CAET-SWin
significantly improves reliability of the invasiveness prediction task
while achieving an accuracy of 82.65%, sensitivity of 83.66%, and
specificity of 81.66% using 10-fold cross-validation.
Index Terms—Lung Adenocarcinoma, Lung Nodule Invasive-
ness, Transformer, Subsolid Nodule, Self-Attention.
1. INTRODUCTION
Lung Cancer (LC) is the deadliest and least-funded cancer world-
wide [1, 2]. Non-small-cell LC is the major type of LC, and Lung
Adenocarcinoma (LUAC) is the most prevalent histologic sub-
type [3]. Lung nodules manifesting as Ground Glass (GG) or Sub-
solid Nodules (SSNs) on Computed Tomography (CT) scans have
a higher risk of malignancy than other incidentally detected small
solid nodules. SSNs are often diagnosed as adenocarcinoma and
are generally classified into pure GG nodules and part-solid nodules
according to their appearance on the lung window settings [4, 5].
A timely and accurate attempt to differentiate the LUACs is of
utmost importance to guide a proper treatment plan, as in some
cases, a pre-invasive or minimally invasive SSN can be monitored
with regular follow-up CT scans, whereas invasive lesions should
undergo immediate surgical resection if they are deemed eligible.
Most often, the SSNs type is diagnosed based on the pathological
findings performed after surgical resections, which is not desired
for prior treatment planning. Currently, radiologists use chest CT
scans to assess the invasiveness of the SSNs based on their imag-
ing findings and patterns prior to making decisions regarding the
appropriate treatment. Such visual approaches, however, are time-
consuming, subjective, and error-prone. So far, many studies have
used high-resolution and thin-slice (<1.5mm) CT images for the
SSN classification, which require longer analysis times, as well as
more reconstruction time [6, 7]. However, lung nodules are mostly
identified from CT scans performed for varied clinical purposes
acquired using routine standard or low-dose scanning protocols with
non-thin slice thicknesses (up to 5mm) [8]. In addition, recent
lung cancer screening recommendation, suggests using low-dose
CT scans with thicker slice-thicknesses (up to 2.5mm) [9, 10].
Capitalizing on the above discussion, the necessity of developing
an automated invasiveness assessment framework that performs
well regardless of technical settings has recently arisen among the
research community and healthcare professionals.
Related Works: Generally speaking, existing works on the SSN in-
vasiveness assessment can be categorized into two main classes: (i)
Radiomics-based, and; (ii) Deep Learning-based frameworks, also
referred to as Discovery Radiomics [11]. In the former class, data-
characterization algorithms extract quantitative features from nodule
masks and the original CT images, which are then analyzed using
statistical or conventional Machine Learning (ML) models [12, 13].
As an example of such frameworks, a histogram-based model is de-
veloped in [8] to predict the invasiveness of primary adenocarcinoma
SSNs from non-thin CT scans of 109 pathologically labeled SSNs.
In this study, a set of histogram-based and morphological features
along with additional features extracted via the functional Princi-
pal Component Analysis (PCA) is fed to a linear logistic regression.
Discovery Radiomics approaches, on the other hand, extract infor-
mative and discriminative features in an automated fashion. Existing
deep models working with volumetric CT scans can be classified in
two categories: (i) 3D-based solutions [14], where the whole 3D vol-
ume of CT images are fed to the model. Processing a large 3D CT
scan at once, however, results in extensive computational complex-
ity requiring more computational resources, and enormous training
datasets, and; (ii) 2D-based solutions [15–17], where individual 2D
CT slices are first analyzed, which are then fused via an aggregation
arXiv:2210.15297v1 [eess.IV] 27 Oct 2022