SAC ’23, March 27-April 2, 2023, Tallinn, Estonia Mariana-Iuliana Georgescu, Radu Tudor Ionescu, and Andreea-Iuliana Miron
low correlated among each other, by leveraging the Dice
score between the outputs of various models.
•
We provide empirical evidence showing that our diversity-
promoting ensemble leads to superior performance levels
compared with individual models and the conventional strat-
egy selecting the top scoring models.
2 RELATED WORK
Medical image segmentation can be divided into two tasks, with
respect to the input image. Indeed, there are works that tackle the
segmentation task on 2D images [
18
,
22
,
29
,
32
], while others rely
on 3D images [
2
,
3
,
5
,
10
,
14
,
15
,
18
,
19
,
25
,
30
,
32
]. The works using
2D images as input naturally produce 2D slices as output, while
the works using entire 3D volumes as input produce 3D volumes
as output.
Perhaps the most popular architecture for 2D segmentation is U-
Net [
22
]. U-Net is a fully convolutional (conv) network designed for
medical image segmentation. The architecture follows a “U” shape
and is composed of a contracting and an expansive path. Each step
of the expansive path is composed of an upsampling operation, a
convolution layer which halves the number of feature maps, and a
concatenation with the corresponding cropped feature maps from
the contracting path. Seo et al. [
25
] proposed the mU-Net model,
a modied version of the U-Net architecture. mU-Net [
25
] adds a
residual path to the deconvolution operations, and an additional
convolutional layer to the skip connections in order to extract high-
level global features of small objects.
Chen et al. [
3
] proposed the voxel-wise residual network (VoxRes-
Net), a 3D CNN formed of 25 layers with residual connections.
Multimodal and multi-level contextual information is introduced
into the VoxResNet model. The multimodal information is added
by concatenating multimodal data before giving it as input to the
model. To improve the 3D segmentation performance of brain le-
sions, Kamnitsas et al. [
15
] employed a 3D CNN comprising 11
layers with parallel convolutional pathways for multi-scale process-
ing. Rather than modifying the layers of their architecture, Zhao et
al. [
32
] inserted a lesion-related spatial attention mechanism into
the network.
In order to help physicians obtain better segmentation results,
Luo et al. [
18
] proposed interactive segmentation to further improve
the performance of CNN models, even to unseen objects.
Closer to our study, the work of Gibson et al. [
10
] shares the same
target task, being focused on multi-organ abdominal segmentation.
Gibson et al. [
10
] presented a registration-free approach based on
Dense V-Networks for multi-organ abdominal segmentation of 3D
images. They also proposed a batch-wise spatial dropout to lower
the memory usage and processing time of dropout.
Dierent from the aforementioned works, which are trained
in a fully-supervised learning setting, there are several works [
6
,
34
] proposing weakly-supervised learning frameworks. Zhou et
al. [
34
] found that data sets having only one organ annotated as the
positive class, leaving the other organs as part of the background,
attain misleading results in multi-organ segmentation, since the
background class contains many organs. In order to alleviate this
problem, Zhou et al. [
34
] proposed a prior-aware neural network,
incorporating anatomical priors on abdominal organ sizes into the
training objective.
Similar to our approach proposing an ensemble of multiple net-
works to improve the segmentation results, Lyksborg [
19
] proposed
to use a model for each of the axial, sagittal and coronal planes, fus-
ing the corresponding segmentations into a single 3D segmentation.
Baldeon et al. [
2
] proposed AdaEn-Net, an ensemble of networks
that boosts the segmentation performance. AdaEn-Net [
2
] rstly
employs an ensemble of 2D and 3D models to predict the output
segmentation. Then, it trains the 2D-3D ensemble on
𝑘
-folds, ob-
taining
𝑘
models. The nal segmentation mask is the average of
the 𝑘models forming the nal ensemble.
Dierent from previous works, such as [
2
,
19
], which directly
combined models into ensembles without taking into account their
output correlation, we propose a novel ensemble creation algorithm
which promotes the diversity among the models comprising the
ensemble.
3 METHOD
3.1 Neural Architectures
To address our medical image segmentation task, we employ the
well known U-Net architecture [
22
]. The U-Net architecture is a
fully convolutional network that belongs to the family of encoder-
decoder neural networks. In the encoding part, the spatial informa-
tion is downsampled through convolution and pooling operations.
In the decoding part, the spatial information is upsampled back to
the original size via convolution transpose. High-resolution fea-
tures from the encoder are passed through skip connections and
concatenated to the corresponding features from the decoder, thus
infusing high-resolution information into the decoder. The intro-
duction of skip connections gives the network its “U” shape. We
further present our changes to the U-Net model, leading to a total
of nine distinct model variants forming the basis of our ensemble.
3.1.1 Backbone Variations. To build an ensemble of a diverse set of
models, we rst introduce variations in terms of the backbone archi-
tecture. Therefore, we try the following three encoder architectures:
ResNet-34 [12], EcientNet-B0 [27], and EcientNet-B1 [27]. We
choose ResNet-34 due to its fairly good trade-o between running
time and accuracy level. The reason behind adding EcientNet-B0
and EcientNet-B1 into our study is the superior performance
levels of these models compared to ResNet-34.
The residual network (ResNet) architecture was proposed by
He et al. [
12
]. ResNet models are composed of residual blocks. A
residual block consists of a few stacked conv layers and a skip
connection from the rst layer to the last layer of the block. Skip
connections allow the training of very deep neural networks, alle-
viating the vanishing gradient problem. He et al. [
12
] proposed ve
ResNet variants of dierent depth, namely ResNet-18, ResNet-34,
ResNet-50, ResNet-101 and ResNet-152. Among these, we select
ResNet-34 to serve as backbone for some of our U-Net models.
The EcientNet architecture was introduced by Tan et al. [
27
]
to eciently scale convolutional neural networks. Tan et al. [
27
]
demonstrated that, in order to obtain better performance under
a certain computational budget, all three components of the net-
work, namely the depth, the width and the resolution, should be