
UNet-2022: Exploring Dynamics in Non-isomorphic Architecture
Jiansen Guo1*Hong-Yu Zhou2*Liansheng Wang1Yizhou Yu2
1School of Informatics, Xiamen University
2Department of Computer Science, The University of Hong Kong
jsguo@stu.xmu.edu.cn, whuzhouhongyu@gmail.com, lswang@xmu.edu.cn, yizhouy@acm.org
Abstract
Recent medical image segmentation models are mostly
hybrid, which integrate self-attention and convolution lay-
ers into the non-isomorphic architecture. However, one po-
tential drawback of these approaches is that they failed to
provide an intuitive explanation of why this hybrid combi-
nation manner is beneficial, making it difficult for subse-
quent work to make improvements on top of them. To ad-
dress this issue, we first analyze the differences between
the weight allocation mechanisms of the self-attention and
convolution. Based on this analysis, we propose to con-
struct a parallel non-isomorphic block that takes the ad-
vantages of self-attention and convolution with simple par-
allelization. We name the resulting U-shape segmentation
model as UNet-2022. In experiments, UNet-2022 obvi-
ously outperforms its counterparts in a range segmenta-
tion tasks, including abdominal multi-organ segmentation,
automatic cardiac diagnosis, neural structures segmenta-
tion, and skin lesion segmentation, sometimes surpassing
the best performing baseline by 4%. Specifically, UNet-
2022 surpasses nnUNet, the most recognized segmentation
model at present, by large margins. These phenomena in-
dicate the potential of UNet-2022 to become the model of
choice for medical image segmentation. Code is available
at https://bit.ly/3ggyD5G.
1. Introduction
The last decade has witnessed the dominance of deep
convolutional neural networks (DCNNs) in computer vi-
sion. However, in 2020, ViT [10] showed that Transform-
ers [28], which were initially developed for natural lan-
guage processing (NLP), can perform equally well in vision
tasks as DCNNs. Nonetheless, as pointed out by [10, 39],
due to the lack of the locality inductive bias from DCNNs,
Transformers do not generalize well when trained on insuf-
ficient amounts of data. To tackle this issue, Swin Trans-
*First two authors contributed equally.
former [21] proposed to limit the self-attention computation
to non-overlapping local windows. This design greatly re-
duces the computational cost and enables Swin Transformer
outperform well-established DCNNs, such as ResNet [13],
in a wide range of vision tasks.
On the other hand, image segmentation has been among
the fundamental tasks in medical image analysis. As the
most widely adopted segmentation tools, UNet [25] and
most of its series [15,18,40] were built upon DCNNs. With
the prevalence of vision transformers in 2021, the medical
imaging community started to incorporate the self-attention
module into U-shape segmentation models for performance
boosting [6,8,9,14,20,24,31–33,37]. The core behind these
approaches is to construct non-isomorphic U-shape archi-
tecture by integrating self-attention with convolution. Al-
though these methods achieved progress in different medi-
cal imaging tasks, most of them failed to provide an intuitive
explanation for why this combination can be optimal. Ac-
cordingly, it is unclear how to better exploit the advantages
of self-attention and convolution to build more optimal seg-
mentation networks.
Let us briefly review the weight allocation mechanisms
of self-attention and convolution, respectively. As is well-
known, the key characteristic that lead to the success of
Transformers is the self-attention mechanism [28]. In vision
transformers [10, 21], self-attention relates representations
at different positions by employing a dynamic weight allo-
cation mechanism. In practice, self-attention first computes
the similarity between visual representations at different po-
sitions. Based on the resulting similarity matrix, dynamic
weights are computed and assigned to representations at
different spatial positions. Thus, as shown in Fig. 1(a) self-
attention, different positions have different weights while
all channels at the same position share the same weight.
One potential problem of this design is that the weights
assigned are not dynamic on the channel dimension, pre-
venting self-attention from capturing the internal variances
among different channels.
On the other hand, DCNNs rely on extra learnable con-
volution kernels to aggregate spatial representations. As
arXiv:2210.15566v1 [eess.IV] 27 Oct 2022