UNet-2022 Exploring Dynamics in Non-isomorphic Architecture Jiansen Guo1Hong-Yu Zhou2Liansheng Wang1Yizhou Yu2 1School of Informatics Xiamen University

2025-05-06 0 0 486.94KB 10 页 10玖币
侵权投诉
UNet-2022: Exploring Dynamics in Non-isomorphic Architecture
Jiansen Guo1*Hong-Yu Zhou2*Liansheng Wang1Yizhou Yu2
1School of Informatics, Xiamen University
2Department of Computer Science, The University of Hong Kong
jsguo@stu.xmu.edu.cn, whuzhouhongyu@gmail.com, lswang@xmu.edu.cn, yizhouy@acm.org
Abstract
Recent medical image segmentation models are mostly
hybrid, which integrate self-attention and convolution lay-
ers into the non-isomorphic architecture. However, one po-
tential drawback of these approaches is that they failed to
provide an intuitive explanation of why this hybrid combi-
nation manner is beneficial, making it difficult for subse-
quent work to make improvements on top of them. To ad-
dress this issue, we first analyze the differences between
the weight allocation mechanisms of the self-attention and
convolution. Based on this analysis, we propose to con-
struct a parallel non-isomorphic block that takes the ad-
vantages of self-attention and convolution with simple par-
allelization. We name the resulting U-shape segmentation
model as UNet-2022. In experiments, UNet-2022 obvi-
ously outperforms its counterparts in a range segmenta-
tion tasks, including abdominal multi-organ segmentation,
automatic cardiac diagnosis, neural structures segmenta-
tion, and skin lesion segmentation, sometimes surpassing
the best performing baseline by 4%. Specifically, UNet-
2022 surpasses nnUNet, the most recognized segmentation
model at present, by large margins. These phenomena in-
dicate the potential of UNet-2022 to become the model of
choice for medical image segmentation. Code is available
at https://bit.ly/3ggyD5G.
1. Introduction
The last decade has witnessed the dominance of deep
convolutional neural networks (DCNNs) in computer vi-
sion. However, in 2020, ViT [10] showed that Transform-
ers [28], which were initially developed for natural lan-
guage processing (NLP), can perform equally well in vision
tasks as DCNNs. Nonetheless, as pointed out by [10, 39],
due to the lack of the locality inductive bias from DCNNs,
Transformers do not generalize well when trained on insuf-
ficient amounts of data. To tackle this issue, Swin Trans-
*First two authors contributed equally.
former [21] proposed to limit the self-attention computation
to non-overlapping local windows. This design greatly re-
duces the computational cost and enables Swin Transformer
outperform well-established DCNNs, such as ResNet [13],
in a wide range of vision tasks.
On the other hand, image segmentation has been among
the fundamental tasks in medical image analysis. As the
most widely adopted segmentation tools, UNet [25] and
most of its series [15,18,40] were built upon DCNNs. With
the prevalence of vision transformers in 2021, the medical
imaging community started to incorporate the self-attention
module into U-shape segmentation models for performance
boosting [6,8,9,14,20,24,31–33,37]. The core behind these
approaches is to construct non-isomorphic U-shape archi-
tecture by integrating self-attention with convolution. Al-
though these methods achieved progress in different medi-
cal imaging tasks, most of them failed to provide an intuitive
explanation for why this combination can be optimal. Ac-
cordingly, it is unclear how to better exploit the advantages
of self-attention and convolution to build more optimal seg-
mentation networks.
Let us briefly review the weight allocation mechanisms
of self-attention and convolution, respectively. As is well-
known, the key characteristic that lead to the success of
Transformers is the self-attention mechanism [28]. In vision
transformers [10, 21], self-attention relates representations
at different positions by employing a dynamic weight allo-
cation mechanism. In practice, self-attention first computes
the similarity between visual representations at different po-
sitions. Based on the resulting similarity matrix, dynamic
weights are computed and assigned to representations at
different spatial positions. Thus, as shown in Fig. 1(a) self-
attention, different positions have different weights while
all channels at the same position share the same weight.
One potential problem of this design is that the weights
assigned are not dynamic on the channel dimension, pre-
venting self-attention from capturing the internal variances
among different channels.
On the other hand, DCNNs rely on extra learnable con-
volution kernels to aggregate spatial representations. As
arXiv:2210.15566v1 [eess.IV] 27 Oct 2022
𝑏Convolution
Channel #0 Channel #1
Channel #0
𝑎Self-attention
Channel #1
… …
Channel #0
𝑐Ours
Channel #1
Shared across space
Dynamic across channels
Shared across channels
Dynamic across space Dynamic across space and channels
Figure 1: Illustration of the weight allocation mechanisms in self-attention (a), convolution (b), and our module (c). Different
colors denote different weights. In self-attention, the weight matrix is dynamic across space but shared across different
channels. In contrast, the weight matrix of convolution is shared across space but varies in channels. Our module integrates
the advantages of self-attention and convolution by assigning dynamic weights to different positions and channels.
shown in Fig. 1(b) convolution, the same set of convolu-
tion kernel weights are shared across different spatial posi-
tions while dynamic weights are assigned to different chan-
nels. As a result, compared to self-attention, convolution
can better explore the potential of representations emerged
in different channels but lack the ability to describe complex
spatial patterns.
In the above, we analyze the difference between self-
attention and convolution from a perspective of weight allo-
cation. From this analysis, we see that these two strategies
maintain distinct but complementary characteristics. Based
on this insight, we introduce a non-isomorphic block to in-
clude self-attention and convolution as two parallel mod-
ules. The proposed block comprises a novel weight alloca-
tion mechanism, which introduces dynamic to both space
and channel dimensions (cf Fig. 1). In practice, we find
this embarrassingly simple combination performs surpris-
ingly well, outperforming previous state-of-the-art medical
segmentation models by large margins in various segmen-
tation tasks. Moreover, to reduce the risk of overfitting, we
use depth-wise convolution (DWConv) for decreasing the
number of weight parameters, which we empirically found
performs slightly better than the naive convolution. To sum-
marize, our contributions are as follows:
We provide an intuitive explanation for why self-
attention and convolution can be complementary to
each other. The core difference lies in the dynamic
feature of weight allocation mechanism. Self-attention
addresses the importance of spatial dynamic but ig-
nores the channel dynamic. In contrast, convolution
assigns dynamic weights to different channels instead
of spatial positions.
We propose a new weight allocation mechanism, in-
troducing dynamic weights to both space and chan-
nel dimensions. The implementation of weight allo-
cation mechanism is frustratingly simple, which com-
prises parallel independent self-attention and convo-
lution modules. The resulting non-isomorphic block
assigns dynamic weights to different spatial positions
and channels, making it capable of capturing complex
patterns emerged in feature maps.
The resulting UNet-2022 obviously outperforms
nnUNet, currently the best generic medical image seg-
mentation model, in a range of medical image seg-
mentation tasks, including abdominal multi-organ seg-
mentation, automatic cardiac diagnosis, neural struc-
tures segmentation, and skin lesion segmentation. For
instance, UNet-2022 surpasses nnUNet by nearly 4%
with a much smaller input size on multi-organ segmen-
tation.
2. Related work
Vision Transformer. Transformer was proposed in the
NLP domain [28] and maintains the ability of modeling
long-range dependencies between elements of a sequence
which can make up for the deficiencies of CNN in cap-
turing the global information. ViT [10] first time applied
Transformer to the vision tasks and achieved impressive
performance that is comparable to or even better than
the traditional DCNNs. However, ViT requires more data
during the training stage due to the lack of locality inductive
bias. To address this issue, a number of Transformer-based
models were developed. On the basis of the ViT, DeiT [26]
introduced stronger data augmentation to regularize vision
transformers. Besides, DeiT employed an idea of knowl-
edge distillation to help train vision transformers, where the
teacher network could help the student network to incor-
porate the locality inductive bias. Swin Transformer [21]
built a hierarchical vision Transformer that introduces the
window-based self-attention mechanism to enhance the
ability to capture the locality of features and reduces the
computation complexity of the self-attention. The local
self-attention is more suitable for the dense prediction tasks
such as semantic segmentation. However, most of these
摘要:

UNet-2022:ExploringDynamicsinNon-isomorphicArchitectureJiansenGuo1*Hong-YuZhou2*LianshengWang1YizhouYu21SchoolofInformatics,XiamenUniversity2DepartmentofComputerScience,TheUniversityofHongKongjsguo@stu.xmu.edu.cn,whuzhouhongyu@gmail.com,lswang@xmu.edu.cn,yizhouy@acm.orgAbstractRecentmedicalimagesegm...

展开>> 收起<<
UNet-2022 Exploring Dynamics in Non-isomorphic Architecture Jiansen Guo1Hong-Yu Zhou2Liansheng Wang1Yizhou Yu2 1School of Informatics Xiamen University.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:10 页 大小:486.94KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注