UNet-2022 Exploring Dynamics in Non-isomorphic Architecture Jiansen Guo1Hong-Yu Zhou2Liansheng Wang1Yizhou Yu2 1School of Informatics Xiamen University

2025-05-06 0 0 486.94KB 10 页 10玖币

侵权投诉

UNet-2022: Exploring Dynamics in Non-isomorphic Architecture

Jiansen Guo1*Hong-Yu Zhou2*Liansheng Wang1Yizhou Yu2

1School of Informatics, Xiamen University

2Department of Computer Science, The University of Hong Kong

jsguo@stu.xmu.edu.cn, whuzhouhongyu@gmail.com, lswang@xmu.edu.cn, yizhouy@acm.org

Abstract

Recent medical image segmentation models are mostly

hybrid, which integrate self-attention and convolution lay-

ers into the non-isomorphic architecture. However, one po-

tential drawback of these approaches is that they failed to

provide an intuitive explanation of why this hybrid combi-

nation manner is beneﬁcial, making it difﬁcult for subse-

quent work to make improvements on top of them. To ad-

dress this issue, we ﬁrst analyze the differences between

the weight allocation mechanisms of the self-attention and

convolution. Based on this analysis, we propose to con-

struct a parallel non-isomorphic block that takes the ad-

vantages of self-attention and convolution with simple par-

allelization. We name the resulting U-shape segmentation

model as UNet-2022. In experiments, UNet-2022 obvi-

ously outperforms its counterparts in a range segmenta-

tion tasks, including abdominal multi-organ segmentation,

automatic cardiac diagnosis, neural structures segmenta-

tion, and skin lesion segmentation, sometimes surpassing

the best performing baseline by 4%. Speciﬁcally, UNet-

2022 surpasses nnUNet, the most recognized segmentation

model at present, by large margins. These phenomena in-

dicate the potential of UNet-2022 to become the model of

choice for medical image segmentation. Code is available

at https://bit.ly/3ggyD5G.

1. Introduction

The last decade has witnessed the dominance of deep

convolutional neural networks (DCNNs) in computer vi-

sion. However, in 2020, ViT [10] showed that Transform-

ers [28], which were initially developed for natural lan-

guage processing (NLP), can perform equally well in vision

tasks as DCNNs. Nonetheless, as pointed out by [10, 39],

due to the lack of the locality inductive bias from DCNNs,

Transformers do not generalize well when trained on insuf-

ﬁcient amounts of data. To tackle this issue, Swin Trans-

*First two authors contributed equally.

former [21] proposed to limit the self-attention computation

to non-overlapping local windows. This design greatly re-

duces the computational cost and enables Swin Transformer

outperform well-established DCNNs, such as ResNet [13],

in a wide range of vision tasks.

On the other hand, image segmentation has been among

the fundamental tasks in medical image analysis. As the

most widely adopted segmentation tools, UNet [25] and

most of its series [15,18,40] were built upon DCNNs. With

the prevalence of vision transformers in 2021, the medical

imaging community started to incorporate the self-attention

module into U-shape segmentation models for performance

boosting [6,8,9,14,20,24,31–33,37]. The core behind these

approaches is to construct non-isomorphic U-shape archi-

tecture by integrating self-attention with convolution. Al-

though these methods achieved progress in different medi-

cal imaging tasks, most of them failed to provide an intuitive

explanation for why this combination can be optimal. Ac-

cordingly, it is unclear how to better exploit the advantages

of self-attention and convolution to build more optimal seg-

mentation networks.

Let us brieﬂy review the weight allocation mechanisms

of self-attention and convolution, respectively. As is well-

known, the key characteristic that lead to the success of

Transformers is the self-attention mechanism [28]. In vision

transformers [10, 21], self-attention relates representations

at different positions by employing a dynamic weight allo-

cation mechanism. In practice, self-attention ﬁrst computes

the similarity between visual representations at different po-

sitions. Based on the resulting similarity matrix, dynamic

weights are computed and assigned to representations at

different spatial positions. Thus, as shown in Fig. 1(a) self-

attention, different positions have different weights while

all channels at the same position share the same weight.

One potential problem of this design is that the weights

assigned are not dynamic on the channel dimension, pre-

venting self-attention from capturing the internal variances

among different channels.

On the other hand, DCNNs rely on extra learnable con-

volution kernels to aggregate spatial representations. As

arXiv:2210.15566v1 [eess.IV] 27 Oct 2022

𝑏Convolution

Channel #0 Channel #1

Channel #0

𝑎Self-attention

Channel #1

… …

Channel #0

𝑐Ours

Channel #1

…

Shared across space

Dynamic across channels

Shared across channels

Dynamic across space Dynamic across space and channels

Figure 1: Illustration of the weight allocation mechanisms in self-attention (a), convolution (b), and our module (c). Different

colors denote different weights. In self-attention, the weight matrix is dynamic across space but shared across different

channels. In contrast, the weight matrix of convolution is shared across space but varies in channels. Our module integrates

the advantages of self-attention and convolution by assigning dynamic weights to different positions and channels.

shown in Fig. 1(b) convolution, the same set of convolu-

tion kernel weights are shared across different spatial posi-

tions while dynamic weights are assigned to different chan-

nels. As a result, compared to self-attention, convolution

can better explore the potential of representations emerged

in different channels but lack the ability to describe complex

spatial patterns.

In the above, we analyze the difference between self-

attention and convolution from a perspective of weight allo-

cation. From this analysis, we see that these two strategies

maintain distinct but complementary characteristics. Based

on this insight, we introduce a non-isomorphic block to in-

clude self-attention and convolution as two parallel mod-

ules. The proposed block comprises a novel weight alloca-

tion mechanism, which introduces dynamic to both space

and channel dimensions (cf Fig. 1). In practice, we ﬁnd

this embarrassingly simple combination performs surpris-

ingly well, outperforming previous state-of-the-art medical

segmentation models by large margins in various segmen-

tation tasks. Moreover, to reduce the risk of overﬁtting, we

use depth-wise convolution (DWConv) for decreasing the

number of weight parameters, which we empirically found

performs slightly better than the naive convolution. To sum-

marize, our contributions are as follows:

• We provide an intuitive explanation for why self-

attention and convolution can be complementary to

each other. The core difference lies in the dynamic

feature of weight allocation mechanism. Self-attention

addresses the importance of spatial dynamic but ig-

nores the channel dynamic. In contrast, convolution

assigns dynamic weights to different channels instead

of spatial positions.

• We propose a new weight allocation mechanism, in-

troducing dynamic weights to both space and chan-

nel dimensions. The implementation of weight allo-

cation mechanism is frustratingly simple, which com-

prises parallel independent self-attention and convo-

lution modules. The resulting non-isomorphic block

assigns dynamic weights to different spatial positions

and channels, making it capable of capturing complex

patterns emerged in feature maps.

• The resulting UNet-2022 obviously outperforms

nnUNet, currently the best generic medical image seg-

mentation model, in a range of medical image seg-

mentation tasks, including abdominal multi-organ seg-

mentation, automatic cardiac diagnosis, neural struc-

tures segmentation, and skin lesion segmentation. For

instance, UNet-2022 surpasses nnUNet by nearly 4%

with a much smaller input size on multi-organ segmen-

tation.

2. Related work

Vision Transformer. Transformer was proposed in the

NLP domain [28] and maintains the ability of modeling

long-range dependencies between elements of a sequence

which can make up for the deﬁciencies of CNN in cap-

turing the global information. ViT [10] ﬁrst time applied

Transformer to the vision tasks and achieved impressive

performance that is comparable to or even better than

the traditional DCNNs. However, ViT requires more data

during the training stage due to the lack of locality inductive

bias. To address this issue, a number of Transformer-based

models were developed. On the basis of the ViT, DeiT [26]

introduced stronger data augmentation to regularize vision

transformers. Besides, DeiT employed an idea of knowl-

edge distillation to help train vision transformers, where the

teacher network could help the student network to incor-

porate the locality inductive bias. Swin Transformer [21]

built a hierarchical vision Transformer that introduces the

window-based self-attention mechanism to enhance the

ability to capture the locality of features and reduces the

computation complexity of the self-attention. The local

self-attention is more suitable for the dense prediction tasks

such as semantic segmentation. However, most of these

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UNet-2022:ExploringDynamicsinNon-isomorphicArchitectureJiansenGuo1*Hong-YuZhou2*LianshengWang1YizhouYu21SchoolofInformatics,XiamenUniversity2DepartmentofComputerScience,TheUniversityofHongKongjsguo@stu.xmu.edu.cn,whuzhouhongyu@gmail.com,lswang@xmu.edu.cn,yizhouy@acm.orgAbstractRecentmedicalimagesegm...

展开>> 收起<<

UNet-2022 Exploring Dynamics in Non-isomorphic Architecture Jiansen Guo1Hong-Yu Zhou2Liansheng Wang1Yizhou Yu2 1School of Informatics Xiamen University.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UNet-2022 Exploring Dynamics in Non-isomorphic Architecture Jiansen Guo1Hong-Yu Zhou2Liansheng Wang1Yizhou Yu2 1School of Informatics Xiamen University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: