MEW-UNET MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR MEDICAL IMAGE SEGMENTATION Jiacheng Ruan Mingye Xie Suncheng Xiang Ting Liu Yuzhuo Fu

2025-05-02 0 0 921.36KB 5 页 10玖币

侵权投诉

MEW-UNET: MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR

MEDICAL IMAGE SEGMENTATION

Jiacheng Ruan, Mingye Xie, Suncheng Xiang*, Ting Liu, Yuzhuo Fu*

Shanghai Jiao Tong University, Shanghai, China

ABSTRACT

Recently, Visual Transformer (ViT) has been widely used

in various ﬁelds of computer vision due to applying self-

attention mechanism in the spatial domain to modeling

global knowledge. Especially in medical image segmen-

tation (MIS), many works are devoted to combining ViT

and CNN, and even some works directly utilize pure ViT-

based models. However, recent works improved models in

the aspect of spatial domain while ignoring the importance of

frequency domain information. Therefore, we propose Multi-

axis External Weights UNet (MEW-UNet) for MIS based on

the U-shape architecture by replacing self-attention in ViT

with our Multi-axis External Weights block. Speciﬁcally, our

block performs a Fourier transform on the three axes of the

input feature and assigns the external weight in the frequency

domain, which is generated by our Weights Generator. Then,

an inverse Fourier transform is performed to change the fea-

tures back to the spatial domain. We evaluate our model on

four datasets and achieve state-of-the-art performances. In

particular, on the Synapse dataset, our method outperforms

MT-UNet by 10.15mm in terms of HD95. Code is available

at https://github.com/JCruan519/MEW-UNet.

Index Terms—Medical image segmentation, Deep learn-

ing, Multi-axis External Weights

1. INTRODUCTION

Medical image segmentation (MIS) can assist relevant medi-

cal staff in locating the lesion area and improve the efﬁciency

of clinical treatment, which has great practical value. In re-

cent years, UNet [1], an encoder-decoder model based on U-

shape architecture, has been widely utilized for MIS. Due to

its strong scalability, many works are carried out based on the

U-shape architecture. For example, UNet++ [2] reduces the

semantic difference between the encoder and decoder by in-

troducing the dense connection. Att-UNet [3] introduces a

gating mechanism to make the model focus on the targets.

The above improvements are all based on CNNs, and

the natural locality of the convolution operation makes net-

* Suncheng Xiang and Yuzhuo Fu are the co-corresponding authors.

This work was partially supported by the National Natural Science Foun-

dation of China (Grant No. 61977045).

works obtain global information poorly. ViT [4], due to its

self-attention mechanism (SA), improves the long-range de-

pendency modeling ability, and focuses on holistic image

semantic information, which beneﬁts intensive prediction

tasks, such as image segmentation. Therefore, recent im-

provements can be divided into two types. On the one hand,

hybrid structures based on CNNs and ViTs are widely used.

For example, UCTransNet [5] replaces the skip connection in

UNet with the CTrans module, alleviating the problem that

the skip connection between the encoder and decoder may

lead to incompatible features. MT-UNet [6] uses CNN at a

shallow level and Local-Global SA at a deep level, combined

with the external attention mechanism, to obtain richer rep-

resentation information. On the other hand, pure ViTs are

conducted in MIS. For instance, Swin-UNet [7] performs

better by replacing the convolution operation in UNet with

the Swin Transformer Block.

Although the above models have achieved good results,

they are all based on the spatial domain, and few works ex-

plore MIS in the frequency domain. The frequency domain

information can help the model distinguish the lesion area

and background more clearly, which is indispensable for MIS.

In the general vision, GFNet [8] utilizes 2D discrete Fourier

transform (2D DFT) to change features from the spatial do-

main to the frequency domain, and ﬁlters in the frequency

domain are used for learning representation, which is enlight-

ening. However, the frequency domain information is only

extracted in a single axis, resulting in incomplete global in-

formation. In addition, it ignores the importance of local in-

formation in image feature extraction.

To address the problem, we propose the Multi-axis Exter-

nal Weights mechanism (MEW) that can simultaneously ob-

tain more comprehensive global and local information. To be

speciﬁc, the feature map is divided into four parts along the

channel dimension. For the ﬁrst three branches, we transform

the features into the frequency domain using 2D DFT along

the three different axes. Then, we multiply the frequency

domain maps by the corresponding learnable weights to ob-

tain the frequency domain information and global knowledge.

In addition, depthwise separable (DW) convolution operation

[9] is conducted for the remaining branch to obtain local in-

formation. After that, via replacing SA in ViT with our MEW,

Multi-axis External Weights Block (MEWB) is obtained. Fi-

arXiv:2210.14007v1 [eess.IV] 25 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MEW-UNET:MULTI-AXISREPRESENTATIONLEARNINGINFREQUENCYDOMAINFORMEDICALIMAGESEGMENTATIONJiachengRuan,MingyeXie,SunchengXiang*,TingLiu,YuzhuoFu*ShanghaiJiaoTongUniversity,Shanghai,ChinaABSTRACTRecently,VisualTransformer(ViT)hasbeenwidelyusedinvariouseldsofcomputervisionduetoapplyingself-attentionmechan...

展开>> 收起<<

MEW-UNET MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR MEDICAL IMAGE SEGMENTATION Jiacheng Ruan Mingye Xie Suncheng Xiang Ting Liu Yuzhuo Fu.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MEW-UNET MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR MEDICAL IMAGE SEGMENTATION Jiacheng Ruan Mingye Xie Suncheng Xiang Ting Liu Yuzhuo Fu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: