MEW-UNET MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR MEDICAL IMAGE SEGMENTATION Jiacheng Ruan Mingye Xie Suncheng Xiang Ting Liu Yuzhuo Fu

2025-05-02 0 0 921.36KB 5 页 10玖币
侵权投诉
MEW-UNET: MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR
MEDICAL IMAGE SEGMENTATION
Jiacheng Ruan, Mingye Xie, Suncheng Xiang*, Ting Liu, Yuzhuo Fu*
Shanghai Jiao Tong University, Shanghai, China
ABSTRACT
Recently, Visual Transformer (ViT) has been widely used
in various fields of computer vision due to applying self-
attention mechanism in the spatial domain to modeling
global knowledge. Especially in medical image segmen-
tation (MIS), many works are devoted to combining ViT
and CNN, and even some works directly utilize pure ViT-
based models. However, recent works improved models in
the aspect of spatial domain while ignoring the importance of
frequency domain information. Therefore, we propose Multi-
axis External Weights UNet (MEW-UNet) for MIS based on
the U-shape architecture by replacing self-attention in ViT
with our Multi-axis External Weights block. Specifically, our
block performs a Fourier transform on the three axes of the
input feature and assigns the external weight in the frequency
domain, which is generated by our Weights Generator. Then,
an inverse Fourier transform is performed to change the fea-
tures back to the spatial domain. We evaluate our model on
four datasets and achieve state-of-the-art performances. In
particular, on the Synapse dataset, our method outperforms
MT-UNet by 10.15mm in terms of HD95. Code is available
at https://github.com/JCruan519/MEW-UNet.
Index TermsMedical image segmentation, Deep learn-
ing, Multi-axis External Weights
1. INTRODUCTION
Medical image segmentation (MIS) can assist relevant medi-
cal staff in locating the lesion area and improve the efficiency
of clinical treatment, which has great practical value. In re-
cent years, UNet [1], an encoder-decoder model based on U-
shape architecture, has been widely utilized for MIS. Due to
its strong scalability, many works are carried out based on the
U-shape architecture. For example, UNet++ [2] reduces the
semantic difference between the encoder and decoder by in-
troducing the dense connection. Att-UNet [3] introduces a
gating mechanism to make the model focus on the targets.
The above improvements are all based on CNNs, and
the natural locality of the convolution operation makes net-
* Suncheng Xiang and Yuzhuo Fu are the co-corresponding authors.
This work was partially supported by the National Natural Science Foun-
dation of China (Grant No. 61977045).
works obtain global information poorly. ViT [4], due to its
self-attention mechanism (SA), improves the long-range de-
pendency modeling ability, and focuses on holistic image
semantic information, which benefits intensive prediction
tasks, such as image segmentation. Therefore, recent im-
provements can be divided into two types. On the one hand,
hybrid structures based on CNNs and ViTs are widely used.
For example, UCTransNet [5] replaces the skip connection in
UNet with the CTrans module, alleviating the problem that
the skip connection between the encoder and decoder may
lead to incompatible features. MT-UNet [6] uses CNN at a
shallow level and Local-Global SA at a deep level, combined
with the external attention mechanism, to obtain richer rep-
resentation information. On the other hand, pure ViTs are
conducted in MIS. For instance, Swin-UNet [7] performs
better by replacing the convolution operation in UNet with
the Swin Transformer Block.
Although the above models have achieved good results,
they are all based on the spatial domain, and few works ex-
plore MIS in the frequency domain. The frequency domain
information can help the model distinguish the lesion area
and background more clearly, which is indispensable for MIS.
In the general vision, GFNet [8] utilizes 2D discrete Fourier
transform (2D DFT) to change features from the spatial do-
main to the frequency domain, and filters in the frequency
domain are used for learning representation, which is enlight-
ening. However, the frequency domain information is only
extracted in a single axis, resulting in incomplete global in-
formation. In addition, it ignores the importance of local in-
formation in image feature extraction.
To address the problem, we propose the Multi-axis Exter-
nal Weights mechanism (MEW) that can simultaneously ob-
tain more comprehensive global and local information. To be
specific, the feature map is divided into four parts along the
channel dimension. For the first three branches, we transform
the features into the frequency domain using 2D DFT along
the three different axes. Then, we multiply the frequency
domain maps by the corresponding learnable weights to ob-
tain the frequency domain information and global knowledge.
In addition, depthwise separable (DW) convolution operation
[9] is conducted for the remaining branch to obtain local in-
formation. After that, via replacing SA in ViT with our MEW,
Multi-axis External Weights Block (MEWB) is obtained. Fi-
arXiv:2210.14007v1 [eess.IV] 25 Oct 2022
摘要:

MEW-UNET:MULTI-AXISREPRESENTATIONLEARNINGINFREQUENCYDOMAINFORMEDICALIMAGESEGMENTATIONJiachengRuan,MingyeXie,SunchengXiang*,TingLiu,YuzhuoFu*ShanghaiJiaoTongUniversity,Shanghai,ChinaABSTRACTRecently,VisualTransformer(ViT)hasbeenwidelyusedinvariouseldsofcomputervisionduetoapplyingself-attentionmechan...

展开>> 收起<<
MEW-UNET MULTI-AXIS REPRESENTATION LEARNING IN FREQUENCY DOMAIN FOR MEDICAL IMAGE SEGMENTATION Jiacheng Ruan Mingye Xie Suncheng Xiang Ting Liu Yuzhuo Fu.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:921.36KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注