APAUNet Axis Projection Attention UNet for Small Target in 3D Medical Segmentation Yuncheng Jiang123 Zixun Zhang123 Shixi Qin123 Yao Guo4

2025-04-27 0 0 1.85MB 20 页 10玖币

侵权投诉

APAUNet: Axis Projection Attention UNet for

Small Target in 3D Medical Segmentation

Yuncheng Jiang1,2,3,?, Zixun Zhang1,2,3,?, Shixi Qin1,2,3, Yao Guo4,

Zhen Li2,1,3,†, and Shuguang Cui2,1,5

1FNii, CUHK-Shenzhen, Guangdong, China

{yunchengjiang@link.,zixunzhang@link.,lizhen@}cuhk.edu.cn

2SSE, CUHK-Shenzhen, Guangdong, China

3SRIBD, CUHK-Shenzhen, Guangdong, China

4Shanghai Jiao Tong Univerisity, Shanghai, China

5Pengcheng Laboratory, Shenzhen,Guangdong, China

Abstract. In 3D medical image segmentation, small targets segmenta-

tion is crucial for diagnosis but still faces challenges. In this paper, we

propose the Axis Projection Attention UNet, named APAUNet, for 3D

medical image segmentation, especially for small targets. Considering the

large proportion of the background in the 3D feature space, we introduce

a projection strategy to project the 3D features into three orthogonal 2D

planes to capture the contextual attention from diﬀerent views. In this

way, we can ﬁlter out the redundant feature information and mitigate

the loss of critical information for small lesions in 3D scans. Then we

utilize a dimension hybridization strategy to fuse the 3D features with

attention from diﬀerent axes and merge them by a weighted summation

to adaptively learn the importance of diﬀerent perspectives. Finally, in

the APA Decoder, we concatenate both high and low resolution features

in the 2D projection process, thereby obtaining more precise multi-scale

information, which is vital for small lesion segmentation. Quantitative

and qualitative experimental results on two public datasets (BTCV and

MSD) demonstrate that our proposed APAUNet outperforms the other

methods. Concretely, our APAUNet achieves an average dice score of

87.84 on BTCV, 84.48 on MSD-Liver and 69.13 on MSD-Pancreas, and

signiﬁcantly surpass the previous SOTA methods on small targets.

Keywords: 3D medical segmentation ·Axis Projection Attention.

1 Introduction

Medical image segmentation, which aims to automatically and accurately diag-

nose lesion and organ regions in either 2D or 3D medical images, is one of the

critical steps for developing image-guided diagnostic and surgical systems. In

practice, compared to large targets, such as organs, small targets like tumors

?Equal contributions. Code is available at github.com/zx33/APAUNet.

†Corresponding Author

arXiv:2210.01485v1 [cs.CV] 4 Oct 2022

2 Y. Jiang et al.

0~0.3 0.3~0.6 0.6~0.9 0.9~1.2 1.2~1 

(b) Target size distribution(a) Target shape distribution

0.2

0.4

0.6

0.8

1BTCV-RightKidney BTCV-Gallbladder BTCV-Pancreas

BTCV-LeftKidney MSD-Liver(tumour) MSD-Pancreas(tumour)

Fig. 1. Target shape samples and size distribution of MSD and synapse multi-organ

segmentation dataset. (a) 6 example organs from synapse multi-organ segmentation

dataset. (b) The target size distribution. The x-axis is the target size interval, and the

y-axis is the proportion (%) of corresponding samples to the whole dataset. The left

part shows the relative proportion (%) of the target size to the whole input, while the

right part shows the absolute size of the target with a interval step of 32 voxels. It can

be observed that the relative target sizes of most samples in all the 6 categories are

less than 0.6% with various shapes.

or polyps are more important for diagnosis, but also prone to be ignored. In

this paper, we focus on 3D medical image segmentation (CT/MRI), with an em-

phasis on small lesions. This task is challenging mainly due to the following two

aspects: 1) severe class imbalance of foreground (lesions) and background (entire

3D scans); 2) large variances in shape, location, and size of organs/lesions.

Recent progress in medical image segmentation has mainly been based on

UNet [1], which applies a U-shaped structure with skip-connections to merge

multi-scale features. However, due to the inductive bias of the locality of the

convolutions, the U-shaped networks still suﬀer from the limited representation

ability. Some studies utilized a coarse-to-ﬁne segmentation framework [2]. These

approaches reﬁne the ﬁnal segmentation in the ﬁne stage, by shrinking input

features to the region of interest (ROI) predicted in the coarse stage. Also,

instead of using vanilla 3D convolutions, some works tried to explore a 2.5D

fashions [3], which performed 2D convolutions on xy-axis at the low-level layers of

the network and 3D convolutions on the high-level layers. Other works attempted

to use an ensemble of 2D and 3D strategies, which fuses the 2D predictions from

diﬀerent views with 3D predictions to get better results [4] or reﬁne the 2D

predictions using 3D convolutions [5]. Besides, inspired by the great success of

Transformers, some works explored the feasibility of applying self-attention into

medical images by integrating CNN-based architectures with Transformer-like

modules [6,7,8] to capture the patch-level contextual information.

Although previous methods have achieved remarkable progress, there are still

some issues: 1) The 2.5D or the ensemble of 2D and 3D methods still suﬀer from

the limited representation ability since the 2D phases only extract features from

two axes while ignoring the information from the other axis, which worsens the

APAUNet: Axis Projection Attention UNet 3

Sagittal

attention

Axial

attention

Coronal

attention

Lesions

Fig. 2. In our APAUNet, we ﬁrst project the 3D features into three orthogonal 2D

planes to capture local contextual attentions from the three 2D perspectives, and then

fuse them with the original 3D features. Finally, we adaptively fuse the features by

weighted summation.

ﬁnal segmentation prediction. Also, the two-stage designs are diﬃcult for end-

to-end training and require more computational resources. 2) Transformer-like

models require higher computational cost on self-attention, and thus, have lim-

ited applications in 3D scenarios. Moreover, these models only learn the attentive

interactions between patches, yet ignoring the local pattern inside the patch. 3)

In addition, the imbalance between target and background has been ignored,

which is vital for 3D medical segmentation. As shown in Fig. 1, on MSD chanl-

lenge and BTCV datasets, the majority samples of the tumour target and small

organ target are smaller than 0.6% to the whole 3D scans with various shapes.

In this paper, we propose an Axis Projection Attention (APA) UNet, named

APAUNet, which utilizes an orthogonal projection strategy and a dimension hy-

bridization strategy to overcome the aforementioned challenges. Speciﬁcally, our

APAUNet follows the established design of 3D-UNet but replaces the main and

functional component, the 3D convolution based encoder/decoder layers, with

our APA encoder/decoder modules. In the APA encoder, the initial 3D feature

maps are projected to three orthogonal 2D planes, i.e., sagittal,axial, and coronal

views. Such a projection operation could mitigate the loss of critical information

for small lesions in 3D scans. For instance, the original foreground-background

area ratio of 3D features is O(1/n3)before the projection, but after projection,

the ratio can be promoted to O(1/n2). Afterwards, we extract the local con-

textual 2D attention along the projected features to perform the asymmetric

feature extraction and fuse them with the original 3D features. Eventually, the

fused features of three axes are summed as the ﬁnal output by three learnable

factors, as shown in Fig. 2. Correspondingly, our APA decoder follows the same

philosophy as the APA encoder but takes input features from two resolution lev-

els. In this way, the decoder can eﬀectively leverage the contextual information

of multi-scale features. Furthermore, we also utilize an oversampling strategy to

ensure the occurrence of foregrounds in each batch during the training process.

In summary, our contributions are in three-fold: (1) We propose the Axis

Projection Attention UNet. APAUNet utilities the orthogonal projection strat-

4 Y. Jiang et al.

egy to enhance the asymmetric projection attention and feature extraction. (2)

We introduce a novel dimension hybridization strategy to fuse 2D and 3D at-

tention maps for better contextual representation in both encoder and decoder

blocks. Besides, we further leverage a multi-resolution fusion strategy into de-

coder blocks for context enhancement. (3) Extensive experiments on Synapse

multi-organ segmentation (BTCV) [9] and Medical Segmentation Decathlon

(MSD) challenge [10] datasets demonstrate the eﬀectiveness and eﬃciency of

our APAUNet, especially on small targets.

2 Related Work

2.1 CNN-based Medical Image Segmentation

CNNs, serving as the standard model of medical image segmentation, have been

extensively studied in the past. The typical U-shaped network, U-Net [1], which

consists of a symmetric encoder and decoder network with skip-connections, has

become a common choice for medical image analysis. Afterwards, diﬀerent vari-

ations of U-Net were proposed, such as Res-UNet [11], and Dense-UNet [12].

Besides, there are also some studies using AutoML to search for UNet architec-

tures or an ensemble of 2D and 3D features, e.g., C2FNAS [13] uses a two stages

NAS to search for the 3D architecture, and [4] utilizes meta learner to learn the

ensemble of 2D and 3D features. Although these architectures have achieved re-

markable progress in various 2D and 3D medical image segmentation tasks, they

lack the capability to learn the global context and long-range spatial dependen-

cies, even though followed by down-sampling operations. Thus, it leads to the

degraded performance on the challenging task of small lesion segmentation.

2.2 Attention Mechanism for Medical Imaging

Attention mechanisms have been widely applied to segmentation networks, which

can be categorized into two branches. The ﬁrst branch is the hard attention,

which typically uses a coarse-to-ﬁne framework for segmentation tasks. [2] ex-

ploited two parallel FCNs to ﬁrst detect the ROI of input features, then con-

ducted the ﬁne-grained segmentation over these cropped ROI patches for vol-

umetric medical image segmentation. RA-UNet [14] introduced a residual at-

tention module that adaptively combined multi-level features, which precisely

extracted the liver region and then segmented tumours in this region. However,

these hard attention methods usually need extensive trainable parameters and

can be diﬃcult to converge, which are not eﬃcient for 3D medical segmentation

tasks. The second branch is the adoption of the self-attention mechanism. One of

the early attempts was the Attention U-Net [15], which utilized an attention gate

to suppress irrelevant regions of the feature map while highlighting salient fea-

tures. UTNet [6] adopted eﬃcient self-attention encoder and decoder to alleviate

the computational cost for 2D medical image segmentation. UNETR [7] further

employed a pure transformer by introduceding a multi-head self-attention mech-

anism into the 3D-UNet structure, taking advantage of both Transformers and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APAUNet:AxisProjectionAttentionUNetforSmallTargetin3DMedicalSegmentationYunchengJiang1;2;3;?,ZixunZhang1;2;3;?,ShixiQin1;2;3,YaoGuo4,ZhenLi2;1;3;y,andShuguangCui2;1;51FNii,CUHK-Shenzhen,Guangdong,China{yunchengjiang@link.,zixunzhang@link.,lizhen@}cuhk.edu.cn2SSE,CUHK-Shenzhen,Guangdong,China3SRIBD,C...

展开>> 收起<<

APAUNet Axis Projection Attention UNet for Small Target in 3D Medical Segmentation Yuncheng Jiang123 Zixun Zhang123 Shixi Qin123 Yao Guo4.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

APAUNet Axis Projection Attention UNet for Small Target in 3D Medical Segmentation Yuncheng Jiang123 Zixun Zhang123 Shixi Qin123 Yao Guo4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: