APAUNet Axis Projection Attention UNet for Small Target in 3D Medical Segmentation Yuncheng Jiang123 Zixun Zhang123 Shixi Qin123 Yao Guo4

2025-04-27 0 0 1.85MB 20 页 10玖币
侵权投诉
APAUNet: Axis Projection Attention UNet for
Small Target in 3D Medical Segmentation
Yuncheng Jiang1,2,3,?, Zixun Zhang1,2,3,?, Shixi Qin1,2,3, Yao Guo4,
Zhen Li2,1,3,, and Shuguang Cui2,1,5
1FNii, CUHK-Shenzhen, Guangdong, China
{yunchengjiang@link.,zixunzhang@link.,lizhen@}cuhk.edu.cn
2SSE, CUHK-Shenzhen, Guangdong, China
3SRIBD, CUHK-Shenzhen, Guangdong, China
4Shanghai Jiao Tong Univerisity, Shanghai, China
5Pengcheng Laboratory, Shenzhen,Guangdong, China
Abstract. In 3D medical image segmentation, small targets segmenta-
tion is crucial for diagnosis but still faces challenges. In this paper, we
propose the Axis Projection Attention UNet, named APAUNet, for 3D
medical image segmentation, especially for small targets. Considering the
large proportion of the background in the 3D feature space, we introduce
a projection strategy to project the 3D features into three orthogonal 2D
planes to capture the contextual attention from different views. In this
way, we can filter out the redundant feature information and mitigate
the loss of critical information for small lesions in 3D scans. Then we
utilize a dimension hybridization strategy to fuse the 3D features with
attention from different axes and merge them by a weighted summation
to adaptively learn the importance of different perspectives. Finally, in
the APA Decoder, we concatenate both high and low resolution features
in the 2D projection process, thereby obtaining more precise multi-scale
information, which is vital for small lesion segmentation. Quantitative
and qualitative experimental results on two public datasets (BTCV and
MSD) demonstrate that our proposed APAUNet outperforms the other
methods. Concretely, our APAUNet achieves an average dice score of
87.84 on BTCV, 84.48 on MSD-Liver and 69.13 on MSD-Pancreas, and
significantly surpass the previous SOTA methods on small targets.
Keywords: 3D medical segmentation ·Axis Projection Attention.
1 Introduction
Medical image segmentation, which aims to automatically and accurately diag-
nose lesion and organ regions in either 2D or 3D medical images, is one of the
critical steps for developing image-guided diagnostic and surgical systems. In
practice, compared to large targets, such as organs, small targets like tumors
?Equal contributions. Code is available at github.com/zx33/APAUNet.
Corresponding Author
arXiv:2210.01485v1 [cs.CV] 4 Oct 2022
2 Y. Jiang et al.
0~0.3 0.3~0.6 0.6~0.9 0.9~1.2 1.2~1 
(b) Target size distribution(a) Target shape distribution
0
0.2
0.4
0.6
0.8
1BTCV-RightKidney BTCV-Gallbladder BTCV-Pancreas
BTCV-LeftKidney MSD-Liver(tumour) MSD-Pancreas(tumour)
Fig. 1. Target shape samples and size distribution of MSD and synapse multi-organ
segmentation dataset. (a) 6 example organs from synapse multi-organ segmentation
dataset. (b) The target size distribution. The x-axis is the target size interval, and the
y-axis is the proportion (%) of corresponding samples to the whole dataset. The left
part shows the relative proportion (%) of the target size to the whole input, while the
right part shows the absolute size of the target with a interval step of 32 voxels. It can
be observed that the relative target sizes of most samples in all the 6 categories are
less than 0.6% with various shapes.
or polyps are more important for diagnosis, but also prone to be ignored. In
this paper, we focus on 3D medical image segmentation (CT/MRI), with an em-
phasis on small lesions. This task is challenging mainly due to the following two
aspects: 1) severe class imbalance of foreground (lesions) and background (entire
3D scans); 2) large variances in shape, location, and size of organs/lesions.
Recent progress in medical image segmentation has mainly been based on
UNet [1], which applies a U-shaped structure with skip-connections to merge
multi-scale features. However, due to the inductive bias of the locality of the
convolutions, the U-shaped networks still suffer from the limited representation
ability. Some studies utilized a coarse-to-fine segmentation framework [2]. These
approaches refine the final segmentation in the fine stage, by shrinking input
features to the region of interest (ROI) predicted in the coarse stage. Also,
instead of using vanilla 3D convolutions, some works tried to explore a 2.5D
fashions [3], which performed 2D convolutions on xy-axis at the low-level layers of
the network and 3D convolutions on the high-level layers. Other works attempted
to use an ensemble of 2D and 3D strategies, which fuses the 2D predictions from
different views with 3D predictions to get better results [4] or refine the 2D
predictions using 3D convolutions [5]. Besides, inspired by the great success of
Transformers, some works explored the feasibility of applying self-attention into
medical images by integrating CNN-based architectures with Transformer-like
modules [6,7,8] to capture the patch-level contextual information.
Although previous methods have achieved remarkable progress, there are still
some issues: 1) The 2.5D or the ensemble of 2D and 3D methods still suffer from
the limited representation ability since the 2D phases only extract features from
two axes while ignoring the information from the other axis, which worsens the
APAUNet: Axis Projection Attention UNet 3
Sagittal
attention
Axial
attention
Coronal
attention
D
H
D
W
Lesions
H
W
Fig. 2. In our APAUNet, we first project the 3D features into three orthogonal 2D
planes to capture local contextual attentions from the three 2D perspectives, and then
fuse them with the original 3D features. Finally, we adaptively fuse the features by
weighted summation.
final segmentation prediction. Also, the two-stage designs are difficult for end-
to-end training and require more computational resources. 2) Transformer-like
models require higher computational cost on self-attention, and thus, have lim-
ited applications in 3D scenarios. Moreover, these models only learn the attentive
interactions between patches, yet ignoring the local pattern inside the patch. 3)
In addition, the imbalance between target and background has been ignored,
which is vital for 3D medical segmentation. As shown in Fig. 1, on MSD chanl-
lenge and BTCV datasets, the majority samples of the tumour target and small
organ target are smaller than 0.6% to the whole 3D scans with various shapes.
In this paper, we propose an Axis Projection Attention (APA) UNet, named
APAUNet, which utilizes an orthogonal projection strategy and a dimension hy-
bridization strategy to overcome the aforementioned challenges. Specifically, our
APAUNet follows the established design of 3D-UNet but replaces the main and
functional component, the 3D convolution based encoder/decoder layers, with
our APA encoder/decoder modules. In the APA encoder, the initial 3D feature
maps are projected to three orthogonal 2D planes, i.e., sagittal,axial, and coronal
views. Such a projection operation could mitigate the loss of critical information
for small lesions in 3D scans. For instance, the original foreground-background
area ratio of 3D features is O(1/n3)before the projection, but after projection,
the ratio can be promoted to O(1/n2). Afterwards, we extract the local con-
textual 2D attention along the projected features to perform the asymmetric
feature extraction and fuse them with the original 3D features. Eventually, the
fused features of three axes are summed as the final output by three learnable
factors, as shown in Fig. 2. Correspondingly, our APA decoder follows the same
philosophy as the APA encoder but takes input features from two resolution lev-
els. In this way, the decoder can effectively leverage the contextual information
of multi-scale features. Furthermore, we also utilize an oversampling strategy to
ensure the occurrence of foregrounds in each batch during the training process.
In summary, our contributions are in three-fold: (1) We propose the Axis
Projection Attention UNet. APAUNet utilities the orthogonal projection strat-
4 Y. Jiang et al.
egy to enhance the asymmetric projection attention and feature extraction. (2)
We introduce a novel dimension hybridization strategy to fuse 2D and 3D at-
tention maps for better contextual representation in both encoder and decoder
blocks. Besides, we further leverage a multi-resolution fusion strategy into de-
coder blocks for context enhancement. (3) Extensive experiments on Synapse
multi-organ segmentation (BTCV) [9] and Medical Segmentation Decathlon
(MSD) challenge [10] datasets demonstrate the effectiveness and efficiency of
our APAUNet, especially on small targets.
2 Related Work
2.1 CNN-based Medical Image Segmentation
CNNs, serving as the standard model of medical image segmentation, have been
extensively studied in the past. The typical U-shaped network, U-Net [1], which
consists of a symmetric encoder and decoder network with skip-connections, has
become a common choice for medical image analysis. Afterwards, different vari-
ations of U-Net were proposed, such as Res-UNet [11], and Dense-UNet [12].
Besides, there are also some studies using AutoML to search for UNet architec-
tures or an ensemble of 2D and 3D features, e.g., C2FNAS [13] uses a two stages
NAS to search for the 3D architecture, and [4] utilizes meta learner to learn the
ensemble of 2D and 3D features. Although these architectures have achieved re-
markable progress in various 2D and 3D medical image segmentation tasks, they
lack the capability to learn the global context and long-range spatial dependen-
cies, even though followed by down-sampling operations. Thus, it leads to the
degraded performance on the challenging task of small lesion segmentation.
2.2 Attention Mechanism for Medical Imaging
Attention mechanisms have been widely applied to segmentation networks, which
can be categorized into two branches. The first branch is the hard attention,
which typically uses a coarse-to-fine framework for segmentation tasks. [2] ex-
ploited two parallel FCNs to first detect the ROI of input features, then con-
ducted the fine-grained segmentation over these cropped ROI patches for vol-
umetric medical image segmentation. RA-UNet [14] introduced a residual at-
tention module that adaptively combined multi-level features, which precisely
extracted the liver region and then segmented tumours in this region. However,
these hard attention methods usually need extensive trainable parameters and
can be difficult to converge, which are not efficient for 3D medical segmentation
tasks. The second branch is the adoption of the self-attention mechanism. One of
the early attempts was the Attention U-Net [15], which utilized an attention gate
to suppress irrelevant regions of the feature map while highlighting salient fea-
tures. UTNet [6] adopted efficient self-attention encoder and decoder to alleviate
the computational cost for 2D medical image segmentation. UNETR [7] further
employed a pure transformer by introduceding a multi-head self-attention mech-
anism into the 3D-UNet structure, taking advantage of both Transformers and
摘要:

APAUNet:AxisProjectionAttentionUNetforSmallTargetin3DMedicalSegmentationYunchengJiang1;2;3;?,ZixunZhang1;2;3;?,ShixiQin1;2;3,YaoGuo4,ZhenLi2;1;3;y,andShuguangCui2;1;51FNii,CUHK-Shenzhen,Guangdong,China{yunchengjiang@link.,zixunzhang@link.,lizhen@}cuhk.edu.cn2SSE,CUHK-Shenzhen,Guangdong,China3SRIBD,C...

展开>> 收起<<
APAUNet Axis Projection Attention UNet for Small Target in 3D Medical Segmentation Yuncheng Jiang123 Zixun Zhang123 Shixi Qin123 Yao Guo4.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.85MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注