FocalUNETR A Focal Transformer for Boundary-aware Prostate Segmentation using CT Images

2025-04-27 1 0 1.74MB 13 页 10玖币

侵权投诉

FocalUNETR: A Focal Transformer for

Boundary-aware Prostate Segmentation using

CT Images

Chengyin Li1, Yao Qiang1, Raﬁ Ibn Sultan1, Hassan Bagher-Ebadian2,

Prashant Khanduri1, Indrin J. Chetty2, and Dongxiao Zhu1(B)

1Department of Computer Science, Wayne State University, Detroit MI, USA

dzhu@wayne.edu

2Department of Radiation Oncology, Henry Ford Cancer Institute, Detroit MI, USA

Abstract. Computed Tomography (CT) based precise prostate segmen-

tation for treatment planning is challenging due to (1) the unclear bound-

ary of the prostate derived from CT’s poor soft tissue contrast and (2)

the limitation of convolutional neural network-based models in capturing

long-range global context. Here we propose a novel focal transformer-

based image segmentation architecture to eﬀectively and eﬃciently ex-

tract local visual features and global context from CT images. Addi-

tionally, we design an auxiliary boundary-induced label regression task

coupled with the main prostate segmentation task to address the unclear

boundary issue in CT images. We demonstrate that this design signiﬁ-

cantly improves the quality of the CT-based prostate segmentation task

over other competing methods, resulting in substantially improved per-

formance, i.e., higher Dice Similarity Coeﬃcient, lower Hausdorﬀ Dis-

tance, and Average Symmetric Surface Distance, on both private and

public CT image datasets. Our code is available at this link.

Keywords: Focal transformer ·Prostate segmentation ·Computed to-

mography ·Boundary-aware

1 Introduction

Prostate cancer is a leading cause of cancer-related deaths in adult males, as

reported in studies, such as [17]. A common treatment option for prostate can-

cer is external beam radiation therapy (EBRT) [4], where CT scanning is a

cost-eﬀective tool for the treatment planning process compared with the more

expensive magnetic resonance imaging (MRI). As a result, precise prostate seg-

mentation in CT images becomes a crucial step, as it helps to ensure that the

radiation doses are delivered eﬀectively to the tumor tissues while minimizing

harm to the surrounding healthy tissues.

Due to the relatively low spatial resolution and soft tissue contrast in CT im-

ages compared to MRI images, manual prostate segmentation in CT images can

be time-consuming and may result in signiﬁcant variations between operators

[10]. Several automated segmentation methods have been proposed to alleviate

arXiv:2210.03189v2 [eess.IV] 18 Jul 2023

2 C. Li et al.

these issues, especially the fully convolutional networks (FCN) based U-Net [19]

(an encoder-decoder architecture with skip connections to preserve details and

extract local visual features) and its variants [14,23,26]. Despite good progress,

these methods often have limitations in capturing long-range relationships and

global context information [2] due to the inherent bias of convolutional opera-

tions. Researchers naturally turn to ViT [5], powered with self-attention (SA), for

more possibilities: TransUNet ﬁrst [2] adapts ViT to medical image segmentation

tasks by connecting several layers of the transformer module (multi-head SA)

to the FCN-based encoder for better capturing the global context information

from the high-level feature maps. TransFuse [25] and MedT [21] use a combined

FCN and Transformer architecture with two branches to capture global depen-

dency and low-level spatial details more eﬀectively. Swin-UNet [1] is the ﬁrst

U-shaped network based purely on more eﬃcient Swin Transformers [12] and

outperforms models with FCN-based methods. UNETR [6] and SiwnUNETR

[20] are Transformer architectures extended for 3D inputs.

In spite of the improved performance for the aforementioned ViT-based net-

works, these methods utilize the standard or shifted-window-based SA, which

is the ﬁne-grained local SA and may overlook the local and global interactions

[24,18]. As reported by [20], even pre-trained with a massive amount of medical

data using self-supervised learning, the performance of prostate segmentation

task using high-resolution and better soft tissue contrast MRI images has not

been completely satisfactory, not to mention the lower-quality CT images. Ad-

ditionally, the unclear boundary of the prostate in CT images derived from the

low soft tissue contrast is not properly addressed [7,22].

Recently, Focal Transformer [24] is proposed for general computer vision

tasks, in which focal self-attention is leveraged to incorporate both ﬁne-grained

local and coarse-grained global interactions. Each token attends its closest sur-

rounding tokens with ﬁne granularity, and the tokens far away with coarse granu-

larity; thus, focal SA can capture both short- and long-range visual dependencies

eﬃciently and eﬀectively. Inspired by this work, we propose the FocalUNETR

(Focal U-NEt TRansformers), a novel focal transformer architecture for CT-

based medical image segmentation (Fig. 1A). Even though prior works such as

Psi-Net [15] incorporates additional decoders to enhance boundary detection

and distance map estimation, they either lack the capacity for eﬀective global

context capture through FCN-based techniques or overlook the signiﬁcance of

considering the randomness of the boundary, particularly in poor soft tissue con-

trast CT images for prostate segmentation. In contrast, our approach utilizes a

multi-task learning strategy that leverages a Gaussian kernel over the bound-

ary of the ground truth segmentation mask [11] as an auxiliary boundary-aware

contour regression task (Fig. 1B). This serves as a regularization term for the

main task of generating the segmentation mask. And the auxiliary task enhances

the model’s generalizability by addressing the challenge of unclear boundaries in

low-contrast CT images.

In this paper, we make several new contributions. First, we develop a novel

focal transformer model (FocalUNETR) for CT-based prostate segmentation,

FocalUNETR 3

Patch Partition

Linear Embed

Blocks

Merging

Blocks

Merging

Blocks

Merging

Blocks

Merging

Stage1

Stage2

Stage3

Stage4

Head

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Res-

Block

Input

Res-

Block

Ground

Truth

Induced

Boundary

Sensitive

Label

(A)

(B)

Focal Transformer

Block

Patch Merging

Deconv

Res-

Block Residual

Block

Bottleneck

Feature

Concatenation Multi-layer

Perceptron

LayerNorm

Focal

Self-Attention

Hidden Feature

Head

Fig. 1. The architecture of FocalUNETR as (A) the main task for prostate segmenta-

tion and (B) a boundary-aware regression auxiliary task.

which makes use of focal SA to hierarchically learn the feature maps accounting

for both short- and long-range visual dependencies eﬃciently and eﬀectively.

Second, we also address the challenge of unclear boundaries speciﬁc to CT images

by incorporating an auxiliary task of contour regression. Third, our methodology

advances state-of-the-art performance via extensive experiments on both real-

world and benchmark datasets.

2 Methods

2.1 FocalUNETR

Our FocalUNETR architecture (Fig. 1) follows a multi-scale design similar to

[6,20], enabling us to obtain hierarchical feature maps at diﬀerent stages. The

input medical image X ∈ RC×H×Wis ﬁrst split into a sequence of tokens with

dimension ⌈H

H′⌉×⌈W

W′⌉, where H, W represent spatial height and width, respec-

tively, and Crepresents the number of channels. These tokens are then projected

into an embedding space of dimension Dusing a patch of resolution (H′, W ′).

The SA is computed at two focal levels [24]: ﬁne-grained and coarse-grained, as

illustrated in Fig. 2A. The focal SA attends to ﬁne-grained tokens locally, while

summarized tokens are attended to globally (reducing computational cost). We

perform focal SA at the window level, where a feature map of x∈ Rd×H′′ ×W′′

with spatial size H′′ ×W′′ and dchannels is partitioned into a grid of windows

with size sw×sw. For each window, we extract its surroundings using focal SA.

For window-wise focal SA [24], there are three terms {L, sw, sr}. Focal level

Lis the number of granularity levels for which we extract the tokens for our

focal SA. We present an example, depicted in Fig. 2B, that illustrates the use

of two focal levels (ﬁne and coarse) for capturing the interaction of local and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FocalUNETR:AFocalTransformerforBoundary-awareProstateSegmentationusingCTImagesChengyinLi1,YaoQiang1,RafiIbnSultan1,HassanBagher-Ebadian2,PrashantKhanduri1,IndrinJ.Chetty2,andDongxiaoZhu1(B)1DepartmentofComputerScience,WayneStateUniversity,DetroitMI,USAdzhu@wayne.edu2DepartmentofRadiationOncology,Hen...

展开>> 收起<<

FocalUNETR A Focal Transformer for Boundary-aware Prostate Segmentation using CT Images.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FocalUNETR A Focal Transformer for Boundary-aware Prostate Segmentation using CT Images

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: