2 C. Li et al.
these issues, especially the fully convolutional networks (FCN) based U-Net [19]
(an encoder-decoder architecture with skip connections to preserve details and
extract local visual features) and its variants [14,23,26]. Despite good progress,
these methods often have limitations in capturing long-range relationships and
global context information [2] due to the inherent bias of convolutional opera-
tions. Researchers naturally turn to ViT [5], powered with self-attention (SA), for
more possibilities: TransUNet first [2] adapts ViT to medical image segmentation
tasks by connecting several layers of the transformer module (multi-head SA)
to the FCN-based encoder for better capturing the global context information
from the high-level feature maps. TransFuse [25] and MedT [21] use a combined
FCN and Transformer architecture with two branches to capture global depen-
dency and low-level spatial details more effectively. Swin-UNet [1] is the first
U-shaped network based purely on more efficient Swin Transformers [12] and
outperforms models with FCN-based methods. UNETR [6] and SiwnUNETR
[20] are Transformer architectures extended for 3D inputs.
In spite of the improved performance for the aforementioned ViT-based net-
works, these methods utilize the standard or shifted-window-based SA, which
is the fine-grained local SA and may overlook the local and global interactions
[24,18]. As reported by [20], even pre-trained with a massive amount of medical
data using self-supervised learning, the performance of prostate segmentation
task using high-resolution and better soft tissue contrast MRI images has not
been completely satisfactory, not to mention the lower-quality CT images. Ad-
ditionally, the unclear boundary of the prostate in CT images derived from the
low soft tissue contrast is not properly addressed [7,22].
Recently, Focal Transformer [24] is proposed for general computer vision
tasks, in which focal self-attention is leveraged to incorporate both fine-grained
local and coarse-grained global interactions. Each token attends its closest sur-
rounding tokens with fine granularity, and the tokens far away with coarse granu-
larity; thus, focal SA can capture both short- and long-range visual dependencies
efficiently and effectively. Inspired by this work, we propose the FocalUNETR
(Focal U-NEt TRansformers), a novel focal transformer architecture for CT-
based medical image segmentation (Fig. 1A). Even though prior works such as
Psi-Net [15] incorporates additional decoders to enhance boundary detection
and distance map estimation, they either lack the capacity for effective global
context capture through FCN-based techniques or overlook the significance of
considering the randomness of the boundary, particularly in poor soft tissue con-
trast CT images for prostate segmentation. In contrast, our approach utilizes a
multi-task learning strategy that leverages a Gaussian kernel over the bound-
ary of the ground truth segmentation mask [11] as an auxiliary boundary-aware
contour regression task (Fig. 1B). This serves as a regularization term for the
main task of generating the segmentation mask. And the auxiliary task enhances
the model’s generalizability by addressing the challenge of unclear boundaries in
low-contrast CT images.
In this paper, we make several new contributions. First, we develop a novel
focal transformer model (FocalUNETR) for CT-based prostate segmentation,