
Automatic parotid tumor segmentation
signal intensity. It is difficult to differentiate the parotid tumor from this anatomical structure in this case. In addition,
the scale of parotid tumors is highly variable, ranging from less than one millimeter to several centimeters in radius.
In the small tumor segmentation, the model is highly likely to confuse it with vascular tissues or muscle tissues. The
comparisons from Fig. 1(e) to 1(h) highlight many anatomical structures with similar intensity and shape to the parotid
tumor. Therefore, this feature makes it very hard for the model to focus on the segmentation of the ground truth tumor.
In addition, the location of the parotid gland in the images often shows a small amount of signal from the facial nerve.
As seen in Fig. 1(e), it further increases the difficulty of automatic segmentation. In summary, unlike most organs
and lesions, the automatic segmentation of parotid tumors is a challenging task. It requires the introduction of prior
anatomical knowledge to improve the robustness and reliability of the model.
On the other hand, parotid tumors themselves have a large number of types. The signal intensity, morphology, and
size of different tumor types in MRI are very different. Therefore, it is difficult for the deep learning-based model to
learn robust feature information related to tumors. As seen in Fig. 1(c) and 1(d), tumors exhibit higher signal intensities,
making it difficult to distinguish them from parotid glands and compare and analyze them with other tumors.
However, experienced radiologists have good consistency in parotid tumor segmentation. The critical factor in this
is the effective extraction and combination of multimodal image information by the expert, which allows for accurate
manual segmentation. The parotid MRI examination produces multiple images. Among them, the most informative
and commonly used modalities are T1 images, T2 images, and STIR images. Although anatomical variations are very
common in individual MRI modalities, it is rare for parotid anatomy and tumors to have abnormal morphology and
signal intensity in all three slices. Therefore, the expert can consider and make the final decision based on observation
and comparison of the different image slices in a comprehensive manner. In summary, deep models for parotid tumor
segmentation need to learn cross-modal representations from multimodal MRI and fuse features from three modalities
to improve the model’s performance.
Therefore, this paper develops an anatomy-aware framework for automatic segmentation of parotid tumors from
multimodal MRI to leverage rich anatomical prior knowledge. The framework contains the Transformer-based seg-
mentation network (PT-Net) and the anatomy-aware loss function.
First, we propose PT-Net, a novel Transformer-based coarse-to-fine multimodal fusion network for parotid tumor
segmentation. The encoder of the network is built on the Transformer, while the decoder is CNNs-based architecture.
Such a design has been shown to balance local feature extraction and global information modeling. Different from the
existing multimodal fusion approaches, the encoder extracts and merges contextual information from three modality-
specific parotid MRI at different scales. It can better obtain cross-modality and multi-scale tumor information. The
decoder stacks the feature maps of various modalities and calibrates the multimodal information using the channel
attention mechanism. Experiments demonstrate that our method has significant advantages over highly competitive
baseline methods in parotid tumor segmentation.
Second, this paper presents the anatomy-aware loss for guiding the deep model to distinguish parotid anatomical
structures from tumors. Considering that segmentation models are prone to be disturbed by irrelevant anatomy and
make wrong predictions, we develop this novel distance-based loss function. In contrast to previous methods Kervadec
et al. (2019); Karimi and Salcudean (2019), anatomy-aware loss uses the distance of the center coordinates for com-
puting the binary mask in both model-predicted segmentation and ground truth. Hence this loss function can force
the model to identify anatomical structures far from the ground truth, thereby predicting the correct tumor location.
It is worth noting that compared with other distance-based loss functions applied to medical image segmentation, our
anatomy-aware loss does not require additional computation and has high training stability.
Based on our experimental results with MRI of 187 parotid tumor patients, we demonstrated the effectiveness of
the proposed PT-Net and the anatomy-aware loss. This approach has the potential to reduce the annotation burden
associated with large-scale parotid tumor image datasets, as well as mitigate the difficult availability of high-quality
labels provided by experienced radiologists.
The main contribution of this paper is summarized as follows:
1. We study parotid tumor segmentation for the first time and propose an automatic segmentation framework with
high performance and robustness.
2. We propose a Transformer-based segmentation network that fuses multimodal information from coarse to fine.
The proposed PT-Net captures compact and high-level tumor features through the self-attention mechanism.
3. This work presents the anatomy-aware loss function. It exploits the prior knowledge of anatomy in parotid MRI
to reduce segmentation mistakes from tumor-similar anatomical structures.
Yifan Gao et al.: Preprint submitted to Elsevier Page 3 of 17