
3
building masks in SAR images. Li et al. [28] designed a multi-
modal cross attention network (MCANet) to extract multi-
scale attention maps by fusing SAR and EO images. Cha
et al. [4] formulated multi-modal representation learning in
contrastive multi-view coding by considering three modalities
(i.e., EO image, SAR image, and label mask) as different
data augmentation techniques. Jain et al. [20] proposed a
self-supervised method to learn invariant feature embeddings
between SAR images and multi-spectral images.
It is worth noting that it is not a hard job to collect pairs
of SAR and EO images for training these multi-modal seg-
mentation methods offline, however, when these multi-modal
methods are used for testing in many real cases (e.g., night,
cloud cover, etc.) where no clear EO images but only SAR
images could be captured, their performances would become
significantly worse as illustrated in Fig. 1 and discussed in
Section I. Unlike these multi-modal segmentation methods in
literature, the proposed HFD-Net in this work focuses on a
novel segmentation configuration, where pairs of SAR and EO
images are used for network training, but only SAR images
without EO images are used for testing.
III. METHODOLOGY
In this section, we propose the Heterogeneous Feature
Distillation Network (HFD-Net) for SAR image semantic
segmentation, where the heterogeneous feature distillation
model is explored for heterogeneous feature transfer and the
heterogeneous feature alignment module is explored for multi-
scale feature aggregation. Firstly, the architecture of the HFD-
Net is described. Then, we present the heterogeneous feature
distillation model and the heterogeneous feature alignment
module respectively in detail. Finally, the model training and
total loss function are introduced.
A. Architecture
The HFD-Net, whose architecture is shown in Fig. 2,
consists of a pre-trained teacher model for segmenting EO
images, a student model for segmenting SAR images, and
a designed heterogeneous feature distillation model (HFDM)
for transferring latent EO features from the teacher model
to the student model. The teacher model takes EO images
as its inputs, while the student model takes SAR images as
its inputs. The two models have an identical architecture,
which is composed of a backbone segmentation network and
a designed heterogeneous feature alignment module (HFAM)
for multi-scale feature aggregation, but different parameter
configurations. Here, we simply use the DeepLabv3+ [7] as the
backbone segmentation network, which consists of a typical
ResNet-101 [13] encoder and a three-block decoder.
At the training stage, the teacher model, whose inputs are
EO scene images, is firstly trained by implementing an EO
image segmentation task, and it is expected to extract EO
features that reserve some semantic information of the input
scene images. And the parameters of this teacher model would
be fixed after it has been trained. Then, the student model
with SAR images as inputs is trained by implementing a
SAR image segmentation task. In this training process of the
student model, both the ground truth segmentation maps and
the learned EO features from the teacher model are utilized
jointly as the supervision information to guide the student
model to learn heterogeneous features, by simultaneously
minimizing the basic segmentation loss LSand the designed
heterogeneous distillation loss LD.
At the testing stage, only the student model is used for
segmenting an arbitrary input SAR image without its corre-
sponding EO image. In the following subsections, the HFDM
and HFAM would be introduced respectively.
B. Heterogeneous Feature Distillation Model
The heterogeneous feature distillation model (HFDM) is
designed to transfer latent EO features which reserve semantic
information from the teacher model to the student model for
segmenting SAR images. It is noted that both the teacher
and student models in the existing knowledge distillation
techniques [15] and [41] generally deal with an identical task
with homogeneous images. Unlike these works, the teacher
and student models in the HFD-Net focus on two similar but
different tasks (one is EO image segmentation, while the other
one is SAR image segmentation), hence, we design a special
architecture with a heterogeneous distillation loss term for the
HFDM so that the EO knowledge from the teacher model
could be distilled to the SAR-segmentation student model, as
shown in Fig. 3.
As seen from Fig. 3, the HFDM consists of two Sigmoid
operators, two Tsoftmax operators and a designed heteroge-
neous distillation loss LD. Given an input pair of EO and
SAR images, we denote the extracted set of EO feature maps
from the D3block in the teacher model as qt={qt
c|qt
c∈
RH×W, c = 1,2, ..., Cq}where Cqis the number of channels
in the third decoder block D3and {H, W }is the size of each
feature map qt
c. Similarly, we also denote the extracted set of
SAR feature maps from the D3block in the student model as
qs={qs
c|qs
c∈RH×W, c = 1,2, ..., Cq}. The HFDM is used
to enforce the student model to output such SAR features qs
that are as similar to the extracted EO features qtfrom the
teacher model as possible.
Firstly, two identical Sigmoid operators are implemented
to normalize the elements in each feature map qc(indicating
both qt
cand qs
c) into (0,1) respectively. Then, each normalized
feature map ¯qc(indicating both ¯qt
cin ¯qtand ¯qs
cin ¯qs) is
transformed into a pixel-level probability map ˆqc(indicating
both ˆqt
cin ˆqtand ˆqs
cin ˆqs) respectively by implementing the
following Tsoftmax operator:
ˆqc= Tsoftmax(¯qc;T) = exp( ¯qc
T)
Cq
P
j=1
exp( ¯qj
T)
(1)
where Tis a preseted temperature constant (here, the Tsoftmax
function degrades to the commonly used Softmax function
when Tis set to 1, as illustrated in [15]). Finally, the
heterogeneous distillation loss LD, which measures the dif-
ference between the obtained pixel-level probability maps, is
designed to distill latent features from the EO-segmentation
teacher model to the SAR-segmentation student model. The