understudied for semantic segmentation, with just a handful
of existing works [14, 16, 26]. The prior works are either
sub-optimal at fully addressing the target-target alignment
[26] or tackle it at a high computation overhead of explicit
style-transfer [14, 16]. We argue that explicit interactions
between a pair of target domains are essential in MTDA for
minimizing the domain gap across target domains.
To this end, in this paper we present a novel MTDA
framework for semantic segmentation that employs a self-
training strategy based on pseudo-labeling to induce bet-
ter synergy between different domains. Self-training is a
widely used technique consisting in comparing different
predictions obtained from a single image to impose con-
sistency in network’s predictions. In our proposed method,
illustrated in Fig. 1 (a), we use an original image from one
target domain (in yellow box) as the view that generates
the pseudo-label; while the second prediction is obtained
with the very same target image but stylized with an im-
age coming from a different target domain (in green box).
Given this stylized feature, the network is then asked to
predict the pseudo-label obtained from the original view.
Unlike [14] we use implicit stylization that does not need
any externally trained style-transfer network, making our
self-training end-to-end. Self-training not only helps the
network to improve the quality of representations but also
helps in implicit alignment between target-target pairs due
to cross-domain interactions.
While our proposed self-training is well-suited for
MTDA, it can still be susceptible to noisy pseudo-labels. To
prevent the network from overfitting to noisy pseudo-labels
when the domain-shift is large, we devise a cross-domain
cooperative rectification strategy that captures the disagree-
ment in predictions from different classifiers. Specifi-
cally, our proposed method uses the predictions from mul-
tiple domain-specific classifiers to estimate the quality of
pseudo-labels (see Fig. 1 (b)), which are then weighted
accordingly during self-training. Thus, interactions be-
tween all the target domains are further leveraged with
our proposed framework, which we call Co-operative Self-
Training (CoaST) for MTDA.
Contributions. In summary, our contributions are three
fold: (i) We propose a self-training approach for MTDA
that synergistically combines pseudo-labeling and feature
stylization to induce better cooperation between domains;
(ii) To reduce the impact of noisy pseudo-labels in self-
training, we propose cross-domain cooperative objective
rectification that uses predictions from multiple domain-
specific classifiers for better estimating the quality of
pseudo-labels; and (iii) We conduct experiments on several
standard MTDA benchmarks and advance the state-of-the-
art performance by non-trivial margins.
2. Related Works
Our proposed method is most related to self-training and
style-transfer, which we discuss in the following section.
Self-training for Domain Adaptation. Self-training in
single-target domain adaptation (STDA) is a popular tech-
nique that involves generating pseudo-labels for the unla-
beled target data and then iteratively training the model
on the most confident labels. To that end, a plethora of
UDA methods for semantic segmentation has been pro-
posed [15, 17, 19, 36, 43, 44, 48] that use self-training due
to its efficiency and simplicity. However, due to the char-
acteristic error-prone nature of the pseudo-labeling strategy,
the pseudo-labels cannot always be trusted and need a selec-
tion or correction mechanism. Most self-training methods
differ in the manner in which the pseudo-labels are gener-
ated and selected. For instance, Zou et al. [48] proposed
a class-balanced self-training strategy and used spatial pri-
ors, whereas in [41, 42] class-dependent centroids are em-
ployed to generate pseudo-labels. Most relevant to our ap-
proach are self-training methods [27, 43, 44] that rectify
the pseudo-labels by measuring the uncertainty in predic-
tions. Our proposed CoaST also derives inspirations from
the STDA method [44], but instead of ad-hoc auxiliary clas-
sifiers, we use different stylized versions of the same image
and different target domain-specific classifiers, to compute
the rectification weights. The majority of the STDA self-
training methods do not trivially allow target-target interac-
tions, which is very crucial for MTDA.
Style-Transfer for Domain Adaptation. Yet another pop-
ular technique in STDA that essentially relies on transfer-
ring style (appearance) to make a source domain image look
like a target image or vice versa. Assuming the semantic
content in the image remains unchanged in the stylization
process, and hence the pixel labels, target-like source im-
ages can be used to train a model for the target domain.
Thus, the main task becomes modeling the style and con-
tent in an image through an encoder-decoder-like network.
In the context of STDA in semantic segmentation, Hoff-
man et al. [13] proposed CyCADA, that incorporates cyclic
reconstruction and semantic consistency to learn a classi-
fier for the target data. Inspired by CyCADA a multitude
of STDA methods [2, 5, 18, 20, 30, 37, 39, 47] have been
proposed which use style-transfer in conjunction with other
techniques. Learning a good encoder-decoder style-transfer
network introduces additional training overheads and the
success is greatly limited by the reconstruction quality. Al-
ternatively, style-transfer can be performed in the feature
space of the encoder without explicitly generating the styl-
ized image [29, 46]. CrossNorm [29] explores this solution
in the context of domain generalization to learn robust fea-
tures. In CoaST, we adapt CrossNorm to our self-training
mechanism by transferring style across target domains to