Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation Yangsong Zhang13 Subhankar Roy24 Hongtao Lu3 Elisa Ricci24 Stephane Lathuili ere1 1LTCI T elecom-Paris Intitute Polytechnique de Paris2University of Trento Trento Italy

2025-05-06 0 0 8.71MB 19 页 10玖币
侵权投诉
Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation
Yangsong Zhang1,3, Subhankar Roy2,4, Hongtao Lu3, Elisa Ricci2,4, St´
ephane Lathuili`
ere1
1LTCI, T´
el´
ecom-Paris, Intitute Polytechnique de Paris 2University of Trento, Trento, Italy
3Shanghai Jiao Tong University, Shanghai, China 4Fondazione Bruno Kessler, Trento, Italy
yangsong.zhang.zys@gmail.com
Abstract
In this work we address multi-target domain adaptation
(MTDA) in semantic segmentation, which consists in adapt-
ing a single model from an annotated source dataset to mul-
tiple unannotated target datasets that differ in their under-
lying data distributions. To address MTDA, we propose a
self-training strategy that employs pseudo-labels to induce
cooperation among multiple domain-specific classifiers. We
employ feature stylization as an efficient way to generate
image views that forms an integral part of self-training. Ad-
ditionally, to prevent the network from overfitting to noisy
pseudo-labels, we devise a rectification strategy that lever-
ages the predictions from different classifiers to estimate the
quality of pseudo-labels. Our extensive experiments on nu-
merous settings, based on four different semantic segmen-
tation datasets, validates the effectiveness of the proposed
self-training strategy and shows that our method outper-
forms state-of-the-art MTDA approaches. Code available
at: https://github.com/Mael-zys/CoaST.
1. Introduction
Semantic segmentation is a key task in computer vision
that consists in learning to predict semantic labels for im-
age pixels. Given its importance in many real world ap-
plications, segmentation is widely studied and significant
progress has been made [1, 3, 4] in the supervised regime.
Much of the recent success can be attributed to the availabil-
ity of large, curated, and annotated datasets [7, 21, 45]. As
obtaining labeled data in semantic segmentation is costly
and tedious, pre-trained models are often deployed in test
environments without fine-tuning. Unfortunately, these
models fail when the test samples are drawn from a distribu-
tion which is different from the training distribution. This
phenomenon is known as the domain shift [31] problem.
To mitigate the domain-shift between the training (source)
and test (target) distributions, Unsupervised Domain Adap-
tation (UDA) methods [8] have been proposed.
(a) Cooperative Self-training
(b) Cooperative Rectification
Figure 1: (a) Proposed method for Multi-Target Domain
Adaptation (MTDA). Feature stylization is performed to
favor consistency across classifiers via pseudo-labelling.
Classifier consistency is used to estimate pseudo-label qual-
ity and rectify the training loss. (b) We show the uncertainty
map estimated from an input image and used for loss rec-
tification (dark blue for high confidence). We observe that
low confidence regions often correspond to errors.
Although a vast majority of UDA methods have been
proposed for semantic segmentation in the single source and
single target setting, in practical applications the assump-
tion of a single target domain easily becomes vacuous. It
is because the real world is more complex and target data
can come from varying and different data distributions. For
e.g., in autonomous driving applications, the vehicle might
encounter cloudy, rainy, and sunny weather conditions in a
span of a very short journey. In such cases, it would re-
quire to switch among various adapted models specialized
for a certain weather condition. To prevent cumbersome
deployment operations one can instead train and deploy a
single model for all the target environments, which is other-
wise known as Multi-Target Domain Adaptation (MTDA).
While in the context of object recognition MTDA has been
explored in several works [6, 11, 23, 25, 38], it is heavily
arXiv:2210.01578v1 [cs.CV] 4 Oct 2022
understudied for semantic segmentation, with just a handful
of existing works [14, 16, 26]. The prior works are either
sub-optimal at fully addressing the target-target alignment
[26] or tackle it at a high computation overhead of explicit
style-transfer [14, 16]. We argue that explicit interactions
between a pair of target domains are essential in MTDA for
minimizing the domain gap across target domains.
To this end, in this paper we present a novel MTDA
framework for semantic segmentation that employs a self-
training strategy based on pseudo-labeling to induce bet-
ter synergy between different domains. Self-training is a
widely used technique consisting in comparing different
predictions obtained from a single image to impose con-
sistency in network’s predictions. In our proposed method,
illustrated in Fig. 1 (a), we use an original image from one
target domain (in yellow box) as the view that generates
the pseudo-label; while the second prediction is obtained
with the very same target image but stylized with an im-
age coming from a different target domain (in green box).
Given this stylized feature, the network is then asked to
predict the pseudo-label obtained from the original view.
Unlike [14] we use implicit stylization that does not need
any externally trained style-transfer network, making our
self-training end-to-end. Self-training not only helps the
network to improve the quality of representations but also
helps in implicit alignment between target-target pairs due
to cross-domain interactions.
While our proposed self-training is well-suited for
MTDA, it can still be susceptible to noisy pseudo-labels. To
prevent the network from overfitting to noisy pseudo-labels
when the domain-shift is large, we devise a cross-domain
cooperative rectification strategy that captures the disagree-
ment in predictions from different classifiers. Specifi-
cally, our proposed method uses the predictions from mul-
tiple domain-specific classifiers to estimate the quality of
pseudo-labels (see Fig. 1 (b)), which are then weighted
accordingly during self-training. Thus, interactions be-
tween all the target domains are further leveraged with
our proposed framework, which we call Co-operative Self-
Training (CoaST) for MTDA.
Contributions. In summary, our contributions are three
fold: (i) We propose a self-training approach for MTDA
that synergistically combines pseudo-labeling and feature
stylization to induce better cooperation between domains;
(ii) To reduce the impact of noisy pseudo-labels in self-
training, we propose cross-domain cooperative objective
rectification that uses predictions from multiple domain-
specific classifiers for better estimating the quality of
pseudo-labels; and (iii) We conduct experiments on several
standard MTDA benchmarks and advance the state-of-the-
art performance by non-trivial margins.
2. Related Works
Our proposed method is most related to self-training and
style-transfer, which we discuss in the following section.
Self-training for Domain Adaptation. Self-training in
single-target domain adaptation (STDA) is a popular tech-
nique that involves generating pseudo-labels for the unla-
beled target data and then iteratively training the model
on the most confident labels. To that end, a plethora of
UDA methods for semantic segmentation has been pro-
posed [15, 17, 19, 36, 43, 44, 48] that use self-training due
to its efficiency and simplicity. However, due to the char-
acteristic error-prone nature of the pseudo-labeling strategy,
the pseudo-labels cannot always be trusted and need a selec-
tion or correction mechanism. Most self-training methods
differ in the manner in which the pseudo-labels are gener-
ated and selected. For instance, Zou et al. [48] proposed
a class-balanced self-training strategy and used spatial pri-
ors, whereas in [41, 42] class-dependent centroids are em-
ployed to generate pseudo-labels. Most relevant to our ap-
proach are self-training methods [27, 43, 44] that rectify
the pseudo-labels by measuring the uncertainty in predic-
tions. Our proposed CoaST also derives inspirations from
the STDA method [44], but instead of ad-hoc auxiliary clas-
sifiers, we use different stylized versions of the same image
and different target domain-specific classifiers, to compute
the rectification weights. The majority of the STDA self-
training methods do not trivially allow target-target interac-
tions, which is very crucial for MTDA.
Style-Transfer for Domain Adaptation. Yet another pop-
ular technique in STDA that essentially relies on transfer-
ring style (appearance) to make a source domain image look
like a target image or vice versa. Assuming the semantic
content in the image remains unchanged in the stylization
process, and hence the pixel labels, target-like source im-
ages can be used to train a model for the target domain.
Thus, the main task becomes modeling the style and con-
tent in an image through an encoder-decoder-like network.
In the context of STDA in semantic segmentation, Hoff-
man et al. [13] proposed CyCADA, that incorporates cyclic
reconstruction and semantic consistency to learn a classi-
fier for the target data. Inspired by CyCADA a multitude
of STDA methods [2, 5, 18, 20, 30, 37, 39, 47] have been
proposed which use style-transfer in conjunction with other
techniques. Learning a good encoder-decoder style-transfer
network introduces additional training overheads and the
success is greatly limited by the reconstruction quality. Al-
ternatively, style-transfer can be performed in the feature
space of the encoder without explicitly generating the styl-
ized image [29, 46]. CrossNorm [29] explores this solution
in the context of domain generalization to learn robust fea-
tures. In CoaST, we adapt CrossNorm to our self-training
mechanism by transferring style across target domains to
induce better synergy.
Multi-target Domain Adaptation. MTDA for semantic
segmentation is an under-explored field with just a hand-
ful of existing works [14, 16, 26]. For instance, Saporta et
al. [26] proposed an adversarial framework where source-
target and target-target alignment is achieved through ded-
icated discriminators. They also introduced a multi-target
knowledge transfer (MTKT) approach where knowledge
distillation (KD) [12] is used to learn a domain-agnostic
classifier from multiple domain-specific experts. On the
other hand, the CCL [14] and ADAS [16] rely on explicit
style-transfer to tackle MTDA in semantic segmentation.
Much like other style-transfer based STDA methods, [16]
uses an external network for explicitly transferring styles
between domains. Instead, we rely on implicit style-transfer
making our proposed CoaST easy to implement and end-to-
end trainable. Additionally, we introduce a cooperative rec-
tification technique which prevents over-fitting on imperfect
pseudo-labels, making our method more robust. We em-
pirically prove this effectiveness over [14, 16, 26] through
numerous experiments.
3. Methods
In this section we formally define the MTDA task and
then we present the details of our proposed Cooperative
Self-Training (CoaST) framework.
3.1. Preliminaries
Problem Definition and Notations. In the multi-target do-
main adaptation (MTDA) task, we assume that we have at
our disposal NSlabeled instances from a source domain
data set DS={(xS
n,yS
n)}NS
n=1 where xSRH×W×3are
input images with their corresponding one-hot ground truth
labels ySRH×W×K, assigned to each pixel in the H×W
spatial grid belonging to one of the Ksemantic classes.
Moreover, there are a total of Munlabeled target domains
{T1,...,TM}where each target domain Ticomprises of an
unlabeled data set DTi={xTi
n}NTi
n=1, with xTiRH×W×3
representing the target images and NTibeing the number
of unlabeled instances. Following standard MTDA proto-
cols, we assume that the marginal distributions between ev-
ery pair of available domains differ, under the constraint of
underlying semantic concept remaining the same. The goal
of MTDA is to learn a single network f=CΦusing
{SM
i=1 DTi}∪DSthat can segment samples from any target
domain, where Cand φare the classifier and the backbone
encoder networks, respectively. While we consider that the
domain information is known at training time, the domains
labels of the images during inference are unknown.
Overall Framework. To address the MTDA, we operate in
two stages. In the first stage we aim to learn target domain-
specific classifiers with adversarial adaptation [32, 35] that
aligns features between a given source-target domain pair.
The first stage results in the network parameters that en-
able even better alignment in the subsequent stage. In this
second stage, we adopt a pseudo-label based cooperative
self-training strategy to further align the target domains. In
particular, our proposed self-training strategy enforces con-
sistency among the target domain-specific classifiers, allow-
ing maximal interaction among the different target domains.
Importantly, our cooperative training also incorporates a
threshold-free rectification term that prevents overfitting to
noisy pseudo-labels. Finally, we use knowledge distillation
to distill all the learned information from domain-specific
classifiers to a domain-agnostic classifier that can be used
to segment a test image from any target domain, thereby
alleviating the need for domain-id during inference.
Adversarial Warm-up. This marks the first stage, where
we follow [26] for initializing our framework in order to
obtain an encoder network Φthat is shared among all the
target domains, and Mdistinct target domain-specific clas-
sifiers {CTi|∀i∈ {1, . . . , M}}. Concurrently, we also ini-
tialize Mtarget domain-specific discriminators {DTi|∀i
{1, . . . , M}} to learn a classifier that is invariant for a spe-
cific source-target pair. To recap, in adversarial warm-up
stage the discriminator DTiis trained to distinguish between
the source and target Tipredictions whereas the network
fTi=CTiΦis trained to fool the DTi. Note that unlike
the original work in [10], the output from the classifier is
given as an input to the domain discriminator [26, 32]. Ad-
ditionally, for the source samples we employ the standard
supervised cross-entropy loss, which is used to train every
fTi. Overall, for a given source-target pair (S,Ti)the dis-
criminator DTiis trained with the objective:
LDTi=LbceDTi(CTi(Φ(xS))),1+
LbceDTi(CTi(Φ(xTi))),0(1)
where Lbce stands for the binary cross-entropy loss. Simul-
taneously, the network fTiis trained along with the source
segmentation loss and adversarial loss as:
LfTi=LceCTi(Φ(xS)),yS+
λadvLbceDTi(CTi(Φ(xTi))),1(2)
where Lce is the supervised cross-entropy loss for the
source data and λadv is a hyperparameter to balance the
losses. In the adversarial warm-up stage we alternatively
minimize LDTiand LfTifor every source-target pairs.
3.2. Cooperative Self-Training (CoaST)
The goal of the second stage is to refine the image rep-
resentation learned in the adversarial warm-up stage. We
devise a self-training approach with the usage of pseudo-
labels that iteratively improves the predictions of the model
on the unlabeled data.
Figure 2: Illustration of the proposed CoaST approach in the case of two target domains. Domain-specific classifiers are
distilled to learn a domain-agnostic classifier. Style-transfer is used in the encoder network to induce cooperation between
the different classifiers and rectify the pseudo-labeling losses.
Pseudo-labelling. In our framework for the MTDA, we
have mspecialized target domain-specific classifiers, with
each classifier CTitrained to handle data coming from
the corresponding domain Ti. We exploit these special-
ized classifiers to generate pseudo-labels (PLs) for the tar-
get samples in their respective target domains. Specifi-
cally, given the nth image xTi
nfrom the target domain i,
we use the network fTito predict the segmentation map
[ˆ
pTi
n(k)]k[H]×[W]×[K]=CTi(Φ(xTi
n)) and compute the
pseudo-label as:
ˆ
yTi
n=ekargmax
k
[ˆ
pTi
n(k)]k[H]×[W]×[K],(3)
where ek(.)denotes the one-hot encoding operator and
ˆ
yTi
nRH×W×K. The PL is computed at the beginning
of the second stage and is updated every nbiterations. This
PL is then used to self-supervise the corresponding fTinet-
work with a cross entropy loss:
Lpl =Lce(ˆ
pTi
n,ˆ
yTi
n),(4)
However, this formulation suffers from two main issues.
First, the PLs act only on the same domain-specific clas-
sifier CTicorresponding to the domain of input images.
Hence, it does not induce any synergy between the different
classifiers. Second, since the PLs can be noisy, using the
pseudo-labeling objective in Eq. (4) can lead to detrimental
behaviour. To address these two issues and further benefit
from our PLs, we introduce a self-training technique that is
realized by leveraging feature stylization [29].
Style-Transfer for Cooperative Self-Training. To bene-
fit from the self-training objective in Eq. (4), one requires
to obtain the predictions from a view t(xTi
n)and enforce
its predictions to match with that of ˆ
yTi
n, where t(.)is any
stochastic transformation. Indeed, such a consistency-based
training strategy has successfully been applied in the semi-
supervised learning literature [28]. However, finding opti-
mal transformations is not trivial and varies between data
sets and even tasks. In this work, we resort to a data-driven
transformation policy that is based on style-transfer [33].
Style-transfer consists in transferring the“style” (appear-
ance) from one image to another. Concretely, in our case,
the transformation t(.)is a style-transfer operation that es-
sentially applies the style of an image xTjto the image xTi,
where i6=j. The style transformed image xTijcan in
essence be regarded as a virtual image that appears to come
from Tjbut having the content structure of Ti. Therefore,
for the nth sample xTijwe obtain the prediction from fTj
and optimize it to be close to ˆ
yTi
n. In this way our PL from
a given target domain-specific classifier can be used to su-
pervise another domain-specific classifier, enforcing better
consistency between pairs of target domains. Moreover, we
thereafter show how style-transfer is instrumental in recti-
fying the objective in Eq. (4) according to an estimated con-
fidence score. We now describe how we use style-transfer
to improve self-training in the MTDA setting.
Style-transfer in the pixel space, with separately trained
encoder-decoder network, has very recently been used for
the MTDA work [16]. To avoid such costly, and often
sub-optimal, image generation with the pixel-space style-
transfer methods, we perform style-transfer in the inter-
mediate feature space of the encoder network. In particu-
lar, we adapt cross normalization (CrossNorm) [29] in our
MTDA setting and use it as a means of exchanging fea-
ture statistics, and hence style, across different domains.
More precisely, our Cross-Domain Normalization (Cross-
DoNorm) performs style-transfer by exchanging style vec-
tors between two target domain images, which are com-
puted from the channel-wise mean and standard deviation
of the features maps. Exchange of style vectors is deemed
sufficient for style-transfer by prior works [33] who show
that these statistics encode the image style and that style-
transfer can be obtained through a simple re-normalization.
Given a pair of images (xTi,xTj) from the target do-
mains Tiand Tj, we extract their corresponding features
摘要:

CooperativeSelf-TrainingforMulti-TargetAdaptiveSemanticSegmentationYangsongZhang1;3,SubhankarRoy2;4,HongtaoLu3,ElisaRicci2;4,St´ephaneLathuiliere11LTCI,T´el´ecom-Paris,IntitutePolytechniquedeParis2UniversityofTrento,Trento,Italy3ShanghaiJiaoTongUniversity,Shanghai,China4FondazioneBrunoKessler,Trent...

展开>> 收起<<
Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation Yangsong Zhang13 Subhankar Roy24 Hongtao Lu3 Elisa Ricci24 Stephane Lathuili ere1 1LTCI T elecom-Paris Intitute Polytechnique de Paris2University of Trento Trento Italy.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:8.71MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注