Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation Yangsong Zhang13 Subhankar Roy24 Hongtao Lu3 Elisa Ricci24 Stephane Lathuili ere1 1LTCI T elecom-Paris Intitute Polytechnique de Paris2University of Trento Trento Italy

2025-05-06 0 0 8.71MB 19 页 10玖币

侵权投诉

Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation

Yangsong Zhang1,3, Subhankar Roy2,4, Hongtao Lu3, Elisa Ricci2,4, St´

ephane Lathuili`

ere1

1LTCI, T´

el´

ecom-Paris, Intitute Polytechnique de Paris 2University of Trento, Trento, Italy

3Shanghai Jiao Tong University, Shanghai, China 4Fondazione Bruno Kessler, Trento, Italy

yangsong.zhang.zys@gmail.com

Abstract

In this work we address multi-target domain adaptation

(MTDA) in semantic segmentation, which consists in adapt-

ing a single model from an annotated source dataset to mul-

tiple unannotated target datasets that differ in their under-

lying data distributions. To address MTDA, we propose a

self-training strategy that employs pseudo-labels to induce

cooperation among multiple domain-speciﬁc classiﬁers. We

employ feature stylization as an efﬁcient way to generate

image views that forms an integral part of self-training. Ad-

ditionally, to prevent the network from overﬁtting to noisy

pseudo-labels, we devise a rectiﬁcation strategy that lever-

ages the predictions from different classiﬁers to estimate the

quality of pseudo-labels. Our extensive experiments on nu-

merous settings, based on four different semantic segmen-

tation datasets, validates the effectiveness of the proposed

self-training strategy and shows that our method outper-

forms state-of-the-art MTDA approaches. Code available

at: https://github.com/Mael-zys/CoaST.

1. Introduction

Semantic segmentation is a key task in computer vision

that consists in learning to predict semantic labels for im-

age pixels. Given its importance in many real world ap-

plications, segmentation is widely studied and signiﬁcant

progress has been made [1, 3, 4] in the supervised regime.

Much of the recent success can be attributed to the availabil-

ity of large, curated, and annotated datasets [7, 21, 45]. As

obtaining labeled data in semantic segmentation is costly

and tedious, pre-trained models are often deployed in test

environments without ﬁne-tuning. Unfortunately, these

models fail when the test samples are drawn from a distribu-

tion which is different from the training distribution. This

phenomenon is known as the domain shift [31] problem.

To mitigate the domain-shift between the training (source)

and test (target) distributions, Unsupervised Domain Adap-

tation (UDA) methods [8] have been proposed.

(a) Cooperative Self-training

(b) Cooperative Rectiﬁcation

Figure 1: (a) Proposed method for Multi-Target Domain

Adaptation (MTDA). Feature stylization is performed to

favor consistency across classiﬁers via pseudo-labelling.

Classiﬁer consistency is used to estimate pseudo-label qual-

ity and rectify the training loss. (b) We show the uncertainty

map estimated from an input image and used for loss rec-

tiﬁcation (dark blue for high conﬁdence). We observe that

low conﬁdence regions often correspond to errors.

Although a vast majority of UDA methods have been

proposed for semantic segmentation in the single source and

single target setting, in practical applications the assump-

tion of a single target domain easily becomes vacuous. It

is because the real world is more complex and target data

can come from varying and different data distributions. For

e.g., in autonomous driving applications, the vehicle might

encounter cloudy, rainy, and sunny weather conditions in a

span of a very short journey. In such cases, it would re-

quire to switch among various adapted models specialized

for a certain weather condition. To prevent cumbersome

deployment operations one can instead train and deploy a

single model for all the target environments, which is other-

wise known as Multi-Target Domain Adaptation (MTDA).

While in the context of object recognition MTDA has been

explored in several works [6, 11, 23, 25, 38], it is heavily

arXiv:2210.01578v1 [cs.CV] 4 Oct 2022

understudied for semantic segmentation, with just a handful

of existing works [14, 16, 26]. The prior works are either

sub-optimal at fully addressing the target-target alignment

[26] or tackle it at a high computation overhead of explicit

style-transfer [14, 16]. We argue that explicit interactions

between a pair of target domains are essential in MTDA for

minimizing the domain gap across target domains.

To this end, in this paper we present a novel MTDA

framework for semantic segmentation that employs a self-

training strategy based on pseudo-labeling to induce bet-

ter synergy between different domains. Self-training is a

widely used technique consisting in comparing different

predictions obtained from a single image to impose con-

sistency in network’s predictions. In our proposed method,

illustrated in Fig. 1 (a), we use an original image from one

target domain (in yellow box) as the view that generates

the pseudo-label; while the second prediction is obtained

with the very same target image but stylized with an im-

age coming from a different target domain (in green box).

Given this stylized feature, the network is then asked to

predict the pseudo-label obtained from the original view.

Unlike [14] we use implicit stylization that does not need

any externally trained style-transfer network, making our

self-training end-to-end. Self-training not only helps the

network to improve the quality of representations but also

helps in implicit alignment between target-target pairs due

to cross-domain interactions.

While our proposed self-training is well-suited for

MTDA, it can still be susceptible to noisy pseudo-labels. To

prevent the network from overﬁtting to noisy pseudo-labels

when the domain-shift is large, we devise a cross-domain

cooperative rectiﬁcation strategy that captures the disagree-

ment in predictions from different classiﬁers. Speciﬁ-

cally, our proposed method uses the predictions from mul-

tiple domain-speciﬁc classiﬁers to estimate the quality of

pseudo-labels (see Fig. 1 (b)), which are then weighted

accordingly during self-training. Thus, interactions be-

tween all the target domains are further leveraged with

our proposed framework, which we call Co-operative Self-

Training (CoaST) for MTDA.

Contributions. In summary, our contributions are three

fold: (i) We propose a self-training approach for MTDA

that synergistically combines pseudo-labeling and feature

stylization to induce better cooperation between domains;

(ii) To reduce the impact of noisy pseudo-labels in self-

training, we propose cross-domain cooperative objective

rectiﬁcation that uses predictions from multiple domain-

speciﬁc classiﬁers for better estimating the quality of

pseudo-labels; and (iii) We conduct experiments on several

standard MTDA benchmarks and advance the state-of-the-

art performance by non-trivial margins.

2. Related Works

Our proposed method is most related to self-training and

style-transfer, which we discuss in the following section.

Self-training for Domain Adaptation. Self-training in

single-target domain adaptation (STDA) is a popular tech-

nique that involves generating pseudo-labels for the unla-

beled target data and then iteratively training the model

on the most conﬁdent labels. To that end, a plethora of

UDA methods for semantic segmentation has been pro-

posed [15, 17, 19, 36, 43, 44, 48] that use self-training due

to its efﬁciency and simplicity. However, due to the char-

acteristic error-prone nature of the pseudo-labeling strategy,

the pseudo-labels cannot always be trusted and need a selec-

tion or correction mechanism. Most self-training methods

differ in the manner in which the pseudo-labels are gener-

ated and selected. For instance, Zou et al. [48] proposed

a class-balanced self-training strategy and used spatial pri-

ors, whereas in [41, 42] class-dependent centroids are em-

ployed to generate pseudo-labels. Most relevant to our ap-

proach are self-training methods [27, 43, 44] that rectify

the pseudo-labels by measuring the uncertainty in predic-

tions. Our proposed CoaST also derives inspirations from

the STDA method [44], but instead of ad-hoc auxiliary clas-

siﬁers, we use different stylized versions of the same image

and different target domain-speciﬁc classiﬁers, to compute

the rectiﬁcation weights. The majority of the STDA self-

training methods do not trivially allow target-target interac-

tions, which is very crucial for MTDA.

Style-Transfer for Domain Adaptation. Yet another pop-

ular technique in STDA that essentially relies on transfer-

ring style (appearance) to make a source domain image look

like a target image or vice versa. Assuming the semantic

content in the image remains unchanged in the stylization

process, and hence the pixel labels, target-like source im-

ages can be used to train a model for the target domain.

Thus, the main task becomes modeling the style and con-

tent in an image through an encoder-decoder-like network.

In the context of STDA in semantic segmentation, Hoff-

man et al. [13] proposed CyCADA, that incorporates cyclic

reconstruction and semantic consistency to learn a classi-

ﬁer for the target data. Inspired by CyCADA a multitude

of STDA methods [2, 5, 18, 20, 30, 37, 39, 47] have been

proposed which use style-transfer in conjunction with other

techniques. Learning a good encoder-decoder style-transfer

network introduces additional training overheads and the

success is greatly limited by the reconstruction quality. Al-

ternatively, style-transfer can be performed in the feature

space of the encoder without explicitly generating the styl-

ized image [29, 46]. CrossNorm [29] explores this solution

in the context of domain generalization to learn robust fea-

tures. In CoaST, we adapt CrossNorm to our self-training

mechanism by transferring style across target domains to

induce better synergy.

Multi-target Domain Adaptation. MTDA for semantic

segmentation is an under-explored ﬁeld with just a hand-

ful of existing works [14, 16, 26]. For instance, Saporta et

al. [26] proposed an adversarial framework where source-

target and target-target alignment is achieved through ded-

icated discriminators. They also introduced a multi-target

knowledge transfer (MTKT) approach where knowledge

distillation (KD) [12] is used to learn a domain-agnostic

classiﬁer from multiple domain-speciﬁc experts. On the

other hand, the CCL [14] and ADAS [16] rely on explicit

style-transfer to tackle MTDA in semantic segmentation.

Much like other style-transfer based STDA methods, [16]

uses an external network for explicitly transferring styles

between domains. Instead, we rely on implicit style-transfer

making our proposed CoaST easy to implement and end-to-

end trainable. Additionally, we introduce a cooperative rec-

tiﬁcation technique which prevents over-ﬁtting on imperfect

pseudo-labels, making our method more robust. We em-

pirically prove this effectiveness over [14, 16, 26] through

numerous experiments.

3. Methods

In this section we formally deﬁne the MTDA task and

then we present the details of our proposed Cooperative

Self-Training (CoaST) framework.

3.1. Preliminaries

Problem Deﬁnition and Notations. In the multi-target do-

main adaptation (MTDA) task, we assume that we have at

our disposal NSlabeled instances from a source domain

data set DS={(xS

n,yS

n)}NS

n=1 where xS∈RH×W×3are

input images with their corresponding one-hot ground truth

labels yS∈RH×W×K, assigned to each pixel in the H×W

spatial grid belonging to one of the Ksemantic classes.

Moreover, there are a total of Munlabeled target domains

{T1,...,TM}where each target domain Ticomprises of an

unlabeled data set DTi={xTi

n}NTi

n=1, with xTi∈RH×W×3

representing the target images and NTibeing the number

of unlabeled instances. Following standard MTDA proto-

cols, we assume that the marginal distributions between ev-

ery pair of available domains differ, under the constraint of

underlying semantic concept remaining the same. The goal

of MTDA is to learn a single network f=C◦Φusing

{SM

i=1 DTi}∪DSthat can segment samples from any target

domain, where Cand φare the classiﬁer and the backbone

encoder networks, respectively. While we consider that the

domain information is known at training time, the domains

labels of the images during inference are unknown.

Overall Framework. To address the MTDA, we operate in

two stages. In the ﬁrst stage we aim to learn target domain-

speciﬁc classiﬁers with adversarial adaptation [32, 35] that

aligns features between a given source-target domain pair.

The ﬁrst stage results in the network parameters that en-

able even better alignment in the subsequent stage. In this

second stage, we adopt a pseudo-label based cooperative

self-training strategy to further align the target domains. In

particular, our proposed self-training strategy enforces con-

sistency among the target domain-speciﬁc classiﬁers, allow-

ing maximal interaction among the different target domains.

Importantly, our cooperative training also incorporates a

threshold-free rectiﬁcation term that prevents overﬁtting to

noisy pseudo-labels. Finally, we use knowledge distillation

to distill all the learned information from domain-speciﬁc

classiﬁers to a domain-agnostic classiﬁer that can be used

to segment a test image from any target domain, thereby

alleviating the need for domain-id during inference.

Adversarial Warm-up. This marks the ﬁrst stage, where

we follow [26] for initializing our framework in order to

obtain an encoder network Φthat is shared among all the

target domains, and Mdistinct target domain-speciﬁc clas-

siﬁers {CTi|∀i∈ {1, . . . , M}}. Concurrently, we also ini-

tialize Mtarget domain-speciﬁc discriminators {DTi|∀i∈

{1, . . . , M}} to learn a classiﬁer that is invariant for a spe-

ciﬁc source-target pair. To recap, in adversarial warm-up

stage the discriminator DTiis trained to distinguish between

the source and target Tipredictions whereas the network

fTi=CTi◦Φis trained to fool the DTi. Note that unlike

the original work in [10], the output from the classiﬁer is

given as an input to the domain discriminator [26, 32]. Ad-

ditionally, for the source samples we employ the standard

supervised cross-entropy loss, which is used to train every

fTi. Overall, for a given source-target pair (S,Ti)the dis-

criminator DTiis trained with the objective:

LDTi=LbceDTi(CTi(Φ(xS))),1+

LbceDTi(CTi(Φ(xTi))),0(1)

where Lbce stands for the binary cross-entropy loss. Simul-

taneously, the network fTiis trained along with the source

segmentation loss and adversarial loss as:

LfTi=LceCTi(Φ(xS)),yS+

λadvLbceDTi(CTi(Φ(xTi))),1(2)

where Lce is the supervised cross-entropy loss for the

source data and λadv is a hyperparameter to balance the

losses. In the adversarial warm-up stage we alternatively

minimize LDTiand LfTifor every source-target pairs.

3.2. Cooperative Self-Training (CoaST)

The goal of the second stage is to reﬁne the image rep-

resentation learned in the adversarial warm-up stage. We

devise a self-training approach with the usage of pseudo-

labels that iteratively improves the predictions of the model

on the unlabeled data.

Figure 2: Illustration of the proposed CoaST approach in the case of two target domains. Domain-speciﬁc classiﬁers are

distilled to learn a domain-agnostic classiﬁer. Style-transfer is used in the encoder network to induce cooperation between

the different classiﬁers and rectify the pseudo-labeling losses.

Pseudo-labelling. In our framework for the MTDA, we

have mspecialized target domain-speciﬁc classiﬁers, with

each classiﬁer CTitrained to handle data coming from

the corresponding domain Ti. We exploit these special-

ized classiﬁers to generate pseudo-labels (PLs) for the tar-

get samples in their respective target domains. Speciﬁ-

cally, given the nth image xTi

nfrom the target domain i,

we use the network fTito predict the segmentation map

[ˆ

pTi

n(k)]k∈[H]×[W]×[K]=CTi(Φ(xTi

n)) and compute the

pseudo-label as:

yTi

n=ekargmax

[ˆ

pTi

n(k)]k∈[H]×[W]×[K],(3)

where ek(.)denotes the one-hot encoding operator and

yTi

n∈RH×W×K. The PL is computed at the beginning

of the second stage and is updated every nbiterations. This

PL is then used to self-supervise the corresponding fTinet-

work with a cross entropy loss:

Lpl =Lce(ˆ

pTi

n,ˆ

yTi

n),(4)

However, this formulation suffers from two main issues.

First, the PLs act only on the same domain-speciﬁc clas-

siﬁer CTicorresponding to the domain of input images.

Hence, it does not induce any synergy between the different

classiﬁers. Second, since the PLs can be noisy, using the

pseudo-labeling objective in Eq. (4) can lead to detrimental

behaviour. To address these two issues and further beneﬁt

from our PLs, we introduce a self-training technique that is

realized by leveraging feature stylization [29].

Style-Transfer for Cooperative Self-Training. To bene-

ﬁt from the self-training objective in Eq. (4), one requires

to obtain the predictions from a view t(xTi

n)and enforce

its predictions to match with that of ˆ

yTi

n, where t(.)is any

stochastic transformation. Indeed, such a consistency-based

training strategy has successfully been applied in the semi-

supervised learning literature [28]. However, ﬁnding opti-

mal transformations is not trivial and varies between data

sets and even tasks. In this work, we resort to a data-driven

transformation policy that is based on style-transfer [33].

Style-transfer consists in transferring the“style” (appear-

ance) from one image to another. Concretely, in our case,

the transformation t(.)is a style-transfer operation that es-

sentially applies the style of an image xTjto the image xTi,

where i6=j. The style transformed image xTi→jcan in

essence be regarded as a virtual image that appears to come

from Tjbut having the content structure of Ti. Therefore,

for the nth sample xTi→jwe obtain the prediction from fTj

and optimize it to be close to ˆ

yTi

n. In this way our PL from

a given target domain-speciﬁc classiﬁer can be used to su-

pervise another domain-speciﬁc classiﬁer, enforcing better

consistency between pairs of target domains. Moreover, we

thereafter show how style-transfer is instrumental in recti-

fying the objective in Eq. (4) according to an estimated con-

ﬁdence score. We now describe how we use style-transfer

to improve self-training in the MTDA setting.

Style-transfer in the pixel space, with separately trained

encoder-decoder network, has very recently been used for

the MTDA work [16]. To avoid such costly, and often

sub-optimal, image generation with the pixel-space style-

transfer methods, we perform style-transfer in the inter-

mediate feature space of the encoder network. In particu-

lar, we adapt cross normalization (CrossNorm) [29] in our

MTDA setting and use it as a means of exchanging fea-

ture statistics, and hence style, across different domains.

More precisely, our Cross-Domain Normalization (Cross-

DoNorm) performs style-transfer by exchanging style vec-

tors between two target domain images, which are com-

puted from the channel-wise mean and standard deviation

of the features maps. Exchange of style vectors is deemed

sufﬁcient for style-transfer by prior works [33] who show

that these statistics encode the image style and that style-

transfer can be obtained through a simple re-normalization.

Given a pair of images (xTi,xTj) from the target do-

mains Tiand Tj, we extract their corresponding features

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CooperativeSelf-TrainingforMulti-TargetAdaptiveSemanticSegmentationYangsongZhang1;3,SubhankarRoy2;4,HongtaoLu3,ElisaRicci2;4,St´ephaneLathuiliere11LTCI,T´el´ecom-Paris,IntitutePolytechniquedeParis2UniversityofTrento,Trento,Italy3ShanghaiJiaoTongUniversity,Shanghai,China4FondazioneBrunoKessler,Trent...

展开>> 收起<<

Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation Yangsong Zhang13 Subhankar Roy24 Hongtao Lu3 Elisa Ricci24 Stephane Lathuili ere1 1LTCI T elecom-Paris Intitute Polytechnique de Paris2University of Trento Trento Italy.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation Yangsong Zhang13 Subhankar Roy24 Hongtao Lu3 Elisa Ricci24 Stephane Lathuili ere1 1LTCI T elecom-Paris Intitute Polytechnique de Paris2University of Trento Trento Italy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: