Meta-DMoE Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts Tao Zhong1 Zhixiang Chi2 Li Gu2 Yang Wang23 Yuanhao Yu2 Jin Tang2

2025-05-02 0 0 2.89MB 21 页 10玖币

侵权投诉

Meta-DMoE: Adapting to Domain Shift by

Meta-Distillation from Mixture-of-Experts

Tao Zhong∗1, Zhixiang Chi∗2, Li Gu∗2, Yang Wang2,3, Yuanhao Yu2, Jin Tang2

1University of Toronto, 2Huawei Noah’s Ark Lab, 3Concordia University

tao.zhong@mail.utoronto.ca yang.wang@concordia.ca

{zhixiang.chi, li.gu, yuanhao.yu, tangjin}@huawei.com

Abstract

In this paper, we tackle the problem of domain shift. Most existing methods

perform training on multiple source domains using a single model, and the same

trained model is used on all unseen target domains. Such solutions are sub-optimal

as each target domain exhibits its own specialty, which is not adapted. Further-

more, expecting single-model training to learn extensive knowledge from multiple

source domains is counterintuitive. The model is more biased toward learning

only domain-invariant features and may result in negative knowledge transfer.

In this work, we propose a novel framework for unsupervised test-time adapta-

tion, which is formulated as a knowledge distillation process to address domain

shift. Speciﬁcally, we incorporate Mixture-of-Experts (MoE) as teachers, where

each expert is separately trained on different source domains to maximize their

specialty. Given a test-time target domain, a small set of unlabeled data is sam-

pled to query the knowledge from MoE. As the source domains are correlated

to the target domains, a transformer-based aggregator then combines the domain

knowledge by examining the interconnection among them. The output is treated

as a supervision signal to adapt a student prediction network toward the target

domain. We further employ meta-learning to enforce the aggregator to distill

positive knowledge and the student network to achieve fast adaptation. Extensive

experiments demonstrate that the proposed method outperforms the state-of-the-art

and validates the effectiveness of each proposed component. Our code is available

at https://github.com/n3il666/Meta-DMoE.

1 Introduction

The emergence of deep models has achieved superior performance [

]. Such unprecedented

success is built on the strong assumption that the training and testing data are highly correlated

(i.e., they are both sampled from the same data distribution). However, the assumption typically

does not hold in real-world settings as the training data is infeasible to cover all the ever-changing

deployment environments [

]. Reducing such distribution correlation is known as distribution shift,

which signiﬁcantly hampers the performance of deep models. Humans are more robust against the

distribution shift, but artiﬁcial learning-based systems suffer more from performance degradation.

One line of research aims to mitigate the distribution shift by exploiting some unlabeled data from

a target domain, which is known as unsupervised domain adaptation (UDA) [

]. The

unlabeled data is an estimation of the target distribution [

]. Therefore, UDA normally adapts

to the target domain by transferring the source knowledge via a common feature space with less

effect from domain discrepancy [

]. However, UDA is less applicable for real-world scenarios

∗equal contribution

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03885v2 [cs.LG] 11 Jan 2023

as repetitive large-scale training is required for every target domain. In addition, collecting data

samples from a target domain in advance might be unavailable as the target distribution could be

unknown during training. Domain generalization (DG) [

] is an alternative line of research

but more challenging as it assumes the prior knowledge of the target domains is unknown. DG

methods leverage multiple source domains for training and directly use the trained model on all

unseen domains. As the domain-speciﬁc information for the target domains is not adapted, a generic

model is sub-optimal [68,17].

Test-time adaptation with DG allows the model to exploit the unlabeled data during testing to

overcome the limitation of using a ﬂawed generic model for all unseen target domains. In ARM [

meta-learning [

] is utilized for training the model as an initialization such that it can be adapted

using the unlabeled data from the unseen target domain before making the ﬁnal inference. However,

we observed that ARM only trains a single model, which is counterintuitive for the multi-source

domain setting. There is a certain amount of correlations among the source domains while each

of them also exhibits its own speciﬁc knowledge. When the number of source domains rises, data

complexity dramatically increases, which impedes thorough exploration of the dataset. Furthermore,

real-world domains are not always balanced in data scales [

]. Therefore, the single-model training

is more biased toward the domain-invariant features and dominant domains instead of the domain-

speciﬁc features [12].

In this work, we propose to formulate the test-time adaptation as the process of knowledge distil-

lation [

] from multiple source domains. Concretely, we propose to incorporate the concept of

Mixture-of-Experts (MoE), which is a natural ﬁt for the multi-source domain settings. The MoE

models are treated as a teacher and separately trained on the corresponding domain to maximize

their domain specialty. Given a new target domain, a few unlabeled data are collected to query the

features from expert models. A transformer-based knowledge aggregator is proposed to examine

the interconnection among queried knowledge and aggregate the correlated information toward the

target domain. The output is then treated as a supervision signal to update a student prediction

network to adapt to the target domain. The adapted student is then used for subsequent inference. We

employ bi-level optimization as meta-learning to train the aggregator at the meta-level to improve

generalization. The student network is also meta-trained to achieve fast adaptation via a few samples.

Furthermore, we simulate the test-time out-of-distribution scenarios during training to align the

training objective with the evaluation protocol.

The proposed method also provides additional advantages over ARM: 1) Our method provides a

larger model capability to improve the generalization power; 2) Despite the higher computational

cost, only the adapted student network is kept for inference, while the MoE models are discarded

after adaptation. Therefore, our method is more ﬂexible in designing the architectures for the teacher

or student models. (e.g., designing compact models for the power-constrained environment); 3) Our

method does not need to access the raw data of source domains but only needs their trained models.

So, we can take advantage of private domains in a real-world setting where their data is inaccessible.

We name our method as

Meta

istillation of

MoE

(Meta-DMoE). Our contributions are as follows:

•

We propose a novel unsupervised test-time adaptation framework that is tailored for multiple

sources domain settings. Our framework employs the concept of MoE to allow each expert

model to explore each source domain thoroughly. We formulate the adaptation process as

knowledge distillation via aggregating the positive knowledge retrieved from MoE.

•

The alignment between training and evaluation objectives via meta-learning improves the

adaptation, hence the test-time generalization.

•

We conduct extensive experiments to show the superiority of the proposed method among

the state-of-the-arts and validate the effectiveness of each component of Meta-DMoE.

•

We validate that our method is more ﬂexible in real-world settings where computational

power and data privacy are the concerns.

2 Related work

Domain shift.

Unsupervised Domain Adaptation (UDA) has been popular to address domain shift by

transferring the knowledge from the labeled source domain to the unlabeled target domain [

It is achieved by learning domain-invariant features via minimizing statistical discrepancy across

domains [

]. Adversarial learning is also applied to develop indistinguishable feature

space [

]. The ﬁrst limitation of UDA is the assumption of the co-existence of source and

target data, which is inapplicable when the target domain is unknown in advance. Furthermore,

most of the algorithms focus on unrealistic single-source-single-target adaptation as source data

normally come from multiple domains. Splitting the source data into various distinct domains and

exploring the unique characteristics of each domain and the dependencies among them strengthen the

robustness [

]. Domain generalization (DG) is another line of research to alleviate the domain

shift. DG aims to train a model on multiple source domains without accessing any prior information

of the target domain and expects it to perform well on unseen target domains. [

] aim to

learn the domain-invariant feature representation. [

] exploit data augmentation strategies in

data or feature space. A concurrent work proposed bidirectional learning to mitigate domain shift [

However, deploying the generic model to all unseen target domains fails to explore domain specialty

and yields sub-optimal solutions. In contrast, our method further exploits the unlabeled target data

and updates the trained model to each speciﬁc unseen target domain at test time.

Test-time adaptation (TTA).

TTA constructs supervision signals from unlabeled data to update the

generic model before inference. Sun et al. [

] uses rotation prediction to update the model during

inference. Chi et al. [

] and Li et al. [

] reconstruct the input images to achieve internal-learning to

restore the blurry images and estimate the human pose. ARM [

] incorporates test-time adaptation

with DG, which meta-learns a model that is capable of adapting to unseen target domains before

making an inference. Instead of adapting to every data sample, our method only updates once for

each target domain using a ﬁxed number of examples.

Meta-learning.

The existing meta-learning methods can be categorised as model-based [

metric-based [

], and optimization-based [

]. Meta-learning aims to learn the learning process

by episodic learning, which is based on bi-level optimization ([

] provides a comprehensive survey).

One of the advantages of bi-level optimization is to improve the training with conﬂicting learning

objectives. Utilizing such a paradigm, [

] successfully reduce the forgetting issue and improve

adaptation for continual learning [

]. In our method, we incorporate meta-learning with knowledge

distillation by jointly learning a student model initialization and a knowledge aggregator for fast

adaptation.

Mixture-of-experts.

The goal of MoE [

] is to decompose the whole training set into many subsets,

which are independently learned by different models. It has been successfully applied in image

recognition models to improve the accuracy [

]. MoE is also popular in scaling up the architectures.

As each expert is independently trained, sparse selection methods are developed to select a subset

of the MoE during inference to increase the network capacity [

]. In contrast, our method

utilizes all the experts to extract and combine the knowledge for positive knowledge transfer.

3 Preliminaries

In this section, we describe the problem setting and discuss the adaptive model. We mainly follow

the test-time unsupervised adaptation as in [

]. Speciﬁcally, we deﬁne a set of

source domains

DS={DSi}N

i=1

and

target domains

DT={DTj}M

j=1

. The exact deﬁnition of a domain varies

and depends on the applications or data collection methods. It could be a speciﬁc dataset, user, or

location. Let

x∈ X

and

y∈ Y

denote the input and the corresponding label, respectively. Each of the

source domains contains data in the form of input-output pairs:

DSi={(xz

S, yz

S)}Zi

z=1

. In contrast,

each of the target domains contains only unlabeled data:

DTj={(xk

T)}Kj

k=1

. For well-designed

datasets (e.g. [

]), all the source or target domains have the same number of data samples. Such

condition is not ubiquitous for real-world scenarios (i.e.

Zi16=Zi2

i16=i2

and

Kj16=Kj2

j16=j2

) where data imbalance always exists [

]. It further challenges the generalization with a

broader range of real-world distribution shifts instead of ﬁnite synthetic ones. Generic domain shift

tasks focus on the out-of-distribution setting where the source and target domains are non-overlapping

(i.e. DS∩ DT=∅), but the label spaces of both domains are the same (i.e. YS=YT).

Conventional DG methods perform training on

and make a minimal assumption on the testing

scenarios [

]. Therefore, the same generic model is directly applied to all target domains

, which leads to sub-optimal solutions [

]. In fact, for each

DTj

, some unlabeled data are

readily available which provides certain prior knowledge for that target distribution. Adaptive

Risk Minimization (ARM) [

] assumes that a batch of unlabeled input data

approximate the

...

Domain expert models

Aggregator

（

Support set

Distill

knowledge

（

Query set

Meta-update

Inner loop, adaptation

Outer loop, update meta parameters

Figure 1: Overview of the training of Meta-DMoE. We ﬁrst sample disjoint support set

xSU

and

query set

(xQ,yQ)

from a training domain.

xSU

is sent to the expert models

to query their

domain-speciﬁc knowledge. An aggregator

A(·;φ)

then combines the information and generates a

supervision signal to update the

f(·;θ)

via knowledge distillation. The updated

f(·;θ0)

is evaluated

using the labeled query set to update the meta-parameters.

input distribution

which provides useful information about

py|x

. Based on the assumption, an

unsupervised test-time adaptation [

] is proposed. The fundamental concept is to adapt the

model to the speciﬁc domain using

. Overall, ARM aims to minimize the following objective

L(·,·)

over all training domains:

DSj∈DS

(x,y)∈DSjL(y, f(x;θ0)),where θ0=h(x, θ;φ).(1)

is the labels for

f(x;θ)

denotes the prediction model parameterized by

h(·;φ)

is an adaptation

function parameterized by

. It receives the original

and the unlabeled data

to adapt

θ0

The goal of ARM is to learn both

(θ, φ)

. To mimic the test-time adaptation (i.e., adapt before

prediction), it follows the episodic learning as in meta-learning [

]. Speciﬁcally, each episode

processes a domain by performing unsupervised adaptation using

and

h(·;φ)

in the inner loop

to obtain

f(·;θ0

). The outer loop evaluates the adapted

f(·;θ0

) using the true label to perform a

meta-update. ARM is a general framework that can be incorporated with existing meta-learning

approaches with different forms of adaptation module h(·;·)[25,27].

However, several shortcomings are observed with respect to the generalization. The episodic learning

processes one domain at a time, which has clear boundaries among the domains. The overall setting

is equivalent to the multi-source domain setting, which is proven to be more effective than learning

from a single domain [

] as most of the domains are correlated to each other [

]. However, it is

counterintuitive to learn all the domain knowledge in one single model as each domain has specialized

semantics or low-level features [

]. Therefore, the single-model method in ARM is sub-optimal

due to: 1) some domains may contain competitive information, which leads to negative knowledge

transfer [

]. It may tend to learn the ambiguous feature representations instead of capturing all the

domain-speciﬁc information [

]; 2) not all the domains are equally important [

], and the learning

might be biased as data in different domains are imbalanced in real-world applications [39].

4 Proposed approach

In this section, we explicitly formulate the test-time adaptation as a knowledge transfer process to

distill the knowledge from MoE. The proposed method is learned via meta-learning to mimic the

test-time out-of-distribution scenarios and ensure positive knowledge transfer.

4.1 Meta-distillation from mixture-of-experts

Overview.

Fig. 1shows the method overview. We wish to explicitly transfer useful knowledge

from various source domains to achieve generalization on unseen target domains. Concretely, we

deﬁne MoE as

M={Mi}N

i=1

to represent the domain-speciﬁc models. Each

is separately

trained using standard supervised learning on the source domain

DSi

to learn its discriminative

features. We propose the test-time adaptation as the unsupervised knowledge distillation [

] to learn

the knowledge from MoE. Therefore, we treat

as the teacher and aim to distill its knowledge to a

student prediction network

f(·;θ)

to achieve adaptation. To do so, we sample a batch of unlabeled

from a target domain, and pass it to

to query their domain-speciﬁc knowledge

{Mi(x)}N

i=1

. That

knowledge is then forwarded to a knowledge aggregator

A(·;φ)

. The aggregator is learned to capture

the interconnection among domain knowledge and aggregate the information from MoE. The output of

A(·;φ)

is treated as the supervision signal to update

f(x;θ)

. Once the adapted

θ0

is obtained,

f(·;θ0)

is used to make predictions for the rest of the data in that domain. The overall framework follows

the effective few-shot learning paradigm where

is treated as an unlabeled support set [

Algorithm 1 Training for Meta-DMoE

Require: {DSi}N

i=1: data of source domains; α, β: learning rates; B: meta batch size

1: // Pretrain domain-speciﬁc MoE models

2: for i=1,...,Ndo

3: Train the domain-speciﬁc model Miusing DSi.

4: end for

5: // Meta-train aggregator A(·;φ)and student model f(·, θe;θc)

6: Initialize:φ,θe,θc

7: while not converged do

8: Sample a batch of Bsource domains {DSb}B, reset batch loss LB= 0

9: for each DSbdo

10: Sample support and query set: (xSU ), (xQ,yQ)∼ DSb

11: M0

e(xSU ;φ) = {Mi

e(xSU ;φ)}N

i=1, mask Mi

e(xSU ;φ)with 0if b=

12: Perform adaptation via knowledge distillation from MoE:

13: θ0

e=θe−α∇θe



A(M0

e(xSU ;φ)) −f(xSU ;θe)

2

14: Evaluate the adapted θ0

eusing the query set and accumulate the loss:

15: LB=LB+LCE (yQ, f(xQ;θ0

e, θc))

16: end for

17: Update φ,θe,θcfor the current meta batch:

18: (φ, θe, θc)←(φ, θe, θc)−β∇(φ,θe,θc)LB

19: end while

Training Meta-DMoE.

Properly

training

(θ, φ)

is critical to im-

prove the generalization on un-

seen domains. In our framework,

A(·, φ)

acts as a mechanism that

explores and mixes the knowledge

from multiple source domains.

Conventional knowledge distilla-

tion process requires large num-

bers of data samples and learn-

ing iterations [

]. The repeti-

tive large-scale training is inappli-

cable in real-world applications.

To mitigate the aforementioned

challenges, we follow the meta-

learning paradigm [

]. Such bi-

level optimization enforces the

A(·, φ)

to learn beyond any spe-

ciﬁc knowledge [

] and allows

the student prediction network

f(·;θ)

to achieve fast adaptation.

Speciﬁcally, We ﬁrst split the data samples in each source domain

DSi

into disjoint support and

query sets. The unlabeled support set (

xSU

) is used to perform adaptation via knowledge distillation,

while the labeled query set (

) is used to evaluate the adapted parameters to explicitly test the

generalization on unseen data.

The student prediction network

f(·;θ)

can be decoupled as a feature extractor

θe

and classiﬁer

θc

Unsupervised knowledge distillation can be achieved via the softened output [

] or intermediate

features [

] from

. The former one allows the whole student network

θ= (θe, θc)

to be adaptive,

while the latter one allows partial or complete

θe

to adapt to

, depending on the features utilized.

We follow [

] to only adapt

θe

in the inner loop while keeping the

θc

ﬁxed. Thus, the adaptation

process is achieved by distilling the knowledge via the aggregated features:

DIST (xSU ,Me, φ, θe) = θ0

e=θe−α∇θekA(Me(xSU ); φ)−f(xSU ;θe)k2,(2)

where

denotes the adaptation learning rate,

is the feature extractor of MoE models, which

extracts the features before the classiﬁer, and

k·k2

measures the

distance. The goal is to obtain an

updated

θ0

such that the extracted features of

f(xSU ;θ0

is closer to the aggregated features. The

overall learning objective of Meta-DMoE is to minimize the following expected loss:

arg min

θe,θc,φ X

DSj∈DS

(xSU )∈DSj

(xQ,yQ)∈DSj

LCE (yQ, f(xQ;θ0

e, θc)),where θ0

e=DIST (xSU ,Me, φ, θe),

(3)

where

LCE

is the cross-entropy loss. Alg. 1demonstrates our full training procedure. To smooth the

meta gradient and stabilize the training, we process a batch of episodes before each meta-update.

Since the training domains overlap for the MoE and meta-training, we simulate the test-time out-of-

distribution by excluding the corresponding expert model in each episode. To do so, we multiply

the features by

to mask them out.

in L11 of Alg. 1denotes such operation. Therefore, the

adaptation is enforced to use the knowledge that is aggregated from other domains.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Meta-DMoE:AdaptingtoDomainShiftbyMeta-DistillationfromMixture-of-ExpertsTaoZhong1,ZhixiangChi2,LiGu2,YangWang2;3,YuanhaoYu2,JinTang21UniversityofToronto,2HuaweiNoah'sArkLab,3ConcordiaUniversitytao.zhong@mail.utoronto.cayang.wang@concordia.ca{zhixiang.chi,li.gu,yuanhao.yu,tangjin}@huawei.comAbstra...

展开>> 收起<<

Meta-DMoE Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts Tao Zhong1 Zhixiang Chi2 Li Gu2 Yang Wang23 Yuanhao Yu2 Jin Tang2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Meta-DMoE Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts Tao Zhong1 Zhixiang Chi2 Li Gu2 Yang Wang23 Yuanhao Yu2 Jin Tang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: