From Mimicking to Integrating Knowledge Integration for Pre-Trained Language Models Lei Li1 Yankai Lin23 Xuancheng Ren1 Guangxiang Zhao1 Peng Li4 Jie Zhou5 Xu Sun1

2025-04-27 2 0 657.03KB 12 页 10玖币

侵权投诉

From Mimicking to Integrating:

Knowledge Integration for Pre-Trained Language Models

Lei Li1, Yankai Lin2,3, Xuancheng Ren1, Guangxiang Zhao1, Peng Li4, Jie Zhou5, Xu Sun1

1MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University

2Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China

3Beijing Key Laboratory of Big Data Management and Analysis Methods , Beijing, China

4Institute for AI Industry Research (AIR), Tsinghua University, China

5Pattern Recognition Center, WeChat AI, Tencent Inc., China

lilei@stu.pku.edu.cn xusun@pku.edu.cn

Abstract

Investigating better ways to reuse the released

pre-trained language models (PLMs) can sig-

niﬁcantly reduce the computational cost and

the potential environmental side-effects. This

paper explores a novel PLM reuse paradigm,

Knowledge Integration (KI). Without human

annotations available, KI aims to merge the

knowledge from different teacher-PLMs, each

of which specializes in a different classiﬁca-

tion problem, into a versatile student model.

To achieve this, we ﬁrst derive the cor-

relation between virtual golden supervision

and teacher predictions. We then design

a Model Uncertainty–aware Knowledge In-

tegration (MUKI) framework to recover the

golden supervision for the student. Speciﬁ-

cally, MUKI adopts Monte-Carlo Dropout to

estimate model uncertainty for the supervision

integration. An instance-wise re-weighting

mechanism based on the margin of uncertainty

scores is further incorporated, to deal with the

potential conﬂicting supervision from teachers.

Experimental results demonstrate that MUKI

achieves substantial improvements over base-

lines on benchmark datasets. Further analy-

sis shows that MUKI can generalize well for

merging teacher models with heterogeneous

architectures, and even teachers major in cross-

lingual datasets.1

1 Introduction

Large-scale pre-trained language models (PLMs),

such as BERT (Devlin et al.,2019), RoBERTa (Liu

et al.,2019) and T5 (Raffel et al.,2020) have re-

cently achieved promising results after ﬁne-tuning

on various natural language processing (NLP) tasks.

Many ﬁne-tuned PLMs are generously released for

facilitating researches and deployments. Reusing

these PLMs can greatly reduce the computational

Our code is available at

https://github.com/

lancopku/MUKI

. Part of the work was done while Yankai

Lin and Peng Li were working at Tencent.

Schema Figure v2

Knowledge IntegrationKnowledge Distillation

Teacher

PLM

Unlabeled Data

Student

PLM

Versatile

Student PLM

Unlabeled Data

Teacher

PLM 1

Teacher

PLM 2

Label Set: !Label Set: !Label Set: !

!Label Set: !

"Label Set: !

!∪ !

Figure 1: Comparison of knowledge distillation (KD)

and knowledge integration (KI). KD assumes that the

student performs predictions on the identical label set

with the teacher, while KI trains a student model that

is capable of performing classiﬁcation over the union

label set of teacher models.

cost of retraining the PLM from scratch and alle-

viate the potential environmental side-effects like

carbon footprints (Strubell et al.,2019), thus mak-

ing NLP systems greener (Schwartz et al.,2020). A

commonly adopted model reuse paradigm is knowl-

edge distillation (Hinton et al.,2015;Romero et al.,

2015), where a student model learns to mimic a

teacher model by aligning its outputs to that of the

teacher. In this way, though achieving promising re-

sults with PLMs (Sun et al.,2019;Jiao et al.,2020),

the student is restricted to perform the same task

as the teacher model, thus restricting re-utilization

of abundant available PLMs ﬁne-tuned on different

tasks, e.g., models ﬁne-tuned on various label sets

or even different datasets.

In this paper, we generalize the idea of KD

from mimicking teachers to integrating knowledge

from teachers, and propose Knowledge Integra-

tion (KI) for PLMs. Given multiple ﬁne-tuned

teacher-PLMs, each of which is capable of per-

forming classiﬁcation over a unique label set, KI

aims to train a versatile student that can make pre-

dictions over the union of teacher label sets. As the

labeled data for training the teachers may not be

publicly released due to data privacy issues, we as-

sume no human annotations are available during KI.

arXiv:2210.05230v1 [cs.CL] 11 Oct 2022

The beneﬁts of KI are two-fold. First, compared

to KD, KI can make full use of the released PLMs

specializing different tasks. Besides, the ability of

the versatile student, i.e., the label set coverage,

can be improved over time by integrating newly

released teacher models. Figure 1illustrates the

main difference between KD and KI.

As no annotations are available, the core chal-

lenge of KI lies in the integration of outputs from

teachers to form golden supervision, i.e., the class

probability distribution over the union label set, for

guiding the student. Through theoretical deriva-

tion, we ﬁrst build the bridge between the teacher

predictions and the golden supervision, which in-

dicates that the key to recovering such supervi-

sion is to identify the adequate teacher for each

instance. However, due to the over-conﬁdent prob-

lem of PLMs (Desai and Durrett,2020), selecting

qualiﬁed teachers for unlabeled instances is non-

trivial, and our exploration shows that prediction

entropy is misleading. Inspired by Monte-Carlo

Dropout (Gal and Ghahramani,2016), we inject pa-

rameter perturbations to the teacher models during

inference and then estimate the model uncertainties

over averaged predictions for indicating the possi-

ble correct teacher model. Our Model Uncertainty–

aware Knowledge Integration (MUKI) framework

is then proposed based on the estimated model un-

certainty. Speciﬁcally, the golden supervision is

approximated by either taking the outputs of the

most conﬁdent teacher, or softly integrating dif-

ferent teacher predictions according to the relative

importance of each teacher. Furthermore, for in-

stances on which teachers achieve close uncertainty

scores, we introduce a re-weighting mechanism

based on the margin of uncertainty scores, to down-

weight the contribution of instances with potential

conﬂicting supervision signals.

Experimental results show that MUKI can suc-

cessfully achieve the goal of knowledge integra-

tion, signiﬁcantly outperforming baseline methods,

and even obtaining comparable results with models

trained with labeled data. Further analysis shows

that MUKI can produce supervision close to the

golden one and generalize well for merging knowl-

edge from heterogeneous teachers with different

architectures, or even cross-lingual teacher models.

The main contributions of this work can be sum-

marized as follows: (1) We explore knowledge

integration for PLMs, which is capable of mak-

ing full use of released PLMs with different label

sets and has great extendability. (2) We present

MUKI, a generalizable KI framework, which in-

tegrates the knowledge from teachers according

to model uncertainty estimated via Monte-Carlo

Dropout and re-weights the instance contribution

based on the uncertainty margin. (3) Experimental

results demonstrate that MUKI is effective and gen-

eralizable, signiﬁcantly outperforming baselines.

2 Knowledge Integration for PLMs

In this section, we ﬁrst give the task formulation for

knowledge integration, followed by the elaboration

on the proposed MUKI framework.

2.1 Problem Formulation

Given

teacher PLM models

T S =

{T1, . . . , TN}

, where each teacher

spe-

cializes in a speciﬁc classiﬁcation problem, i.e., a

set of classes

, knowledge integration aims to

train student model

for performing predictions

over the comprehensive class set

Y=SN

i=1 Yi

with an unlabeled dataset

. We assume that for

each instance in

there is at least one teacher

capable of handling it and we focus on a practical

setting where the teacher specialties are totally

disjoint, i.e.,

Yi∩Yj=∅,∀i6=j

, as merging

teachers with overlapping classes can be easily

converted in to the disjoint situation.

2.2 Model Uncertainty–Aware Knowledge

Integration

As there are no annotated data available due to the

data privacy issue, we need to construct supervision

for guiding the student. Given a golden label dis-

tribution

T(x)

for each instance

over

, we can

train the student by minimizing the KL-divergence:

L=X

x∈D

KL (S(x)||T (x)) ,(1)

where

S(x)

denotes the output distribution of the

student for input

. As we only operate on the

output distribution level, thus this framework is

generalizable for PLMs that potentially differ in

the model architectures and training data distribu-

tion. To estimate the golden supervision

T(x)

we ﬁrst derive the correlation between

T(x)

and

the prediction

Ti(x)

of teacher model

. Speciﬁ-

cally, as teacher

specializes in label set

, it can

only predict

Ti(y|x)

for instance

when

y∈Yi

Therefore, the correlation between

Ti(y|x)

and

global probability

T(y|x)

over the full class set

Figure 2: Model uncertainty (normalized) distribu-

tions evaluated with 1000 instances randomly sampled

from the AG News dataset. The vanilla prediction

entropy distributions of two teacher models overlaps

greatly (left), while Monte-Carlo Dropout produces

a more accurate uncertainty approximation for distin-

guishing the correct teacher model (right). Best viewed

in color.

can be derived as:

Ti(y|x) = T(y|x, y ∈Yi)(2)

=T(y, y ∈Yi|x)

T(y∈Yi|x).(3)

The above derivation indicates that we can recover

the golden probability distribution by (1) getting

the teacher predictions, and (2) estimating the de-

nominator, which means how likely the instance

lies in the teacher

specialty

. As instances

associated with classes not in

can be treated

as the out-of-distribution data for the teacher

the teacher predictions would be more uncertain

about these instances than that of in-distribution

instances (Hendrycks and Gimpel,2017). We thus

propose to approximate the denominator in an op-

posite direction, i.e., estimating how likely the in-

stance is not belong to teacher

via model un-

certainty. Followingly, we ﬁrst explore different

uncertainty estimations for recovering the golden

supervision, and then introduce how we incorpo-

rates teacher predictions according to the estimated

uncertainty scores.

2.2.1 Uncertainty Estimation

A naïve estimation is directly taking the statics like

prediction entropy of predicted class distribution.

However, due to the over-conﬁdent issues of over-

parameterized models like PLMs (Guo et al.,2017;

Desai and Durrett,2020), this simple estimation

can be unreliable. We investigate this by ﬁrst split-

ting the instances of the AG News dataset (Zhang

et al.,2015) into two sets with disjoint labels, and

then ﬁne-tuning teacher models on each set sepa-

rately. For each instance, there is a correct teacher

that is capable of handling it and a wrong teacher

that is not qualiﬁed for processing it. We plot

the prediction entropy distributions of the correct

teacher and the wrong teacher in the left part of

Figure 2. It can be found that the wrong teacher

also produces conﬁdent predictions even for in-

stances that are not in its speciality with nearly

zero uncertainty scores, exhibiting a great overlap

with the correct teacher model. This indicates that

utilizing the simple metric will mislead the iden-

tiﬁcation of the adequate teacher. To remedy this,

inspired by recent progress in Bayesian neural net-

works (Blundell et al.,2015;Gal and Ghahramani,

2016), we propose to add small perturbations to

the model weights during inference to ﬁnd out the

correct teacher model. The intuition behind is that,

as the instance is well ﬁtted by the parameter of the

qualiﬁed teacher model, the teacher can produce

conﬁdent results consistently in the multiple predic-

tions even with small perturbed parameters. On the

contrary, small perturbations on the model weights

of the wrong teacher will lead to a drastic change

in the output probabilities, resulting in more un-

certain predictions on average. Therefore, we can

estimate the model uncertainty more accurately ac-

cording to the average predictions under parameter

perturbations. Speciﬁcally, we adopt Monte-Carlo

Dropout (Gal and Ghahramani,2016), where the

output distribution of an instance

with

is cal-

culated as:

pi(y|x, D)≈1

k=1

piy|Wi

k, x(4)

k=1

Tix, Wi

k,(5)

where

is the

-th masked weights of

sam-

pled from the Dropout distribution (Srivastava et al.,

2014), and

is the sampling number. The model

uncertainty of teacher model

thus can be sum-

marized as the entropy of the averaged probability

distribution pi:

ui=H(pi) = −

|Yi|

y=1

ilog py

i.(6)

As shown in the right part of Figure 2, the uncer-

tainty distributions of the correct teacher and the

wrong teacher model estimated via Monte-Carlo

Dropout exhibit a clearer difference than vanilla

prediction entropy, indicating its great potential for

guiding the probability combination.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FromMimickingtoIntegrating:KnowledgeIntegrationforPre-TrainedLanguageModelsLeiLi1,YankaiLin2,3,XuanchengRen1,GuangxiangZhao1,PengLi4,JieZhou5,XuSun11MOEKeyLabofComputationalLinguistics,SchoolofComputerScience,PekingUniversity2GaolingSchoolofArticialIntelligence,RenminUniversityofChina,Beijing,China...

展开>> 收起<<

From Mimicking to Integrating Knowledge Integration for Pre-Trained Language Models Lei Li1 Yankai Lin23 Xuancheng Ren1 Guangxiang Zhao1 Peng Li4 Jie Zhou5 Xu Sun1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

From Mimicking to Integrating Knowledge Integration for Pre-Trained Language Models Lei Li1 Yankai Lin23 Xuancheng Ren1 Guangxiang Zhao1 Peng Li4 Jie Zhou5 Xu Sun1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: