From Mimicking to Integrating Knowledge Integration for Pre-Trained Language Models Lei Li1 Yankai Lin23 Xuancheng Ren1 Guangxiang Zhao1 Peng Li4 Jie Zhou5 Xu Sun1

2025-04-27 0 0 657.03KB 12 页 10玖币
侵权投诉
From Mimicking to Integrating:
Knowledge Integration for Pre-Trained Language Models
Lei Li1, Yankai Lin2,3, Xuancheng Ren1, Guangxiang Zhao1, Peng Li4, Jie Zhou5, Xu Sun1
1MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University
2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
3Beijing Key Laboratory of Big Data Management and Analysis Methods , Beijing, China
4Institute for AI Industry Research (AIR), Tsinghua University, China
5Pattern Recognition Center, WeChat AI, Tencent Inc., China
lilei@stu.pku.edu.cn xusun@pku.edu.cn
Abstract
Investigating better ways to reuse the released
pre-trained language models (PLMs) can sig-
nificantly reduce the computational cost and
the potential environmental side-effects. This
paper explores a novel PLM reuse paradigm,
Knowledge Integration (KI). Without human
annotations available, KI aims to merge the
knowledge from different teacher-PLMs, each
of which specializes in a different classifica-
tion problem, into a versatile student model.
To achieve this, we first derive the cor-
relation between virtual golden supervision
and teacher predictions. We then design
a Model Uncertainty–aware Knowledge In-
tegration (MUKI) framework to recover the
golden supervision for the student. Specifi-
cally, MUKI adopts Monte-Carlo Dropout to
estimate model uncertainty for the supervision
integration. An instance-wise re-weighting
mechanism based on the margin of uncertainty
scores is further incorporated, to deal with the
potential conflicting supervision from teachers.
Experimental results demonstrate that MUKI
achieves substantial improvements over base-
lines on benchmark datasets. Further analy-
sis shows that MUKI can generalize well for
merging teacher models with heterogeneous
architectures, and even teachers major in cross-
lingual datasets.1
1 Introduction
Large-scale pre-trained language models (PLMs),
such as BERT (Devlin et al.,2019), RoBERTa (Liu
et al.,2019) and T5 (Raffel et al.,2020) have re-
cently achieved promising results after fine-tuning
on various natural language processing (NLP) tasks.
Many fine-tuned PLMs are generously released for
facilitating researches and deployments. Reusing
these PLMs can greatly reduce the computational
1
Our code is available at
https://github.com/
lancopku/MUKI
. Part of the work was done while Yankai
Lin and Peng Li were working at Tencent.
Schema Figure v2
Knowledge IntegrationKnowledge Distillation
Teacher
PLM
Unlabeled Data
Student
PLM
Versatile
Student PLM
Unlabeled Data
Teacher
PLM 1
Teacher
PLM 2
Label Set: !Label Set: !Label Set: !
!Label Set: !
"Label Set: !
!∪ !
"
Figure 1: Comparison of knowledge distillation (KD)
and knowledge integration (KI). KD assumes that the
student performs predictions on the identical label set
with the teacher, while KI trains a student model that
is capable of performing classification over the union
label set of teacher models.
cost of retraining the PLM from scratch and alle-
viate the potential environmental side-effects like
carbon footprints (Strubell et al.,2019), thus mak-
ing NLP systems greener (Schwartz et al.,2020). A
commonly adopted model reuse paradigm is knowl-
edge distillation (Hinton et al.,2015;Romero et al.,
2015), where a student model learns to mimic a
teacher model by aligning its outputs to that of the
teacher. In this way, though achieving promising re-
sults with PLMs (Sun et al.,2019;Jiao et al.,2020),
the student is restricted to perform the same task
as the teacher model, thus restricting re-utilization
of abundant available PLMs fine-tuned on different
tasks, e.g., models fine-tuned on various label sets
or even different datasets.
In this paper, we generalize the idea of KD
from mimicking teachers to integrating knowledge
from teachers, and propose Knowledge Integra-
tion (KI) for PLMs. Given multiple fine-tuned
teacher-PLMs, each of which is capable of per-
forming classification over a unique label set, KI
aims to train a versatile student that can make pre-
dictions over the union of teacher label sets. As the
labeled data for training the teachers may not be
publicly released due to data privacy issues, we as-
sume no human annotations are available during KI.
arXiv:2210.05230v1 [cs.CL] 11 Oct 2022
The benefits of KI are two-fold. First, compared
to KD, KI can make full use of the released PLMs
specializing different tasks. Besides, the ability of
the versatile student, i.e., the label set coverage,
can be improved over time by integrating newly
released teacher models. Figure 1illustrates the
main difference between KD and KI.
As no annotations are available, the core chal-
lenge of KI lies in the integration of outputs from
teachers to form golden supervision, i.e., the class
probability distribution over the union label set, for
guiding the student. Through theoretical deriva-
tion, we first build the bridge between the teacher
predictions and the golden supervision, which in-
dicates that the key to recovering such supervi-
sion is to identify the adequate teacher for each
instance. However, due to the over-confident prob-
lem of PLMs (Desai and Durrett,2020), selecting
qualified teachers for unlabeled instances is non-
trivial, and our exploration shows that prediction
entropy is misleading. Inspired by Monte-Carlo
Dropout (Gal and Ghahramani,2016), we inject pa-
rameter perturbations to the teacher models during
inference and then estimate the model uncertainties
over averaged predictions for indicating the possi-
ble correct teacher model. Our Model Uncertainty–
aware Knowledge Integration (MUKI) framework
is then proposed based on the estimated model un-
certainty. Specifically, the golden supervision is
approximated by either taking the outputs of the
most confident teacher, or softly integrating dif-
ferent teacher predictions according to the relative
importance of each teacher. Furthermore, for in-
stances on which teachers achieve close uncertainty
scores, we introduce a re-weighting mechanism
based on the margin of uncertainty scores, to down-
weight the contribution of instances with potential
conflicting supervision signals.
Experimental results show that MUKI can suc-
cessfully achieve the goal of knowledge integra-
tion, significantly outperforming baseline methods,
and even obtaining comparable results with models
trained with labeled data. Further analysis shows
that MUKI can produce supervision close to the
golden one and generalize well for merging knowl-
edge from heterogeneous teachers with different
architectures, or even cross-lingual teacher models.
The main contributions of this work can be sum-
marized as follows: (1) We explore knowledge
integration for PLMs, which is capable of mak-
ing full use of released PLMs with different label
sets and has great extendability. (2) We present
MUKI, a generalizable KI framework, which in-
tegrates the knowledge from teachers according
to model uncertainty estimated via Monte-Carlo
Dropout and re-weights the instance contribution
based on the uncertainty margin. (3) Experimental
results demonstrate that MUKI is effective and gen-
eralizable, significantly outperforming baselines.
2 Knowledge Integration for PLMs
In this section, we first give the task formulation for
knowledge integration, followed by the elaboration
on the proposed MUKI framework.
2.1 Problem Formulation
Given
N
teacher PLM models
T S =
{T1, . . . , TN}
, where each teacher
Ti
spe-
cializes in a specific classification problem, i.e., a
set of classes
Yi
, knowledge integration aims to
train student model
S
for performing predictions
over the comprehensive class set
Y=SN
i=1 Yi
,
with an unlabeled dataset
D
. We assume that for
each instance in
D
there is at least one teacher
capable of handling it and we focus on a practical
setting where the teacher specialties are totally
disjoint, i.e.,
YiYj=,i6=j
, as merging
teachers with overlapping classes can be easily
converted in to the disjoint situation.
2.2 Model Uncertainty–Aware Knowledge
Integration
As there are no annotated data available due to the
data privacy issue, we need to construct supervision
for guiding the student. Given a golden label dis-
tribution
T(x)
for each instance
x
over
Y
, we can
train the student by minimizing the KL-divergence:
L=X
x∈D
KL (S(x)||T (x)) ,(1)
where
S(x)
denotes the output distribution of the
student for input
x
. As we only operate on the
output distribution level, thus this framework is
generalizable for PLMs that potentially differ in
the model architectures and training data distribu-
tion. To estimate the golden supervision
T(x)
,
we first derive the correlation between
T(x)
and
the prediction
Ti(x)
of teacher model
Ti
. Specifi-
cally, as teacher
Ti
specializes in label set
Yi
, it can
only predict
Ti(y|x)
for instance
x
when
yYi
.
Therefore, the correlation between
Ti(y|x)
and
global probability
T(y|x)
over the full class set
Figure 2: Model uncertainty (normalized) distribu-
tions evaluated with 1000 instances randomly sampled
from the AG News dataset. The vanilla prediction
entropy distributions of two teacher models overlaps
greatly (left), while Monte-Carlo Dropout produces
a more accurate uncertainty approximation for distin-
guishing the correct teacher model (right). Best viewed
in color.
can be derived as:
Ti(y|x) = T(y|x, y Yi)(2)
=T(y, y Yi|x)
T(yYi|x).(3)
The above derivation indicates that we can recover
the golden probability distribution by (1) getting
the teacher predictions, and (2) estimating the de-
nominator, which means how likely the instance
x
lies in the teacher
Ti
specialty
Yi
. As instances
associated with classes not in
Yi
can be treated
as the out-of-distribution data for the teacher
Ti
,
the teacher predictions would be more uncertain
about these instances than that of in-distribution
instances (Hendrycks and Gimpel,2017). We thus
propose to approximate the denominator in an op-
posite direction, i.e., estimating how likely the in-
stance is not belong to teacher
Ti
via model un-
certainty. Followingly, we first explore different
uncertainty estimations for recovering the golden
supervision, and then introduce how we incorpo-
rates teacher predictions according to the estimated
uncertainty scores.
2.2.1 Uncertainty Estimation
A naïve estimation is directly taking the statics like
prediction entropy of predicted class distribution.
However, due to the over-confident issues of over-
parameterized models like PLMs (Guo et al.,2017;
Desai and Durrett,2020), this simple estimation
can be unreliable. We investigate this by first split-
ting the instances of the AG News dataset (Zhang
et al.,2015) into two sets with disjoint labels, and
then fine-tuning teacher models on each set sepa-
rately. For each instance, there is a correct teacher
that is capable of handling it and a wrong teacher
that is not qualified for processing it. We plot
the prediction entropy distributions of the correct
teacher and the wrong teacher in the left part of
Figure 2. It can be found that the wrong teacher
also produces confident predictions even for in-
stances that are not in its speciality with nearly
zero uncertainty scores, exhibiting a great overlap
with the correct teacher model. This indicates that
utilizing the simple metric will mislead the iden-
tification of the adequate teacher. To remedy this,
inspired by recent progress in Bayesian neural net-
works (Blundell et al.,2015;Gal and Ghahramani,
2016), we propose to add small perturbations to
the model weights during inference to find out the
correct teacher model. The intuition behind is that,
as the instance is well fitted by the parameter of the
qualified teacher model, the teacher can produce
confident results consistently in the multiple predic-
tions even with small perturbed parameters. On the
contrary, small perturbations on the model weights
of the wrong teacher will lead to a drastic change
in the output probabilities, resulting in more un-
certain predictions on average. Therefore, we can
estimate the model uncertainty more accurately ac-
cording to the average predictions under parameter
perturbations. Specifically, we adopt Monte-Carlo
Dropout (Gal and Ghahramani,2016), where the
output distribution of an instance
x
with
Ti
is cal-
culated as:
pi(y|x, D)1
K
K
X
k=1
piy|Wi
k, x(4)
=1
K
K
X
k=1
Tix, Wi
k,(5)
where
Wi
k
is the
k
-th masked weights of
Ti
sam-
pled from the Dropout distribution (Srivastava et al.,
2014), and
K
is the sampling number. The model
uncertainty of teacher model
Ti
thus can be sum-
marized as the entropy of the averaged probability
distribution pi:
ui=H(pi) =
|Yi|
X
y=1
py
ilog py
i.(6)
As shown in the right part of Figure 2, the uncer-
tainty distributions of the correct teacher and the
wrong teacher model estimated via Monte-Carlo
Dropout exhibit a clearer difference than vanilla
prediction entropy, indicating its great potential for
guiding the probability combination.
摘要:

FromMimickingtoIntegrating:KnowledgeIntegrationforPre-TrainedLanguageModelsLeiLi1,YankaiLin2,3,XuanchengRen1,GuangxiangZhao1,PengLi4,JieZhou5,XuSun11MOEKeyLabofComputationalLinguistics,SchoolofComputerScience,PekingUniversity2GaolingSchoolofArticialIntelligence,RenminUniversityofChina,Beijing,China...

展开>> 收起<<
From Mimicking to Integrating Knowledge Integration for Pre-Trained Language Models Lei Li1 Yankai Lin23 Xuancheng Ren1 Guangxiang Zhao1 Peng Li4 Jie Zhou5 Xu Sun1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:657.03KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注