
From Mimicking to Integrating:
Knowledge Integration for Pre-Trained Language Models
Lei Li1, Yankai Lin2,3, Xuancheng Ren1, Guangxiang Zhao1, Peng Li4, Jie Zhou5, Xu Sun1
1MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University
2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
3Beijing Key Laboratory of Big Data Management and Analysis Methods , Beijing, China
4Institute for AI Industry Research (AIR), Tsinghua University, China
5Pattern Recognition Center, WeChat AI, Tencent Inc., China
lilei@stu.pku.edu.cn xusun@pku.edu.cn
Abstract
Investigating better ways to reuse the released
pre-trained language models (PLMs) can sig-
nificantly reduce the computational cost and
the potential environmental side-effects. This
paper explores a novel PLM reuse paradigm,
Knowledge Integration (KI). Without human
annotations available, KI aims to merge the
knowledge from different teacher-PLMs, each
of which specializes in a different classifica-
tion problem, into a versatile student model.
To achieve this, we first derive the cor-
relation between virtual golden supervision
and teacher predictions. We then design
a Model Uncertainty–aware Knowledge In-
tegration (MUKI) framework to recover the
golden supervision for the student. Specifi-
cally, MUKI adopts Monte-Carlo Dropout to
estimate model uncertainty for the supervision
integration. An instance-wise re-weighting
mechanism based on the margin of uncertainty
scores is further incorporated, to deal with the
potential conflicting supervision from teachers.
Experimental results demonstrate that MUKI
achieves substantial improvements over base-
lines on benchmark datasets. Further analy-
sis shows that MUKI can generalize well for
merging teacher models with heterogeneous
architectures, and even teachers major in cross-
lingual datasets.1
1 Introduction
Large-scale pre-trained language models (PLMs),
such as BERT (Devlin et al.,2019), RoBERTa (Liu
et al.,2019) and T5 (Raffel et al.,2020) have re-
cently achieved promising results after fine-tuning
on various natural language processing (NLP) tasks.
Many fine-tuned PLMs are generously released for
facilitating researches and deployments. Reusing
these PLMs can greatly reduce the computational
1
Our code is available at
https://github.com/
lancopku/MUKI
. Part of the work was done while Yankai
Lin and Peng Li were working at Tencent.
Schema Figure v2
Knowledge IntegrationKnowledge Distillation
Teacher
PLM
Unlabeled Data
Student
PLM
Versatile
Student PLM
Unlabeled Data
Teacher
PLM 1
Teacher
PLM 2
Label Set: !Label Set: !Label Set: !
!Label Set: !
"Label Set: !
!∪ !
"
Figure 1: Comparison of knowledge distillation (KD)
and knowledge integration (KI). KD assumes that the
student performs predictions on the identical label set
with the teacher, while KI trains a student model that
is capable of performing classification over the union
label set of teacher models.
cost of retraining the PLM from scratch and alle-
viate the potential environmental side-effects like
carbon footprints (Strubell et al.,2019), thus mak-
ing NLP systems greener (Schwartz et al.,2020). A
commonly adopted model reuse paradigm is knowl-
edge distillation (Hinton et al.,2015;Romero et al.,
2015), where a student model learns to mimic a
teacher model by aligning its outputs to that of the
teacher. In this way, though achieving promising re-
sults with PLMs (Sun et al.,2019;Jiao et al.,2020),
the student is restricted to perform the same task
as the teacher model, thus restricting re-utilization
of abundant available PLMs fine-tuned on different
tasks, e.g., models fine-tuned on various label sets
or even different datasets.
In this paper, we generalize the idea of KD
from mimicking teachers to integrating knowledge
from teachers, and propose Knowledge Integra-
tion (KI) for PLMs. Given multiple fine-tuned
teacher-PLMs, each of which is capable of per-
forming classification over a unique label set, KI
aims to train a versatile student that can make pre-
dictions over the union of teacher label sets. As the
labeled data for training the teachers may not be
publicly released due to data privacy issues, we as-
sume no human annotations are available during KI.
arXiv:2210.05230v1 [cs.CL] 11 Oct 2022