Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition Junhao Zheng Zhanxian Liang Haibin Chen Qianli Ma

2025-04-27 0 0 1.16MB 14 页 10玖币
侵权投诉
Distilling Causal Effect from Miscellaneous Other-Class
for Continual Named Entity Recognition
Junhao Zheng, Zhanxian Liang, Haibin Chen, Qianli Ma*
School of Computer Science and Engineering,
South China University of Technology, Guangzhou, China
junhaozheng47@outlook.com,qianlima@scut.edu.cn
Abstract
Continual Learning for Named Entity Recog-
nition (CL-NER) aims to learn a growing num-
ber of entity types over time from a stream of
data. However, simply learning Other-Class in
the same way as new entity types amplifies the
catastrophic forgetting and leads to a substan-
tial performance drop. The main cause behind
this is that Other-Class samples usually con-
tain old entity types, and the old knowledge
in these Other-Class samples is not preserved
properly. Thanks to the causal inference, we
identify that the forgetting is caused by the
missing causal effect from the old data. To this
end, we propose a unified causal framework
to retrieve the causality from both new entity
types and Other-Class. Furthermore, we ap-
ply curriculum learning to mitigate the impact
of label noise and introduce a self-adaptive
weight for balancing the causal effects be-
tween new entity types and Other-Class. Ex-
perimental results on three benchmark datasets
show that our method outperforms the state-of-
the-art method by a large margin. Moreover,
our method can be combined with the existing
state-of-the-art methods to improve the perfor-
mance in CL-NER. 1
1 Introduction
Named Entity Recognition (NER) is a vital task
in various NLP applications (Ma and Hovy, 2016).
Traditional NER aims at extracting entities from un-
structured text and classifying them into a fixed set
of entity types (e.g.,Person,Location,Organiza-
tion,etc). However, in many real-world scenarios,
the training data are streamed, and the NER sys-
tems are required to recognize new entity types to
support new functionalities, which can be formu-
lated into the paradigm of continual learning (CL,
a.k.a. incremental learning or lifelong learning)
*Corresponding author
1
Our codes are publicly available at
https://github.com/zzz47zzz/CFNER
Figure 1: An illustration of Other-class in CL-NER.
Suppose that a model learns four entity types in
CoNLL2003 sequentially. “LOC”: Location; “MISC”:
Miscellaneous; “ORG”: Organisation; “PER”: Person.
(Thrun, 1998; Parisi et al., 2019). For instance,
voice assistants such as Siri or Alexa are often re-
quired to extract new entity types (e.g. Song,Band)
for grasping new intents (e.g. GetMusic) (Monaikul
et al., 2021).
However, as is well known, continual learning
faces a serious challenge called catastrophic for-
getting in learning new knowledge (McCloskey
and Cohen, 1989; Robins, 1995; Goodfellow et al.,
2013; Kirkpatrick et al., 2017). More specifically,
simply fine-tuning a NER system on new data usu-
ally leads to a substantial performance drop on
previous data. In contrast, a child can naturally
learn new concepts (e.g., Song and Band) without
forgetting the learned concepts (e.g., Person and
Location). Therefore, continual learning for NER
(CL-NER) is a ubiquitous issue and a big challenge
in achieving human-level intelligence.
In the standard setting of continual learning, only
new entity types are recognized by the model in
each CL step. For CL-NER, the new dataset con-
tains not only new entity types but also Other-class
tokens which do not belong to any new entity types.
For instance, about 89% tokens belongs to Other-
class in OntoNotes5 (Hovy et al., 2006). Unlike
accuracy-oriented tasks such as the image/text clas-
sification, NER inevitably introduces a vast number
of Other-class samples in training data. As a result,
the model strongly biases towards Other-class (Li
et al., 2020). Even worse, the meaning of Other-
class varies along with the continual learning pro-
arXiv:2210.03980v1 [cs.CL] 8 Oct 2022
Figure 2: An illustration of the impact of Other-class
samples on OntoNotes5. We consider two scenarios
with different extra annotation levels on Other-class
samples: (1) annotate all recognized entity types on the
data in the current CL step (Current); (2) no extra anno-
tations on Other-class samples (None).
cess. For example, “Europe” is tagged as Location
if and only if the entity type Location is learned
in the current CL step. Otherwise, the token “Eu-
rope” will be tagged as Other-class. An illustration
is given in Figure 1 to demonstrate Other-class in
CL-NER. In a nutshell, the continually changing
meaning of Other-class as well as the imbalance
between the entity and Other-class tokens amplify
the forgetting problem in CL-NER.
Figure 2 is an illustration of the impact of Other-
class samples. We divide the training set into 18
disjoint splits, and each split corresponds to one
entity type to learn. Then, we only retain the la-
bels of the corresponding entity type in each split
while the other tokens are tagged as Other-class.
Next, the NER model learns 18 entity types one
after another, as in CL. To eliminate the impact
of forgetting, we assume that all recognized train-
ing data can be stored. Figure 2 shows two sce-
narios where Other-class samples are additionally
annotated with ground-truth labels or not. Results
show that ignoring the different meanings of Other-
classes affects the performance dramatically. The
main cause is that Other-class contains old enti-
ties. From another perspective, the old entities in
Other-class are similar to the reserved samples of
old classes in the data replay strategy (Rebuffi et al.,
2017). Therefore, we raise a question: how can we
learn from Other-class samples for anti-forgetting
in CL-NER?
In this study, we address this question with
a
C
ausal
F
ramework for CL-
NER
(
CFNER
)
based on causal inference (Glymour et al., 2016;
Schölkopf, 2022). Through causal lenses, we deter-
mine that the crux of CL-NER lies in establishing
causal links from the old data to new entity types
and Other-class. To achieve this, we utilize the old
model (i.e., the NER model trained on old entity
types) to recognize old entities in Other-class sam-
ples and distillate causal effects (Glymour et al.,
2016) from both new entity types and Other-class
simultaneously. In this way, the causality of Other-
class can be learned to preserve old knowledge,
while the different meanings of Other-classes can
be captured dynamically. In addition, we design
a curriculum learning (Bengio et al., 2009) strat-
egy to enhance the causal effect from Other-class
by mitigating the label noise generated by the old
model. Moreover, we introduce a self-adaptive
weight to dynamically balance the causal effects
from Other-class and new entity types. Extensive
experiments on three benchmark NER datasets,
i.e., OntoNotes5, i2b2 (Murphy et al., 2010) and
CoNLL2003 (Sang and De Meulder, 2003), vali-
date the effectiveness of the proposed method. The
experimental results show that our method out-
performs the previous state-of-the-art method in
CL-NER significantly. The main contributions are
summarized as follows:
We frame CL-NER into a causal graph (Pearl,
2009) and propose a unified causal framework
to retrieve the causalities from both Other-
class and new entity types.
We are the first to distillate causal effects from
Other-class for anti-forgetting in CL, and we
propose a curriculum learning strategy and
a self-adaptive weight to enhance the causal
effect in Other-class.
Through extensive experiments, we show that
our method achieves the state-of-the-art per-
formance in CL-NER and can be implemented
as a plug-and-play module to further improve
the performances of other CL methods.
2 Related Work
2.1 Continual Learning for NER
Despite the fast development of CL in computer
vision, most of these methods (Douillard et al.,
2020; Rebuffi et al., 2017; Hou et al., 2019) are
devised for accuracy-oriented tasks such as image
classification and fail to preserve the old knowledge
in Other-class samples. In our experiment, we find
that simply applying these methods to CL-NER
does not lead to satisfactory performances.
In CL-NER, a straightforward solution for learn-
ing old knowledge from Other-class samples is
self-training (Rosenberg et al., 2005; De Lange
et al., 2019). In each CL step, the old model is
used to annotate the Other-class samples in the
new dataset. Next, a new NER model is trained
to recognize both old and new entity types in the
dataset. The main disadvantage of self-training
is that the errors caused by wrong predictions of
the old model are propagated to the new model
(Monaikul et al., 2021). Monaikul et al. (2021)
proposed a method based on knowledge distilla-
tion (Hinton et al., 2015) called ExtendNER where
the old model acts as a teacher and the new model
acts as a student. Compared with self-training,
this distillation-based method takes the uncertainty
of the old model’s predictions into consideration
and reaches the state-of-the-art performance in CL-
NER.
Recently, Das et al. (2022) alleviates the prob-
lem of Other-tokens in few-shot NER by con-
trastive learning and pretraining techniques. Unlike
them, our method explicitly alleviates the problem
brought by Other-Class tokens through a causal
framework in CL-NER.
2.2 Causal Inference
Causal inference (Glymour et al., 2016; Schölkopf,
2022) has been recently introduced to various com-
puter vision and NLP tasks, such as semantic seg-
mentation (Zhang et al., 2020), long-tailed classifi-
cation (Tang et al., 2020; Nan et al., 2021), distantly
supervised NER (Zhang et al., 2021) and neural dia-
logue generation (Zhu et al., 2020). Hu et al. (2021)
first applied causal inference in CL and pointed out
that the vanishing old data effect leads to forgetting.
Inspired by the causal view in (Hu et al., 2021),
we mitigate the forgetting problem in CL-NER by
mining the old knowledge in Other-class samples.
3 Causal Views on (Anti-) Forgetting
In this section, we explain the (anti-) forgetting
in CL from a causal perspective. First, we model
the causalities among data, feature, and prediction
at any consecutive CL step with a causal graph
(Pearl, 2009) to identify the forgetting problem.
The causal graph is a directed acyclic graph whose
nodes are variables, and directed edges are causali-
ties between nodes. Next, we introduce how causal
effects are utilized for anti-forgetting.
3.1 Causal Graph
Figure 3a shows the causal graph of CL-NER when
no anti-forgetting techniques are used. Specifi-
cally, we denote the old data as
S
; the new data
as
D
; the feature of new data extracted from the
old and new model as
X0
and
X
; the prediction
of new data as
ˆ
Y
(i.e., the probability distribution
(scores)). The causality between notes is as follows:
(1)
DXˆ
Y
:
DX
represents that the fea-
ture
X
is extracted by the backbone model (
e.g.,
BERT (Devlin et al., 2019)), and
Xˆ
Y
indicates
that the prediction
ˆ
Y
is obtained by using the fea-
ture
X
with the classifier (
e.g.,
a fully-connected
layer); (2)
SX0D
: these links represent
that the old feature representation of new data
X0
is determined by the new data
D
and the old model
trained on old data
S
. Figure 3a shows that the for-
getting happens because there are no causal links
between
S
and
ˆ
Y
. More explanations about the for-
getting in CL-NER are demonstrated in Appendix
A.
3.2 Colliding Effects
In order to build cause paths from
S
to
ˆ
Y
, a naive
solution is to store (a fraction of) old data, resulting
in a causal link
SD
is built. However, storing
old data contradicts the scenario of CL to some
extent. To deal with this dilemma, Hu et al. (2021)
proposed to add a causal path
SD
between old
and new data by using Colliding Effect (Glymour
et al., 2016). Consequently,
S
and
D
will be cor-
related to each other when we control the collider
X0
. Here is an intuitive example: a causal graph
sprinkler pavement weather
represents
the pavement’s condition (wet/dry) is determined
by both the weather (rainy/sunny) and the sprinkler
(on/off). Typically, the weather and the sprinkler
are independent of each other. However, if we ob-
serve that the pavement is wet and know that the
sprinkler is off, we can infer that the weather is
likely to be rainy, and vice versa.
4 A Causal Framework for CL-NER
In this section, we frame CL-NER into a causal
graph and identify that learning the causality in
Other-class is crucial for CL-NER. Based on the
characteristic of CL-NER, we propose a unified
causal framework to retrieve the causalities from
both Other-class and new entity types. We are the
first to distillate causal effects from Other-class for
anti-forgetting in CL. Furthermore, we introduce
摘要:

DistillingCausalEffectfromMiscellaneousOther-ClassforContinualNamedEntityRecognitionJunhaoZheng,ZhanxianLiang,HaibinChen,QianliMa*SchoolofComputerScienceandEngineering,SouthChinaUniversityofTechnology,Guangzhou,Chinajunhaozheng47@outlook.com,qianlima@scut.edu.cnAbstractContinualLearningforNamedEnti...

展开>> 收起<<
Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition Junhao Zheng Zhanxian Liang Haibin Chen Qianli Ma.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.16MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注