Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition Junhao Zheng Zhanxian Liang Haibin Chen Qianli Ma

2025-04-27 0 0 1.16MB 14 页 10玖币

侵权投诉

Distilling Causal Effect from Miscellaneous Other-Class

for Continual Named Entity Recognition

Junhao Zheng, Zhanxian Liang, Haibin Chen, Qianli Ma*

School of Computer Science and Engineering,

South China University of Technology, Guangzhou, China

junhaozheng47@outlook.com,qianlima@scut.edu.cn∗

Abstract

Continual Learning for Named Entity Recog-

nition (CL-NER) aims to learn a growing num-

ber of entity types over time from a stream of

data. However, simply learning Other-Class in

the same way as new entity types ampliﬁes the

catastrophic forgetting and leads to a substan-

tial performance drop. The main cause behind

this is that Other-Class samples usually con-

tain old entity types, and the old knowledge

in these Other-Class samples is not preserved

properly. Thanks to the causal inference, we

identify that the forgetting is caused by the

missing causal effect from the old data. To this

end, we propose a uniﬁed causal framework

to retrieve the causality from both new entity

types and Other-Class. Furthermore, we ap-

ply curriculum learning to mitigate the impact

of label noise and introduce a self-adaptive

weight for balancing the causal effects be-

tween new entity types and Other-Class. Ex-

perimental results on three benchmark datasets

show that our method outperforms the state-of-

the-art method by a large margin. Moreover,

our method can be combined with the existing

state-of-the-art methods to improve the perfor-

mance in CL-NER. 1

1 Introduction

Named Entity Recognition (NER) is a vital task

in various NLP applications (Ma and Hovy, 2016).

Traditional NER aims at extracting entities from un-

structured text and classifying them into a ﬁxed set

of entity types (e.g.,Person,Location,Organiza-

tion,etc). However, in many real-world scenarios,

the training data are streamed, and the NER sys-

tems are required to recognize new entity types to

support new functionalities, which can be formu-

lated into the paradigm of continual learning (CL,

a.k.a. incremental learning or lifelong learning)

∗

*Corresponding author

Our codes are publicly available at

https://github.com/zzz47zzz/CFNER

Figure 1: An illustration of Other-class in CL-NER.

Suppose that a model learns four entity types in

CoNLL2003 sequentially. “LOC”: Location; “MISC”:

Miscellaneous; “ORG”: Organisation; “PER”: Person.

(Thrun, 1998; Parisi et al., 2019). For instance,

voice assistants such as Siri or Alexa are often re-

quired to extract new entity types (e.g. Song,Band)

for grasping new intents (e.g. GetMusic) (Monaikul

et al., 2021).

However, as is well known, continual learning

faces a serious challenge called catastrophic for-

getting in learning new knowledge (McCloskey

and Cohen, 1989; Robins, 1995; Goodfellow et al.,

2013; Kirkpatrick et al., 2017). More speciﬁcally,

simply ﬁne-tuning a NER system on new data usu-

ally leads to a substantial performance drop on

previous data. In contrast, a child can naturally

learn new concepts (e.g., Song and Band) without

forgetting the learned concepts (e.g., Person and

Location). Therefore, continual learning for NER

(CL-NER) is a ubiquitous issue and a big challenge

in achieving human-level intelligence.

In the standard setting of continual learning, only

new entity types are recognized by the model in

each CL step. For CL-NER, the new dataset con-

tains not only new entity types but also Other-class

tokens which do not belong to any new entity types.

For instance, about 89% tokens belongs to Other-

class in OntoNotes5 (Hovy et al., 2006). Unlike

accuracy-oriented tasks such as the image/text clas-

siﬁcation, NER inevitably introduces a vast number

of Other-class samples in training data. As a result,

the model strongly biases towards Other-class (Li

et al., 2020). Even worse, the meaning of Other-

class varies along with the continual learning pro-

arXiv:2210.03980v1 [cs.CL] 8 Oct 2022

Figure 2: An illustration of the impact of Other-class

samples on OntoNotes5. We consider two scenarios

with different extra annotation levels on Other-class

samples: (1) annotate all recognized entity types on the

data in the current CL step (Current); (2) no extra anno-

tations on Other-class samples (None).

cess. For example, “Europe” is tagged as Location

if and only if the entity type Location is learned

in the current CL step. Otherwise, the token “Eu-

rope” will be tagged as Other-class. An illustration

is given in Figure 1 to demonstrate Other-class in

CL-NER. In a nutshell, the continually changing

meaning of Other-class as well as the imbalance

between the entity and Other-class tokens amplify

the forgetting problem in CL-NER.

Figure 2 is an illustration of the impact of Other-

class samples. We divide the training set into 18

disjoint splits, and each split corresponds to one

entity type to learn. Then, we only retain the la-

bels of the corresponding entity type in each split

while the other tokens are tagged as Other-class.

Next, the NER model learns 18 entity types one

after another, as in CL. To eliminate the impact

of forgetting, we assume that all recognized train-

ing data can be stored. Figure 2 shows two sce-

narios where Other-class samples are additionally

annotated with ground-truth labels or not. Results

show that ignoring the different meanings of Other-

classes affects the performance dramatically. The

main cause is that Other-class contains old enti-

ties. From another perspective, the old entities in

Other-class are similar to the reserved samples of

old classes in the data replay strategy (Rebufﬁ et al.,

2017). Therefore, we raise a question: how can we

learn from Other-class samples for anti-forgetting

in CL-NER?

In this study, we address this question with

ausal

ramework for CL-

NER

(

CFNER

)

based on causal inference (Glymour et al., 2016;

Schölkopf, 2022). Through causal lenses, we deter-

mine that the crux of CL-NER lies in establishing

causal links from the old data to new entity types

and Other-class. To achieve this, we utilize the old

model (i.e., the NER model trained on old entity

types) to recognize old entities in Other-class sam-

ples and distillate causal effects (Glymour et al.,

2016) from both new entity types and Other-class

simultaneously. In this way, the causality of Other-

class can be learned to preserve old knowledge,

while the different meanings of Other-classes can

be captured dynamically. In addition, we design

a curriculum learning (Bengio et al., 2009) strat-

egy to enhance the causal effect from Other-class

by mitigating the label noise generated by the old

model. Moreover, we introduce a self-adaptive

weight to dynamically balance the causal effects

from Other-class and new entity types. Extensive

experiments on three benchmark NER datasets,

i.e., OntoNotes5, i2b2 (Murphy et al., 2010) and

CoNLL2003 (Sang and De Meulder, 2003), vali-

date the effectiveness of the proposed method. The

experimental results show that our method out-

performs the previous state-of-the-art method in

CL-NER signiﬁcantly. The main contributions are

summarized as follows:

•

We frame CL-NER into a causal graph (Pearl,

2009) and propose a uniﬁed causal framework

to retrieve the causalities from both Other-

class and new entity types.

•

We are the ﬁrst to distillate causal effects from

Other-class for anti-forgetting in CL, and we

propose a curriculum learning strategy and

a self-adaptive weight to enhance the causal

effect in Other-class.

•

Through extensive experiments, we show that

our method achieves the state-of-the-art per-

formance in CL-NER and can be implemented

as a plug-and-play module to further improve

the performances of other CL methods.

2 Related Work

2.1 Continual Learning for NER

Despite the fast development of CL in computer

vision, most of these methods (Douillard et al.,

2020; Rebufﬁ et al., 2017; Hou et al., 2019) are

devised for accuracy-oriented tasks such as image

classiﬁcation and fail to preserve the old knowledge

in Other-class samples. In our experiment, we ﬁnd

that simply applying these methods to CL-NER

does not lead to satisfactory performances.

In CL-NER, a straightforward solution for learn-

ing old knowledge from Other-class samples is

self-training (Rosenberg et al., 2005; De Lange

et al., 2019). In each CL step, the old model is

used to annotate the Other-class samples in the

new dataset. Next, a new NER model is trained

to recognize both old and new entity types in the

dataset. The main disadvantage of self-training

is that the errors caused by wrong predictions of

the old model are propagated to the new model

(Monaikul et al., 2021). Monaikul et al. (2021)

proposed a method based on knowledge distilla-

tion (Hinton et al., 2015) called ExtendNER where

the old model acts as a teacher and the new model

acts as a student. Compared with self-training,

this distillation-based method takes the uncertainty

of the old model’s predictions into consideration

and reaches the state-of-the-art performance in CL-

NER.

Recently, Das et al. (2022) alleviates the prob-

lem of Other-tokens in few-shot NER by con-

trastive learning and pretraining techniques. Unlike

them, our method explicitly alleviates the problem

brought by Other-Class tokens through a causal

framework in CL-NER.

2.2 Causal Inference

Causal inference (Glymour et al., 2016; Schölkopf,

2022) has been recently introduced to various com-

puter vision and NLP tasks, such as semantic seg-

mentation (Zhang et al., 2020), long-tailed classiﬁ-

cation (Tang et al., 2020; Nan et al., 2021), distantly

supervised NER (Zhang et al., 2021) and neural dia-

logue generation (Zhu et al., 2020). Hu et al. (2021)

ﬁrst applied causal inference in CL and pointed out

that the vanishing old data effect leads to forgetting.

Inspired by the causal view in (Hu et al., 2021),

we mitigate the forgetting problem in CL-NER by

mining the old knowledge in Other-class samples.

3 Causal Views on (Anti-) Forgetting

In this section, we explain the (anti-) forgetting

in CL from a causal perspective. First, we model

the causalities among data, feature, and prediction

at any consecutive CL step with a causal graph

(Pearl, 2009) to identify the forgetting problem.

The causal graph is a directed acyclic graph whose

nodes are variables, and directed edges are causali-

ties between nodes. Next, we introduce how causal

effects are utilized for anti-forgetting.

3.1 Causal Graph

Figure 3a shows the causal graph of CL-NER when

no anti-forgetting techniques are used. Speciﬁ-

cally, we denote the old data as

; the new data

; the feature of new data extracted from the

old and new model as

and

; the prediction

of new data as

(i.e., the probability distribution

(scores)). The causality between notes is as follows:

(1)

D→X→ˆ

D→X

represents that the fea-

ture

is extracted by the backbone model (

e.g.,

BERT (Devlin et al., 2019)), and

X→ˆ

indicates

that the prediction

is obtained by using the fea-

ture

with the classiﬁer (

e.g.,

a fully-connected

layer); (2)

S→X0←D

: these links represent

that the old feature representation of new data

is determined by the new data

and the old model

trained on old data

. Figure 3a shows that the for-

getting happens because there are no causal links

between

and

. More explanations about the for-

getting in CL-NER are demonstrated in Appendix

3.2 Colliding Effects

In order to build cause paths from

, a naive

solution is to store (a fraction of) old data, resulting

in a causal link

S→D

is built. However, storing

old data contradicts the scenario of CL to some

extent. To deal with this dilemma, Hu et al. (2021)

proposed to add a causal path

S↔D

between old

and new data by using Colliding Effect (Glymour

et al., 2016). Consequently,

and

will be cor-

related to each other when we control the collider

. Here is an intuitive example: a causal graph

sprinkler →pavement ←weather

represents

the pavement’s condition (wet/dry) is determined

by both the weather (rainy/sunny) and the sprinkler

(on/off). Typically, the weather and the sprinkler

are independent of each other. However, if we ob-

serve that the pavement is wet and know that the

sprinkler is off, we can infer that the weather is

likely to be rainy, and vice versa.

4 A Causal Framework for CL-NER

In this section, we frame CL-NER into a causal

graph and identify that learning the causality in

Other-class is crucial for CL-NER. Based on the

characteristic of CL-NER, we propose a uniﬁed

causal framework to retrieve the causalities from

both Other-class and new entity types. We are the

ﬁrst to distillate causal effects from Other-class for

anti-forgetting in CL. Furthermore, we introduce

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DistillingCausalEffectfromMiscellaneousOther-ClassforContinualNamedEntityRecognitionJunhaoZheng,ZhanxianLiang,HaibinChen,QianliMa*SchoolofComputerScienceandEngineering,SouthChinaUniversityofTechnology,Guangzhou,Chinajunhaozheng47@outlook.com,qianlima@scut.edu.cnAbstractContinualLearningforNamedEnti...

展开>> 收起<<

Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition Junhao Zheng Zhanxian Liang Haibin Chen Qianli Ma.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition Junhao Zheng Zhanxian Liang Haibin Chen Qianli Ma

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: