self-training (Rosenberg et al., 2005; De Lange
et al., 2019). In each CL step, the old model is
used to annotate the Other-class samples in the
new dataset. Next, a new NER model is trained
to recognize both old and new entity types in the
dataset. The main disadvantage of self-training
is that the errors caused by wrong predictions of
the old model are propagated to the new model
(Monaikul et al., 2021). Monaikul et al. (2021)
proposed a method based on knowledge distilla-
tion (Hinton et al., 2015) called ExtendNER where
the old model acts as a teacher and the new model
acts as a student. Compared with self-training,
this distillation-based method takes the uncertainty
of the old model’s predictions into consideration
and reaches the state-of-the-art performance in CL-
NER.
Recently, Das et al. (2022) alleviates the prob-
lem of Other-tokens in few-shot NER by con-
trastive learning and pretraining techniques. Unlike
them, our method explicitly alleviates the problem
brought by Other-Class tokens through a causal
framework in CL-NER.
2.2 Causal Inference
Causal inference (Glymour et al., 2016; Schölkopf,
2022) has been recently introduced to various com-
puter vision and NLP tasks, such as semantic seg-
mentation (Zhang et al., 2020), long-tailed classifi-
cation (Tang et al., 2020; Nan et al., 2021), distantly
supervised NER (Zhang et al., 2021) and neural dia-
logue generation (Zhu et al., 2020). Hu et al. (2021)
first applied causal inference in CL and pointed out
that the vanishing old data effect leads to forgetting.
Inspired by the causal view in (Hu et al., 2021),
we mitigate the forgetting problem in CL-NER by
mining the old knowledge in Other-class samples.
3 Causal Views on (Anti-) Forgetting
In this section, we explain the (anti-) forgetting
in CL from a causal perspective. First, we model
the causalities among data, feature, and prediction
at any consecutive CL step with a causal graph
(Pearl, 2009) to identify the forgetting problem.
The causal graph is a directed acyclic graph whose
nodes are variables, and directed edges are causali-
ties between nodes. Next, we introduce how causal
effects are utilized for anti-forgetting.
3.1 Causal Graph
Figure 3a shows the causal graph of CL-NER when
no anti-forgetting techniques are used. Specifi-
cally, we denote the old data as
S
; the new data
as
D
; the feature of new data extracted from the
old and new model as
X0
and
X
; the prediction
of new data as
ˆ
Y
(i.e., the probability distribution
(scores)). The causality between notes is as follows:
(1)
D→X→ˆ
Y
:
D→X
represents that the fea-
ture
X
is extracted by the backbone model (
e.g.,
BERT (Devlin et al., 2019)), and
X→ˆ
Y
indicates
that the prediction
ˆ
Y
is obtained by using the fea-
ture
X
with the classifier (
e.g.,
a fully-connected
layer); (2)
S→X0←D
: these links represent
that the old feature representation of new data
X0
is determined by the new data
D
and the old model
trained on old data
S
. Figure 3a shows that the for-
getting happens because there are no causal links
between
S
and
ˆ
Y
. More explanations about the for-
getting in CL-NER are demonstrated in Appendix
A.
3.2 Colliding Effects
In order to build cause paths from
S
to
ˆ
Y
, a naive
solution is to store (a fraction of) old data, resulting
in a causal link
S→D
is built. However, storing
old data contradicts the scenario of CL to some
extent. To deal with this dilemma, Hu et al. (2021)
proposed to add a causal path
S↔D
between old
and new data by using Colliding Effect (Glymour
et al., 2016). Consequently,
S
and
D
will be cor-
related to each other when we control the collider
X0
. Here is an intuitive example: a causal graph
sprinkler →pavement ←weather
represents
the pavement’s condition (wet/dry) is determined
by both the weather (rainy/sunny) and the sprinkler
(on/off). Typically, the weather and the sprinkler
are independent of each other. However, if we ob-
serve that the pavement is wet and know that the
sprinkler is off, we can infer that the weather is
likely to be rainy, and vice versa.
4 A Causal Framework for CL-NER
In this section, we frame CL-NER into a causal
graph and identify that learning the causality in
Other-class is crucial for CL-NER. Based on the
characteristic of CL-NER, we propose a unified
causal framework to retrieve the causalities from
both Other-class and new entity types. We are the
first to distillate causal effects from Other-class for
anti-forgetting in CL. Furthermore, we introduce