Improving Continual Relation Extraction through Prototypical Contrastive Learning Chengwei Hu1y Deqing Yang1y Haoliang Jin1z Zhen Chen1z Yanghua Xiao2y

2025-05-08 2 0 742.56KB 11 页 10玖币

侵权投诉

Improving Continual Relation Extraction through

Prototypical Contrastive Learning ∗

Chengwei Hu1†, Deqing Yang1†, Haoliang Jin1‡, Zhen Chen1‡, Yanghua Xiao2†

1School of Data Science, Fudan University, Shanghai, China

2School of Computer Science, Fudan University, Shanghai, China

†{cwhu20, yangdeqing, shawyh}@fudan.edu.cn

‡{hljin21, zhenchen21}@m.fudan.edu.cn

Abstract

Continual relation extraction (CRE) aims to

extract relations towards the continuous and it-

erative arrival of new data, of which the ma-

jor challenge is the catastrophic forgetting of

old tasks. In order to alleviate this critical

problem for enhanced CRE performance, we

propose a novel Continual Relation Extraction

framework with Contrastive Learning, namely

CRECL, which is built with a classiﬁcation

network and a prototypical contrastive net-

work to achieve the incremental-class learning

of CRE. Speciﬁcally, in the contrastive net-

work a given instance is contrasted with the

prototype of each candidate relations stored in

the memory module. Such contrastive learn-

ing scheme ensures the data distributions of

all tasks more distinguishable, so as to alle-

viate the catastrophic forgetting further. Our

experiment results not only demonstrate our

CRECL’s advantage over the state-of-the-art

baselines on two public datasets, but also ver-

ify the effectiveness of CRECL’s contrastive

learning on improving CRE performance.

1 Introduction

In some scenarios of relation extraction (RE), mas-

sive new data including new relations emerges con-

tinuously, which can not be solved by traditional

RE methods. To handle such situation, continual

relation extraction (CRE) (Wang et al.,2019) was

proposed. Due to the limited storage and comput-

ing resources, it is impractical to store all training

data of previous tasks. As new tasks are learned

where new relations emerge constantly, the model

tends to forget the existing knowledge about old re-

lations. Therefore, the problem of catastrophic for-

getting damages CRE performance severely (Hass-

abis et al.,2017;Thrun and Mitchell,1995).

In recent years, some efforts have focused on the

alleviating catastrophic forgetting in CRE, which

∗

This work is supported by Shanghai Science and Tech-

nology Innovation Action Plan (China) No.21511100401.

Figure 1: The data distribution map (better viewed in

color) after training a classiﬁcation model for an old

task and then a new task. Many different relation data

(different colors) of the old (dots) and new (crosses)

task are mixed due to the catastrophic forgetting, mak-

ing it hard to distinguish the new task’s relations from

the old task’s relations.

can be divided into consolidation-based methods

(Kirkpatrick et al.,2017), dynamic architecture

methods (Chen et al.,2015;Fernando et al.,2017)

and Memory-based methods (Chaudhry et al.,

2018;Han et al.,2020;Cui et al.,2021). Despite

these methods’ effectiveness on CRE, most of them

have not taken full advantage of the negative rela-

tion information in all tasks to alleviate catastrophic

forgetting more thoroughly, result in suboptimal

CRE performance.

Through our empirical studies, we found that

the catastrophic forgetting of a model results in the

indistinguishability between the data (instances)

distributions of all tasks, making it hard to distin-

guish the relations of all tasks. We illustrate it with

the data distribution map after training a relation

classiﬁcation model for a new task, as shown in

Figure 1where the dots and crosses represent the

data of the old and new task respectively, and dif-

ferent colors represent different relations. It shows

that the data points of different colors in either

dot group (old task) or cross group (new task) are

distinguishable. However, many dots and crosses

are mixed, making it hard to discriminate the new

task’s relations from the old task’s relations. There-

fore, making the data distributions of all tasks more

arXiv:2210.04513v1 [cs.IR] 10 Oct 2022

distinguishable is crucial to achieve better CRE.

To address above issue, in this paper we pro-

pose a novel

ontinual

elation

xtraction frame-

work with

ontrastive

earning, namely

CRECL

which is built with a classiﬁcation network and a

contrastive network. In order to fully leverage the

information of negative relations to make the data

distributions of all tasks more distinguishable, we

design a prototypical contrastive learning scheme.

Speciﬁcally, in the contrastive network of CRECL,

a given instance is contrasted with the prototype of

each candidate relation stored in the memory mod-

ule. Such sufﬁcient comparisons ensure the align-

ment and uniformity between the data distributions

of old and new tasks. Therefore, the catastrophic

forgetting in CRECL is alleviated more thoroughly,

resulting in enhanced CRE performance. In addi-

tion, different to the classiﬁcation for a ﬁxed (rela-

tion) class set as (Han et al.,2020;Cui et al.,2021),

CRECL achieves an incremental-class learning of

CRE which is more feasible to real-world CRE

scenarios.

Our contributions in this paper are summarized

as follows:

1. We propose a novel CRE framework CRECL

that combines a classiﬁcation network and a pro-

totypical contrastive network to fully alleviate the

problem of catastrophic forgetting.

2. With the contrasting-based mechanism,

our CRECL can effectively achieve the class-

incremental learning which is more practical in

real-world CRE scenarios.

3. Our extensive experiments justify our

CRECL’s advantage over the state-of-the-art

(SOTA) models on two benchmark datasets, TA-

CRED and FewRel. Furthermore, we provide our

deep insights into the reasons of the compared mod-

els’ distinct performance.

2 Related Work

In this section, we brieﬂy introduce continual learn-

ing and contrastive learning which are both related

to our work.

Continual learning (Delange et al.,2021;Parisi

et al.,2019) focuses on the learning from a con-

tinuous stream of data. The models of continual

learning are able to accumulate knowledge across

different tasks without retraining from scratch. The

major challenge in continual learning is to allevi-

ate catastrophic forgetting which refers to that the

performance on previous tasks should not signif-

icantly decline over time as new tasks come in.

For overcoming catastrophic forgetting, most re-

cent works can be divided into three categories.

1) Regularized-based methods impose constraints

on the update of parameters. For example, LwF

approach (Li and Hoiem,2016) enforces the net-

work of previously learned tasks to be similar to the

network of current task by knowledge distillation.

However, LwF depends heavily on the data in new

task and its relatedness to prior tasks. EWC (Kirk-

patrick et al.,2016) adopts a quadratic penalty on

the difference between the parameters for old and

new tasks. It models the parameter relevance with

respect to training data as a posterior distribution,

which is estimated by Laplace approximation with

the precision determined by the Fisher Information

Matrix. WA (Zhao et al.,2020) maintains discrimi-

nation and fairness among the new and old task by

adjust the parameters of the last layer. 2) Dynamic

architecture methods change models’ architectural

properties upon new data by dynamically accom-

modating new neural resources, such as increased

number of neurons. For example, PackNet (Mallya

and Lazebnik,2017) iteratively assigns parameter

subsets to consecutive tasks by constituting prun-

ing masks, which ﬁxes the task parameter subset

for future tasks. DER (Yan et al.,2021) proposes

a novel two-stage learning approach to get more

effective dynamically expandable representation.

3) Memory-based methods explicitly retrain the

models on a limited subset of stored samples dur-

ing the training on new tasks. For example, iCaRL

(Rebufﬁ et al.,2017) focuses on learning in a class-

incremental way, which selects and stores the sam-

ples most close to the feature mean of each class.

During training, distillation loss between targets ob-

tained from previous and current model predictions

is added into overall loss, to preserve previously

learned knowledge. RP-CRE (Cui et al.,2021)

introduces a novel pluggable attention-based mem-

ory module to automatically calculate old tasks’

weights when learning new tasks.

Since classiﬁcation-based approaches require

relation schema in the classiﬁcation layer,

classiﬁcation-based models have an unignorable

drawback on class-incremental learning. Many

researchers leverage metric learning to solve this

problem. (Wang et al.,2019;Wu et al.,2021) uti-

lize sentence alignment model based on Margin

Ranking Loss (Nayyeri et al.,2019), while lack the

intrinsic ability to perform hard positive/negative

Memory

Dropout layer

[E11] Charlie Chaplin [E12] was born in [E21] London[E22]

𝒆i

𝒉𝑖

BERT-based encoder

K-means

Instance

Embedding

Similarity

Compared

instance

L"=𝜆#L$%& + (1 − 𝜆#)L'()

𝒑𝑟

𝒔𝑟

Projector

GELU

Linear

Norm

Linear

Current task Tk

Relation

prototype

Projector

𝒔𝑖

𝒉𝑖

Typical instances

Encoding layer

GELU

Linear

Softmax

Classification network Contrastive network

Instance i:

Figure 2: The overall structure of our proposed CRECL. The framework is built with a shared encoding layer, a

classiﬁcation network and a contrastive network.

mining, resulting in poor performance. Recently,

contrastive learning has been widely imported into

self-supervised learning frameworks in many ﬁelds

including computer vision, natural language pro-

cessing and so on. Contrastive learning is a dis-

criminative scheme that aims to group similar sam-

ples more closer and diverse samples far from each

other. (Wang and Liu,2021) proves that contrastive

learning can promote the alignment and stability of

data distribution, and (Khosla et al.,2020) veriﬁes

that using modern batch contrastive approaches,

such as InfoNCE loss (Oord et al.,2018), outper-

forms traditional contrastive losses, such as margin

ranking loss, and also achieves good results in su-

pervised contrastive learning tasks.

3 Methodology

3.1 Task Formalization

The CRE task aims to identify the relation be-

tween two entities expressed by one sentence in

the task sequence. Formally, given a sequence

tasks

{T1, T2, . . . , TK}

, suppose

and

denote the instance set and relation class set of

the

-th task

, respectively.

contains

in-

stances

{(x1, t1, y1),...,(xNk, tNk, yNk)}

where

instance

(xi, ti, yi),1≤i≤Nk

represents that

the relation of entity pair

in sentence

yi∈Rk

. One CRE model should perform

well on all historical tasks up to

, denoted as

Tk=∪k

i=1Ti

, of which the relation class set is

Rk=∪k

i=1Ri

. We also adopt an episodic mem-

ory module

Mr={(x1, t1, r),...,(xL, tL, r)}

store typical instances of relation

, similar to (Han

et al.,2020;Cui et al.,2021), where

is the mem-

ory size (typical instance number). The overall

episodic memory for the observed relations in all

tasks is ˜

Mk=∪r∈˜

RkMr.

3.2 Framework Overview

The overall structure of our CRECL is depicted

in Figure 2, which has two major components,

i.e., a classiﬁcation network and a contrastive net-

work. The procedure of learning the current task in

CRECL is described by the algorithm in Alg. 1.

At ﬁrst, suppose the current task is

, the repre-

sentation of each instance in

is obtained through

the encoder and dropout layer shared by the two

networks. In the classiﬁcation network, each in-

stance’s relation is predicted based on its represen-

tation (line 1-3). Then, we apply K-means algo-

rithm over the instance representations to select

typical instances for each relation in

, which are

used to generate the relation prototypes and stored

into memory

for the subsequent contrast (line

4-13). There are two training processes in the con-

trastive network. The ﬁrst is to compare current

task instances with the stored relation prototypes

(line 14-17). The second is to compare each

typical instance with all relation prototypes which

are both stored in

(line 18-24). These two

training procedures ensure each compared instance

keep distance from sufﬁcient negative relations in

. Therefore, the data distributions of

are

distinguishable enough to alleviate CRECL’s catas-

trophic forgetting of old tasks. Next, we detail the

operations in CRECL.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingContinualRelationExtractionthroughPrototypicalContrastiveLearningChengweiHu1y,DeqingYang1y,HaoliangJin1z,ZhenChen1z,YanghuaXiao2y1SchoolofDataScience,FudanUniversity,Shanghai,China2SchoolofComputerScience,FudanUniversity,Shanghai,Chinay{cwhu20,yangdeqing,shawyh}@fudan.edu.cnz{hljin21,zhenc...

展开>> 收起<<

Improving Continual Relation Extraction through Prototypical Contrastive Learning Chengwei Hu1y Deqing Yang1y Haoliang Jin1z Zhen Chen1z Yanghua Xiao2y.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Continual Relation Extraction through Prototypical Contrastive Learning Chengwei Hu1y Deqing Yang1y Haoliang Jin1z Zhen Chen1z Yanghua Xiao2y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: