Extracting or Guessing Improving Faithfulness of Event Temporal Relation Extraction Haoyu Wang1 Hongming Zhang2 Yuqian Deng1

2025-04-27 0 0 548.23KB 13 页 10玖币

侵权投诉

Extracting or Guessing? Improving Faithfulness of Event Temporal

Relation Extraction

Haoyu Wang1, Hongming Zhang2, Yuqian Deng1,

Jacob R. Gardner1, Dan Roth1& Muhao Chen3

1Department of Computer and Information Science, UPenn

2Tencent AI Lab, Seattle

3Department of Computer Science, USC

{why16gzl, yuqiand, jacobrg, danroth}@seas.upenn.edu;

hongmzhang@tencent.com;muhaoche@usc.edu

Abstract

In this paper, we seek to improve the faith-

fulness of TEMPREL extraction models from

two perspectives. The ﬁrst perspective is to

extract genuinely based on contextual descrip-

tion. To achieve this, we propose to conduct

counterfactual analysis to attenuate the effects

of two signiﬁcant types of training biases: the

event trigger bias and the frequent label bias.

We also add tense information into event rep-

resentations to explicitly place an emphasis on

the contextual description. The second per-

spective is to provide proper uncertainty esti-

mation and abstain from extraction when no

relation is described in the text. By param-

eterization of Dirichlet Prior over the model-

predicted categorical distribution, we improve

the model estimates of the correctness likeli-

hood and make TEMPREL predictions more

selective. We also employ temperature scal-

ing to recalibrate the model conﬁdence mea-

sure after bias mitigation. Through experimen-

tal analysis on MATRES, MATRES-DS, and

TDDiscourse, we demonstrate that our model

extracts TEMPREL and timelines more faith-

fully compared to SOTA methods, especially

under distribution shifts.

1 Introduction

Event temporal relation (TEMPREL) extraction is

an essential step towards understanding narrative

text, such as stories, novels, news, and guideline

articles. With a robust temporal relation extrac-

tor, one can easily construct a storyline from text

and capture the trend of temporally connected event

mentions. TEMPREL extraction is also broadly ben-

eﬁcial to various downstream tasks including clin-

ical narrative processing (Jindal and Roth,2013;

Bethard et al.,2016), question answering (Llorens

et al.,2015;Meng et al.,2017;Stricker,2021), and

schema induction (Chambers and Jurafsky,2009;

Wen et al.,2021;Li et al.,2021).

Most existing TEMPREL extraction models are

developed with data-driven machine learning ap-

A) I went to e1:SEE the doctor. However, I was

more seriously e2:SICK.=⇒e1AFTER e2

B) Microsoft said it has e3:IDENTIFIED three

companies for the China program to run

through June. The company also e4:GIVES

each participating startup in the Seattle program

$20,000 to create software. =⇒e3BEFORE e4

Figure 1: Examples of unfaithful extractions. BEFORE

and AFTER that follow the arrows denote the extracted

TEMPREL’s from the sentences by (Zhou et al.,2021).

proaches, for which recent studies also incorporate

advanced learning and inference techniques such as

structured prediction (Ning et al.,2017,2018b;Han

et al.,2019;Wang et al.,2020;Tan et al.,2021),

graph representation (Mathur et al.,2021;Zhang

et al.,2022), data augmentation (Ballesteros et al.,

2020;Trong et al.,2022), and indirect supervision

(Zhao et al.,2021;Zhou et al.,2021). These mod-

els are prevalently built upon pretrained language

models (PLMs) and ﬁne-tuned on a small set of

annotated documents, e.g., TimeBank-Dense (Cas-

sidy et al.,2014), MATRES (Ning et al.,2018c),

and TDDiscourse (Naik et al.,2019).

Though these recent approaches have achieved

promising evaluation results on benchmarks,

whether they provide faithful extraction is an un-

explored problem. The faithfulness of a relation

extraction system is not simply about how much

accuracy a system can offer. Instead, a faithful ex-

tractor should concern the validity and reliability

of its extraction process. Speciﬁcally, when there is

a TEMPREL to extract, a faithful extractor should

genuinely obtain what is described in the context

but not give trivial guesses from surface names of

events or most frequent labels. Besides, when there

is no relation described in the context, the system

should selectively abstain from prediction.

We observe that in recent models, biases from

prior knowledge in PLMs and statistically skewed

arXiv:2210.04992v2 [cs.CL] 12 Oct 2022

training data often lead to unfaithful extractions

(see Fig. 1). Example A thereof exhibits a case

where the model adheres to the prior knowledge

where people usually

see the doctor

after

getting

sick

, but in this context

getting sick

is obviously

a consequent of

seeing the doctor

. In Example

B, BEFORE is extracted due to statistical biases

learned from training data that BEFORE is not only

the most frequent TEMPREL between

identify

and

give

, but is also the most frequent TEMPREL be-

tween the ﬁrst and second event in narrative order

(Gee and Grosjean,1984). However, with a closer

inspection, it can be noticed that the two events in

Example B are involved in different programs, one

in the China program, the other in the Seattle pro-

gram. Therefore, the system should abstain from

prediction and give VAGUE as output.

In this paper, we seek to improve the faithfulness

of TEMPREL extraction models from two perspec-

tives. The ﬁrst perspective is to guide the model to

genuinely extract the described TEMPREL based

on a relation-mentioning context. To achieve this

goal, we conduct counterfactual analysis (Niu et al.,

2021) to capture and attenuate the effects of two

typical types of training biases: event bias caused

by treating event trigger names as shortcuts for

TEMPREL prediction, and label bias that causes

the model prediction to lean towards more frequent

training labels. We also propose to afﬁx tense in-

formation to event mentions to explicitly place an

emphasis on the contextual description.

The second perspective is to teach the model

to abstain from extraction when no relation is de-

scribed in the text. To know when to abstain, the

models need to have a good estimate of the correct-

ness likelihood. By incorporating Dirichlet Prior

(Malinin and Gales,2018,2019) in the training

phase of current TEMPREL extraction models, we

improve the predictive uncertainty estimation of

the models and make the TEMPREL predictions

more selective. Furthermore, since the counterfac-

tual analysis component (from the ﬁrst perspective)

may shift the model-predicted categorical distri-

bution, we also employ temperature scaling (Guo

et al.,2017) in inference to allow for recalibrated

conﬁdence measure of the model.

The technical contributions of our work are two-

folds. First, to the best of our knowledge, this

is the ﬁrst study on the faithfulness issue of event-

centric information extraction. Evidently, the devel-

opment of a faithful TEMPREL extraction system

contributes to more robust and reliable machine

comprehension of events and narratives. Second,

we propose training and inference techniques that

can be easily plugged into existing neural TEM-

PREL extractors and effectively improve model

faithfulness by mitigating prediction shortcuts and

enhancing the capability of selective prediction.

Our contributions are veriﬁed with TEMPREL

extraction experiments conducted on MATRES

(Ning et al.,2018c), TDDiscourse (Naik et al.,

2019) and distribution-shifted version of MATRES

(MATRES-DS). Particularly, we evaluate on how

precise and selective our TEMPREL extraction

method is on in-distribution data, and how well

it generalizes under distribution shift. Experimen-

tal results demonstrate that the techniques explored

within the two aforementioned perspectives bring

about promising results in improving faithfulness

of current models. In addition, we also apply our

method to the task of timeline construction (Do

et al.,2012), showing that faithful TEMPREL ex-

traction greatly beneﬁts the accurate construction

of timelines.

2 Related Work

Event TEMPREL Extraction.

Recent event TEM-

PREL extraction approaches are mainly built on

PLMs to obtain representations of event mentions

and are improved with various learning and infer-

ence methodologies. To improve the quality of

event representations, Mathur et al. (2021) embrace

rhetorical discourse features and temporal argu-

ments; Trong et al. (2022) select optimal context

sentences via reinforcement learning to achieve

SOTA performances; while Liu et al. (2021b);

Mathur et al. (2021); Zhang et al. (2022) employ

graph neural networks to avoid complex feature

engineering. From the learning perspective, Ning

et al. (2018a), Ballesteros et al. (2020), and Wang

et al. (2020) enrich the models with auxiliary train-

ing tasks to provide complementary supervision

signals, while Ning et al. (2018b), Zhao et al.

(2021) and Zhou et al. (2021) bring into play dis-

tant supervision from heuristic cues and patterns.

Nevertheless, recent data-driven models risk am-

plifying bias by exacerbating biases present in the

pretraining and task training data when making

predictions (Zhao et al.,2017). To rectify the mod-

els’ biases towards prior knowledge in PLMs and

shortcuts learned from biased training examples,

our work proposes several training and inference

techniques, seeking to improve the faithfulness of

neural TEMPREL extractors as described in §1.

Bias Mitigation in NLP.

Methods for mitigating

prediction biases can be categorized as retraining

and inference (Sun et al.,2019). Retraining meth-

ods address the bias in early stages or at its source.

For instance, Zhang et al. (2017) masks the enti-

ties with special tokens to prevent relation extrac-

tion models from learning shortcuts from entity

names, whereas several works conduct data aug-

mentation (Park et al.,2018;Alzantot et al.,2018;

Jin et al.,2020;Wu et al.,2022) or sample reweight-

ing techniques (Lin et al.,2017;Liu et al.,2021a)

to reduce biases in training. However, masking

would result in the loss of semantic information

and performance degradation, and it is costly to

manipulate data or ﬁnd proper unbiased data in

temporal reasoning. Directly debiasing the training

process may also hinder the model generalization

on out-of-distribution (OOD) data (Wang et al.,

2022). Therefore, inspired by several recent stud-

ies on debiasing text classiﬁcation or entity-centric

information extraction (Qian et al.,2021;Nan et al.,

2021), our work adopts counterfactual inference to

measure and control prediction biases based on

automatically generated counterfactual examples.

Selective Prediction.

Neural models have become

increasingly accurate with the advances of deep

learning. In the meantime, however, they should

also indicate when their predictions are likely to

be inaccurate in real-world scenarios. A series

of recent studies have focused on resolving model

miscalibration by measuring how closely the model

conﬁdences match empirical likelihoods. Among

them, computationally expensive Bayesian (Gal

and Ghahramani,2016;Küppers et al.,2021) and

non-Bayesian ensemble (Lakshminarayanan et al.,

2017;Beluch et al.,2018) methods have been

adopted to yield high quality predictive uncertainty

estimates. Other methods have been proposed

to use uncertainty reﬂected from model param-

eters to assess the conﬁdence, including sharp-

ness (Kuleshov et al.,2018) and softmax response

(Hendrycks and Gimpel,2017;Xin et al.,2021).

Another class of methods adjust the models’ out-

put probability distribution by altering loss func-

tion in training via label smoothing (Szegedy et al.,

2016) and Dirichlet Prior (Malinin and Gales,2018,

2019). Besides, temperature scaling (Guo et al.,

2017) also serves as a simple yet effective post-

hoc calibration technique. In this paper, we model

TEMPREL’s with Dirichlet Prior in learning, and

during inference we employ temperature scaling to

recalibrate conﬁdence measure of the model after

bias mitigation.

3 Preliminaries

A document

is represented as a sequence of

tokens

D= [w1,· · · , e1,· · · , e2,· · · , wm]

, where

some tokens belong to the set of annotated event

triggers, i.e.,

ED={e1, e2,· · · , en}

, and the rest

are other lexemes. For a pair of events

(ei, ej)

, the

task of TEMPREL extraction is to predict a relation

from

R ∪ {VAGUE}

, where

denotes the set of

TEMPREL’s. An event pair is labeled VAGUE if the

text does not express any determinable relation that

belongs to

. Let

y(i,j)

denote the model-predicted

categorical distribution over R.

In order to provide a conﬁdence estimate

that

is as close as possible to the true probability, we

ﬁrst describe three separate factors (Malinin and

Gales,2018) that attribute to the predictive uncer-

tainty for an AI system, namely epistemic uncer-

tainty, aleatoric uncertainty, and distributional un-

certainty. Epistemic uncertainty refers to the de-

gree of uncertainty in estimating model parameters

based on training data, whereas aleatoric uncer-

tainty results from data’s innate complexity. Distri-

butional uncertainty arises when the model cannot

make accurate predictions due to the lack of famil-

iarity with the test data.

We argue that the way of handling VAGUE rela-

tions in existing TEMPREL extractors is problem-

atic since they typically merge VAGUE into

. In

fact, VAGUE relations are complicated exception

cases in the IE task, yet the annotation of such

exceptions are never close to exhaustive in bench-

marks, or even not given (Naik et al.,2019). In this

work, we consider VAGUE relations as a source of

distributional uncertainty

and separately model

them. Details are introduced in §4.2.

4 Methods

In this section, we ﬁrst present how we obtain event

representations and categorical distribution

in a

local classiﬁer for TEMPREL (§4.1). Then we intro-

duce proposed learning and inference techniques to

improve model faithfulness from the perspectives

of selective prediction (§4.2) and prediction bias

mitigation (§4.3), before we combine these two

techniques with temperature scaling and introduce

the OOD detection method in §4.4.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExtractingorGuessing?ImprovingFaithfulnessofEventTemporalRelationExtractionHaoyuWang1,HongmingZhang2,YuqianDeng1,JacobR.Gardner1,DanRoth1&MuhaoChen31DepartmentofComputerandInformationScience,UPenn2TencentAILab,Seattle3DepartmentofComputerScience,USC{why16gzl,yuqiand,jacobrg,danroth}@seas.upenn.edu;h...

展开>> 收起<<

Extracting or Guessing Improving Faithfulness of Event Temporal Relation Extraction Haoyu Wang1 Hongming Zhang2 Yuqian Deng1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Extracting or Guessing Improving Faithfulness of Event Temporal Relation Extraction Haoyu Wang1 Hongming Zhang2 Yuqian Deng1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: