Extracting or Guessing Improving Faithfulness of Event Temporal Relation Extraction Haoyu Wang1 Hongming Zhang2 Yuqian Deng1

2025-04-27 0 0 548.23KB 13 页 10玖币
侵权投诉
Extracting or Guessing? Improving Faithfulness of Event Temporal
Relation Extraction
Haoyu Wang1, Hongming Zhang2, Yuqian Deng1,
Jacob R. Gardner1, Dan Roth1& Muhao Chen3
1Department of Computer and Information Science, UPenn
2Tencent AI Lab, Seattle
3Department of Computer Science, USC
{why16gzl, yuqiand, jacobrg, danroth}@seas.upenn.edu;
hongmzhang@tencent.com;muhaoche@usc.edu
Abstract
In this paper, we seek to improve the faith-
fulness of TEMPREL extraction models from
two perspectives. The first perspective is to
extract genuinely based on contextual descrip-
tion. To achieve this, we propose to conduct
counterfactual analysis to attenuate the effects
of two significant types of training biases: the
event trigger bias and the frequent label bias.
We also add tense information into event rep-
resentations to explicitly place an emphasis on
the contextual description. The second per-
spective is to provide proper uncertainty esti-
mation and abstain from extraction when no
relation is described in the text. By param-
eterization of Dirichlet Prior over the model-
predicted categorical distribution, we improve
the model estimates of the correctness likeli-
hood and make TEMPREL predictions more
selective. We also employ temperature scal-
ing to recalibrate the model confidence mea-
sure after bias mitigation. Through experimen-
tal analysis on MATRES, MATRES-DS, and
TDDiscourse, we demonstrate that our model
extracts TEMPREL and timelines more faith-
fully compared to SOTA methods, especially
under distribution shifts.
1 Introduction
Event temporal relation (TEMPREL) extraction is
an essential step towards understanding narrative
text, such as stories, novels, news, and guideline
articles. With a robust temporal relation extrac-
tor, one can easily construct a storyline from text
and capture the trend of temporally connected event
mentions. TEMPREL extraction is also broadly ben-
eficial to various downstream tasks including clin-
ical narrative processing (Jindal and Roth,2013;
Bethard et al.,2016), question answering (Llorens
et al.,2015;Meng et al.,2017;Stricker,2021), and
schema induction (Chambers and Jurafsky,2009;
Wen et al.,2021;Li et al.,2021).
Most existing TEMPREL extraction models are
developed with data-driven machine learning ap-
A) I went to e1:SEE the doctor. However, I was
more seriously e2:SICK.=e1AFTER e2
B) Microsoft said it has e3:IDENTIFIED three
companies for the China program to run
through June. The company also e4:GIVES
each participating startup in the Seattle program
$20,000 to create software. =e3BEFORE e4
Figure 1: Examples of unfaithful extractions. BEFORE
and AFTER that follow the arrows denote the extracted
TEMPRELs from the sentences by (Zhou et al.,2021).
proaches, for which recent studies also incorporate
advanced learning and inference techniques such as
structured prediction (Ning et al.,2017,2018b;Han
et al.,2019;Wang et al.,2020;Tan et al.,2021),
graph representation (Mathur et al.,2021;Zhang
et al.,2022), data augmentation (Ballesteros et al.,
2020;Trong et al.,2022), and indirect supervision
(Zhao et al.,2021;Zhou et al.,2021). These mod-
els are prevalently built upon pretrained language
models (PLMs) and fine-tuned on a small set of
annotated documents, e.g., TimeBank-Dense (Cas-
sidy et al.,2014), MATRES (Ning et al.,2018c),
and TDDiscourse (Naik et al.,2019).
Though these recent approaches have achieved
promising evaluation results on benchmarks,
whether they provide faithful extraction is an un-
explored problem. The faithfulness of a relation
extraction system is not simply about how much
accuracy a system can offer. Instead, a faithful ex-
tractor should concern the validity and reliability
of its extraction process. Specifically, when there is
a TEMPREL to extract, a faithful extractor should
genuinely obtain what is described in the context
but not give trivial guesses from surface names of
events or most frequent labels. Besides, when there
is no relation described in the context, the system
should selectively abstain from prediction.
We observe that in recent models, biases from
prior knowledge in PLMs and statistically skewed
arXiv:2210.04992v2 [cs.CL] 12 Oct 2022
training data often lead to unfaithful extractions
(see Fig. 1). Example A thereof exhibits a case
where the model adheres to the prior knowledge
where people usually
see the doctor
after
getting
sick
, but in this context
getting sick
is obviously
a consequent of
seeing the doctor
. In Example
B, BEFORE is extracted due to statistical biases
learned from training data that BEFORE is not only
the most frequent TEMPREL between
identify
and
give
, but is also the most frequent TEMPREL be-
tween the first and second event in narrative order
(Gee and Grosjean,1984). However, with a closer
inspection, it can be noticed that the two events in
Example B are involved in different programs, one
in the China program, the other in the Seattle pro-
gram. Therefore, the system should abstain from
prediction and give VAGUE as output.
In this paper, we seek to improve the faithfulness
of TEMPREL extraction models from two perspec-
tives. The first perspective is to guide the model to
genuinely extract the described TEMPREL based
on a relation-mentioning context. To achieve this
goal, we conduct counterfactual analysis (Niu et al.,
2021) to capture and attenuate the effects of two
typical types of training biases: event bias caused
by treating event trigger names as shortcuts for
TEMPREL prediction, and label bias that causes
the model prediction to lean towards more frequent
training labels. We also propose to affix tense in-
formation to event mentions to explicitly place an
emphasis on the contextual description.
The second perspective is to teach the model
to abstain from extraction when no relation is de-
scribed in the text. To know when to abstain, the
models need to have a good estimate of the correct-
ness likelihood. By incorporating Dirichlet Prior
(Malinin and Gales,2018,2019) in the training
phase of current TEMPREL extraction models, we
improve the predictive uncertainty estimation of
the models and make the TEMPREL predictions
more selective. Furthermore, since the counterfac-
tual analysis component (from the first perspective)
may shift the model-predicted categorical distri-
bution, we also employ temperature scaling (Guo
et al.,2017) in inference to allow for recalibrated
confidence measure of the model.
The technical contributions of our work are two-
folds. First, to the best of our knowledge, this
is the first study on the faithfulness issue of event-
centric information extraction. Evidently, the devel-
opment of a faithful TEMPREL extraction system
contributes to more robust and reliable machine
comprehension of events and narratives. Second,
we propose training and inference techniques that
can be easily plugged into existing neural TEM-
PREL extractors and effectively improve model
faithfulness by mitigating prediction shortcuts and
enhancing the capability of selective prediction.
Our contributions are verified with TEMPREL
extraction experiments conducted on MATRES
(Ning et al.,2018c), TDDiscourse (Naik et al.,
2019) and distribution-shifted version of MATRES
(MATRES-DS). Particularly, we evaluate on how
precise and selective our TEMPREL extraction
method is on in-distribution data, and how well
it generalizes under distribution shift. Experimen-
tal results demonstrate that the techniques explored
within the two aforementioned perspectives bring
about promising results in improving faithfulness
of current models. In addition, we also apply our
method to the task of timeline construction (Do
et al.,2012), showing that faithful TEMPREL ex-
traction greatly benefits the accurate construction
of timelines.
2 Related Work
Event TEMPREL Extraction.
Recent event TEM-
PREL extraction approaches are mainly built on
PLMs to obtain representations of event mentions
and are improved with various learning and infer-
ence methodologies. To improve the quality of
event representations, Mathur et al. (2021) embrace
rhetorical discourse features and temporal argu-
ments; Trong et al. (2022) select optimal context
sentences via reinforcement learning to achieve
SOTA performances; while Liu et al. (2021b);
Mathur et al. (2021); Zhang et al. (2022) employ
graph neural networks to avoid complex feature
engineering. From the learning perspective, Ning
et al. (2018a), Ballesteros et al. (2020), and Wang
et al. (2020) enrich the models with auxiliary train-
ing tasks to provide complementary supervision
signals, while Ning et al. (2018b), Zhao et al.
(2021) and Zhou et al. (2021) bring into play dis-
tant supervision from heuristic cues and patterns.
Nevertheless, recent data-driven models risk am-
plifying bias by exacerbating biases present in the
pretraining and task training data when making
predictions (Zhao et al.,2017). To rectify the mod-
els’ biases towards prior knowledge in PLMs and
shortcuts learned from biased training examples,
our work proposes several training and inference
techniques, seeking to improve the faithfulness of
neural TEMPREL extractors as described in §1.
Bias Mitigation in NLP.
Methods for mitigating
prediction biases can be categorized as retraining
and inference (Sun et al.,2019). Retraining meth-
ods address the bias in early stages or at its source.
For instance, Zhang et al. (2017) masks the enti-
ties with special tokens to prevent relation extrac-
tion models from learning shortcuts from entity
names, whereas several works conduct data aug-
mentation (Park et al.,2018;Alzantot et al.,2018;
Jin et al.,2020;Wu et al.,2022) or sample reweight-
ing techniques (Lin et al.,2017;Liu et al.,2021a)
to reduce biases in training. However, masking
would result in the loss of semantic information
and performance degradation, and it is costly to
manipulate data or find proper unbiased data in
temporal reasoning. Directly debiasing the training
process may also hinder the model generalization
on out-of-distribution (OOD) data (Wang et al.,
2022). Therefore, inspired by several recent stud-
ies on debiasing text classification or entity-centric
information extraction (Qian et al.,2021;Nan et al.,
2021), our work adopts counterfactual inference to
measure and control prediction biases based on
automatically generated counterfactual examples.
Selective Prediction.
Neural models have become
increasingly accurate with the advances of deep
learning. In the meantime, however, they should
also indicate when their predictions are likely to
be inaccurate in real-world scenarios. A series
of recent studies have focused on resolving model
miscalibration by measuring how closely the model
confidences match empirical likelihoods. Among
them, computationally expensive Bayesian (Gal
and Ghahramani,2016;Küppers et al.,2021) and
non-Bayesian ensemble (Lakshminarayanan et al.,
2017;Beluch et al.,2018) methods have been
adopted to yield high quality predictive uncertainty
estimates. Other methods have been proposed
to use uncertainty reflected from model param-
eters to assess the confidence, including sharp-
ness (Kuleshov et al.,2018) and softmax response
(Hendrycks and Gimpel,2017;Xin et al.,2021).
Another class of methods adjust the models’ out-
put probability distribution by altering loss func-
tion in training via label smoothing (Szegedy et al.,
2016) and Dirichlet Prior (Malinin and Gales,2018,
2019). Besides, temperature scaling (Guo et al.,
2017) also serves as a simple yet effective post-
hoc calibration technique. In this paper, we model
TEMPRELs with Dirichlet Prior in learning, and
during inference we employ temperature scaling to
recalibrate confidence measure of the model after
bias mitigation.
3 Preliminaries
A document
D
is represented as a sequence of
tokens
D= [w1,· · · , e1,· · · , e2,· · · , wm]
, where
some tokens belong to the set of annotated event
triggers, i.e.,
ED={e1, e2,· · · , en}
, and the rest
are other lexemes. For a pair of events
(ei, ej)
, the
task of TEMPREL extraction is to predict a relation
r
from
R ∪ {VAGUE}
, where
R
denotes the set of
TEMPRELs. An event pair is labeled VAGUE if the
text does not express any determinable relation that
belongs to
R
. Let
y(i,j)
denote the model-predicted
categorical distribution over R.
In order to provide a confidence estimate
y
that
is as close as possible to the true probability, we
first describe three separate factors (Malinin and
Gales,2018) that attribute to the predictive uncer-
tainty for an AI system, namely epistemic uncer-
tainty, aleatoric uncertainty, and distributional un-
certainty. Epistemic uncertainty refers to the de-
gree of uncertainty in estimating model parameters
based on training data, whereas aleatoric uncer-
tainty results from data’s innate complexity. Distri-
butional uncertainty arises when the model cannot
make accurate predictions due to the lack of famil-
iarity with the test data.
We argue that the way of handling VAGUE rela-
tions in existing TEMPREL extractors is problem-
atic since they typically merge VAGUE into
R
. In
fact, VAGUE relations are complicated exception
cases in the IE task, yet the annotation of such
exceptions are never close to exhaustive in bench-
marks, or even not given (Naik et al.,2019). In this
work, we consider VAGUE relations as a source of
distributional uncertainty
and separately model
them. Details are introduced in §4.2.
4 Methods
In this section, we first present how we obtain event
representations and categorical distribution
y
in a
local classifier for TEMPREL 4.1). Then we intro-
duce proposed learning and inference techniques to
improve model faithfulness from the perspectives
of selective prediction (§4.2) and prediction bias
mitigation (§4.3), before we combine these two
techniques with temperature scaling and introduce
the OOD detection method in §4.4.
摘要:

ExtractingorGuessing?ImprovingFaithfulnessofEventTemporalRelationExtractionHaoyuWang1,HongmingZhang2,YuqianDeng1,JacobR.Gardner1,DanRoth1&MuhaoChen31DepartmentofComputerandInformationScience,UPenn2TencentAILab,Seattle3DepartmentofComputerScience,USC{why16gzl,yuqiand,jacobrg,danroth}@seas.upenn.edu;h...

展开>> 收起<<
Extracting or Guessing Improving Faithfulness of Event Temporal Relation Extraction Haoyu Wang1 Hongming Zhang2 Yuqian Deng1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:548.23KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注