techniques, seeking to improve the faithfulness of
neural TEMPREL extractors as described in §1.
Bias Mitigation in NLP.
Methods for mitigating
prediction biases can be categorized as retraining
and inference (Sun et al.,2019). Retraining meth-
ods address the bias in early stages or at its source.
For instance, Zhang et al. (2017) masks the enti-
ties with special tokens to prevent relation extrac-
tion models from learning shortcuts from entity
names, whereas several works conduct data aug-
mentation (Park et al.,2018;Alzantot et al.,2018;
Jin et al.,2020;Wu et al.,2022) or sample reweight-
ing techniques (Lin et al.,2017;Liu et al.,2021a)
to reduce biases in training. However, masking
would result in the loss of semantic information
and performance degradation, and it is costly to
manipulate data or find proper unbiased data in
temporal reasoning. Directly debiasing the training
process may also hinder the model generalization
on out-of-distribution (OOD) data (Wang et al.,
2022). Therefore, inspired by several recent stud-
ies on debiasing text classification or entity-centric
information extraction (Qian et al.,2021;Nan et al.,
2021), our work adopts counterfactual inference to
measure and control prediction biases based on
automatically generated counterfactual examples.
Selective Prediction.
Neural models have become
increasingly accurate with the advances of deep
learning. In the meantime, however, they should
also indicate when their predictions are likely to
be inaccurate in real-world scenarios. A series
of recent studies have focused on resolving model
miscalibration by measuring how closely the model
confidences match empirical likelihoods. Among
them, computationally expensive Bayesian (Gal
and Ghahramani,2016;Küppers et al.,2021) and
non-Bayesian ensemble (Lakshminarayanan et al.,
2017;Beluch et al.,2018) methods have been
adopted to yield high quality predictive uncertainty
estimates. Other methods have been proposed
to use uncertainty reflected from model param-
eters to assess the confidence, including sharp-
ness (Kuleshov et al.,2018) and softmax response
(Hendrycks and Gimpel,2017;Xin et al.,2021).
Another class of methods adjust the models’ out-
put probability distribution by altering loss func-
tion in training via label smoothing (Szegedy et al.,
2016) and Dirichlet Prior (Malinin and Gales,2018,
2019). Besides, temperature scaling (Guo et al.,
2017) also serves as a simple yet effective post-
hoc calibration technique. In this paper, we model
TEMPREL’s with Dirichlet Prior in learning, and
during inference we employ temperature scaling to
recalibrate confidence measure of the model after
bias mitigation.
3 Preliminaries
A document
D
is represented as a sequence of
tokens
D= [w1,· · · , e1,· · · , e2,· · · , wm]
, where
some tokens belong to the set of annotated event
triggers, i.e.,
ED={e1, e2,· · · , en}
, and the rest
are other lexemes. For a pair of events
(ei, ej)
, the
task of TEMPREL extraction is to predict a relation
r
from
R ∪ {VAGUE}
, where
R
denotes the set of
TEMPREL’s. An event pair is labeled VAGUE if the
text does not express any determinable relation that
belongs to
R
. Let
y(i,j)
denote the model-predicted
categorical distribution over R.
In order to provide a confidence estimate
y
that
is as close as possible to the true probability, we
first describe three separate factors (Malinin and
Gales,2018) that attribute to the predictive uncer-
tainty for an AI system, namely epistemic uncer-
tainty, aleatoric uncertainty, and distributional un-
certainty. Epistemic uncertainty refers to the de-
gree of uncertainty in estimating model parameters
based on training data, whereas aleatoric uncer-
tainty results from data’s innate complexity. Distri-
butional uncertainty arises when the model cannot
make accurate predictions due to the lack of famil-
iarity with the test data.
We argue that the way of handling VAGUE rela-
tions in existing TEMPREL extractors is problem-
atic since they typically merge VAGUE into
R
. In
fact, VAGUE relations are complicated exception
cases in the IE task, yet the annotation of such
exceptions are never close to exhaustive in bench-
marks, or even not given (Naik et al.,2019). In this
work, we consider VAGUE relations as a source of
distributional uncertainty
and separately model
them. Details are introduced in §4.2.
4 Methods
In this section, we first present how we obtain event
representations and categorical distribution
y
in a
local classifier for TEMPREL (§4.1). Then we intro-
duce proposed learning and inference techniques to
improve model faithfulness from the perspectives
of selective prediction (§4.2) and prediction bias
mitigation (§4.3), before we combine these two
techniques with temperature scaling and introduce
the OOD detection method in §4.4.