
3 Computational Challenges
The computation costs of even approximate in-
stance level influence can be prohibitive in practice,
especially with the large pretrained language mod-
els that now dominate NLP. Computing and storing
inverse Hessians of the loss has
O(p3)
and
O(p2)
complexity where
p
is the number of parameters
in the model (commonly
∼
100M-100B for deep
models). Even ignoring the Hessian, one still needs
the gradient with respect to each training example
in
D
; one could attempt to pre-compute these, but
storing the results requires O(|D|p)memory.
The alternative is therefore to recompute these
for each new test point zi. For segment level influ-
ence these costs are compounded because we need
influence with respect to every segment within a
training example, multiplying complexities by
T2
,
where
T
is the average length of a training exam-
ple. Consequently, it is practically infeasible to
calculate segment influence per Equation 9.
Prior work by Pezeshkpour et al. (2021) showed
that for instance-level influence, considering a re-
stricted set of parameters (e.g., those in the classifi-
cation layer) when taking the gradient is reasonable
for influence in that this does not much affect the
induced rankings over train points with respect to
influence. Similarly, ignoring the Hessian term
does not significantly affect rankings by influence.
These two simplifications dramatically improve ef-
ficiency, and we adopt them here.
Consider a sequence tagging model built on top
of a deep encoder
F
, e.g., BERT (Devlin et al.,
2018). In the context of a linear chain CRF on top
of
F
, the standard score function for this model
is:
s(yi, xi) = PTi
t=1 y>
it WF(xi)t+y>
i(t−1)Tyit
,
where
T
is a matrix of class transition scores and
yit
is the one-hot representation of label
yit
. A
CRF layer consumes these scores and computes
the probability of a label sequence as
p(yi|xi) =
es(yi,xi)
Py0∈YTies(y0,xi)
. In this work we consider the gra-
dient only with respect to the
W
and
T
parameters
above and not any parameters associated with F.
Further, we consider influence only with respect
to individual token outputs in training samples,
rather than every possible segment — i.e., we only
consider single-token segments. This further re-
duces the T2terms in our complexity to T.
4 Experimental Aims and Setup
We evaluate segment influence in terms of: (1)
Approximation validity, or how well the approxi-
mation proposed in Equation 9corresponds to the
exact influence value (Equation 6), and, (2) Utility,
i.e., whether segment influence might help identify
problematic training examples for sequence tag-
ging tasks. In this paper we consider only NER
tasks, using the following datasets.
CoNLL
(Tjong Kim Sang and De Meulder,2003).
An NER dataset containing news stories from
Reuters, labeled with four entity types:
PER
,
LOC
,
ORG
and
MISC
, and demarcated using
the beginning-inside-out (BIO) tagging scheme
(Ramshaw and Marcus,1999). The dataset is di-
vided into train, validation, and test splits compris-
ing 879, 194, and 197 documents, respectively.
EBM-NLP
(Nye et al.,2018). A corpus of anno-
tated medical article abstracts describing clinical
randomized controlled trials. ‘Entities’ here are
spans of tokens that describe the patient Population
enrolled, the Interventions compared, and the Out-
comes measured (‘PICO’ elements). This dataset
includes a test set of 191 abstracts labeled by med-
ical professionals, and a train/val set of 3836/958
abstracts labeled via Amazon Mechanical Turk.
For NER models we use a representative modern
transformer — BigBird Base (Zaheer et al.,2020)
— as our encoder
F
, using final layer hidden states
as contextualized token representations. Depen-
dencies between output labels are modeled using a
CRF. We provide training details in Appendix C.
In addition to segment and instance level influ-
ence we also evaluate — where applicable —
seg-
ment nearest neighbor
as an attribution method,
which works as follows. We retrieve from the train-
ing dataset segments that have the highest similarity
between their corresponding feature embeddings
(we again consider only single token segments here,
so we do not need to worry about embedding multi-
token segments). We consider both dot product
and cosine similarity and report the results for the
version that gives best performance for each exper-
iment.
5 Validating Segment Influence
In this first set of experiments we aim to (i) verify
that the proposed approximation correlates with the
exact influence (2.2.2), and, (ii) ascertain that for
synthetic tasks which we construct, segment influ-
ence returns the a priori expected training segment