Influence Functions for Sequence Tagging Models Sarthak Jain Northeastern University

2025-05-05 0 0 861.66KB 16 页 10玖币
侵权投诉
Influence Functions for Sequence Tagging Models
Sarthak Jain
Northeastern University
jain.sar@northeastern.edu
Varun Manjunatha
Adobe Research
vmanjuna@adobe.com
Byron C. Wallace
Northeastern University
b.wallace@northeastern.edu
Ani Nenkova
Adobe Research
nenkova@adobe.com
Abstract
Many language tasks (e.g., Named Entity
Recognition, Part-of-Speech tagging, and Se-
mantic Role Labeling) are naturally framed as
sequence tagging problems. However, there
has been comparatively little work on inter-
pretability methods for sequence tagging mod-
els. In this paper, we extend influence func-
tions — which aim to trace predictions back to
the training points that informed them — to se-
quence tagging tasks. We define the influence
of a training instance segment as the effect that
perturbing the labels within this segment has
on a test segment level prediction. We provide
an efficient approximation to compute this,
and show that it tracks with the true segment
influence, measured empirically. We show the
practical utility of segment influence by using
the method to identify systematic annotation
errors in two named entity recognition corpora.
Code to reproduce our results is available
at https://github.com/successar/
Segment_Influence_Functions.
1 Introduction
Instance attribution methods aim to identify train-
ing examples that most informed a particular (test)
prediction. The influence of training point
k
on
test point
i
is typically formalized as the change
in loss that would be observed for point
i
if exam-
ple
k
was removed from the training set (Koh and
Liang,2017). Heuristic alternatives have also been
developed to measure the importance of training
samples during prediction, such as retrieving train-
ing examples similar to a test item (Pezeshkpour
et al.,2021;Ilyas et al.,2022;Guo et al.,2021).
Influence functions can facilitate dataset debug-
ging by helping to surface training samples which
exhibit artifacts (Han et al.,2020). But on language
tasks, most work on identifying training samples
influential to a particular prediction has focused
on classification tasks (Koh and Liang,2017;Han
et al.,2020;Pezeshkpour et al.,2021). It is not
Real Madrid scored
three goals
Mancester United won
the game
LOC
cd
a b
Mancester United won
the game
Influence(z[a,b]
k,z[c,d]
i)
<latexit sha1_base64="LcehAlw0qANw465CdeENhsEoozo=">AAACFnicbVBNSwMxFMz6bf2qevQSLIJCLbsq6FH0ojcFq4V2XbLpWw3NZpfkrViX/RVe/CtePCjiVbz5b0xrD1odCExm3iOZCVMpDLrupzMyOjY+MTk1XZqZnZtfKC8unZsk0xzqPJGJboTMgBQK6ihQQiPVwOJQwkXYOez5FzegjUjUGXZT8GN2pUQkOEMrBeXNFsIt5scqkhkoDsX6XdC5zJusSkO/qNK7QNgbr9K2X2zQoFxxa24f9C/xBqRCBjgJyh+tdsKzGBRyyYxpem6Kfs40Ci6hKLUyAynjHXYFTUsVi8H4eT9WQdes0qZRou1RSPvqz42cxcZ049BOxgyvzbDXE//zmhlGe34uVJqhzfz9UJRJigntdUTbQgNH2bWEcS3sXym/ZppxtE2WbAnecOS/5Hyr5m3Xtk53KvsHgzqmyApZJevEI7tknxyRE1InnNyTR/JMXpwH58l5dd6+R0ecwc4y+QXn/QsZCp6k</latexit>
What training segment
influenced this prediction?
ORG ORG O O O
LOC
tinkering
byron wallace
May 2022
1 Introduction
zi=(xi,yi)
1
ORG labeled as LOC
Figure 1: We propose and evaluate influence functions
for sequence tagging tasks, which retrieve snippets
(from token ato b) in train samples that most influenced
predictions for test tokens cthrough d. Here this reveals
a training example in which an ORG is problematically
marked as a LOC, leading to the observed error.
immediately clear how we can extend such meth-
ods to the structured prediction problems such as
named entity recognition (NER).
In this work we address this gap, presenting new
methods for characterizing token-level influence
for structured predictions (specifically sequence
tagging tasks), and evaluating their use across il-
lustrative datasets. More specifically, we focus
on NER, one of the most common sequence tag-
ging tasks. We extend influence functions to detect
important training examples, i.e., those that most
influenced the prediction of a specific entity, as
opposed to being most influential with respect to
the entire predicted label sequence. We call this
extension segment influence.
Segment influence can help one perform fine-
grained analysis of why specific segments of text
were incorrectly labeled (as opposed to the entire
sequence). Consider, for example, Figure 1. This
shows a common issue in the CoNLL NER dataset:
city names contained in soccer club titles tend to
be mislabeled as location, rather than organization.
arXiv:2210.14177v1 [cs.CL] 25 Oct 2022
This in turn leads to similar mispredictions in the
test set. For the shown test example we can use
segment influence to ask which entities within the
training examples most informed the prediction
made for the entity ‘Manchester United’? In prin-
cipal, segment influence can directly recover the
entities responsible for this systematic mislabeling.
Our main
contributions
are as follows. (1) We
present a new method to approximately compute
token-level influence for outputs produced by se-
quence tagging models; (2) We evaluate whether
this approximation corresponds to exact influence
values in linear models, and whether the method
recovers intuitively correct training examples in
synthetically constructed cases, and; (3) We estab-
lish the practical utility of approximating structured
influence by using the method to identify system-
atic annotation errors in NER corpora.
2 Influence for Sequence Tagging
Consider a standard sequence tagging task in which
the aim is to estimate the parameters
θ
of function
fθ
which assigns to each token
xit ∈ V
in an input
sequence
xi
(of length
Ti
) a label
yit
from a label
set
Y
. Denote the training dataset by
D
where
D={(xi={xit}Ti
t=1, yi={yit}Ti
t=1)}.
Define
fθ
as a model that yields conditional
probability estimates for sequence label assign-
ments:
pθ(yi|xi)
. Given parameter estimates
ˆ
θ
,
we can make a prediction for a test instance
xi
by selecting the most likely
y
under this model:
ˆyi=argmaxypˆ
θ(y|xi)
. In structured prediction
tasks we assume that the label
yit
depends in part
on labels
yi\yit
, given the input
xi
. In linear chain
sequence tagging, this dependence can be formal-
ized as a graphical model in which adjacent labels
are connected; the most common realization of
such a model is perhaps the Conditional Random
Field (CRF; Lafferty et al. 2001).
We typically estimate
θ
by minimizing the nega-
tive log-likelihood of the training dataset D.
argmin
θ
1
|D| X
(xi,yi)∈D
log pθ(yi|xi)(1)
For brevity, we will also write the loss (nega-
tive log likelihood) of an example
zi= (xi, yi)
as
L(zi, θ) = log pθ(yi|xi)
and the over-
all loss over the training set by
L(D, θ) =
1
|D| Pzi∈D L(zi, θ).
2.1 Background: Influence Functions in ML
Influence Functions
(Koh and Liang,2017) re-
trieve training samples
zk
deemed “influential” to
the prediction made for a specific test sample
xi
:
ˆyi=fˆ
θ(xi)
. The exact influence of a training
example
zk
on a test example
zi= (xi, yi)
is de-
fined as the change in the loss on
zi
that would
be incurred under parameter estimates if the train-
ing sample
zk
were removed prior to training, i.e.,
L(zi,ˆ
θzk)− L(zi,ˆ
θ).
In practice this is prohibitively expensive to
compute. Koh and Liang (2017) proposed an
approximation. The idea is to measure the change
in the loss on
zi
observed when the loss associated
with train sample
zk
is slightly upweighted by some
. Explicitly computing the effect of such an
-
perturbation is not feasible. Koh and Liang (2017)
provide an efficient mechanism to approximate
this (reproduced in Appendix A.1):
I(zi, zk) =
−∇θL(zi,ˆ
θ)[2
θL(D,ˆ
θ)]1θL(zk,ˆ
θ)
, where
2
θL(D,ˆ
θ)
is the Hessian of the loss
L(D,ˆ
θ)
over
the dataset with respect to θ.
Sequence tagging tasks by definition involve
multiple predictions (and labels) per instance, and
it is therefore natural to consider finer-grained in-
fluence. In particular, we would like to quantify the
effect of segments of labels for
zk
on a specific seg-
ment of the predicted output for
zi
. For example, if
we mispredict a particular entity within
xi
, we may
want to identify the train sample segment(s) most
responsible for this error, especially if the model
makes systematic errors that might be rectified by
cleaning D.
2.2 Segment Level Influence
We provide machinery to compute segment level in-
fluence. We want to quantify the impact of training
tokens
xk[a,b]
(with labels
yk[a,b]
),
1a, b Tk
on the loss of a segment of test point
zi
. In NER,
these segments may correspond to entities.
2.2.1 Exact Segment Influence
We define the exact influence of a segment
[a, b]
within training example
zk
on a segment
[c, d]
of a
test example
zi= (xi, yi)
as the change in loss that
would be observed for reference token labels in
segment
[c, d]
of
zi
, had we excluded the labels for
segment
[a, b]
within
zk
from the training data. To
make the above definition precise, we need to first
define how training is to be performed when only
partial annotations may be available for a given
train example (i.e., where a segment has been “re-
moved”). We need also to formally define change
in loss for a segment of a test example.
Start with training under partial annotations.
Consider a training example
zk= (xk=
{xk1, . . . , xkTk}, yk={yk1, . . . , ykTk})
. Assume
we did not have labels for segment
[a, b]
in
yk
,
i.e., labels
{yka, . . . , ykb}
were missing. Denote
such a partial label sequence by
y[a,b]
k=yk\
{yka, . . . , ykb}
. Let
y[a,b]
k={yka, . . . , ykb}
. A nat-
ural way to handle such cases is to marginalize over
all possible label assignments to the segment
[a, b]
when computing the likelihood of this training ex-
ample (Tsuboi et al.,2008):
pθ(y[a,b]
k|xk) =
X
y0
a∈Y
· · · X
y0
b∈Y
pθ(y[a,b]
k∪ {y0
a, ..., y0
b}|xk)(2)
Denote the
marginal loss
of this partially
annotated sequence as
ML(z[a,b]
k, θ) =
log pθ(y[a,b]
k|xk).
We can also write this marginal loss as the dif-
ference between the joint loss of
yk
and the condi-
tional loss of the segment
y[a,b]
k
. This second form
is more intuitive when we move to approximate
exact influence values via -weighting.
log pθ(y[a,b]
k|xk) = log pθ(yk|xk)
log pθ(y[a,b]
k|y[a,b]
k, xk)
(3)
ML(z[a,b]
k, θ) = L(zk, θ)− CL(z[a,b]
k, θ)(4)
where we have defined also the
condi-
tional loss
of the segment as
CL(z[a,b]
k, θ) =
log pθ(y[a,b]
k|y[a,b]
k, xk).
Next we define the change in loss for a segment
of a test example
zi= (xi, yi)
. We define the
loss for the segment
[c, d]
of the output
yi
as the
conditional loss of the segment
[c, d]
:
CL(z[c,d]
i, θ)
.
Given the above definitions, we can concretize
the notion of the exact influence as follows:
1. Retrain the model without the segment
[a, b]
of
training example zk:
ˆ
θ[z[a,b]
k] = argmin
θ
1
|D| X
zl∈D
L(zl, θ)
1
|D|(L(zk, θ)− ML(z[a,b]
k, θ))
(5)
Comparing Equations 4and 5, we see that remov-
ing the effect of segment
[a, b]
of
zk
amounts to
subtracting the conditional loss of the segment
CL(z[a,b]
k, θ)from the original loss L(D, θ).
2. Compute the difference between the conditional
loss of segment
[c, d]
of test example
zi
under new
parameter estimates
ˆ
θ[z[a,b]
k]
and the original esti-
mates
ˆ
θ
trained using the objective in Equation 1:
Exact-Influence(z[a,b]
k, z[c,d]
i) =
CL(z[c,d]
i,ˆ
θ[z[a,b]
k]) − CL(z[c,d]
i,ˆ
θ)
(6)
2.2.2 Approximating Segment Influence
Above we have derived a means to calculate exact
segment level influence values. But in practice re-
training the model (step 1) is not feasible. Here we
instead present an
-upweighting method — anal-
ogous to the approximation to instance influence
(Koh and Liang,2017) — for computing segment
influence. The idea is to compute the change in
model parameters if we incur a slight additional
penalty CL(z[a,b]
k, θ)for the segment [a, b]:
ˆ
θ[z[a,b]
k] = argmin
θ
1
|D|X
zl∈D
L(zl, θ)
+CL(z[a,b]
k, θ)
(7)
A first order approximation to the difference in the
model parameters near = 0 is given by:
dˆ
θ[z[a,b]
k]
d
=0 =[2
θL(D,ˆ
θ)]1θCL(z[a,b]
k,ˆ
θ)
(8)
We can apply the chain rule to measure the change
in the conditional loss over segment
[c, d]
of test
example zidue to this upweighting:
I(z[c,d]
i, z[a,b]
k) = dCL(z[c,d]
i,ˆ
θ[z[a,b]
k])
d
=0 =
− ∇θCL(z[c,d]
i,ˆ
θ)[2
θL(D,ˆ
θ)]1θCL(z[a,b]
k,ˆ
θ)
(9)
This definition provides us with an approxima-
tion to the exact influence defined in previous sec-
tion for a segment of a training example on a seg-
ment of a test example. Derivation showing all
the steps can be found in Appendix A.2. We have
assumed that we can take the gradient of the con-
ditional loss over a segment, which is possible for
a CRF tagger (derivation in Appendix B), but may
be non-trivial for other models.
3 Computational Challenges
The computation costs of even approximate in-
stance level influence can be prohibitive in practice,
especially with the large pretrained language mod-
els that now dominate NLP. Computing and storing
inverse Hessians of the loss has
O(p3)
and
O(p2)
complexity where
p
is the number of parameters
in the model (commonly
100M-100B for deep
models). Even ignoring the Hessian, one still needs
the gradient with respect to each training example
in
D
; one could attempt to pre-compute these, but
storing the results requires O(|D|p)memory.
The alternative is therefore to recompute these
for each new test point zi. For segment level influ-
ence these costs are compounded because we need
influence with respect to every segment within a
training example, multiplying complexities by
T2
,
where
T
is the average length of a training exam-
ple. Consequently, it is practically infeasible to
calculate segment influence per Equation 9.
Prior work by Pezeshkpour et al. (2021) showed
that for instance-level influence, considering a re-
stricted set of parameters (e.g., those in the classifi-
cation layer) when taking the gradient is reasonable
for influence in that this does not much affect the
induced rankings over train points with respect to
influence. Similarly, ignoring the Hessian term
does not significantly affect rankings by influence.
These two simplifications dramatically improve ef-
ficiency, and we adopt them here.
Consider a sequence tagging model built on top
of a deep encoder
F
, e.g., BERT (Devlin et al.,
2018). In the context of a linear chain CRF on top
of
F
, the standard score function for this model
is:
s(yi, xi) = PTi
t=1 y>
it WF(xi)t+y>
i(t1)Tyit
,
where
T
is a matrix of class transition scores and
yit
is the one-hot representation of label
yit
. A
CRF layer consumes these scores and computes
the probability of a label sequence as
p(yi|xi) =
es(yi,xi)
Py0∈YTies(y0,xi)
. In this work we consider the gra-
dient only with respect to the
W
and
T
parameters
above and not any parameters associated with F.
Further, we consider influence only with respect
to individual token outputs in training samples,
rather than every possible segment — i.e., we only
consider single-token segments. This further re-
duces the T2terms in our complexity to T.
4 Experimental Aims and Setup
We evaluate segment influence in terms of: (1)
Approximation validity, or how well the approxi-
mation proposed in Equation 9corresponds to the
exact influence value (Equation 6), and, (2) Utility,
i.e., whether segment influence might help identify
problematic training examples for sequence tag-
ging tasks. In this paper we consider only NER
tasks, using the following datasets.
CoNLL
(Tjong Kim Sang and De Meulder,2003).
An NER dataset containing news stories from
Reuters, labeled with four entity types:
PER
,
LOC
,
ORG
and
MISC
, and demarcated using
the beginning-inside-out (BIO) tagging scheme
(Ramshaw and Marcus,1999). The dataset is di-
vided into train, validation, and test splits compris-
ing 879, 194, and 197 documents, respectively.
EBM-NLP
(Nye et al.,2018). A corpus of anno-
tated medical article abstracts describing clinical
randomized controlled trials. ‘Entities’ here are
spans of tokens that describe the patient Population
enrolled, the Interventions compared, and the Out-
comes measured (‘PICO’ elements). This dataset
includes a test set of 191 abstracts labeled by med-
ical professionals, and a train/val set of 3836/958
abstracts labeled via Amazon Mechanical Turk.
For NER models we use a representative modern
transformer — BigBird Base (Zaheer et al.,2020)
— as our encoder
F
, using final layer hidden states
as contextualized token representations. Depen-
dencies between output labels are modeled using a
CRF. We provide training details in Appendix C.
In addition to segment and instance level influ-
ence we also evaluate — where applicable —
seg-
ment nearest neighbor
as an attribution method,
which works as follows. We retrieve from the train-
ing dataset segments that have the highest similarity
between their corresponding feature embeddings
(we again consider only single token segments here,
so we do not need to worry about embedding multi-
token segments). We consider both dot product
and cosine similarity and report the results for the
version that gives best performance for each exper-
iment.
5 Validating Segment Influence
In this first set of experiments we aim to (i) verify
that the proposed approximation correlates with the
exact influence (2.2.2), and, (ii) ascertain that for
synthetic tasks which we construct, segment influ-
ence returns the a priori expected training segment
摘要:

InuenceFunctionsforSequenceTaggingModelsSarthakJainNortheasternUniversityjain.sar@northeastern.eduVarunManjunathaAdobeResearchvmanjuna@adobe.comByronC.WallaceNortheasternUniversityb.wallace@northeastern.eduAniNenkovaAdobeResearchnenkova@adobe.comAbstractManylanguagetasks(e.g.,NamedEntityRecognition...

展开>> 收起<<
Influence Functions for Sequence Tagging Models Sarthak Jain Northeastern University.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:861.66KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注