Understanding Inuence Functions and Datamodels via Harmonic Analysis Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora

2025-05-06 0 0 861.94KB 23 页 10玖币
侵权投诉
Understanding Influence Functions and Datamodels
via Harmonic Analysis
Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora
{nsaunshi, arushig, mbraverm, arora}@cs.princeton.edu
Department of Computer Science, Princeton University
October 4, 2022
Abstract
Influence functions estimate effect of individual data points on predictions of the model on test
data and were adapted to deep learning in Koh and Liang [2017]. They have been used for de-
tecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints,
etc. Recently, Ilyas et al. [2022] introduced a linear regression method they termed datamodels to
predict the effect of training points on outputs on test data. The current paper seeks to provide
a better theoretical understanding of such interesting empirical phenomena. The primary tool is
harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization
of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the
residual error and quality of the optimum linear datamodel without having to train the datamodel.
(c) New insights into when influences of groups of datapoints may or may not add up linearly.
1 Introduction
It is often of great interest to quantify how the presence or absence of a particular training data point
affects the trained model’s performance on test data points. Influence functions is a classical idea for
this [Jaeckel,1972,Hampel,1974,Cook,1977] that has recently been adapted to modern deep models
and large datasets Koh and Liang [2017]. Influence functions have been applied to explain predictions
and produce confidence intervals [Schulam and Saria,2019], investigate model bias [Brunet et al.,
2019,Wang et al.,2019], estimate Shapley values [Jia et al.,2019,Ghorbani and Zou,2019], improve
human trust [Zhou et al.,2019], and craft data poisoning attacks [Koh et al.,2019].
Influence actually has different formalizations. The classic calculus-based estimate (henceforth referred
to as continuous influence) involves conceptualizing training loss as a weighted sum over training
datapoints, where the weighting of a particular datapoint zcan be varied infinitesimally. Using
gradient and Hessian one obtains an expression for the rate of change in test error (or other functions)
of z0with respect to (infinitesimal) changes to weighting of z. Though the estimate is derived only
for infinitesimal change to the weighting of zin the training set, in practice it has been employed
also as a reasonable estimate for the discrete notion of influence, which is the effect of completely
adding/removing the data point from the training dataset [Koh and Liang,2017]. Informally speaking,
this discrete influence is defined as f(S{i})f(S) where fis some function of the test points, Sis a
training dataset and iis the index of a training point. (This can be noisy, so several papers use expected
influence of iby taking the expectation over random choice of Sof a certain size; see Section 2.) Koh
and Liang [2017] as well as subsequent papers have used continuous influence to estimate the effect
of decidedly non-infinitesimal changes to the dataset, such as changing the training set by adding or
deleting entire groups of datapoints [Koh et al.,2019]. Recently Bae et al. [2022] show mathematical
1
arXiv:2210.01072v1 [cs.LG] 3 Oct 2022
reasons why this is not well-founded, and give a clearer explanation (and alternative implementation)
of Koh-Liang style estimators.
Yet another idea related to influence functions is linear datamodels in Ilyas et al. [2022]. By training
many models on subsets of pfraction of datapoints in the training set, the authors show that some
interesting measures of test error (defined using logit values) behave as follows: the measure f(x) is
well-approximable as a (sparse) linear expression θ0+Piθixi, where xis a binary vector denoting a
sample of pfraction of training datapoints, with xi= 1 indicating presence of i-th training point and
xi=1 denoting absence. The coefficients θiare estimated via lasso regression. The surprise here is
that f(x) —which is the result of deep learning on dataset x—is well-approximated by θ0+Piθixi.
The authors note that the θi’s can be viewed as heuristic estimates for the discrete influence of the
ith datapoint.
The current paper seeks to provide better theoretical understanding of above-mentioned phenomena
concerning discrete influence functions. At first sight this quest appears difficult. The calculus defini-
tion of influence functions (which as mentioned is also used in practice to estimate the discrete notions
of influence) involves Hessians and gradients evaluated on the trained net, and thus one imagines that
any explanation for properties of influence functions must await better mathematical understanding
of datasets, net architectures, and training algorithms.
Surprisingly, we show that the explanation for many observed properties turns out to be fairly generic.
Our chief technical tool is harmonic analysis, and especially theory of noise stability of functions (see
O’Donnell [2014] for an excellent survey).
1.1 Our Conceptual framework (Discrete Influence)
Training data points are numbered 1 through N, but the model is being trained on a random subset of
data points, where each data point is included independently in the subset with probability p. (This
is precisely the setting in linear datamodels.) For notational ease and consistency with harmonic
analysis, we denote this subset by x∈ {−1,+1}Nwhere +1 means the corresponding data point was
included. We are interested in some quantity f(x) associated with the trained model on one or more
test data points. Note fis a probabilistic function of xdue to stochasticity in deep net training –
SGD, dropout, data augmentation etc. – but one can average over the stochastic choices and think of
fas deterministic function f:1}NR. (In practice, this means we estimate f(x) by repeating
the training on x, say, 10 to 50 times.)
This scenario is close to classical study of boolean functions via harmonic analysis, except our function
is real-valued. Using those tools we provide the following new mathematical understanding:
1. We give reasons for existence of datamodels of Ilyas et al. [2022], the phenomenon that functions
related to test error are well-approximable by a linear function θ0+Piθixi. See Section 3.1.
2. Section 2 gives exact characterizations of the θi’s for data models with/without regularization.
(Earlier, Ilyas et al. [2022] noted this for a special case: p= 0.5, `2regularization)
3. Using our framework, we give a new algorithm to estimate the degree to which a test function
fis well-approximated by a linear datamodel, without having to train the datamodel per se. See
Section 3.2, where our method needs only O(1/3) samples instead of O(N/2).
4. We study group influence, which quantifies the effect of adding or deleting a set Iof datapoints to
x.Ilyas et al. [2022] note that this can often be well-approximated by linearly adding the individual
influences of points in I. Section 4clarifies simple settings where linearity would fail, by a factor
exponentially large in |I|, and also discusses potential reasons for the observed linearity.
2
1.2 Other related work
Narasimhan et al. [2015] investigate when influence is PAC learnable. Basu et al. [2020b] use second
order influence functions and find they make better predictions than first order influence functions.
Cohen et al. [2020] use influence functions to detect adversarial examples. Kong et al. [2021] propose
an influence based re-labeling function that can relabel harmful examples to improve generalization
instead of just discarding them. Zhang and Zhang [2022] use Neural Tangent Kernels to understand
influence functions rigorously for highly overparametrized nets.
Pruthi et al. [2020] give another notion of influence by tracing the effect of data points on the loss
throughout gradient descent. Chen et al. [2020] define multi-stage influence functions to trace influ-
ence all the way back to pre-training to find which samples were most helpful during pre-training.
Basu et al. [2020a] find that influence functions are fragile, in the sense that the quality of influence
estimates depend on the architecture and training procedure. Alaa and Van Der Schaar [2020] use
higher order influence functions to characterize uncertainty in a jack-knife estimate. Teso et al. [2021]
introduce Cincer, which uses influence functions to identify suspicious pairs of examples for interactive
label cleaning. Rahaman et al. [2019] use harmonic analysis to decompose a neural network into a
piecewise linear Fourier series, thus finding that neural networks exhibit spectral bias.
Other instance based interpretability techniques include Representer Point Selection [Yeh et al.,2018],
Grad-Cos [Charpiat et al.,2019], Grad-dot [Hanawa et al.,2020], MMD-Critic [Kim et al.,2016], and
unconditional counter-factual explanations [Wachter et al.,2017].
Variants on influence functions have also been proposed, including those using Fisher kernels [Khanna
et al.,2019], tricks for faster and more scalable inference [Guo et al.,2021,Schioppa et al.,2022], and
identifying relevant training samples with relative influence [Barshan et al.,2020].
Discrete influence played a prominent role in the surprising discovery of long tail phenomenon in Feld-
man [2020], Feldman and Zhang [2020]: the experimental finding that in large datasets like ImageNet,
a significant fraction of training points are atypical, in the sense that the model does not easily learn
to classify them correctly if the point is removed from the training set.
2 Harmonic analysis, influence functions and datamodels
In this section we introduce notations for the standard harmonic analysis for functions on the hy-
percube [O’Donnell,2014], and establish connections between the corresponding fourier coefficients,
discrete influence of data points and linear datamodels from Ilyas et al. [2022].
2.1 Preliminaries: harmonic analysis
In the conceptual framework of Section 1.1, let [N] := {1,2,3, ..., N }. Viewing f:1}NRas
a vector in R2N, for any distribution Don 1}N, the set of all such functions can be treated as
a vector space with inner product defined as hf, giD=Ex∼D[f(x)g(x)], leading to a norm defined
as kfkD=pEx∼D[f(x)2]. Harmonic analysis involves identifying special orthonormal bases for this
vector space. We are interested in f’s values at or near p-biased points x∈ {±1}N, where xis viewed
as a random variable whose each coordinate is independently set to +1 with probability p. We denote
this distribution as Bp. Properties of fin this setting are best studied using the orthonormal basis
functions {φS:S[N]}defined as φS(x) = QiSxiµ
σ, where µ= 2p1 and σ2= 4p(1p) are the
mean and variance of each coordinate of x. Orthonormality implies that Ex[φS(x)] = 0 when S6=and
hφS, φS0iBp=S=S0. Then every f: [N]Rcan be expressed as f=PS[N]b
fSφS. Our discussion
will often refer to b
fS’s as “Fourier” coefficients of f, when the orthonormal basis is clear from context.
3
This also implies Parseval’s identity: PSb
f2
S=kfk2
Bp. For any vector zRd, we denote zito denote
its i-th coordinate (with 1-indexing). For a matrix ARd×r,Ai,:Rrand A:,j Rddenote its i-th
row and j-th column respectively. We use k·kto denote Euclidean norm when not specified.
2.2 Influence functions
We use the notion of influence of a single point from Feldman and Zhang [2020], Ilyas et al. [2022].
Influence of the i-th coordinate on fat xis defined as Infi(f(x)) = f(x|i1)f(x|i→−1), where xis
sampled as a p-biased training set and x|i1is xwith the i-th coordinate set to 1.
Proposition 2.1 (Individual influence).The “leave-one-out” influences satisfy the following:
Infi(f(x)) = 2
σX
S3ib
fSφS\{i}(x),Infi(f):=Ex[Infi(f(x))] = 2
σb
f{i}.(1)
Thus the degree-1 Fourier coefficients are directly related to average influence of individual points.
Similar results can be shown for other definition of single point influence: Ex[f(x)f(x|i→−1)] and
Ex[f(x|i1)f(x)] are equal to pInfi(f) and (1 p) Infi(f) respectively. The proof of this follows
by observing that Ex[f(x|i1)f(x|i→−1)] = PS3ib
fS1µ
σ1µ
σφS\{i}(x). The only term that
is not zero in expectation is the one for S={i}, thus proving the result. Section 4 deals with the
influence of add or deleting larger subsets of points.
Continuous vs Discrete Influence Koh and Liang [2017] utilize a continuous notion of influence:
train a model using dataset x∈ {±1}N, and then treat the i-th coordinate of xas a continuous
variable xi. Compute d
dxifat xi= 1 using gradients and Hessians of the loss at end of training. This
is called the influence of the i-th datapoint on f. As mentioned in Section 1 in several other contexts
one uses the discrete influence 1, which has a better connection to harmonic analysis. Experiments in
Koh and Liang [2017] suggest that in a variety of models, the continuous influence closely tracks the
discrete influence.
2.3 Linear datamodels from a Harmonic Analysis Lens
Next we turn to the phenomenon Ilyas et al. [2022] that a function f(x) related to average test error1
(where xis the training set) often turns out to be approximable by a linear function θ0+Piθi¯xi,
where ¯x∈ {0,1}Nis the binary version of x∈ {±1}N. It is important to note that this approximation
(when it exists) holds only in a least squares sense, meaning that the following is small: Ex[(f(x)
θ0Piθi¯xi)2] where the expectation is over p-biased x.
The authors suggest that θican be seen as an estimate of the average discrete influence of variable i.
While this is intuitive, they do not give a general proof (their Lemma 2 proves it for p= 1/2 with `2
regularization). The following result exactly characterizes the solutions for arbitrary pand with both
`1and `2regularization.
Theorem 2.2 (Characterizing solutions to linear datamodels).Denote the quality of a linear data-
model θRN+1 on the p-biased distribution over training sets Bpby
R(θ):=Ex∼Bp
f(x)θ0
N
X
i=1
θi¯xi!2
.(2)
1The average is not over test error but over the difference between the correct label logits and the top logit among
the rest, at the output layer of the deep net.
4
where ¯x∈ {0,1}Nis the binary version for any x∈ {±1}N. Then for µ= 2p1and σ=p4p(1 p),
the following are true about the optimal datamodels with and without regularization:
(a) The unregularized minimizer θ?= arg min R(θ)satisfies
θ?
i=2
σb
f{i}, θ?
0=b
f(µ+ 1)
2X
i
θ?
i.(3)
Furthermore the residual error is the sum of all Fourier coefficients of order 2 or higher.
R(θ?) = B2:=X
S[N]:|S|≥2b
f2
S(4)
(b) The minimizer with `2regularization θ?(λ, `2) = arg min R(θ) + λkθ1:Nk2
2satisfies
θ?(λ, `2)i=2
σ1 + 4λ
σ21b
f{i},(5)
(c) The minimizer with `1regularization θ?(λ, `1) = arg min {R(θ) + λkθ1:Nk1}satisfies
θ?(λ, `1)i=2
σb
f{i}λ
/σ+b
f{i}λ
/σ+=2
σsign b
f{i}b
f{i}λ
/σ+(6)
where z+=zz>0is the standard ReLU operation.
This result shows that the optimal linear datamodel with various regularization schemes, for any
p(0,1) are directly related to the first order Fourier coefficients at p. Given that the average dis-
crete influences, from Equation (1), are also the first order coefficients, this result directly establishes a
connection between datamodels and influences. Result (c) suggests that `1regularization has the effect
of clipping the Fourier coefficients such that those with small magnitude are set to 0, thus encouraging
sparse datamodels. Furthermore, Equation (4) also gives a simple expression for the residual of the
best linear fit which we utilize for our efficient residual estimation procedure in Section 3.2.
The proof of the full result is presented in Appendix A.1, however we present a proof sketch of the
result (a) to highlight the role of the Fourier basis and coefficients.
Proof sketch for Theorem 2.2(a).Since {φS}S[N]is an orthonormal basis for the inner product space
with hf, giBp=E
x∼Bp
[f(x)g(x)], we can rewrite R(θ) as follows
R(θ) = Ex∼Bp f(x)θ0X
i
θi¯xi!2
=Ex∼Bp f(x)¯
θ0X
i
¯
θiφ{i}(x)!2
=
f¯
θ0X
i
¯
θiφ{i}
2
Bp
,where k·kBp:=,·iBp
where ¯
θi=σ/2θiand ¯
θ0=θ0+(µ+1)
/2Piθi. Due to the orthonormality of {φS}S[N], the minimizer
¯
θ?will lead to a projection of fonto the span of {φS}|S|≤1, which gives ¯
θi=b
f{i}and thus θi=
2
/σb
f{i}. Furthermore the residual is the norm of the projection of fonto the orthogonal subspace
span {φS}|S|≥2, which is precisely P|S|≥2b
f2
S.
5
摘要:

UnderstandingInuenceFunctionsandDatamodelsviaHarmonicAnalysisNikunjSaunshiArushiGuptaMarkBravermanSanjeevArorafnsaunshi,arushig,mbraverm,arorag@cs.princeton.eduDepartmentofComputerScience,PrincetonUniversityOctober4,2022AbstractInuencefunctionsestimatee ectofindividualdatapointsonpredictionsofthemod...

展开>> 收起<<
Understanding Inuence Functions and Datamodels via Harmonic Analysis Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:23 页 大小:861.94KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注