Understanding Inuence Functions and Datamodels via Harmonic Analysis Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora

2025-05-06 0 0 861.94KB 23 页 10玖币

侵权投诉

Understanding Inﬂuence Functions and Datamodels

via Harmonic Analysis

Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora

{nsaunshi, arushig, mbraverm, arora}@cs.princeton.edu

Department of Computer Science, Princeton University

October 4, 2022

Abstract

Inﬂuence functions estimate eﬀect of individual data points on predictions of the model on test

data and were adapted to deep learning in Koh and Liang [2017]. They have been used for de-

tecting data poisoning, detecting helpful and harmful examples, inﬂuence of groups of datapoints,

etc. Recently, Ilyas et al. [2022] introduced a linear regression method they termed datamodels to

predict the eﬀect of training points on outputs on test data. The current paper seeks to provide

a better theoretical understanding of such interesting empirical phenomena. The primary tool is

harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization

of the learnt datamodel in terms of Fourier coeﬃcients. (b) An eﬃcient method to estimate the

residual error and quality of the optimum linear datamodel without having to train the datamodel.

1 Introduction

It is often of great interest to quantify how the presence or absence of a particular training data point

aﬀects the trained model’s performance on test data points. Inﬂuence functions is a classical idea for

this [Jaeckel,1972,Hampel,1974,Cook,1977] that has recently been adapted to modern deep models

and large datasets Koh and Liang [2017]. Inﬂuence functions have been applied to explain predictions

and produce conﬁdence intervals [Schulam and Saria,2019], investigate model bias [Brunet et al.,

2019,Wang et al.,2019], estimate Shapley values [Jia et al.,2019,Ghorbani and Zou,2019], improve

human trust [Zhou et al.,2019], and craft data poisoning attacks [Koh et al.,2019].

Inﬂuence actually has diﬀerent formalizations. The classic calculus-based estimate (henceforth referred

to as continuous inﬂuence) involves conceptualizing training loss as a weighted sum over training

datapoints, where the weighting of a particular datapoint zcan be varied inﬁnitesimally. Using

gradient and Hessian one obtains an expression for the rate of change in test error (or other functions)

of z0with respect to (inﬁnitesimal) changes to weighting of z. Though the estimate is derived only

for inﬁnitesimal change to the weighting of zin the training set, in practice it has been employed

also as a reasonable estimate for the discrete notion of inﬂuence, which is the eﬀect of completely

adding/removing the data point from the training dataset [Koh and Liang,2017]. Informally speaking,

this discrete inﬂuence is deﬁned as f(S∪{i})−f(S) where fis some function of the test points, Sis a

training dataset and iis the index of a training point. (This can be noisy, so several papers use expected

inﬂuence of iby taking the expectation over random choice of Sof a certain size; see Section 2.) Koh

and Liang [2017] as well as subsequent papers have used continuous inﬂuence to estimate the eﬀect

of decidedly non-inﬁnitesimal changes to the dataset, such as changing the training set by adding or

deleting entire groups of datapoints [Koh et al.,2019]. Recently Bae et al. [2022] show mathematical

arXiv:2210.01072v1 [cs.LG] 3 Oct 2022

reasons why this is not well-founded, and give a clearer explanation (and alternative implementation)

of Koh-Liang style estimators.

Yet another idea related to inﬂuence functions is linear datamodels in Ilyas et al. [2022]. By training

many models on subsets of pfraction of datapoints in the training set, the authors show that some

interesting measures of test error (deﬁned using logit values) behave as follows: the measure f(x) is

well-approximable as a (sparse) linear expression θ0+Piθixi, where xis a binary vector denoting a

sample of pfraction of training datapoints, with xi= 1 indicating presence of i-th training point and

xi=−1 denoting absence. The coeﬃcients θiare estimated via lasso regression. The surprise here is

that f(x) —which is the result of deep learning on dataset x—is well-approximated by θ0+Piθixi.

The authors note that the θi’s can be viewed as heuristic estimates for the discrete inﬂuence of the

ith datapoint.

The current paper seeks to provide better theoretical understanding of above-mentioned phenomena

concerning discrete inﬂuence functions. At ﬁrst sight this quest appears diﬃcult. The calculus deﬁni-

tion of inﬂuence functions (which as mentioned is also used in practice to estimate the discrete notions

of inﬂuence) involves Hessians and gradients evaluated on the trained net, and thus one imagines that

any explanation for properties of inﬂuence functions must await better mathematical understanding

of datasets, net architectures, and training algorithms.

Surprisingly, we show that the explanation for many observed properties turns out to be fairly generic.

Our chief technical tool is harmonic analysis, and especially theory of noise stability of functions (see

O’Donnell [2014] for an excellent survey).

1.1 Our Conceptual framework (Discrete Inﬂuence)

Training data points are numbered 1 through N, but the model is being trained on a random subset of

data points, where each data point is included independently in the subset with probability p. (This

is precisely the setting in linear datamodels.) For notational ease and consistency with harmonic

analysis, we denote this subset by x∈ {−1,+1}Nwhere +1 means the corresponding data point was

included. We are interested in some quantity f(x) associated with the trained model on one or more

test data points. Note fis a probabilistic function of xdue to stochasticity in deep net training –

SGD, dropout, data augmentation etc. – but one can average over the stochastic choices and think of

fas deterministic function f:{±1}N→R. (In practice, this means we estimate f(x) by repeating

the training on x, say, 10 to 50 times.)

This scenario is close to classical study of boolean functions via harmonic analysis, except our function

is real-valued. Using those tools we provide the following new mathematical understanding:

1. We give reasons for existence of datamodels of Ilyas et al. [2022], the phenomenon that functions

related to test error are well-approximable by a linear function θ0+Piθixi. See Section 3.1.

2. Section 2 gives exact characterizations of the θi’s for data models with/without regularization.

(Earlier, Ilyas et al. [2022] noted this for a special case: p= 0.5, `2regularization)

3. Using our framework, we give a new algorithm to estimate the degree to which a test function

fis well-approximated by a linear datamodel, without having to train the datamodel per se. See

Section 3.2, where our method needs only O(1/3) samples instead of O(N/2).

4. We study group inﬂuence, which quantiﬁes the eﬀect of adding or deleting a set Iof datapoints to

x.Ilyas et al. [2022] note that this can often be well-approximated by linearly adding the individual

inﬂuences of points in I. Section 4clariﬁes simple settings where linearity would fail, by a factor

exponentially large in |I|, and also discusses potential reasons for the observed linearity.

1.2 Other related work

Narasimhan et al. [2015] investigate when inﬂuence is PAC learnable. Basu et al. [2020b] use second

order inﬂuence functions and ﬁnd they make better predictions than ﬁrst order inﬂuence functions.

Cohen et al. [2020] use inﬂuence functions to detect adversarial examples. Kong et al. [2021] propose

an inﬂuence based re-labeling function that can relabel harmful examples to improve generalization

instead of just discarding them. Zhang and Zhang [2022] use Neural Tangent Kernels to understand

inﬂuence functions rigorously for highly overparametrized nets.

Pruthi et al. [2020] give another notion of inﬂuence by tracing the eﬀect of data points on the loss

throughout gradient descent. Chen et al. [2020] deﬁne multi-stage inﬂuence functions to trace inﬂu-

ence all the way back to pre-training to ﬁnd which samples were most helpful during pre-training.

Basu et al. [2020a] ﬁnd that inﬂuence functions are fragile, in the sense that the quality of inﬂuence

estimates depend on the architecture and training procedure. Alaa and Van Der Schaar [2020] use

higher order inﬂuence functions to characterize uncertainty in a jack-knife estimate. Teso et al. [2021]

introduce Cincer, which uses inﬂuence functions to identify suspicious pairs of examples for interactive

label cleaning. Rahaman et al. [2019] use harmonic analysis to decompose a neural network into a

piecewise linear Fourier series, thus ﬁnding that neural networks exhibit spectral bias.

Other instance based interpretability techniques include Representer Point Selection [Yeh et al.,2018],

Grad-Cos [Charpiat et al.,2019], Grad-dot [Hanawa et al.,2020], MMD-Critic [Kim et al.,2016], and

unconditional counter-factual explanations [Wachter et al.,2017].

Variants on inﬂuence functions have also been proposed, including those using Fisher kernels [Khanna

et al.,2019], tricks for faster and more scalable inference [Guo et al.,2021,Schioppa et al.,2022], and

identifying relevant training samples with relative inﬂuence [Barshan et al.,2020].

Discrete inﬂuence played a prominent role in the surprising discovery of long tail phenomenon in Feld-

man [2020], Feldman and Zhang [2020]: the experimental ﬁnding that in large datasets like ImageNet,

a signiﬁcant fraction of training points are atypical, in the sense that the model does not easily learn

to classify them correctly if the point is removed from the training set.

2 Harmonic analysis, inﬂuence functions and datamodels

In this section we introduce notations for the standard harmonic analysis for functions on the hy-

percube [O’Donnell,2014], and establish connections between the corresponding fourier coeﬃcients,

discrete inﬂuence of data points and linear datamodels from Ilyas et al. [2022].

2.1 Preliminaries: harmonic analysis

In the conceptual framework of Section 1.1, let [N] := {1,2,3, ..., N }. Viewing f:{±1}N→Ras

a vector in R2N, for any distribution Don {±1}N, the set of all such functions can be treated as

a vector space with inner product deﬁned as hf, giD=Ex∼D[f(x)g(x)], leading to a norm deﬁned

as kfkD=pEx∼D[f(x)2]. Harmonic analysis involves identifying special orthonormal bases for this

vector space. We are interested in f’s values at or near p-biased points x∈ {±1}N, where xis viewed

as a random variable whose each coordinate is independently set to +1 with probability p. We denote

this distribution as Bp. Properties of fin this setting are best studied using the orthonormal basis

functions {φS:S⊆[N]}deﬁned as φS(x) = Qi∈Sxi−µ

σ, where µ= 2p−1 and σ2= 4p(1−p) are the

mean and variance of each coordinate of x. Orthonormality implies that Ex[φS(x)] = 0 when S6=∅and

hφS, φS0iBp=S=S0. Then every f: [N]→Rcan be expressed as f=PS⊆[N]b

fSφS. Our discussion

will often refer to b

fS’s as “Fourier” coeﬃcients of f, when the orthonormal basis is clear from context.

This also implies Parseval’s identity: PSb

S=kfk2

Bp. For any vector z∈Rd, we denote zito denote

its i-th coordinate (with 1-indexing). For a matrix A∈Rd×r,Ai,:∈Rrand A:,j ∈Rddenote its i-th

row and j-th column respectively. We use k·kto denote Euclidean norm when not speciﬁed.

2.2 Inﬂuence functions

We use the notion of inﬂuence of a single point from Feldman and Zhang [2020], Ilyas et al. [2022].

Inﬂuence of the i-th coordinate on fat xis deﬁned as Infi(f(x)) = f(x|i→1)−f(x|i→−1), where xis

sampled as a p-biased training set and x|i→1is xwith the i-th coordinate set to 1.

Proposition 2.1 (Individual inﬂuence).The “leave-one-out” inﬂuences satisfy the following:

Infi(f(x)) = 2

σX

S3ib

fSφS\{i}(x),Infi(f):=Ex[Infi(f(x))] = 2

σb

f{i}.(1)

Thus the degree-1 Fourier coeﬃcients are directly related to average inﬂuence of individual points.

Similar results can be shown for other deﬁnition of single point inﬂuence: Ex[f(x)−f(x|i→−1)] and

Ex[f(x|i→1)−f(x)] are equal to pInfi(f) and (1 −p) Infi(f) respectively. The proof of this follows

by observing that Ex[f(x|i→1)−f(x|i→−1)] = PS3ib

fS1−µ

σ−−1−µ

σφS\{i}(x). The only term that

is not zero in expectation is the one for S={i}, thus proving the result. Section 4 deals with the

inﬂuence of add or deleting larger subsets of points.

Continuous vs Discrete Inﬂuence Koh and Liang [2017] utilize a continuous notion of inﬂuence:

train a model using dataset x∈ {±1}N, and then treat the i-th coordinate of xas a continuous

variable xi. Compute d

dxifat xi= 1 using gradients and Hessians of the loss at end of training. This

is called the inﬂuence of the i-th datapoint on f. As mentioned in Section 1 in several other contexts

one uses the discrete inﬂuence 1, which has a better connection to harmonic analysis. Experiments in

Koh and Liang [2017] suggest that in a variety of models, the continuous inﬂuence closely tracks the

discrete inﬂuence.

2.3 Linear datamodels from a Harmonic Analysis Lens

Next we turn to the phenomenon Ilyas et al. [2022] that a function f(x) related to average test error1

(where xis the training set) often turns out to be approximable by a linear function θ0+Piθi¯xi,

where ¯x∈ {0,1}Nis the binary version of x∈ {±1}N. It is important to note that this approximation

(when it exists) holds only in a least squares sense, meaning that the following is small: Ex[(f(x)−

θ0−Piθi¯xi)2] where the expectation is over p-biased x.

The authors suggest that θican be seen as an estimate of the average discrete inﬂuence of variable i.

While this is intuitive, they do not give a general proof (their Lemma 2 proves it for p= 1/2 with `2

regularization). The following result exactly characterizes the solutions for arbitrary pand with both

`1and `2regularization.

Theorem 2.2 (Characterizing solutions to linear datamodels).Denote the quality of a linear data-

model θ∈RN+1 on the p-biased distribution over training sets Bpby

R(θ):=Ex∼Bp

 f(x)−θ0−

i=1

θi¯xi!2

.(2)

1The average is not over test error but over the diﬀerence between the correct label logits and the top logit among

the rest, at the output layer of the deep net.

where ¯x∈ {0,1}Nis the binary version for any x∈ {±1}N. Then for µ= 2p−1and σ=p4p(1 −p),

the following are true about the optimal datamodels with and without regularization:

(a) The unregularized minimizer θ?= arg min R(θ)satisﬁes

θ?

i=2

σb

f{i}, θ?

0=b

f∅−(µ+ 1)

θ?

i.(3)

Furthermore the residual error is the sum of all Fourier coeﬃcients of order 2 or higher.

R(θ?) = B≥2:=X

S⊆[N]:|S|≥2b

S(4)

(b) The minimizer with `2regularization θ?(λ, `2) = arg min R(θ) + λkθ1:Nk2

2satisﬁes

θ?(λ, `2)i=2

σ1 + 4λ

σ2−1b

f{i},(5)

θ?(λ, `1)i=2

σb

f{i}−λ

/σ+−−b

f{i}−λ

/σ+=2

σsign b

f{i}b

f{i}−λ

/σ+(6)

where z+=zz>0is the standard ReLU operation.

This result shows that the optimal linear datamodel with various regularization schemes, for any

p∈(0,1) are directly related to the ﬁrst order Fourier coeﬃcients at p. Given that the average dis-

crete inﬂuences, from Equation (1), are also the ﬁrst order coeﬃcients, this result directly establishes a

connection between datamodels and inﬂuences. Result (c) suggests that `1regularization has the eﬀect

of clipping the Fourier coeﬃcients such that those with small magnitude are set to 0, thus encouraging

sparse datamodels. Furthermore, Equation (4) also gives a simple expression for the residual of the

best linear ﬁt which we utilize for our eﬃcient residual estimation procedure in Section 3.2.

The proof of the full result is presented in Appendix A.1, however we present a proof sketch of the

result (a) to highlight the role of the Fourier basis and coeﬃcients.

Proof sketch for Theorem 2.2(a).Since {φS}S⊆[N]is an orthonormal basis for the inner product space

with hf, giBp=E

x∼Bp

[f(x)g(x)], we can rewrite R(θ) as follows

R(θ) = Ex∼Bp f(x)−θ0−X

θi¯xi!2

=Ex∼Bp f(x)−¯

θ0−X

θiφ{i}(x)!2

=



f−¯

θ0−X

θiφ{i}



,where k·kBp:=h·,·iBp

where ¯

θi=σ/2θiand ¯

θ0=θ0+(µ+1)

/2Piθi. Due to the orthonormality of {φS}S⊆[N], the minimizer

θ?will lead to a projection of fonto the span of {φS}|S|≤1, which gives ¯

θi=b

f{i}and thus θi=

/σb

f{i}. Furthermore the residual is the norm of the projection of fonto the orthogonal subspace

span {φS}|S|≥2, which is precisely P|S|≥2b

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderstandingInuenceFunctionsandDatamodelsviaHarmonicAnalysisNikunjSaunshiArushiGuptaMarkBravermanSanjeevArorafnsaunshi,arushig,mbraverm,arorag@cs.princeton.eduDepartmentofComputerScience,PrincetonUniversityOctober4,2022AbstractInuencefunctionsestimateeectofindividualdatapointsonpredictionsofthemod...

展开>> 收起<<

Understanding Inuence Functions and Datamodels via Harmonic Analysis Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Understanding Inuence Functions and Datamodels via Harmonic Analysis Nikunj Saunshi Arushi Gupta Mark Braverman Sanjeev Arora

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: