Efﬁcient Dataset Distillation using Random Feature Approximation Noel Loo Ramin Hasani Alexander Amini Daniela Rus

2025-05-03 0 0 7.51MB 29 页 10玖币

侵权投诉

Efﬁcient Dataset Distillation

using Random Feature Approximation

Noel Loo, Ramin Hasani, Alexander Amini, Daniela Rus

Computer Science and Artiﬁcial Intelligence Lab (CSAIL)

Massachusetts Institute of Technology (MIT)

{loo, rhasani, amini, rus} @mit.edu

Abstract

Dataset distillation compresses large datasets into smaller synthetic coresets which

retain performance with the aim of reducing the storage and computational burden

of processing the entire dataset. Today’s best-performing algorithm, Kernel Induc-

ing Points (KIP), which makes use of the correspondence between inﬁnite-width

neural networks and kernel-ridge regression, is prohibitively slow due to the exact

computation of the neural tangent kernel matrix, scaling

O(|S|2)

, with

|S|

being

the coreset size. To improve this, we propose a novel algorithm that uses a random

feature approximation (RFA) of the Neural Network Gaussian Process (NNGP)

kernel, which reduces the kernel matrix computation to

O(|S|)

. Our algorithm

provides at least a 100-fold speedup over KIP and can run on a single GPU. Our

new method, termed an RFA Distillation (RFAD), performs competitively with KIP

and other dataset condensation algorithms in accuracy over a range of large-scale

datasets, both in kernel regression and ﬁnite-width network training. We demon-

strate the effectiveness of our approach on tasks involving model interpretability

and privacy preservation.1

1 Introduction

KIP RFAD

(ours)

Time per iteration Accuracy

KIP RFAD

(ours)

460s

2.4s

67%

73.1%

191.6 times faster

+6.1% gain

Example Distilled

Dataset from CelebA

Class Male

Class Female

Train on this 2 images,

get 92% classification acc

Figure 1: RFAD provides over 100-fold speedup

over the state-of-the-art algorithm Kernel-Inducing

Points (KIP) [Nguyen et al.,2021a], while exceed-

ing its performance on CIFAR-10. (right) Example

distilled synthetic sets one image per class

Coreset algorithms aim to summarize large

datasets into signiﬁcantly smaller datasets that

still accurately represent the full dataset on

downstream tasks [Jubran et al.,2019]. There

are myriad applications of these smaller datasets

including speeding up model training [Mirza-

soleiman et al.,2020], reducing catastrophic for-

getting [Aljundi et al.,2019,Rebufﬁ et al.,2017,

Borsos et al.,2020], and enhancing interpretabil-

ity [Kim et al.,2016,Bien and Tibshirani,2011].

While most coreset selection techniques aim to

select representative data points from the dataset,

recent work has looked at generating synthetic

data points instead, a process known as dataset

distillation [Wang et al.,2018,Bohdal et al.,

2020,Sucholutsky and Schonlau,2019,Zhao et al.,2021,Zhao and Bilen,2021b,Nguyen et al.,

2021b]. These synthetic datasets have the beneﬁt of using continuous gradient-based optimization

techniques rather than combinatorial methods and are not limited to the set of images and labels given

by the dataset, providing added ﬂexibility and performance.

1Code is available at https://github.com/yolky/RFAD

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12067v1 [cs.LG] 21 Oct 2022

A large variety of applications beneﬁt from obtaining an efﬁcient dataset distillation algorithm. For

instance, Kernel methods [Vinyals et al.,2016,Kaya and Bilge,2019,Snell et al.,2017,Ghorbani

et al.,2020,Reﬁnetti et al.,2021] usually demand a large support set in order to generate good

prediction performance at inference. This can be facilitated by an efﬁcient dataset distillation pipeline.

Moreover, distilling a synthetic version of sensitive data helps preserve privacy; a support set can

be provided to an end-user for the downstream applications without disclosure of data. Lastly, for

resource-hungry applications such as continual learning [Borsos et al.,2020], neural architecture

search [Shleifer and Prokop,2019] and automated machine learning [Hutter et al.,2019], generation

of a support-set on which we can ﬁt models efﬁciently is very helpful.

Recently, a dataset distillation method called Kernel-Inducing Points (KIP) [Nguyen et al.,2021a,b]

showed great performance in neural network classiﬁcation tasks. KIP uses Neural Tangent Kernel

(NTK) ridge-regression to exactly compute the output states of an inﬁnite-width neural network

trained on the support set. Although the method established the state-of-the-art for dataset distillation

in terms of accuracy, the computational complexity of KIP is very high due to the exact calculation of

the NTK. The algorithm, therefore, has limited applicability.

In this paper, we build on the prior work on KIP and develop a new algorithm for dataset distillation

called RFAD, which has similar accuracy and signiﬁcantly better performance than KIP. The key

insight is to introduce a new kernel inducing point method that improves complexity from

O(|S|2)

(where |S|is the support-set size) to O(|S|). To this end, we make three major contributions:

We develop RFAD, a fast, accurate, and scalable algorithm for dataset distillation in neural network

classiﬁcation tasks.

II.

We improve the time performance of KIP [Nguyen et al.,2021a,b] by over two orders of magnitude

while retaining or improving its accuracy. This speedup comes from leveraging a random-feature

approximation of the Neural Network Gaussian Process (NNGP) kernel by instantiating random

neural networks.

III.

We show the effectiveness of RFAD in efﬁcient dataset distillation tasks, enhancing model

interpretability and privacy preservation.

2 Background and Related Work

Coresets and Dataset Distillation.

Coresets are a subset of data that ensure models trained on

them show competitive performance compared to models trained directly on data. Standard coreset

selection algorithms use importance sampling to ﬁnd coresets [Har-Peled and Mazumdar,2004,

Lucic et al.,2017,Cohen et al.,2017]. Besides random selection methods, inspired by catastrophic

forgetting [Toneva et al.,2019] and mean-matching (adding samples to the coreset to match the mean

of the original dataset) [Chen et al.,2010,Rebufﬁ et al.,2017,Castro et al.,2018,Belouadah and

Popescu,2020,Scholkopf et al.,1999], new algorithms have been introduced. An overview of how

coresets work on point approximation is provided in Phillips [2016]

More recently, aligned with coreset selection methods, new algorithms have been developed to distill

a synthetic dataset from a given dataset, such that ﬁtting to this synthetic set provides performance on

par with training on the original dataset [Wang et al.,2018]. To this end, these dataset condensation

(or distillation) algorithms use gradient matching [Zhao et al.,2021,Maclaurin et al.,2015,Lorraine

et al.,2020], utilize differentiable siamese augmentation [Zhao and Bilen,2021b], and matching

distributions [Zhao and Bilen,2021a]. Dataset distillation has also been applied to the labels rather

than images [Bohdal et al.,2020]. Recently, a novel algorithm called Kernel Inducing Points

(KIP) [Nguyen et al.,2021a,b] has been introduced that performs very well on distilling synthetic

sets by using neural tangent kernel ridge-regression (KRR). KIP, similar to other algorithms, is

computationally expensive. Here, we propose a new method to signiﬁcantly improve its complexity.

Inﬁnite Width Neural Networks. Single-layer inﬁnite-width randomly initialized neural networks

correspond to Gaussian Processes [Neal,1996], allowing for closed-form exact training of Bayesian

neural networks for regression. Recently, this has been extended to deep fully-connected networks

[Lee et al.,2018,de G. Matthews et al.,2018], convolutional networks [Novak et al.,2019,Garriga-

Alonso et al.,2019], attention-based networks [Hron et al.,2020], and even to arbitrary neural

architectures [Yang,2019], with the corresponding GP kernel being the NNGP Kernel. Likewise, for

inﬁnite-width neural networks trained with gradient descent, the training process simpliﬁes dramati-

cally, corresponding to kernel ridge regression when trained with MSE loss with the corresponding

kernel being the Neural Tangent Kernel (NTK) [Jacot et al.,2018,Arora et al.,2019,Loo et al.,

2022]. These two kernels are closely related, as the NNGP kernel forms the leading term of the NTK

kernel, representing the effect of the ﬁnal layer weights. Calculation of kernel entries typically scales

with

O(HW D)

for conv nets, with

H, W

being the image height and width,

the network depth,

and

O(H2W2D)

for architectures with global average pooling [Arora et al.,2019]. This, combined

with the necessity of computing and inverting the

N×N

kernel matrix for kernel ridge regression,

typically make these methods intractable for large datasets [Snelson and Ghahramani,2006,Titsias,

2009].

Random Feature methods.

Every kernel corresponds to a dot product for some feature map:

k(x, x0) = φ(x)Tφ(x0)

. Random feature methods aim to approximate the feature vector with a

ﬁnite-dimensional random feature vector, the most notable example being Random Fourier Features

[Rahimi and Recht,2007]. Typically, this limits the rank of the kernel matrix, enabling faster matrix

inversion and allowing for scaling kernel methods to large datasets. Recently, these random feature

methods have been used to speed up NTKs and NNGPs [Zandieh et al.,2021,Novak et al.,2022,

2019] at inference or for neural architecture search [Peng et al.,2020]. In this work, we focus on the

NNGP approximation described in Novak et al. [2019], as it only requires network forward passes

and is model agnostic, allowing for ﬂexible usage across different architectures without more complex

machinery needed to calculate the approximation, unlike those found in Zandieh et al. [2021], Novak

et al. [2022].

3 Algorithm Setup and Design

In this section, we ﬁrst provide a high-level background on the KIP algorithm. We then sequentially

outline our modiﬁcations leading to the RFAD algorithm.

3.1 KIP Revisit

The Kernel-Inducing Point algorithm [Nguyen et al.,2021a,b], or KIP, is a dataset distillation

technique that uses the NTK kernel ridge-regression correspondence to compute exactly the outputs

of an inﬁnite-width neural network trained on the support set, bypassing the need to ever compute

gradients or back-propagate on any ﬁnite network. Let

XT, yT

correspond to the images and one-hot

vector labels on the training dataset and let

XS, yS

be the corresponding images and labels for

the support set, which we aim to optimize. We have the outputs of a trained neural network as

f(XT) = KT S (KSS +λI)−1yS

, with

being the kernel matrices calculated using the NTK kernel,

with

T×S

S×S

entries, for

KT S

and

KSS

, respectively.

is a small regularization parameter.

KIP then optimizes

LMSE =||yT−f(XT)||2

directly. The key bottleneck is the computation of

these kernel matrices, requiring

O(T S ·HW D)

time and memory, necessitating the use of hundreds

of GPUs working in parallel. Additionally, the use of the MSE loss is suboptimal.

3.2 Replacing the NTK Kernel with an NNGP Kernel

We ﬁrst replace the NTK used in the kernel regression of KIP with an NNGP kernel. While this

change alone would yield a speed up, as the NNGP kernel is less computationally intensive to compute

[Novak et al.,2020], we primarily do this because the NNGP kernel admits a simple random feature

approximation, with advantages described later in this section. We ﬁrst justify the appropriateness of

this modiﬁcation.

Firstly, we denote that in the computation of NTK (

) and NNGP (

) forms the leading term, as

shown in Table 1 in Appendix D of [Novak et al.,2020] which outlines the NTK and NNGP kernel

computation rules for various layers of a neural network. For fully connected (FC) layers, which is

the typical ﬁnal layer in neural network architectures, the remaining terms are suppressed by a matrix

of expected derivatives with respect to activations,

, as observed by the recursion yielded from the

computation of the NTK for an FC network [Novak et al.,2020]:

Θl=Kl+˙

KlΘl−1

. For ReLU

activations, the entries in this derivative matrix are upper bounded by

, so the remaining terms must

have a decaying contribution. We verify that our algorithms still provide good performance under the

NTK and for ﬁnite networks trained with gradient descent, justifying this approximation.

Algorithm 1 Dataset distillation with NNGP random features

Require:

Training set and labels

XT, yT

, Randomly initialized coreset and labels

XS, yS

, Random

network count

, Random network output dimension

, Batch size

|B|

, Random network

initialization distribution, p(θ), Regularization coefﬁcient, λ, Learning rate η,

while loss not converged do

Sample batch from the training set XB, yB∼p(XT, yT)

Sample Nrandom networks each with output dimension Mfrom p(θ):θ1, ...θN∼p(θ)

Compute random features for batch with random nets:

Φ(XB)←1

√N M [fθ1(XB),...,fθN(XB)]T∈R|N M |×|B|

Compute random features for support set with random nets:

Φ(XS)←1

√N M [fθ1(XS),...,fθN(XS)]T∈R|N M |×|S|

Compute kernel matrices: ˆ

KBS ←ˆ

Φ(XB)Tˆ

Φ(XS)

KSS ←ˆ

Φ(XS)Tˆ

Φ(XS)

Calculate trained network output on batch: ˆyB←ˆ

KBS (ˆ

KSS +λI|S|)−1yS

Calculate loss: L=L(yB,ˆyB)

Update coreset: XS←XS−η∂L

∂XS, yS←yS−η∂L

∂yS

end while

3.3 Replacing NNGP with an Empirical NNGP

When we sample from a Gaussian process

f∼ GP(0, K)

, it suggests a natural ﬁnite feature map

corresponding to scaled draws from the GP:

φ(x) = 1

√N[f1(x), ..., fN(x)]T

. For most GPs, this

insight is not relevant, as sampling from a GP typically requires a Cholesky decomposition of the

kernel matrix, requiring its computation in the ﬁrst place [Rasmussen and Williams,2006]. However,

for NNGP we can generate approximate samples of

by instantiating random neural networks,

fi(x) = fθi(x), θi∼p(θ)

, for some initialization distribution

p(θ)

. Moreover, with a given neural

network, we can deﬁne

to be a vector of dimension

by having a network with multiple output

heads, meaning that with

networks, we have

N M

features. For our purposes, we typically have

N = 8, M = 4096, giving 32768 total features. For the convolutional architectures we consider, this

corresponds to

C= 256

convolutional channels per layer. Even with this relatively large number of

features, we still see a signiﬁcant computation speedup over exact calculation.

To sample

f∼ GP(0, K)

, we would have to instantiate random inﬁnite width neural nets, whereas,

in practice, we can only sample ﬁnite ones. This discrepancy incurs an

O(1/C)

bias to our kernel

matrix entries, with

being the width-relevant parameter (i.e., convolutional channels) [Yaida,2020].

However, we have a

O(1/(NC))

variance of the mean of the random features [Daniely et al.,2016],

meaning that in practice, the variance dominates the computation over bias. This has been noted

empirically in Novak et al. [2019], and we verify that the ﬁnite-width bias does not signiﬁcantly

affect performance in appendix I, showing that we can achieve reasonable performance with as little

as one convolution channel.

The time cost of computing these random features is linear in the training set and coreset size,

|T|,|S|

. With the relatively low cost of matrix multiplication, this results in the construction of the

kernel matrices

KT S

and

KSS

having

O(|T|+|S|)

and

O(|S|)

, time complexity, respectively, as

opposed to

O(|T||S|)

and

O(|S|2)

with KIP. Noting that the cost of matrix inversion is relatively

small compared to random feature construction, our total runtime is reduced to

linear

in the coreset

size. We empirically verify this linear time complexity in section 4.1 and additionally provide a more

detailed discussion in appendix C.

3.4 Loss Function in dataset distillation

We denoted earlier that

LMSE

is not well suited for dataset distillation settings. In particular, there

are two key problems:

Over-inﬂuence of already correctly classiﬁed data points.

Consider two-way classiﬁcation, with

the label

corresponding to the positive class and

−1

corresponding to the negative class. Let

and

be items in the training set whose labels are both

. Let

fKRR(x) = Kx,S (KSS +λI)−1yS

be the KRR output on

given our support set

. If

fKRR(x1) = 5

and

fKRR(x2) = −1

, then the

Table 1: Kernel distillation results on ﬁve datasets with varying support set sizes.

Bolded

numbers

indicate the best performance with ﬁxed labels, and

underlined

numbers indicate the best performance

with learned labels. Note that DC and DSA use ﬁxed labels. (n = 4)

Fixed Labels Learned Labels

Img/Cls DC DSA KIP RFAD (ours) KIP RFAD (ours)

MNIST

191.7±0.5 88.7±0.6 95.2±0.296.7±0.297.3±0.1 97.2±0.2

10 97.4±0.2 97.8±0.1 98.4±0.099.0±0.199.1±0.1 99.1±0.0

50 98.8±0.199.2±0.199.1±0.0 99.1±0.0 99.4±0.1 99.1±0.0

Fashion-MNIST

170.5±0.6 70.6±0.6 78.9±0.281.6±0.682.9±0.2 84.6±0.2

10 82.3±0.4 84.6±0.3 87.6±0.190.0±0.191.0±0.1 90.3±0.2

50 83.6±0.4 88.7±0.2 90.0±0.191.3±0.192.4±0.1 91.4±0.1

SVHN

131.2±1.4 27.5±1.4 48.1±0.751.4±1.364.3±0.4 57.4±.8

10 76.1±0.679.2±0.575.8±0.1 77.2±0.3 81.1±0.5 78.2±0.5

50 82.3±0.384.4±0.481.3±0.2 81.8±0.2 84.3±0.1 82.4±0.1

CIFAR-10

128.3±0.5 28.8±0.7 59.1±0.461.1±0.764.7±0.2 61.4±0.8

10 44.9±0.5 52.1±0.5 67.0±0.473.1±0.175.6±0.2 73.7±0.2

50 53.9±0.5 60.6±0.5 71.7±0.276.1±0.380.6±0.1 76.6±0.3

CIFAR-100 112.8±0.3 13.9±0.3 31.8±0.336.0±0.434.9±0.1 44.1±0.1

10 25.2±0.3 32.3±0.346.0±0.244.9±0.2 49.5±0.3 46.8±0.2

resulting MSE error on

and

would be

and

, respectively. Notably,

incurs a larger

loss and results in a larger gradient on

than

, despite being correctly classiﬁed and

being

incorrectly classiﬁed. In the heavily constrained dataset distillation setting, ﬁtting both data points

simultaneously is not possible, leading to underﬁtting of the data in terms of classiﬁcation in order to

better ﬁt already-correctly labeled data points in terms of regression.

Unclear probabilistic interpretation of MSE for classiﬁcation.

This prevents regression from

being used directly in calibration-sensitive environment, necessitating the use of transformation

functions in tasks such as GP classiﬁcation [Williams and Barber,1998,Milios et al.,2018].

Based on these two issues, we adopt : Platt scaling [Platt,2000], by applying a cross entropy loss to

the labels instead of an MSE one:

Lplatt =x-entropy(yT, f(XT)/τ )

, where

is a positive learned

temperature scaling parameter. Unlike typical Platt scaling, we learn

jointly with our support set

instead of post-hoc tuning on a separate validation set.

f(XT)

is still calculated using the same

KRR formula. Accordingly, this corresponds to training a network using MSE loss, but at inference,

scaling the outputs by

τ−1

and applying a softmax to get a categorical distribution. Unlike typical

GP classiﬁcation, we ignore the variance of our predictions, taking only the mean instead.

The combination of these three changes, namely, using the NNGP kernel instead of NTK, applying a

random-feature approximation of NNGP, and Platt-scaling result in our RFAD algorithm, which is

given in algorithm 1.

4 Experiments with RFAD

Here, we perform experiments to evaluate the performance of RFAD in dataset distillation tasks.

Benchmarks.

We applied our algorithm to ﬁve datasets: MNIST, FashionMNIST, SVHN, CIFAR-10

and CIFAR-100 [LeCun et al.,2010,Xiao et al.,2017,Netzer et al.,2011,Krizhevsky et al.,2009],

distilling the datasets to coresets with 1, 10 or 50 images per class.

Network Structure and Training Setup.

Similar to previous work on dataset distillation, we used

standard ConvNet architectures with three convolutional layers with average pooling and ReLU

activations [Zhao et al.,2021,2020,Nguyen et al.,2021b]. Similar to KIP [Nguyen et al.,2021b], we

do not use instancenorm layers because of the lack of an inﬁnite-width analog. During training, we

used

N= 8

random models, each with

C= 256

convolutional channels per layer, and during test-

time, we evaluated the datasets using the exact NNGP kernel using the neural-tangents library [Novak

et al.,2020]. We consider both the ﬁxed and learned label conﬁgurations, with Platt scaling applied

and no data augmentation. We used the regularized Zero Component Analysis (ZCA) preprocessing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientDatasetDistillationusingRandomFeatureApproximationNoelLoo,RaminHasani,AlexanderAmini,DanielaRusComputerScienceandArticialIntelligenceLab(CSAIL)MassachusettsInstituteofTechnology(MIT){loo,rhasani,amini,rus}@mit.eduAbstractDatasetdistillationcompresseslargedatasetsintosmallersyntheticcoresets...

展开>> 收起<<

Efﬁcient Dataset Distillation using Random Feature Approximation Noel Loo Ramin Hasani Alexander Amini Daniela Rus.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Dataset Distillation using Random Feature Approximation Noel Loo Ramin Hasani Alexander Amini Daniela Rus

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: