Efficient Dataset Distillation using Random Feature Approximation Noel Loo Ramin Hasani Alexander Amini Daniela Rus

2025-05-03 0 0 7.51MB 29 页 10玖币
侵权投诉
Efficient Dataset Distillation
using Random Feature Approximation
Noel Loo, Ramin Hasani, Alexander Amini, Daniela Rus
Computer Science and Artificial Intelligence Lab (CSAIL)
Massachusetts Institute of Technology (MIT)
{loo, rhasani, amini, rus} @mit.edu
Abstract
Dataset distillation compresses large datasets into smaller synthetic coresets which
retain performance with the aim of reducing the storage and computational burden
of processing the entire dataset. Today’s best-performing algorithm, Kernel Induc-
ing Points (KIP), which makes use of the correspondence between infinite-width
neural networks and kernel-ridge regression, is prohibitively slow due to the exact
computation of the neural tangent kernel matrix, scaling
O(|S|2)
, with
|S|
being
the coreset size. To improve this, we propose a novel algorithm that uses a random
feature approximation (RFA) of the Neural Network Gaussian Process (NNGP)
kernel, which reduces the kernel matrix computation to
O(|S|)
. Our algorithm
provides at least a 100-fold speedup over KIP and can run on a single GPU. Our
new method, termed an RFA Distillation (RFAD), performs competitively with KIP
and other dataset condensation algorithms in accuracy over a range of large-scale
datasets, both in kernel regression and finite-width network training. We demon-
strate the effectiveness of our approach on tasks involving model interpretability
and privacy preservation.1
1 Introduction
KIP RFAD
(ours)
Time per iteration Accuracy
KIP RFAD
(ours)
460s
2.4s
67%
73.1%
191.6 times faster
+6.1% gain
Example Distilled
Dataset from CelebA
Class Male
Class Female
Train on this 2 images,
get 92% classification acc
Figure 1: RFAD provides over 100-fold speedup
over the state-of-the-art algorithm Kernel-Inducing
Points (KIP) [Nguyen et al.,2021a], while exceed-
ing its performance on CIFAR-10. (right) Example
distilled synthetic sets one image per class
Coreset algorithms aim to summarize large
datasets into significantly smaller datasets that
still accurately represent the full dataset on
downstream tasks [Jubran et al.,2019]. There
are myriad applications of these smaller datasets
including speeding up model training [Mirza-
soleiman et al.,2020], reducing catastrophic for-
getting [Aljundi et al.,2019,Rebuf et al.,2017,
Borsos et al.,2020], and enhancing interpretabil-
ity [Kim et al.,2016,Bien and Tibshirani,2011].
While most coreset selection techniques aim to
select representative data points from the dataset,
recent work has looked at generating synthetic
data points instead, a process known as dataset
distillation [Wang et al.,2018,Bohdal et al.,
2020,Sucholutsky and Schonlau,2019,Zhao et al.,2021,Zhao and Bilen,2021b,Nguyen et al.,
2021b]. These synthetic datasets have the benefit of using continuous gradient-based optimization
techniques rather than combinatorial methods and are not limited to the set of images and labels given
by the dataset, providing added flexibility and performance.
1Code is available at https://github.com/yolky/RFAD
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12067v1 [cs.LG] 21 Oct 2022
A large variety of applications benefit from obtaining an efficient dataset distillation algorithm. For
instance, Kernel methods [Vinyals et al.,2016,Kaya and Bilge,2019,Snell et al.,2017,Ghorbani
et al.,2020,Refinetti et al.,2021] usually demand a large support set in order to generate good
prediction performance at inference. This can be facilitated by an efficient dataset distillation pipeline.
Moreover, distilling a synthetic version of sensitive data helps preserve privacy; a support set can
be provided to an end-user for the downstream applications without disclosure of data. Lastly, for
resource-hungry applications such as continual learning [Borsos et al.,2020], neural architecture
search [Shleifer and Prokop,2019] and automated machine learning [Hutter et al.,2019], generation
of a support-set on which we can fit models efficiently is very helpful.
Recently, a dataset distillation method called Kernel-Inducing Points (KIP) [Nguyen et al.,2021a,b]
showed great performance in neural network classification tasks. KIP uses Neural Tangent Kernel
(NTK) ridge-regression to exactly compute the output states of an infinite-width neural network
trained on the support set. Although the method established the state-of-the-art for dataset distillation
in terms of accuracy, the computational complexity of KIP is very high due to the exact calculation of
the NTK. The algorithm, therefore, has limited applicability.
In this paper, we build on the prior work on KIP and develop a new algorithm for dataset distillation
called RFAD, which has similar accuracy and significantly better performance than KIP. The key
insight is to introduce a new kernel inducing point method that improves complexity from
O(|S|2)
(where |S|is the support-set size) to O(|S|). To this end, we make three major contributions:
I.
We develop RFAD, a fast, accurate, and scalable algorithm for dataset distillation in neural network
classification tasks.
II.
We improve the time performance of KIP [Nguyen et al.,2021a,b] by over two orders of magnitude
while retaining or improving its accuracy. This speedup comes from leveraging a random-feature
approximation of the Neural Network Gaussian Process (NNGP) kernel by instantiating random
neural networks.
III.
We show the effectiveness of RFAD in efficient dataset distillation tasks, enhancing model
interpretability and privacy preservation.
2 Background and Related Work
Coresets and Dataset Distillation.
Coresets are a subset of data that ensure models trained on
them show competitive performance compared to models trained directly on data. Standard coreset
selection algorithms use importance sampling to find coresets [Har-Peled and Mazumdar,2004,
Lucic et al.,2017,Cohen et al.,2017]. Besides random selection methods, inspired by catastrophic
forgetting [Toneva et al.,2019] and mean-matching (adding samples to the coreset to match the mean
of the original dataset) [Chen et al.,2010,Rebuffi et al.,2017,Castro et al.,2018,Belouadah and
Popescu,2020,Scholkopf et al.,1999], new algorithms have been introduced. An overview of how
coresets work on point approximation is provided in Phillips [2016]
More recently, aligned with coreset selection methods, new algorithms have been developed to distill
a synthetic dataset from a given dataset, such that fitting to this synthetic set provides performance on
par with training on the original dataset [Wang et al.,2018]. To this end, these dataset condensation
(or distillation) algorithms use gradient matching [Zhao et al.,2021,Maclaurin et al.,2015,Lorraine
et al.,2020], utilize differentiable siamese augmentation [Zhao and Bilen,2021b], and matching
distributions [Zhao and Bilen,2021a]. Dataset distillation has also been applied to the labels rather
than images [Bohdal et al.,2020]. Recently, a novel algorithm called Kernel Inducing Points
(KIP) [Nguyen et al.,2021a,b] has been introduced that performs very well on distilling synthetic
sets by using neural tangent kernel ridge-regression (KRR). KIP, similar to other algorithms, is
computationally expensive. Here, we propose a new method to significantly improve its complexity.
Infinite Width Neural Networks. Single-layer infinite-width randomly initialized neural networks
correspond to Gaussian Processes [Neal,1996], allowing for closed-form exact training of Bayesian
neural networks for regression. Recently, this has been extended to deep fully-connected networks
[Lee et al.,2018,de G. Matthews et al.,2018], convolutional networks [Novak et al.,2019,Garriga-
Alonso et al.,2019], attention-based networks [Hron et al.,2020], and even to arbitrary neural
architectures [Yang,2019], with the corresponding GP kernel being the NNGP Kernel. Likewise, for
infinite-width neural networks trained with gradient descent, the training process simplifies dramati-
2
cally, corresponding to kernel ridge regression when trained with MSE loss with the corresponding
kernel being the Neural Tangent Kernel (NTK) [Jacot et al.,2018,Arora et al.,2019,Loo et al.,
2022]. These two kernels are closely related, as the NNGP kernel forms the leading term of the NTK
kernel, representing the effect of the final layer weights. Calculation of kernel entries typically scales
with
O(HW D)
for conv nets, with
H, W
being the image height and width,
D
the network depth,
and
O(H2W2D)
for architectures with global average pooling [Arora et al.,2019]. This, combined
with the necessity of computing and inverting the
N×N
kernel matrix for kernel ridge regression,
typically make these methods intractable for large datasets [Snelson and Ghahramani,2006,Titsias,
2009].
Random Feature methods.
Every kernel corresponds to a dot product for some feature map:
k(x, x0) = φ(x)Tφ(x0)
. Random feature methods aim to approximate the feature vector with a
finite-dimensional random feature vector, the most notable example being Random Fourier Features
[Rahimi and Recht,2007]. Typically, this limits the rank of the kernel matrix, enabling faster matrix
inversion and allowing for scaling kernel methods to large datasets. Recently, these random feature
methods have been used to speed up NTKs and NNGPs [Zandieh et al.,2021,Novak et al.,2022,
2019] at inference or for neural architecture search [Peng et al.,2020]. In this work, we focus on the
NNGP approximation described in Novak et al. [2019], as it only requires network forward passes
and is model agnostic, allowing for flexible usage across different architectures without more complex
machinery needed to calculate the approximation, unlike those found in Zandieh et al. [2021], Novak
et al. [2022].
3 Algorithm Setup and Design
In this section, we first provide a high-level background on the KIP algorithm. We then sequentially
outline our modifications leading to the RFAD algorithm.
3.1 KIP Revisit
The Kernel-Inducing Point algorithm [Nguyen et al.,2021a,b], or KIP, is a dataset distillation
technique that uses the NTK kernel ridge-regression correspondence to compute exactly the outputs
of an infinite-width neural network trained on the support set, bypassing the need to ever compute
gradients or back-propagate on any finite network. Let
XT, yT
correspond to the images and one-hot
vector labels on the training dataset and let
XS, yS
be the corresponding images and labels for
the support set, which we aim to optimize. We have the outputs of a trained neural network as
f(XT) = KT S (KSS +λI)1yS
, with
K
being the kernel matrices calculated using the NTK kernel,
with
T×S
or
S×S
entries, for
KT S
and
KSS
, respectively.
λ
is a small regularization parameter.
KIP then optimizes
LMSE =||yTf(XT)||2
2
directly. The key bottleneck is the computation of
these kernel matrices, requiring
O(T S ·HW D)
time and memory, necessitating the use of hundreds
of GPUs working in parallel. Additionally, the use of the MSE loss is suboptimal.
3.2 Replacing the NTK Kernel with an NNGP Kernel
We first replace the NTK used in the kernel regression of KIP with an NNGP kernel. While this
change alone would yield a speed up, as the NNGP kernel is less computationally intensive to compute
[Novak et al.,2020], we primarily do this because the NNGP kernel admits a simple random feature
approximation, with advantages described later in this section. We first justify the appropriateness of
this modification.
Firstly, we denote that in the computation of NTK (
Θ
) and NNGP (
K
) forms the leading term, as
shown in Table 1 in Appendix D of [Novak et al.,2020] which outlines the NTK and NNGP kernel
computation rules for various layers of a neural network. For fully connected (FC) layers, which is
the typical final layer in neural network architectures, the remaining terms are suppressed by a matrix
of expected derivatives with respect to activations,
˙
K
, as observed by the recursion yielded from the
computation of the NTK for an FC network [Novak et al.,2020]:
Θl=Kl+˙
KlΘl1
. For ReLU
activations, the entries in this derivative matrix are upper bounded by
1
, so the remaining terms must
have a decaying contribution. We verify that our algorithms still provide good performance under the
NTK and for finite networks trained with gradient descent, justifying this approximation.
3
Algorithm 1 Dataset distillation with NNGP random features
Require:
Training set and labels
XT, yT
, Randomly initialized coreset and labels
XS, yS
, Random
network count
N
, Random network output dimension
M
, Batch size
|B|
, Random network
initialization distribution, p(θ), Regularization coefficient, λ, Learning rate η,
while loss not converged do
Sample batch from the training set XB, yBp(XT, yT)
Sample Nrandom networks each with output dimension Mfrom p(θ):θ1, ...θNp(θ)
Compute random features for batch with random nets:
ˆ
Φ(XB)1
N M [fθ1(XB),...,fθN(XB)]TR|N M |×|B|
Compute random features for support set with random nets:
ˆ
Φ(XS)1
N M [fθ1(XS),...,fθN(XS)]TR|N M |×|S|
Compute kernel matrices: ˆ
KBS ˆ
Φ(XB)Tˆ
Φ(XS)
ˆ
KSS ˆ
Φ(XS)Tˆ
Φ(XS)
Calculate trained network output on batch: ˆyBˆ
KBS (ˆ
KSS +λI|S|)1yS
Calculate loss: L=L(yB,ˆyB)
Update coreset: XSXSηL
XS, ySySηL
yS
end while
3.3 Replacing NNGP with an Empirical NNGP
When we sample from a Gaussian process
f∼ GP(0, K)
, it suggests a natural finite feature map
corresponding to scaled draws from the GP:
ˆ
φ(x) = 1
N[f1(x), ..., fN(x)]T
. For most GPs, this
insight is not relevant, as sampling from a GP typically requires a Cholesky decomposition of the
kernel matrix, requiring its computation in the first place [Rasmussen and Williams,2006]. However,
for NNGP we can generate approximate samples of
f
by instantiating random neural networks,
fi(x) = fθi(x), θip(θ)
, for some initialization distribution
p(θ)
. Moreover, with a given neural
network, we can define
fi
to be a vector of dimension
M
by having a network with multiple output
heads, meaning that with
N
networks, we have
N M
features. For our purposes, we typically have
N = 8, M = 4096, giving 32768 total features. For the convolutional architectures we consider, this
corresponds to
C= 256
convolutional channels per layer. Even with this relatively large number of
features, we still see a significant computation speedup over exact calculation.
To sample
f∼ GP(0, K)
, we would have to instantiate random infinite width neural nets, whereas,
in practice, we can only sample finite ones. This discrepancy incurs an
O(1/C)
bias to our kernel
matrix entries, with
C
being the width-relevant parameter (i.e., convolutional channels) [Yaida,2020].
However, we have a
O(1/(NC))
variance of the mean of the random features [Daniely et al.,2016],
meaning that in practice, the variance dominates the computation over bias. This has been noted
empirically in Novak et al. [2019], and we verify that the finite-width bias does not significantly
affect performance in appendix I, showing that we can achieve reasonable performance with as little
as one convolution channel.
The time cost of computing these random features is linear in the training set and coreset size,
|T|,|S|
. With the relatively low cost of matrix multiplication, this results in the construction of the
kernel matrices
KT S
and
KSS
having
O(|T|+|S|)
and
O(|S|)
, time complexity, respectively, as
opposed to
O(|T||S|)
and
O(|S|2)
with KIP. Noting that the cost of matrix inversion is relatively
small compared to random feature construction, our total runtime is reduced to
linear
in the coreset
size. We empirically verify this linear time complexity in section 4.1 and additionally provide a more
detailed discussion in appendix C.
3.4 Loss Function in dataset distillation
We denoted earlier that
LMSE
is not well suited for dataset distillation settings. In particular, there
are two key problems:
Over-influence of already correctly classified data points.
Consider two-way classification, with
the label
1
corresponding to the positive class and
1
corresponding to the negative class. Let
x1
and
x2
be items in the training set whose labels are both
1
. Let
fKRR(x) = Kx,S (KSS +λI)1yS
be the KRR output on
x
given our support set
XS
. If
fKRR(x1) = 5
and
fKRR(x2) = 1
, then the
4
Table 1: Kernel distillation results on five datasets with varying support set sizes.
Bolded
numbers
indicate the best performance with fixed labels, and
underlined
numbers indicate the best performance
with learned labels. Note that DC and DSA use fixed labels. (n = 4)
Fixed Labels Learned Labels
Img/Cls DC DSA KIP RFAD (ours) KIP RFAD (ours)
MNIST
191.7±0.5 88.7±0.6 95.2±0.296.7±0.297.3±0.1 97.2±0.2
10 97.4±0.2 97.8±0.1 98.4±0.099.0±0.199.1±0.1 99.1±0.0
50 98.8±0.199.2±0.199.1±0.0 99.1±0.0 99.4±0.1 99.1±0.0
Fashion-MNIST
170.5±0.6 70.6±0.6 78.9±0.281.6±0.682.9±0.2 84.6±0.2
10 82.3±0.4 84.6±0.3 87.6±0.190.0±0.191.0±0.1 90.3±0.2
50 83.6±0.4 88.7±0.2 90.0±0.191.3±0.192.4±0.1 91.4±0.1
SVHN
131.2±1.4 27.5±1.4 48.1±0.751.4±1.364.3±0.4 57.4±.8
10 76.1±0.679.2±0.575.8±0.1 77.2±0.3 81.1±0.5 78.2±0.5
50 82.3±0.384.4±0.481.3±0.2 81.8±0.2 84.3±0.1 82.4±0.1
CIFAR-10
128.3±0.5 28.8±0.7 59.1±0.461.1±0.764.7±0.2 61.4±0.8
10 44.9±0.5 52.1±0.5 67.0±0.473.1±0.175.6±0.2 73.7±0.2
50 53.9±0.5 60.6±0.5 71.7±0.276.1±0.380.6±0.1 76.6±0.3
CIFAR-100 112.8±0.3 13.9±0.3 31.8±0.336.0±0.434.9±0.1 44.1±0.1
10 25.2±0.3 32.3±0.346.0±0.244.9±0.2 49.5±0.3 46.8±0.2
resulting MSE error on
x1
and
x2
would be
16
and
4
, respectively. Notably,
x1
incurs a larger
loss and results in a larger gradient on
XS
than
x2
, despite being correctly classified and
x2
being
incorrectly classified. In the heavily constrained dataset distillation setting, fitting both data points
simultaneously is not possible, leading to underfitting of the data in terms of classification in order to
better fit already-correctly labeled data points in terms of regression.
Unclear probabilistic interpretation of MSE for classification.
This prevents regression from
being used directly in calibration-sensitive environment, necessitating the use of transformation
functions in tasks such as GP classification [Williams and Barber,1998,Milios et al.,2018].
Based on these two issues, we adopt : Platt scaling [Platt,2000], by applying a cross entropy loss to
the labels instead of an MSE one:
Lplatt =x-entropy(yT, f(XT) )
, where
τ
is a positive learned
temperature scaling parameter. Unlike typical Platt scaling, we learn
τ
jointly with our support set
instead of post-hoc tuning on a separate validation set.
f(XT)
is still calculated using the same
KRR formula. Accordingly, this corresponds to training a network using MSE loss, but at inference,
scaling the outputs by
τ1
and applying a softmax to get a categorical distribution. Unlike typical
GP classification, we ignore the variance of our predictions, taking only the mean instead.
The combination of these three changes, namely, using the NNGP kernel instead of NTK, applying a
random-feature approximation of NNGP, and Platt-scaling result in our RFAD algorithm, which is
given in algorithm 1.
4 Experiments with RFAD
Here, we perform experiments to evaluate the performance of RFAD in dataset distillation tasks.
Benchmarks.
We applied our algorithm to five datasets: MNIST, FashionMNIST, SVHN, CIFAR-10
and CIFAR-100 [LeCun et al.,2010,Xiao et al.,2017,Netzer et al.,2011,Krizhevsky et al.,2009],
distilling the datasets to coresets with 1, 10 or 50 images per class.
Network Structure and Training Setup.
Similar to previous work on dataset distillation, we used
standard ConvNet architectures with three convolutional layers with average pooling and ReLU
activations [Zhao et al.,2021,2020,Nguyen et al.,2021b]. Similar to KIP [Nguyen et al.,2021b], we
do not use instancenorm layers because of the lack of an infinite-width analog. During training, we
used
N= 8
random models, each with
C= 256
convolutional channels per layer, and during test-
time, we evaluated the datasets using the exact NNGP kernel using the neural-tangents library [Novak
et al.,2020]. We consider both the fixed and learned label configurations, with Platt scaling applied
and no data augmentation. We used the regularized Zero Component Analysis (ZCA) preprocessing
5
摘要:

EfcientDatasetDistillationusingRandomFeatureApproximationNoelLoo,RaminHasani,AlexanderAmini,DanielaRusComputerScienceandArticialIntelligenceLab(CSAIL)MassachusettsInstituteofTechnology(MIT){loo,rhasani,amini,rus}@mit.eduAbstractDatasetdistillationcompresseslargedatasetsintosmallersyntheticcoresets...

展开>> 收起<<
Efficient Dataset Distillation using Random Feature Approximation Noel Loo Ramin Hasani Alexander Amini Daniela Rus.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:7.51MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注