A large variety of applications benefit from obtaining an efficient dataset distillation algorithm. For
instance, Kernel methods [Vinyals et al.,2016,Kaya and Bilge,2019,Snell et al.,2017,Ghorbani
et al.,2020,Refinetti et al.,2021] usually demand a large support set in order to generate good
prediction performance at inference. This can be facilitated by an efficient dataset distillation pipeline.
Moreover, distilling a synthetic version of sensitive data helps preserve privacy; a support set can
be provided to an end-user for the downstream applications without disclosure of data. Lastly, for
resource-hungry applications such as continual learning [Borsos et al.,2020], neural architecture
search [Shleifer and Prokop,2019] and automated machine learning [Hutter et al.,2019], generation
of a support-set on which we can fit models efficiently is very helpful.
Recently, a dataset distillation method called Kernel-Inducing Points (KIP) [Nguyen et al.,2021a,b]
showed great performance in neural network classification tasks. KIP uses Neural Tangent Kernel
(NTK) ridge-regression to exactly compute the output states of an infinite-width neural network
trained on the support set. Although the method established the state-of-the-art for dataset distillation
in terms of accuracy, the computational complexity of KIP is very high due to the exact calculation of
the NTK. The algorithm, therefore, has limited applicability.
In this paper, we build on the prior work on KIP and develop a new algorithm for dataset distillation
called RFAD, which has similar accuracy and significantly better performance than KIP. The key
insight is to introduce a new kernel inducing point method that improves complexity from
O(|S|2)
(where |S|is the support-set size) to O(|S|). To this end, we make three major contributions:
I.
We develop RFAD, a fast, accurate, and scalable algorithm for dataset distillation in neural network
classification tasks.
II.
We improve the time performance of KIP [Nguyen et al.,2021a,b] by over two orders of magnitude
while retaining or improving its accuracy. This speedup comes from leveraging a random-feature
approximation of the Neural Network Gaussian Process (NNGP) kernel by instantiating random
neural networks.
III.
We show the effectiveness of RFAD in efficient dataset distillation tasks, enhancing model
interpretability and privacy preservation.
2 Background and Related Work
Coresets and Dataset Distillation.
Coresets are a subset of data that ensure models trained on
them show competitive performance compared to models trained directly on data. Standard coreset
selection algorithms use importance sampling to find coresets [Har-Peled and Mazumdar,2004,
Lucic et al.,2017,Cohen et al.,2017]. Besides random selection methods, inspired by catastrophic
forgetting [Toneva et al.,2019] and mean-matching (adding samples to the coreset to match the mean
of the original dataset) [Chen et al.,2010,Rebuffi et al.,2017,Castro et al.,2018,Belouadah and
Popescu,2020,Scholkopf et al.,1999], new algorithms have been introduced. An overview of how
coresets work on point approximation is provided in Phillips [2016]
More recently, aligned with coreset selection methods, new algorithms have been developed to distill
a synthetic dataset from a given dataset, such that fitting to this synthetic set provides performance on
par with training on the original dataset [Wang et al.,2018]. To this end, these dataset condensation
(or distillation) algorithms use gradient matching [Zhao et al.,2021,Maclaurin et al.,2015,Lorraine
et al.,2020], utilize differentiable siamese augmentation [Zhao and Bilen,2021b], and matching
distributions [Zhao and Bilen,2021a]. Dataset distillation has also been applied to the labels rather
than images [Bohdal et al.,2020]. Recently, a novel algorithm called Kernel Inducing Points
(KIP) [Nguyen et al.,2021a,b] has been introduced that performs very well on distilling synthetic
sets by using neural tangent kernel ridge-regression (KRR). KIP, similar to other algorithms, is
computationally expensive. Here, we propose a new method to significantly improve its complexity.
Infinite Width Neural Networks. Single-layer infinite-width randomly initialized neural networks
correspond to Gaussian Processes [Neal,1996], allowing for closed-form exact training of Bayesian
neural networks for regression. Recently, this has been extended to deep fully-connected networks
[Lee et al.,2018,de G. Matthews et al.,2018], convolutional networks [Novak et al.,2019,Garriga-
Alonso et al.,2019], attention-based networks [Hron et al.,2020], and even to arbitrary neural
architectures [Yang,2019], with the corresponding GP kernel being the NNGP Kernel. Likewise, for
infinite-width neural networks trained with gradient descent, the training process simplifies dramati-
2