
On the other hand, dataset distillation [
28
] aims to distill a large-scale dataset into a small synthetic
dataset such that the test performance of a model trained on the distilled data is comparable to
that of a model trained on the original data. Such methods have achieved remarkable performance
in real-world problems with high-dimensional data and deep neural networks. Although dataset
distillation methods have a similar motivation to pseudocoresets, the two methods optimize different
objectives. Whereas pseudocoresets minimize the divergence between coreset and full-data posteriors,
dataset distillation considers a non-Bayesian setting with heuristically set objective functions such as
matching the gradients of loss functions [
36
] or matching the parameters obtained from optimization
trajectories [
8
]. Such objectives have no direct Bayesian analogue due to their heuristic nature, and
more theoretical grounding can advance our understanding of dataset distillation and the empirical
performance of Bayesian pseudocoreset methods.
In this paper, we provide a unifying view of Bayesian pseudocoresets and dataset distillation to take
advantage of both approaches. We first study various choices of divergence measures as objectives
for learning pseudocoresets. While Manousakas et al.
[19]
minimizes the reverse KL divergence
between the pseudocoreset posterior distribution and the full data posterior distribution, we show
that alternative divergence measures such as forward KL divergence and Wasserstein distance are
also effective in practice. Based on this perspective, we re-interpret existing dataset distillation
algorithms [
36
,
8
] approximations to the Bayesian pseudocoresets learned by minimizing reverse KL
and Wasserstein distance, with specific choices of variational approximations for coreset posteriors.
This connection justifies the heuristically chosen learning objectives of dataset distillation algorithms,
and at the same time, provides a theoretical background for using the distilled datasets obtained
from the such procedure for Bayesian inference. Conversely, Bayesian pseudocoresets benefit from
this connection by borrowing various ideas and tricks used in the dataset distillation algorithms
to make them scale for complex tasks. For instance, the variational posteriors we identified by
recasting dataset distillation algorithms as Bayesian pseudocoreset learning can be used for Bayesian
pseudocoresets with various choices of divergence measures. Also, we can adapt the idea of reusing
pre-computed optimization trajectories [8] which have already been shown to work at large scales.
We empirically compare pseudocoreset algorithms based on three different divergence measures
on high-dimensional image classification tasks. Our results demonstrate that we can efficiently
and scalably construct Bayesian pseudocoresets with which MCMC algorithms such as Hamilto-
nian Monte-Carlo (HMC) [
10
,
22
] or Stochastic Gradient Hamiltonian Monte-Carlo (SGHMC) [
9
]
accurately approximate the full posterior.
2 Background
2.1 Bayesian Pseudocoresets
Denote observed data as
x={xn}N
n=1
. Given a probabilistic model indexed by a parameter
θ
with
some prior distribution π0, we are interested in the posterior conditioned on the full data x,
πx(θ) = 1
Z(x)exp N
X
n=1
f(xn, θ)!π0(θ):=1
Z(x)exp 1>
Nf(x, θ)π0(θ),(1)
where
f(x, θ):= [f(x1, θ), . . . , f(xN, θ)]>
,
1N
is the
N
-dimensional one vector, and
Z(x) =
Rexp(1>
Nf(x, θ))dθ
is a partition function. The posterior
πx
is usually intractable to compute due to
Z(x)
, so we employ approximate inference algorithms. Bayesian pseudocoreset methods [
19
] aim
to construct a synthetic dataset called a pseudocoreset
u={um}M
m=1
with
MN
such that the
posterior of
θ
conditioned on it approximates the original posterior
πx
. The pseudocoreset posterior
is written as1,
πu(θ) = 1
Z(u)exp M
X
m=1
f(um, θ)!π0(θ) = 1
Z(u)exp(1>
Mf(u, θ))π0(θ),(2)
1
Manousakas et al.
[19]
considered two sets of parameters, the coreset elements
u
and their weights
w={wm}M
m=1
. We empirically found that the weights
w
have a negligible impact on performance, so we only
learn the coreset elements.
2