On Divergence Measures for Bayesian Pseudocoresets Balhae Kim1 Jungwon Choi1 Seanie Lee1 Yoonho Lee2 Jung-Woo Ha3 Juho Lee14 KAIST1 Stanford University2 NA VER AI Lab3 AITRICS4

2025-04-27 0 0 4.66MB 18 页 10玖币
侵权投诉
On Divergence Measures for Bayesian Pseudocoresets
Balhae Kim1, Jungwon Choi1, Seanie Lee1, Yoonho Lee2, Jung-Woo Ha3, Juho Lee1,4
KAIST1, Stanford University2, NAVER AI Lab3, AITRICS4
{balhaekim, jungwon.choi, lsnfamily02}@kaist.ac.kr,
yoonho@stanford.edu,jungwoo.ha@navercorp.com,juholee@kaist.ac.kr
Abstract
A Bayesian pseudocoreset is a small synthetic dataset for which the posterior
over parameters approximates that of the original dataset. While promising, the
scalability of Bayesian pseudocoresets is not yet validated in realistic problems
such as image classification with deep neural networks. On the other hand, dataset
distillation methods similarly construct a small dataset such that the optimization
using the synthetic dataset converges to a solution with performance competitive
with optimization using full data. Although dataset distillation has been empirically
verified in large-scale settings, the framework is restricted to point estimates, and
their adaptation to Bayesian inference has not been explored. This paper casts
two representative dataset distillation algorithms as approximations to methods for
constructing pseudocoresets by minimizing specific divergence measures: reverse
KL divergence and Wasserstein distance. Furthermore, we provide a unifying view
of such divergence measures in Bayesian pseudocoreset construction. Finally, we
propose a novel Bayesian pseudocoreset algorithm based on minimizing forward
KL divergence. Our empirical results demonstrate that the pseudocoresets con-
structed from these methods reflect the true posterior even in high-dimensional
Bayesian inference problems.
1 Introduction
One of the main ingredients for modern statistical machine learning is large-scale data, which
enables the learning of powerful and flexible models such as deep neural networks when handled
correctly [
3
,
32
]. However, training a neural network on larger datasets usually requires many
nontrivial choices, ranging from architecture choice, hyperparameter optimization, and infrastructure
challenges [
33
]. Aside from requiring enormous computing resources for training, large-scale data
often contains sensitive information that must not be shared publicly [
11
]. This motivates the
construction of coresets [
29
,
14
], a sparse subset of a large-scale dataset that summarizes essential
features of the original dataset. In the context of Bayesian inference, the posterior conditioned on an
essential coreset should be similar to the exact posterior conditioned on the full dataset.
Coreset construction becomes more difficult with high-dimensional data. A previous work [
19
]
shows that even the best coreset becomes suboptimal at large scales: the divergence between the
posterior conditioned on the optimal coreset and the exact posterior grows with the dimension of the
data. Motivated by this weakness of strict coresets, Manousakas et al.
[19]
proposes to construct a
Bayesian pseudocoreset: a synthetic dataset represented as a set of learnable parameters and trained
to minimize the divergence between the coreset posterior and the full-data posterior. Compared
to coresets, pseudocoresets scale much better with data dimension and achieve significantly lower
posterior approximation error. However, the KL divergence minimization in pseudocoreset learning
requires the construction of approximate posteriors on the fly during the learning process, making
this approach hard to apply to high-dimensional models such as deep neural networks.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06205v1 [cs.LG] 12 Oct 2022
On the other hand, dataset distillation [
28
] aims to distill a large-scale dataset into a small synthetic
dataset such that the test performance of a model trained on the distilled data is comparable to
that of a model trained on the original data. Such methods have achieved remarkable performance
in real-world problems with high-dimensional data and deep neural networks. Although dataset
distillation methods have a similar motivation to pseudocoresets, the two methods optimize different
objectives. Whereas pseudocoresets minimize the divergence between coreset and full-data posteriors,
dataset distillation considers a non-Bayesian setting with heuristically set objective functions such as
matching the gradients of loss functions [
36
] or matching the parameters obtained from optimization
trajectories [
8
]. Such objectives have no direct Bayesian analogue due to their heuristic nature, and
more theoretical grounding can advance our understanding of dataset distillation and the empirical
performance of Bayesian pseudocoreset methods.
In this paper, we provide a unifying view of Bayesian pseudocoresets and dataset distillation to take
advantage of both approaches. We first study various choices of divergence measures as objectives
for learning pseudocoresets. While Manousakas et al.
[19]
minimizes the reverse KL divergence
between the pseudocoreset posterior distribution and the full data posterior distribution, we show
that alternative divergence measures such as forward KL divergence and Wasserstein distance are
also effective in practice. Based on this perspective, we re-interpret existing dataset distillation
algorithms [
36
,
8
] approximations to the Bayesian pseudocoresets learned by minimizing reverse KL
and Wasserstein distance, with specific choices of variational approximations for coreset posteriors.
This connection justifies the heuristically chosen learning objectives of dataset distillation algorithms,
and at the same time, provides a theoretical background for using the distilled datasets obtained
from the such procedure for Bayesian inference. Conversely, Bayesian pseudocoresets benefit from
this connection by borrowing various ideas and tricks used in the dataset distillation algorithms
to make them scale for complex tasks. For instance, the variational posteriors we identified by
recasting dataset distillation algorithms as Bayesian pseudocoreset learning can be used for Bayesian
pseudocoresets with various choices of divergence measures. Also, we can adapt the idea of reusing
pre-computed optimization trajectories [8] which have already been shown to work at large scales.
We empirically compare pseudocoreset algorithms based on three different divergence measures
on high-dimensional image classification tasks. Our results demonstrate that we can efficiently
and scalably construct Bayesian pseudocoresets with which MCMC algorithms such as Hamilto-
nian Monte-Carlo (HMC) [
10
,
22
] or Stochastic Gradient Hamiltonian Monte-Carlo (SGHMC) [
9
]
accurately approximate the full posterior.
2 Background
2.1 Bayesian Pseudocoresets
Denote observed data as
x={xn}N
n=1
. Given a probabilistic model indexed by a parameter
θ
with
some prior distribution π0, we are interested in the posterior conditioned on the full data x,
πx(θ) = 1
Z(x)exp N
X
n=1
f(xn, θ)!π0(θ):=1
Z(x)exp 1>
Nf(x, θ)π0(θ),(1)
where
f(x, θ):= [f(x1, θ), . . . , f(xN, θ)]>
,
1N
is the
N
-dimensional one vector, and
Z(x) =
Rexp(1>
Nf(x, θ))dθ
is a partition function. The posterior
πx
is usually intractable to compute due to
Z(x)
, so we employ approximate inference algorithms. Bayesian pseudocoreset methods [
19
] aim
to construct a synthetic dataset called a pseudocoreset
u={um}M
m=1
with
MN
such that the
posterior of
θ
conditioned on it approximates the original posterior
πx
. The pseudocoreset posterior
is written as1,
πu(θ) = 1
Z(u)exp M
X
m=1
f(um, θ)!π0(θ) = 1
Z(u)exp(1>
Mf(u, θ))π0(θ),(2)
1
Manousakas et al.
[19]
considered two sets of parameters, the coreset elements
u
and their weights
w={wm}M
m=1
. We empirically found that the weights
w
have a negligible impact on performance, so we only
learn the coreset elements.
2
where
f(u, θ)
,
1M
, and
Z(u)
are defined similarly. Manousakas et al.
[19]
suggests to learn
u
by
minimizing the KL divergence between πuand πx:
u= arg min
u
DKL[πukπx],(3)
and show that the gradient of this objective is computed as
umDKL[πukπx] = Covπu(θ)uf(um, θ),1>
Nf(x, θ)1>
Mf(u, θ).(4)
The expectation over the pseudocoreset posterior
πu
in the above equation is further approximated
with the Monte-Carlo estimation with an approximate posterior
quπu
, for instance, from the
Laplace approximation.
2.2 Dataset Distillation
With a similar motivation as Bayesian pseudocoresets, dataset distillation methods construct a small
synthetic dataset
u
, but from the perspective of matching optimal parameters. That is, in dataset
distillation, we find usuch that
u= arg min
u
d(θ
x, θ
u), θ
x= arg min
θ
`(x, θ), θ
u= arg min
θ
`(u, θ),(5)
with a loss function
`(·, θ)
and a distance function
d(·,·)
comparing two parameters. In general,
optimizing
d(θ
x, θ
u)
is intractable since it is a bilevel optimization problem requiring
θ?
x
and
θ
u
, so
existing works relax the problem in several ways, such as using kernel-based approaches [
23
,
24
]
or solving only for a short horizon in inner optimization [
36
,
34
,
8
]. Below, we briefly review
two representative methods—Dataset Condensation (DC) [
36
] and Matching Training Trajectories
(MTT) [8].
Dataset Condensation
Instead of fully unrolling the inner optimization process, DC instead com-
putes the matching objective at every inner step starting from some random parameter θ(0).
u= arg min
u
Eθ(0) "T
X
t=0
dcos θ`(x, θ(t)),θ`(u, θ(t))#,(6)
θ(t)=θ(t1) ηθ`(u, θ(t1))for t1,(7)
where
dcos
denotes the cosine similarity between two gradients. To implement this objective, we
build a trajectory of parameters starting from a random
θ(0)
with the current distilled dataset
u
. We
then compute the gradient matching objective for each layer at each step of the trajectory to update
u
.
Despite its appealing simplicity, this algorithm may suffer from short-horizon bias due to the limited
trajectory length.
Matching Training Trajectories
MTT extends the single-step gradient matching in DC and con-
siders longer inner optimization trajectories. Starting from a randomly initialized parameter
θ(0)
,
MTT takes multiple inner optimization steps with both
u
and
x
, and minimizes the normalized
L2
distance between two parameters at the end of the learning trajectories.
u= arg min
u
Eθ(0) hd2θ(Lx)
x, θ(Lu)
ui, θ(0)
x=θ(0)
u=θ(0),(8)
θ(t)
x=θ(t1)
xηxθ`(x, θ(t1)
x)for t= 1, . . . , Lx,(9)
θ(t)
u=θ(t1)
uηuθ`(u, θ(t1)
u)for t= 1, . . . , Lu,(10)
where
d2(·,·)
is the normalized
L2
distance between parameters. Unlike DC, this algorithm starts
from a common initial parameter
θ(0)
, keeps track of two separate optimization trajectories (one with
u
and another with
x
), and the outer-update for
u
is done only after
Lx
and
Lu
inner steps. To reduce
the heavy computational cost for the inner update with the full dataset
x
, MTT employs mini-batch
SGD for the inner steps, in addition to saving and re-using precomputed SGD trajectories to train
u
. The use of precomputed SGD trajectories is similar to that of replay buffers for reinforcement
learning [
17
,
20
,
1
], and allows MTT to handle long inner-loop sequences. While MTT significantly
improves over DC, presumably due to its longer inner optimization horizon, the algorithm requires
large memory for even a moderate number of inner-steps
Lu
because it requires backpropagation
through all inner steps with u.
3
3 On Divergence Measures for Bayesian Pseudocoresets
While the Bayesian pseudocoreset is originally obtained by minimizing the reverse KL divergence, in
principle we can minimize an arbitrary divergence measure D(·,·)to achieve the goal:
u= arg min
u
D(πx, πu).(11)
In this section, we study three different choices for such divergence measures along with their relation
to dataset distillation methods.
3.1 Reverse KL Divergence
Manousakas et al.
[19]
proposed to minimize reverse KL divergence with a Laplace approximation to
obtain the pseudocoreset posterior
πu
. Here, we show that it is possible to recast Dataset Condensation
(DC) as a reverse KL divergence minimization with an alternative choice for the variational posterior.
Let
θ(t)
be a parameter in a learning trajectory with a loss function
`(u, θ) := 1>
Mf(u, θ)
. Provided
that θ(t)is sufficiently close to a local optimum, we can construct a naïve approximation of πuas
qu(θ) = N(θ;θ(t),Σ), θ(t)=θ(t1) ηθ`(u, θ(t1)).(12)
Under this setting, we can actually show that one gradient descent step with respect to reverse KL
divergence
DKL[πukπx]
is equivalent to one step of gradient matching in DC when the parameter
being optimized is close to local optima. We note that while the Gaussian posterior is a simple
approximation class, it is commonly used due to its demonstrated usefulness in several literatures [
18
],
such as the Bernstein-von Mises theorem [
16
,
27
]. However, we also note that these Gaussian
approximations are not necessary for establishing the connection between the dataset distillation and
Bayesian pseudocoresets; Gaussian approximations are just a special case which has the benefit of
simplifying the analysis.
Proposition 3.1.
Let
qu(θ)
set as Eq. 12. Assume
θ(t1)
is close enough to local optima, so that we
have kθ(t1) θ(t)k  1. Then we have
uDKL[πukπx]≈ −ηuθ`(x, θ(t1))>θ`(u, θ(t1)).(13)
The proof is given in Appendix A. In other words, when sufficiently close to local optima, DC can
be interpreted as a special instance of Bayesian pseudocoreset via reverse KL minimization with
Gaussian approximation for the pseudocoreset posterior
πu
. Throughout the paper, we will refer to
the Bayesian pseudocoreset with reverse KL minimization as BPC-rKL. To simply implement the
Algorithm 1 in [
19
] for the image dataset, we approximated
πu
as a Gaussian distribution. Please
refer to Appendix B.2 for the detailed algorithm.
3.2 Wasserstein Distance
We can instead choose
D(·,·)
to be the Wasserstein distance between
πu
and
πx
. While the Wasser-
stein distance is intractable in general, by approximating both
πu
and
πx
with Gaussian distributions,
we can attain a closed-form expression for this distance metric. Specifically, let
θu
and
θx
be two
parameters sufficiently close to local optima. Then we can construct a crude approximation for
πu
and πxas,
πu(θ)qu(θ) := N(θ;θu,Σu), πx(θ)qx(θ) := N(θ;θx,Σx).(14)
Then the Wasserstein distance between these two distributions is computed as
W2(qu;qx) = kθuθxk2
2+ TrΣx+ Σu2(Σ1/2
uΣxΣ1/2
u)1/2.(15)
Now consider a single outer-step of the MTT. Starting from a common initial
θ0
, the algorithm takes
the gradient steps both with
x
and
u
to get
θ(Lx)
x
and
θ(Lu)
u
. If we set
qu(θ) = N(θ;θ(Lu)
u,Σ)
and
qx(θ) = N(θ;θ(Lx)
x,Σ)
with a common covariance
Σ
, then the Wasserstein distance reduces to
kθ(Lx)
xθ(Lu)
uk2
2
, which is proportional to the
L2
objective being optimized in the MTT. Similarly
to the reverse KL divergence and BPC-rKL, we refer to the Bayesian pseudocoreset with Wasserstein
distance minimization as BPC-W.
4
摘要:

OnDivergenceMeasuresforBayesianPseudocoresetsBalhaeKim1,JungwonChoi1,SeanieLee1,YoonhoLee2,Jung-WooHa3,JuhoLee1;4KAIST1,StanfordUniversity2,NAVERAILab3,AITRICS4{balhaekim,jungwon.choi,lsnfamily02}@kaist.ac.kr,yoonho@stanford.edu,jungwoo.ha@navercorp.com,juholee@kaist.ac.krAbstractABayesianpseudocore...

展开>> 收起<<
On Divergence Measures for Bayesian Pseudocoresets Balhae Kim1 Jungwon Choi1 Seanie Lee1 Yoonho Lee2 Jung-Woo Ha3 Juho Lee14 KAIST1 Stanford University2 NA VER AI Lab3 AITRICS4.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:4.66MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注