On Divergence Measures for Bayesian Pseudocoresets Balhae Kim1 Jungwon Choi1 Seanie Lee1 Yoonho Lee2 Jung-Woo Ha3 Juho Lee14 KAIST1 Stanford University2 NA VER AI Lab3 AITRICS4

2025-04-27 0 0 4.66MB 18 页 10玖币

侵权投诉

On Divergence Measures for Bayesian Pseudocoresets

Balhae Kim1, Jungwon Choi1, Seanie Lee1, Yoonho Lee2, Jung-Woo Ha3, Juho Lee1,4

KAIST1, Stanford University2, NAVER AI Lab3, AITRICS4

{balhaekim, jungwon.choi, lsnfamily02}@kaist.ac.kr,

yoonho@stanford.edu,jungwoo.ha@navercorp.com,juholee@kaist.ac.kr

Abstract

A Bayesian pseudocoreset is a small synthetic dataset for which the posterior

over parameters approximates that of the original dataset. While promising, the

scalability of Bayesian pseudocoresets is not yet validated in realistic problems

such as image classiﬁcation with deep neural networks. On the other hand, dataset

distillation methods similarly construct a small dataset such that the optimization

using the synthetic dataset converges to a solution with performance competitive

with optimization using full data. Although dataset distillation has been empirically

veriﬁed in large-scale settings, the framework is restricted to point estimates, and

their adaptation to Bayesian inference has not been explored. This paper casts

two representative dataset distillation algorithms as approximations to methods for

constructing pseudocoresets by minimizing speciﬁc divergence measures: reverse

KL divergence and Wasserstein distance. Furthermore, we provide a unifying view

of such divergence measures in Bayesian pseudocoreset construction. Finally, we

propose a novel Bayesian pseudocoreset algorithm based on minimizing forward

KL divergence. Our empirical results demonstrate that the pseudocoresets con-

structed from these methods reﬂect the true posterior even in high-dimensional

Bayesian inference problems.

1 Introduction

One of the main ingredients for modern statistical machine learning is large-scale data, which

enables the learning of powerful and ﬂexible models such as deep neural networks when handled

correctly [

]. However, training a neural network on larger datasets usually requires many

nontrivial choices, ranging from architecture choice, hyperparameter optimization, and infrastructure

challenges [

]. Aside from requiring enormous computing resources for training, large-scale data

often contains sensitive information that must not be shared publicly [

]. This motivates the

construction of coresets [

], a sparse subset of a large-scale dataset that summarizes essential

features of the original dataset. In the context of Bayesian inference, the posterior conditioned on an

essential coreset should be similar to the exact posterior conditioned on the full dataset.

Coreset construction becomes more difﬁcult with high-dimensional data. A previous work [

]

shows that even the best coreset becomes suboptimal at large scales: the divergence between the

posterior conditioned on the optimal coreset and the exact posterior grows with the dimension of the

data. Motivated by this weakness of strict coresets, Manousakas et al.

[19]

proposes to construct a

Bayesian pseudocoreset: a synthetic dataset represented as a set of learnable parameters and trained

to minimize the divergence between the coreset posterior and the full-data posterior. Compared

to coresets, pseudocoresets scale much better with data dimension and achieve signiﬁcantly lower

posterior approximation error. However, the KL divergence minimization in pseudocoreset learning

requires the construction of approximate posteriors on the ﬂy during the learning process, making

this approach hard to apply to high-dimensional models such as deep neural networks.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06205v1 [cs.LG] 12 Oct 2022

On the other hand, dataset distillation [

] aims to distill a large-scale dataset into a small synthetic

dataset such that the test performance of a model trained on the distilled data is comparable to

that of a model trained on the original data. Such methods have achieved remarkable performance

in real-world problems with high-dimensional data and deep neural networks. Although dataset

distillation methods have a similar motivation to pseudocoresets, the two methods optimize different

objectives. Whereas pseudocoresets minimize the divergence between coreset and full-data posteriors,

dataset distillation considers a non-Bayesian setting with heuristically set objective functions such as

matching the gradients of loss functions [

] or matching the parameters obtained from optimization

trajectories [

]. Such objectives have no direct Bayesian analogue due to their heuristic nature, and

more theoretical grounding can advance our understanding of dataset distillation and the empirical

performance of Bayesian pseudocoreset methods.

In this paper, we provide a unifying view of Bayesian pseudocoresets and dataset distillation to take

advantage of both approaches. We ﬁrst study various choices of divergence measures as objectives

for learning pseudocoresets. While Manousakas et al.

[19]

minimizes the reverse KL divergence

between the pseudocoreset posterior distribution and the full data posterior distribution, we show

that alternative divergence measures such as forward KL divergence and Wasserstein distance are

also effective in practice. Based on this perspective, we re-interpret existing dataset distillation

algorithms [

] approximations to the Bayesian pseudocoresets learned by minimizing reverse KL

and Wasserstein distance, with speciﬁc choices of variational approximations for coreset posteriors.

This connection justiﬁes the heuristically chosen learning objectives of dataset distillation algorithms,

and at the same time, provides a theoretical background for using the distilled datasets obtained

from the such procedure for Bayesian inference. Conversely, Bayesian pseudocoresets beneﬁt from

this connection by borrowing various ideas and tricks used in the dataset distillation algorithms

to make them scale for complex tasks. For instance, the variational posteriors we identiﬁed by

recasting dataset distillation algorithms as Bayesian pseudocoreset learning can be used for Bayesian

pseudocoresets with various choices of divergence measures. Also, we can adapt the idea of reusing

pre-computed optimization trajectories [8] which have already been shown to work at large scales.

We empirically compare pseudocoreset algorithms based on three different divergence measures

on high-dimensional image classiﬁcation tasks. Our results demonstrate that we can efﬁciently

and scalably construct Bayesian pseudocoresets with which MCMC algorithms such as Hamilto-

nian Monte-Carlo (HMC) [

] or Stochastic Gradient Hamiltonian Monte-Carlo (SGHMC) [

]

accurately approximate the full posterior.

2 Background

2.1 Bayesian Pseudocoresets

Denote observed data as

x={xn}N

n=1

. Given a probabilistic model indexed by a parameter

with

some prior distribution π0, we are interested in the posterior conditioned on the full data x,

πx(θ) = 1

Z(x)exp N

n=1

f(xn, θ)!π0(θ):=1

Z(x)exp 1>

Nf(x, θ)π0(θ),(1)

where

f(x, θ):= [f(x1, θ), . . . , f(xN, θ)]>

is the

-dimensional one vector, and

Z(x) =

Rexp(1>

Nf(x, θ))dθ

is a partition function. The posterior

πx

is usually intractable to compute due to

Z(x)

, so we employ approximate inference algorithms. Bayesian pseudocoreset methods [

] aim

to construct a synthetic dataset called a pseudocoreset

u={um}M

m=1

with

MN

such that the

posterior of

conditioned on it approximates the original posterior

πx

. The pseudocoreset posterior

is written as1,

πu(θ) = 1

Z(u)exp M

m=1

f(um, θ)!π0(θ) = 1

Z(u)exp(1>

Mf(u, θ))π0(θ),(2)

Manousakas et al.

[19]

considered two sets of parameters, the coreset elements

and their weights

w={wm}M

m=1

. We empirically found that the weights

have a negligible impact on performance, so we only

learn the coreset elements.

where

f(u, θ)

, and

Z(u)

are deﬁned similarly. Manousakas et al.

[19]

suggests to learn

minimizing the KL divergence between πuand πx:

u∗= arg min

DKL[πukπx],(3)

and show that the gradient of this objective is computed as

∇umDKL[πukπx] = −Covπu(θ)∇uf(um, θ),1>

Nf(x, θ)−1>

Mf(u, θ).(4)

The expectation over the pseudocoreset posterior

πu

in the above equation is further approximated

with the Monte-Carlo estimation with an approximate posterior

qu≈πu

, for instance, from the

Laplace approximation.

2.2 Dataset Distillation

With a similar motivation as Bayesian pseudocoresets, dataset distillation methods construct a small

synthetic dataset

, but from the perspective of matching optimal parameters. That is, in dataset

distillation, we ﬁnd usuch that

u∗= arg min

d(θ∗

x, θ∗

u), θ∗

x= arg min

`(x, θ), θ∗

u= arg min

`(u, θ),(5)

with a loss function

`(·, θ)

and a distance function

d(·,·)

comparing two parameters. In general,

optimizing

d(θ∗

x, θ∗

is intractable since it is a bilevel optimization problem requiring

θ?

and

θ∗

, so

existing works relax the problem in several ways, such as using kernel-based approaches [

]

or solving only for a short horizon in inner optimization [

]. Below, we brieﬂy review

two representative methods—Dataset Condensation (DC) [

] and Matching Training Trajectories

(MTT) [8].

Dataset Condensation

Instead of fully unrolling the inner optimization process, DC instead com-

putes the matching objective at every inner step starting from some random parameter θ(0).

u∗= arg min

Eθ(0) "T

t=0

dcos ∇θ`(x, θ(t)),∇θ`(u, θ(t))#,(6)

θ(t)=θ(t−1) −η∇θ`(u, θ(t−1))for t≥1,(7)

where

dcos

denotes the cosine similarity between two gradients. To implement this objective, we

build a trajectory of parameters starting from a random

θ(0)

with the current distilled dataset

. We

then compute the gradient matching objective for each layer at each step of the trajectory to update

Despite its appealing simplicity, this algorithm may suffer from short-horizon bias due to the limited

trajectory length.

Matching Training Trajectories

MTT extends the single-step gradient matching in DC and con-

siders longer inner optimization trajectories. Starting from a randomly initialized parameter

θ(0)

MTT takes multiple inner optimization steps with both

and

, and minimizes the normalized

distance between two parameters at the end of the learning trajectories.

u∗= arg min

Eθ(0) hd2θ(Lx)

x, θ(Lu)

ui, θ(0)

x=θ(0)

u=θ(0),(8)

θ(t)

x=θ(t−1)

x−ηx∇θ`(x, θ(t−1)

x)for t= 1, . . . , Lx,(9)

θ(t)

u=θ(t−1)

u−ηu∇θ`(u, θ(t−1)

u)for t= 1, . . . , Lu,(10)

where

d2(·,·)

is the normalized

distance between parameters. Unlike DC, this algorithm starts

from a common initial parameter

θ(0)

, keeps track of two separate optimization trajectories (one with

and another with

), and the outer-update for

is done only after

and

inner steps. To reduce

the heavy computational cost for the inner update with the full dataset

, MTT employs mini-batch

SGD for the inner steps, in addition to saving and re-using precomputed SGD trajectories to train

. The use of precomputed SGD trajectories is similar to that of replay buffers for reinforcement

learning [

], and allows MTT to handle long inner-loop sequences. While MTT signiﬁcantly

improves over DC, presumably due to its longer inner optimization horizon, the algorithm requires

large memory for even a moderate number of inner-steps

because it requires backpropagation

through all inner steps with u.

3 On Divergence Measures for Bayesian Pseudocoresets

While the Bayesian pseudocoreset is originally obtained by minimizing the reverse KL divergence, in

principle we can minimize an arbitrary divergence measure D(·,·)to achieve the goal:

u∗= arg min

D(πx, πu).(11)

In this section, we study three different choices for such divergence measures along with their relation

to dataset distillation methods.

3.1 Reverse KL Divergence

Manousakas et al.

[19]

proposed to minimize reverse KL divergence with a Laplace approximation to

obtain the pseudocoreset posterior

πu

. Here, we show that it is possible to recast Dataset Condensation

(DC) as a reverse KL divergence minimization with an alternative choice for the variational posterior.

Let

θ(t)

be a parameter in a learning trajectory with a loss function

`(u, θ) := −1>

Mf(u, θ)

. Provided

that θ(t)is sufﬁciently close to a local optimum, we can construct a naïve approximation of πuas

qu(θ) = N(θ;θ(t),Σ), θ(t)=θ(t−1) −η∇θ`(u, θ(t−1)).(12)

Under this setting, we can actually show that one gradient descent step with respect to reverse KL

divergence

DKL[πukπx]

is equivalent to one step of gradient matching in DC when the parameter

being optimized is close to local optima. We note that while the Gaussian posterior is a simple

approximation class, it is commonly used due to its demonstrated usefulness in several literatures [

such as the Bernstein-von Mises theorem [

]. However, we also note that these Gaussian

approximations are not necessary for establishing the connection between the dataset distillation and

Bayesian pseudocoresets; Gaussian approximations are just a special case which has the beneﬁt of

simplifying the analysis.

Proposition 3.1.

Let

qu(θ)

set as Eq. 12. Assume

θ(t−1)

is close enough to local optima, so that we

have kθ(t−1) −θ(t)k  1. Then we have

∇uDKL[πukπx]≈ −η∇u∇θ`(x, θ(t−1))>∇θ`(u, θ(t−1)).(13)

The proof is given in Appendix A. In other words, when sufﬁciently close to local optima, DC can

be interpreted as a special instance of Bayesian pseudocoreset via reverse KL minimization with

Gaussian approximation for the pseudocoreset posterior

πu

. Throughout the paper, we will refer to

the Bayesian pseudocoreset with reverse KL minimization as BPC-rKL. To simply implement the

Algorithm 1 in [

] for the image dataset, we approximated

πu

as a Gaussian distribution. Please

refer to Appendix B.2 for the detailed algorithm.

3.2 Wasserstein Distance

We can instead choose

D(·,·)

to be the Wasserstein distance between

πu

and

πx

. While the Wasser-

stein distance is intractable in general, by approximating both

πu

and

πx

with Gaussian distributions,

we can attain a closed-form expression for this distance metric. Speciﬁcally, let

θu

and

θx

be two

parameters sufﬁciently close to local optima. Then we can construct a crude approximation for

πu

and πxas,

πu(θ)≈qu(θ) := N(θ;θu,Σu), πx(θ)≈qx(θ) := N(θ;θx,Σx).(14)

Then the Wasserstein distance between these two distributions is computed as

W2(qu;qx) = kθu−θxk2

2+ TrΣx+ Σu−2(Σ1/2

uΣxΣ1/2

u)1/2.(15)

Now consider a single outer-step of the MTT. Starting from a common initial

θ0

, the algorithm takes

the gradient steps both with

and

to get

θ(Lx)

and

θ(Lu)

. If we set

qu(θ) = N(θ;θ(Lu)

u,Σ)

and

qx(θ) = N(θ;θ(Lx)

x,Σ)

with a common covariance

, then the Wasserstein distance reduces to

kθ(Lx)

x−θ(Lu)

uk2

, which is proportional to the

objective being optimized in the MTT. Similarly

to the reverse KL divergence and BPC-rKL, we refer to the Bayesian pseudocoreset with Wasserstein

distance minimization as BPC-W.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnDivergenceMeasuresforBayesianPseudocoresetsBalhaeKim1,JungwonChoi1,SeanieLee1,YoonhoLee2,Jung-WooHa3,JuhoLee1;4KAIST1,StanfordUniversity2,NAVERAILab3,AITRICS4{balhaekim,jungwon.choi,lsnfamily02}@kaist.ac.kr,yoonho@stanford.edu,jungwoo.ha@navercorp.com,juholee@kaist.ac.krAbstractABayesianpseudocore...

展开>> 收起<<

On Divergence Measures for Bayesian Pseudocoresets Balhae Kim1 Jungwon Choi1 Seanie Lee1 Yoonho Lee2 Jung-Woo Ha3 Juho Lee14 KAIST1 Stanford University2 NA VER AI Lab3 AITRICS4.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Divergence Measures for Bayesian Pseudocoresets Balhae Kim1 Jungwon Choi1 Seanie Lee1 Yoonho Lee2 Jung-Woo Ha3 Juho Lee14 KAIST1 Stanford University2 NA VER AI Lab3 AITRICS4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: