Efficient Bayesian Updates for Deep Learning via Laplace Approximations Denis HuseljicMarek Herde Lukas Rauch Paul Hahn Zhixin Huang

2025-04-26 0 0 1.13MB 17 页 10玖币

侵权投诉

Efﬁcient Bayesian Updates for Deep Learning via

Laplace Approximations

Denis Huseljic∗Marek Herde Lukas Rauch Paul Hahn Zhixin Huang

Daniel Kottke Stephan Vogt Bernhard Sick

University of Kassel, Intelligent Embedded Systems

∗dhuseljic@uni-kassel.de

Abstract

Since training deep neural networks takes signiﬁcant computational resources,

extending the training dataset with new data is difﬁcult, as it typically requires

complete retraining. Moreover, speciﬁc applications do not allow costly retraining

due to time or computational constraints. We address this issue by proposing a novel

Bayesian update method for deep neural networks by using a last-layer Laplace

approximation. Concretely, we leverage second-order optimization techniques on

the Gaussian posterior distribution of a Laplace approximation, computing the

inverse Hessian matrix in closed form. This way, our method allows for fast and

effective updates upon the arrival of new data in a stationary setting. A large-scale

evaluation study across different data modalities conﬁrms that our updates are a

fast and competitive alternative to costly retraining. Furthermore, we demonstrate

its applicability in a deep active learning scenario by using our update to improve

existing selection strategies.

1 Introduction

Training deep neural networks (DNNs) often demands signiﬁcant time and computational resources.

Moreover, when extending a dataset with new data, DNNs require complete retraining, which involves

both the new and previously used data to prevent issues such as catastrophic forgetting [

]. Although

improved generalization performance typically justiﬁes exhaustive retraining, it may not always

be feasible immediately after new data arrives. Key scenarios where retraining is beneﬁcial but

impractical include: (1) When data arrives sequentially and an immediate update is necessary, e.g., in

active learning (AL) [

] or when working on data streams [

], (2) when only limited computational

resources are available, e.g., training on embedded hardware [

], and (3) when there are privacy

concerns, which restrict sharing new data with distributed computing units [

]. In these scenarios, it

is crucial to devise methods that allow DNNs to incorporate new data fast and effectively.

While numerous studies developed methods to update DNNs effectively [

], these methods

typically assume a non-stationary setting in which the respective data distribution changes over

time [

]. As a result, these methods often focus on situations where new data arrives in relatively

large amounts [

], neglecting scenarios where an immediate update with a small number of instances

might be beneﬁcial. In contrast, we assume a stationary setting with the goal to immediately update

the model upon the arrival of new data, even in the case of a single data point.

As an example scenario, we consider AL throughout this article. A popular approach in AL is to

use a one-step-look-ahead to select instances that signiﬁcantly change model predictions [

This approach assesses how model predictions would change, when including a speciﬁc instance

in the labeled pool and updating the model. Consequently, it requires a highly efﬁcient update

∗Corresponding Author

Preprint. Under review.

arXiv:2210.06112v2 [cs.LG] 12 Jul 2024

(a) Original Model

(b) MC-based Update (Fast)

Figure 1: Comparison of different BNNs on the two moons dataset. (a) The original model resulting

from training on the original dataset. (b) The typical MC-based update applied to the original

model [

]. (c) Our update applied to the original model. (d) The model resulting from retraining

on both the original and the new dataset.

method, as it involves updates for all candidates of a large dataset. To make this feasible, several AL

selection strategies utilize a Bayesian neural network (BNN) and rely on Monte Carlo (MC)-based

updates [

]. However, while this update is fast, it falls short in performance compared to a retrained

model (cf. Fig. 1). In such a scenario, a fast and effective update would enhance these strategies.

In this article, we propose a novel Bayesian update method. Like [

], we consider BNNs

since they are particularly well-suited for our setting. The posterior distribution can be updated in

a theoretically sound way by treating it as the prior distribution when new data becomes available.

However, instead of relying on MC-based BNNs (e.g., MC-Dropout), we transform an arbitrary

DNN into a BNN by employing a last-layer Laplace approximation (LA) [

], giving us a closed-form

expression for the posterior distribution. Then, we leverage second-order optimization techniques

on the Gaussian posterior distribution of the LA. To ensure low computational complexity, we

compute the required inverse Hessian in closed form. The resulting update is both fast – it can be

used in the previously described scenarios – and effective – its performance closely aligns with

retraining. Moreover, our update can be combined with more recently developed BNNs such as

Spectral Normalized Neural Gaussian Process (SNGP) [30].

Extensive studies across different data modalities, including image and text datasets, demonstrate

that our updates outperform the typically employed MC-based ones [

] in terms of speed and

performance. Further, to demonstrate the applicability in one of the previously mentioned scenarios,

we revisit AL and propose a simple and effective idea to improve existing selection strategies by

immediately making use of acquired labels.

Contributions

•

We propose a novel update method for DNNs by employing a last-layer LA and second-

order optimization techniques, suitable for scenarios where data arrives in small quantities,

and immediate model updates are required.

•

We conduct a comprehensive evaluation of our update method across different data modali-

ties, demonstrating superior performance and speed compared to the MC-based updates.

•

We propose a simple framework to improve existing AL strategies employing our updates.

2 Related Work

Similar to our setting, continual learning [

] updates models by exclusively training with data from

a new task, addressing the challenge of retaining knowledge from previously learned tasks. Popular

techniques [

] use conventional ﬁrst-order optimization methods, incorporating a regularization

term to counteract catastrophic forgetting. More speciﬁcally, [

] and [

] derive a regularization

term from an LA that penalizes large deviations from prior knowledge. Ebrahimi et al.

[11]

exploit

uncertainty estimates of BNNs to dynamically adjust learning rates during training. Unlike our

method, these approaches require training over multiple epochs, preventing immediate updates.

Additionally, they require large amounts of new data (thousands of data points) per task, whereas

our update method is designed for smaller datasets, ranging from single to hundreds of data points.

More closely related to our work is online learning [17], which aims to sequentially and efﬁciently

update models from incoming data streams. Traditional approaches often focus on linear [

] or

shallow [

] models with maximum-margin classiﬁcation. However, applying online learning

to DNNs remains difﬁcult due to issues such as convergence, vanishing gradients, and large model

sizes [

]. To address these challenges, Sahoo et al.

[46]

proposed a method that modiﬁes a

DNN’s architecture to facilitate updates. We argue that this approach is restrictive in state-of-the-art

settings, given the increasing reliance on pretrained architectures. Most similar to our setting is the

work on Bayesian online inference by Kirsch et al.

[21]

. The core idea is to sample hypotheses, e.g.,

via MC-Dropout, from the posterior distribution of a BNN and weight their importance according

to the respective likelihoods for sequentially arriving data. We refer to these types of updates as

MC-based updates. The empirical results raised concerns regarding the applicability of such MC

updates in high-dimensional parameter spaces.

BNNs [

] induce a prior distribution over their parameters, i.e., weights, and learn a posterior

distribution given training data. Predictions are made by marginalizing this posterior. BNN types

differ mainly in their probabilistic model and sampling from the posterior distribution [

]. MC-

Dropout [

], one of the most prominent BNN types, is a regularization technique performed during

training. Using dropout during evaluation, called MC-Dropout, we obtain a distribution for the

predictions corresponding to a variational distribution in the parameter space. Due to its simplicity

and training efﬁciency, MC-Dropout is often used for comparison. However, its inference is inefﬁcient

and predictions may not properly represent uncertainty estimates [

]. Deep ensembles [

] consist

of multiple DNNs. Combined with regularization, these DNNs are samples (different modes) of the

parameters’ posterior distribution. Ensembles typically provide better uncertainty estimates than

MC-Dropout but require more computational capacity during training [

]. A BNN obtained via

LA [

] is computationally more efﬁcient than MC-Dropout and deep ensembles. It speciﬁes an

approximate Gaussian posterior distribution, where the maximum a posteriori (MAP) estimate deﬁnes

the mean and the inverse of the negative log posterior’s Hessian at the MAP estimate corresponds to

the covariance matrix. As computing this Hessian is expensive for large DNNs, LA is often used

only in the last layer [

]. SNGP [

] combines LA with random Fourier features [

] and spectral

normalization [34] to approximate a Gaussian process.

AL poses signiﬁcant challenges in practice due to the computational cost associated with required

retraining per cycle. The emergence of one-step-look-ahead strategies [

] further highlights

the importance of efﬁcient model updates. Notably, BEMPS [

] evaluates all possible candidates

(i.e., instances) in a dataset to select those that are anticipated to alter the model’s predictions

substantially. However, this strategy utilizes an MC-based BNN with deep ensembles, relying on

MC-based updates. They assume that all hypotheses are equally likely to explain the data before an

update step. While this update is fast, its performance is inferior to that of a retrained model.

3 Fast Bayesian Updates for Deep Neural Networks

In this section, we present our new Bayesian update method. First, we introduce the general concept

of Bayesian updates and the commonly applied MC-based Bayesian updates [

]. Afterward, we

propose our novel method focusing on an efﬁcient update of the Gaussian posterior distribution via

last-layer LAs. For an introduction to LA, we refer to [7].

3.1 Bayesian Updates

We focus on classiﬁcation problems with instance space

and label space

Y={0, . . . , K −1}

. The

primary goal in our setting is to efﬁciently incorporate the information of new instance-label pairs

D⊕={(xn, yn)}N

n=1 ⊂ X × Y

into a BNN trained on dataset

D ⊂ X × Y

. Retraining the entire

network on the extended dataset

D ∪ D⊕

results in high computational cost for a large dataset

Conversely, using the new data solely can cause catastrophic forgetting [42].

For this purpose, we employ BNNs [

] with Bayesian updates [

] as an efﬁcient alternative to

retraining. The main idea of BNNs is to estimate posterior distribution

p(ω|D)

over the parameters

ω∈Ω

given the observed training data

using Bayes’ theorem. The obtained posterior distribution

over the parameters can then be used to specify the predictive distribution over a new instance’s class

Figure 2: The left plot shows the predicted probabilities of the positive class for each hypothesis

(colored lines) drawn from a BNN as well as the mean (black solid line) and standard deviation (black

dashed line) of its predictive distribution. The right plot shows updated weights for each hypothesis

and the predictive distribution after observing additional instances (green).

membership via marginalization:

p(y|x,D) = Ep(ω|D)[p(y|x,ω)] = Zp(y|x,ω)p(ω|D) dω.(1)

Thereby, the likelihood p(y|x,ω) = [softmax(fω(x))]ydenotes the probabilistic output of a DNN

with parameters ω, where fω:X → RKis a function outputting class-wise logits.2

The formulation in Eq.

(1)

provides a theoretically sound way to obtain updated predictions. In

particular, this is because the probabilistic outputs

p(y|x,ω)

do not directly depend on the training

data

. Consequently, to obtain an updated predictive distribution, we do not need to update the

parameters

directly but only the posterior distribution

p(ω|D)

. The updated posterior distribution

p(ω|D,D⊕)

is found through Bayes’ theorem, where the current posterior distribution

p(ω|D)

considered the prior and multiplied with the likelihood

p(y|x,ω)

per instance-label pair

(x, y)∈ D⊕

As instances in

and

D⊕

are assumed to be independently distributed, we can simplify the likelihood

and reformulate the parameter distribution as follows3:

p(ω|D⊕,D)∝p(ω|D)p(D⊕|D,ω)i.i.d.

=p(ω|D)p(D⊕|ω) = p(ω|D)Y

(x,y)∈D⊕

p(y|x,ω).(2)

We refer to Eq. (2) as the Bayesian update.

The most common realization [

] of this update is through MC-based BNNs, such as MC-

Dropout and deep ensembles. These BNNs rely on samples (or hypotheses)

ω1,...,ωM

drawn from

an approximate posterior

q(ω|D)

. Research [

] assumes that all hypotheses are equally likely

to explain the observed data and have the same probability before updating. By updating the posterior

distribution through Eq.

(2)

, they weigh more likely hypotheses given the new data higher. We refer

to these as MC-based updates with a formal deﬁnition given in Appendix A. Figure 2 illustrates this

concept where different hypotheses

ω1,...,ωM∼q(ω|D)

are shown. Each hypothesis represents

a possible true solution for the learning task (white instances). When new data (green instances)

arrives, we weigh each hypothesis by its likelihood of explaining the new data and obtain an updated

prediction without retraining. This results in an updated predictive distribution, as seen in bold in

Fig. 2 (right).

3.2 Fast Approximations of Bayesian Updates for Deep Neural Networks

Our update method is based on a combination of two concepts. First, instead of MC-based BNNs, we

suggest using LAs on the last layer of a DNN. Second, we directly modify the approximate posterior

distribution of the LA, providing a much more ﬂexible way to adapt it to new data than reweighting.

In the following, we explain each component in detail. For now, we focus on binary classiﬁcation

with K= 2, and refer to Appendix C for an extension to multi-class classiﬁcation.

Last-layer LA: LAs approximate the (intractable) posterior distribution

p(ω|D)

with a Gaussian

centered on the maximum a posteriori (MAP) estimate with a covariance equal to the negative Hessian

2We denote the i-th element of a vector bas [b]i=bi.

3We denote p(y1,...,yN|x1,...,xN,ω)with D={(xn, yn)}N

n=1 as p(D|ω).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfficientBayesianUpdatesforDeepLearningviaLaplaceApproximationsDenisHuseljic∗MarekHerdeLukasRauchPaulHahnZhixinHuangDanielKottkeStephanVogtBernhardSickUniversityofKassel,IntelligentEmbeddedSystems∗dhuseljic@uni-kassel.deAbstractSincetrainingdeepneuralnetworkstakessignificantcomputationalresources,ex...

展开>> 收起<<

Efficient Bayesian Updates for Deep Learning via Laplace Approximations Denis HuseljicMarek Herde Lukas Rauch Paul Hahn Zhixin Huang.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efficient Bayesian Updates for Deep Learning via Laplace Approximations Denis HuseljicMarek Herde Lukas Rauch Paul Hahn Zhixin Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: