Efficient Bayesian Updates for Deep Learning via Laplace Approximations Denis HuseljicMarek Herde Lukas Rauch Paul Hahn Zhixin Huang

2025-04-26 0 0 1.13MB 17 页 10玖币
侵权投诉
Efficient Bayesian Updates for Deep Learning via
Laplace Approximations
Denis HuseljicMarek Herde Lukas Rauch Paul Hahn Zhixin Huang
Daniel Kottke Stephan Vogt Bernhard Sick
University of Kassel, Intelligent Embedded Systems
dhuseljic@uni-kassel.de
Abstract
Since training deep neural networks takes significant computational resources,
extending the training dataset with new data is difficult, as it typically requires
complete retraining. Moreover, specific applications do not allow costly retraining
due to time or computational constraints. We address this issue by proposing a novel
Bayesian update method for deep neural networks by using a last-layer Laplace
approximation. Concretely, we leverage second-order optimization techniques on
the Gaussian posterior distribution of a Laplace approximation, computing the
inverse Hessian matrix in closed form. This way, our method allows for fast and
effective updates upon the arrival of new data in a stationary setting. A large-scale
evaluation study across different data modalities confirms that our updates are a
fast and competitive alternative to costly retraining. Furthermore, we demonstrate
its applicability in a deep active learning scenario by using our update to improve
existing selection strategies.
1 Introduction
Training deep neural networks (DNNs) often demands significant time and computational resources.
Moreover, when extending a dataset with new data, DNNs require complete retraining, which involves
both the new and previously used data to prevent issues such as catastrophic forgetting [
42
]. Although
improved generalization performance typically justifies exhaustive retraining, it may not always
be feasible immediately after new data arrives. Key scenarios where retraining is beneficial but
impractical include: (1) When data arrives sequentially and an immediate update is necessary, e.g., in
active learning (AL) [
48
] or when working on data streams [
46
], (2) when only limited computational
resources are available, e.g., training on embedded hardware [
51
], and (3) when there are privacy
concerns, which restrict sharing new data with distributed computing units [
51
]. In these scenarios, it
is crucial to devise methods that allow DNNs to incorporate new data fast and effectively.
While numerous studies developed methods to update DNNs effectively [
19
,
42
,
50
], these methods
typically assume a non-stationary setting in which the respective data distribution changes over
time [
52
]. As a result, these methods often focus on situations where new data arrives in relatively
large amounts [
42
], neglecting scenarios where an immediate update with a small number of instances
might be beneficial. In contrast, we assume a stationary setting with the goal to immediately update
the model upon the arrival of new data, even in the case of a single data point.
As an example scenario, we consider AL throughout this article. A popular approach in AL is to
use a one-step-look-ahead to select instances that significantly change model predictions [
44
,
50
].
This approach assesses how model predictions would change, when including a specific instance
in the labeled pool and updating the model. Consequently, it requires a highly efficient update
Corresponding Author
Preprint. Under review.
arXiv:2210.06112v2 [cs.LG] 12 Jul 2024
(a) Original Model
(b) MC-based Update (Fast)
(c) Our Update (Fast) (d) Retraining (Slow)
Figure 1: Comparison of different BNNs on the two moons dataset. (a) The original model resulting
from training on the original dataset. (b) The typical MC-based update applied to the original
model [
21
,
50
]. (c) Our update applied to the original model. (d) The model resulting from retraining
on both the original and the new dataset.
method, as it involves updates for all candidates of a large dataset. To make this feasible, several AL
selection strategies utilize a Bayesian neural network (BNN) and rely on Monte Carlo (MC)-based
updates [
50
]. However, while this update is fast, it falls short in performance compared to a retrained
model (cf. Fig. 1). In such a scenario, a fast and effective update would enhance these strategies.
In this article, we propose a novel Bayesian update method. Like [
50
,
42
,
21
], we consider BNNs
since they are particularly well-suited for our setting. The posterior distribution can be updated in
a theoretically sound way by treating it as the prior distribution when new data becomes available.
However, instead of relying on MC-based BNNs (e.g., MC-Dropout), we transform an arbitrary
DNN into a BNN by employing a last-layer Laplace approximation (LA) [
7
], giving us a closed-form
expression for the posterior distribution. Then, we leverage second-order optimization techniques
on the Gaussian posterior distribution of the LA. To ensure low computational complexity, we
compute the required inverse Hessian in closed form. The resulting update is both fast – it can be
used in the previously described scenarios – and effective – its performance closely aligns with
retraining. Moreover, our update can be combined with more recently developed BNNs such as
Spectral Normalized Neural Gaussian Process (SNGP) [30].
Extensive studies across different data modalities, including image and text datasets, demonstrate
that our updates outperform the typically employed MC-based ones [
21
,
50
] in terms of speed and
performance. Further, to demonstrate the applicability in one of the previously mentioned scenarios,
we revisit AL and propose a simple and effective idea to improve existing selection strategies by
immediately making use of acquired labels.
Contributions
We propose a novel update method for DNNs by employing a last-layer LA and second-
order optimization techniques, suitable for scenarios where data arrives in small quantities,
and immediate model updates are required.
We conduct a comprehensive evaluation of our update method across different data modali-
ties, demonstrating superior performance and speed compared to the MC-based updates.
We propose a simple framework to improve existing AL strategies employing our updates.
2 Related Work
Similar to our setting, continual learning [
8
] updates models by exclusively training with data from
a new task, addressing the challenge of retaining knowledge from previously learned tasks. Popular
techniques [
42
,
19
] use conventional first-order optimization methods, incorporating a regularization
term to counteract catastrophic forgetting. More specifically, [
42
] and [
19
] derive a regularization
term from an LA that penalizes large deviations from prior knowledge. Ebrahimi et al.
[11]
exploit
uncertainty estimates of BNNs to dynamically adjust learning rates during training. Unlike our
method, these approaches require training over multiple epochs, preventing immediate updates.
2
Additionally, they require large amounts of new data (thousands of data points) per task, whereas
our update method is designed for smaller datasets, ranging from single to hundreds of data points.
More closely related to our work is online learning [17], which aims to sequentially and efficiently
update models from incoming data streams. Traditional approaches often focus on linear [
57
,
6
] or
shallow [
22
,
45
] models with maximum-margin classification. However, applying online learning
to DNNs remains difficult due to issues such as convergence, vanishing gradients, and large model
sizes [
46
,
27
]. To address these challenges, Sahoo et al.
[46]
proposed a method that modifies a
DNN’s architecture to facilitate updates. We argue that this approach is restrictive in state-of-the-art
settings, given the increasing reliance on pretrained architectures. Most similar to our setting is the
work on Bayesian online inference by Kirsch et al.
[21]
. The core idea is to sample hypotheses, e.g.,
via MC-Dropout, from the posterior distribution of a BNN and weight their importance according
to the respective likelihoods for sequentially arriving data. We refer to these types of updates as
MC-based updates. The empirical results raised concerns regarding the applicability of such MC
updates in high-dimensional parameter spaces.
BNNs [
53
,
12
] induce a prior distribution over their parameters, i.e., weights, and learn a posterior
distribution given training data. Predictions are made by marginalizing this posterior. BNN types
differ mainly in their probabilistic model and sampling from the posterior distribution [
18
]. MC-
Dropout [
13
], one of the most prominent BNN types, is a regularization technique performed during
training. Using dropout during evaluation, called MC-Dropout, we obtain a distribution for the
predictions corresponding to a variational distribution in the parameter space. Due to its simplicity
and training efficiency, MC-Dropout is often used for comparison. However, its inference is inefficient
and predictions may not properly represent uncertainty estimates [
37
]. Deep ensembles [
25
] consist
of multiple DNNs. Combined with regularization, these DNNs are samples (different modes) of the
parameters’ posterior distribution. Ensembles typically provide better uncertainty estimates than
MC-Dropout but require more computational capacity during training [
37
]. A BNN obtained via
LA [
43
] is computationally more efficient than MC-Dropout and deep ensembles. It specifies an
approximate Gaussian posterior distribution, where the maximum a posteriori (MAP) estimate defines
the mean and the inverse of the negative log posterior’s Hessian at the MAP estimate corresponds to
the covariance matrix. As computing this Hessian is expensive for large DNNs, LA is often used
only in the last layer [
7
]. SNGP [
30
] combines LA with random Fourier features [
38
] and spectral
normalization [34] to approximate a Gaussian process.
AL poses significant challenges in practice due to the computational cost associated with required
retraining per cycle. The emergence of one-step-look-ahead strategies [
44
,
50
,
56
] further highlights
the importance of efficient model updates. Notably, BEMPS [
50
] evaluates all possible candidates
(i.e., instances) in a dataset to select those that are anticipated to alter the model’s predictions
substantially. However, this strategy utilizes an MC-based BNN with deep ensembles, relying on
MC-based updates. They assume that all hypotheses are equally likely to explain the data before an
update step. While this update is fast, its performance is inferior to that of a retrained model.
3 Fast Bayesian Updates for Deep Neural Networks
In this section, we present our new Bayesian update method. First, we introduce the general concept
of Bayesian updates and the commonly applied MC-based Bayesian updates [
21
,
50
]. Afterward, we
propose our novel method focusing on an efficient update of the Gaussian posterior distribution via
last-layer LAs. For an introduction to LA, we refer to [7].
3.1 Bayesian Updates
We focus on classification problems with instance space
X
and label space
Y={0, . . . , K 1}
. The
primary goal in our setting is to efficiently incorporate the information of new instance-label pairs
D={(xn, yn)}N
n=1 X × Y
into a BNN trained on dataset
D X × Y
. Retraining the entire
network on the extended dataset
D ∪ D
results in high computational cost for a large dataset
D
.
Conversely, using the new data solely can cause catastrophic forgetting [42].
For this purpose, we employ BNNs [
12
] with Bayesian updates [
35
] as an efficient alternative to
retraining. The main idea of BNNs is to estimate posterior distribution
p(ω|D)
over the parameters
ω
given the observed training data
D
using Bayes’ theorem. The obtained posterior distribution
over the parameters can then be used to specify the predictive distribution over a new instance’s class
3
Figure 2: The left plot shows the predicted probabilities of the positive class for each hypothesis
(colored lines) drawn from a BNN as well as the mean (black solid line) and standard deviation (black
dashed line) of its predictive distribution. The right plot shows updated weights for each hypothesis
and the predictive distribution after observing additional instances (green).
membership via marginalization:
p(y|x,D) = Ep(ω|D)[p(y|x,ω)] = Zp(y|x,ω)p(ω|D) dω.(1)
Thereby, the likelihood p(y|x,ω) = [softmax(fω(x))]ydenotes the probabilistic output of a DNN
with parameters ω, where fω:X RKis a function outputting class-wise logits.2
The formulation in Eq.
(1)
provides a theoretically sound way to obtain updated predictions. In
particular, this is because the probabilistic outputs
p(y|x,ω)
do not directly depend on the training
data
D
. Consequently, to obtain an updated predictive distribution, we do not need to update the
parameters
ω
directly but only the posterior distribution
p(ω|D)
. The updated posterior distribution
p(ω|D,D)
is found through Bayes’ theorem, where the current posterior distribution
p(ω|D)
is
considered the prior and multiplied with the likelihood
p(y|x,ω)
per instance-label pair
(x, y)∈ D
.
As instances in
D
and
D
are assumed to be independently distributed, we can simplify the likelihood
and reformulate the parameter distribution as follows3:
p(ω|D,D)p(ω|D)p(D|D,ω)i.i.d.
=p(ω|D)p(D|ω) = p(ω|D)Y
(x,y)∈D
p(y|x,ω).(2)
We refer to Eq. (2) as the Bayesian update.
The most common realization [
21
,
50
] of this update is through MC-based BNNs, such as MC-
Dropout and deep ensembles. These BNNs rely on samples (or hypotheses)
ω1,...,ωM
drawn from
an approximate posterior
q(ω|D)
. Research [
56
,
50
] assumes that all hypotheses are equally likely
to explain the observed data and have the same probability before updating. By updating the posterior
distribution through Eq.
(2)
, they weigh more likely hypotheses given the new data higher. We refer
to these as MC-based updates with a formal definition given in Appendix A. Figure 2 illustrates this
concept where different hypotheses
ω1,...,ωMq(ω|D)
are shown. Each hypothesis represents
a possible true solution for the learning task (white instances). When new data (green instances)
arrives, we weigh each hypothesis by its likelihood of explaining the new data and obtain an updated
prediction without retraining. This results in an updated predictive distribution, as seen in bold in
Fig. 2 (right).
3.2 Fast Approximations of Bayesian Updates for Deep Neural Networks
Our update method is based on a combination of two concepts. First, instead of MC-based BNNs, we
suggest using LAs on the last layer of a DNN. Second, we directly modify the approximate posterior
distribution of the LA, providing a much more flexible way to adapt it to new data than reweighting.
In the following, we explain each component in detail. For now, we focus on binary classification
with K= 2, and refer to Appendix C for an extension to multi-class classification.
Last-layer LA: LAs approximate the (intractable) posterior distribution
p(ω|D)
with a Gaussian
centered on the maximum a posteriori (MAP) estimate with a covariance equal to the negative Hessian
2We denote the i-th element of a vector bas [b]i=bi.
3We denote p(y1,...,yN|x1,...,xN,ω)with D={(xn, yn)}N
n=1 as p(D|ω).
4
摘要:

EfficientBayesianUpdatesforDeepLearningviaLaplaceApproximationsDenisHuseljic∗MarekHerdeLukasRauchPaulHahnZhixinHuangDanielKottkeStephanVogtBernhardSickUniversityofKassel,IntelligentEmbeddedSystems∗dhuseljic@uni-kassel.deAbstractSincetrainingdeepneuralnetworkstakessignificantcomputationalresources,ex...

展开>> 收起<<
Efficient Bayesian Updates for Deep Learning via Laplace Approximations Denis HuseljicMarek Herde Lukas Rauch Paul Hahn Zhixin Huang.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.13MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注