Additionally, they require large amounts of new data (thousands of data points) per task, whereas
our update method is designed for smaller datasets, ranging from single to hundreds of data points.
More closely related to our work is online learning [17], which aims to sequentially and efficiently
update models from incoming data streams. Traditional approaches often focus on linear [
57
,
6
] or
shallow [
22
,
45
] models with maximum-margin classification. However, applying online learning
to DNNs remains difficult due to issues such as convergence, vanishing gradients, and large model
sizes [
46
,
27
]. To address these challenges, Sahoo et al.
[46]
proposed a method that modifies a
DNN’s architecture to facilitate updates. We argue that this approach is restrictive in state-of-the-art
settings, given the increasing reliance on pretrained architectures. Most similar to our setting is the
work on Bayesian online inference by Kirsch et al.
[21]
. The core idea is to sample hypotheses, e.g.,
via MC-Dropout, from the posterior distribution of a BNN and weight their importance according
to the respective likelihoods for sequentially arriving data. We refer to these types of updates as
MC-based updates. The empirical results raised concerns regarding the applicability of such MC
updates in high-dimensional parameter spaces.
BNNs [
53
,
12
] induce a prior distribution over their parameters, i.e., weights, and learn a posterior
distribution given training data. Predictions are made by marginalizing this posterior. BNN types
differ mainly in their probabilistic model and sampling from the posterior distribution [
18
]. MC-
Dropout [
13
], one of the most prominent BNN types, is a regularization technique performed during
training. Using dropout during evaluation, called MC-Dropout, we obtain a distribution for the
predictions corresponding to a variational distribution in the parameter space. Due to its simplicity
and training efficiency, MC-Dropout is often used for comparison. However, its inference is inefficient
and predictions may not properly represent uncertainty estimates [
37
]. Deep ensembles [
25
] consist
of multiple DNNs. Combined with regularization, these DNNs are samples (different modes) of the
parameters’ posterior distribution. Ensembles typically provide better uncertainty estimates than
MC-Dropout but require more computational capacity during training [
37
]. A BNN obtained via
LA [
43
] is computationally more efficient than MC-Dropout and deep ensembles. It specifies an
approximate Gaussian posterior distribution, where the maximum a posteriori (MAP) estimate defines
the mean and the inverse of the negative log posterior’s Hessian at the MAP estimate corresponds to
the covariance matrix. As computing this Hessian is expensive for large DNNs, LA is often used
only in the last layer [
7
]. SNGP [
30
] combines LA with random Fourier features [
38
] and spectral
normalization [34] to approximate a Gaussian process.
AL poses significant challenges in practice due to the computational cost associated with required
retraining per cycle. The emergence of one-step-look-ahead strategies [
44
,
50
,
56
] further highlights
the importance of efficient model updates. Notably, BEMPS [
50
] evaluates all possible candidates
(i.e., instances) in a dataset to select those that are anticipated to alter the model’s predictions
substantially. However, this strategy utilizes an MC-based BNN with deep ensembles, relying on
MC-based updates. They assume that all hypotheses are equally likely to explain the data before an
update step. While this update is fast, its performance is inferior to that of a retrained model.
3 Fast Bayesian Updates for Deep Neural Networks
In this section, we present our new Bayesian update method. First, we introduce the general concept
of Bayesian updates and the commonly applied MC-based Bayesian updates [
21
,
50
]. Afterward, we
propose our novel method focusing on an efficient update of the Gaussian posterior distribution via
last-layer LAs. For an introduction to LA, we refer to [7].
3.1 Bayesian Updates
We focus on classification problems with instance space
X
and label space
Y={0, . . . , K −1}
. The
primary goal in our setting is to efficiently incorporate the information of new instance-label pairs
D⊕={(xn, yn)}N
n=1 ⊂ X × Y
into a BNN trained on dataset
D ⊂ X × Y
. Retraining the entire
network on the extended dataset
D ∪ D⊕
results in high computational cost for a large dataset
D
.
Conversely, using the new data solely can cause catastrophic forgetting [42].
For this purpose, we employ BNNs [
12
] with Bayesian updates [
35
] as an efficient alternative to
retraining. The main idea of BNNs is to estimate posterior distribution
p(ω|D)
over the parameters
ω∈Ω
given the observed training data
D
using Bayes’ theorem. The obtained posterior distribution
over the parameters can then be used to specify the predictive distribution over a new instance’s class
3