Hypernetwork approach to Bayesian MAML

2025-04-22 0 0 1.33MB 14 页 10玖币
侵权投诉
Hypernetwork approach to Bayesian MAML
P. Borycki, P. Kubacki, M. Przewi˛e´zlikowski, T. Ku´
smierczyk, J. Tabor, P. Spurek
Faculty of Mathematics and Computer Science,
Jagiellonian University, Kraków, Poland
przemyslaw.spurek@gmail.com
Abstract
The main goal of Few-Shot learning algorithms is to enable learning from small
amounts of data. One of the most popular and elegant Few-Shot learning ap-
proaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this
method is to learn the shared universal weights of a meta-model, which are then
adapted for specific tasks. However, the method suffers from over-fitting and
poorly quantifies uncertainty due to limited data size. Bayesian approaches could,
in principle, alleviate these shortcomings by learning weight distributions in place
of point-wise weights. Unfortunately, previous modifications of MAML are limited
due to the simplicity of Gaussian posteriors, MAML-like gradient-based weight
updates, or by the same structure enforced for universal and adapted weights.
In this paper, we propose a novel framework for Bayesian MAML called
BayesianHMAML, which employs Hypernetworks for weight updates. It learns the
universal weights point-wise, but a probabilistic structure is added when adapted
for specific tasks. In such a framework, we can use simple Gaussian distributions
or more complicated posteriors induced by Continuous Normalizing Flows.
1 Introduction
Few-Shot learning models easily adapt to previously unseen tasks based on a few labeled samples.
One of the most popular and elegant among them is Model-Agnostic Meta-Learning (MAML) [14].
The main idea behind this method is to produce universal weights which can be rapidly updated
to solve new small tasks (see the first plot in Fig. 1). However, limited data sets lead to two
main problems. First, the method tends to overfit to training data, preventing us from using deep
architectures with large numbers of weights. Second, it lacks good quantification of uncertainty,
e.g., the model does not know how reliable its predictions are. Both problems can be addressed by
employing Bayesian Neural Networks (BNNs) [
26
], which learn distributions in place of point-wise
estimates.
There exist a few Bayesian modifications of the classical MAML algorithm. Bayesian MAML [
54
],
Amortized bayesian meta-learning [
38
], PACOH [
41
,
40
], FO-MAML [
30
], MLAP-M [
1
], Meta-
Mixture [
22
] learn distributions for the common universal weights, which are then updated to per-task
local weights distributions. The above modifications of MAML, similar to the original MAML,
rely on gradient-based updates. Weights specialized for small tasks are obtained by taking a fixed
number of gradient steps from the standard universal weights. Such a procedure needs two levels of
Bayesian regularization and the universal distribution is usually employed as a prior for the per-task
specializations (see the second plot in Fig. 1). However, the hierarchical structure complicates the
optimization procedure and limits updates in the MAML procedure.
The paper presents BayesianHMAML – a new framework for Bayesian Few-Shot learning. It
simplifies the explained above weight-adapting procedure and thanks to the use of hypernetworks,
enables learning more complicated posterior updates. Similar to the previous approaches, the
final weight posteriors are obtained by updating from the universal weights. However, we avoid
arXiv:2210.02796v2 [cs.LG] 30 Aug 2023
MAML BayesianMAML BayesianHMAML BayesianHMAML
(Gaussian) (CNF)
θ
θ
i
θ N (µ, σ)
θ
i N (µ
i, σ
i)|N (µ, σ)
θ
θ
i N (µ
i, σ
i)
θ
θ
iCNFθ
Figure 1: Comparison of four models: MAML [
14
], BayesianMAML [
54
], as well as
BayesianHMAML-G, and BayesianHMAML-CNF. In the classic MAML, we have universal weights
θ
, which are adapted to
θ
i
for individual tasks
Ti
. In BayesianMAML, the posterior distributions for
individual small tasks are obtained in a few gradient-based updates from the universal distribution.
In BayesianHMAML-G, we learn point-wise universal weights similar to MAML, but parameters
of the specialized Gaussian posteriors are produced by a hypernetwork. Unlike BayesianMAML,
the per-task distributions do not share a common prior distribution. In BayesianHMAML-CNF, the
hypernetwork conditions a CNF, which can model arbitrary non-Gaussian posteriors.
learning the aforementioned hierarchical structure by point-wise modeling of the universal weights.
The probabilistic structure is added only later when specializing the model for a specific task. In
BayesianHMAML updates from the universal weights to the per-task specialized ones are generated
by hypernetworks instead of the previously used gradient-based optimization. Because hypernetworks
can easily model more complex structures, they allow for better adaptations. In particular, we tested
the standard Gaussian posteriors (see the third plot in Fig. 1) against more general posteriors induced
by Continuous Normalizing Flows (CNF) [19] (see the right-most plot in Fig. 1).
To the best of our knowledge, BayesianHMAML is the first approach that uses hypernetworks with
Bayesian learning for Few-Shot learning tasks. Our contributions can be summarized as follows:
We introduce a novel framework for Bayesian Few-Shot learning, which simplifies updating
procedure and allows using complicated posterior distributions.
Compared to the previous Bayesian modifications of MAML, BayesianHMAML employs
the hypernetworks architecture for producing significantly more flexible weight updates.
We implement two versions of the model: BayesianHMAML-G, a classical Gaussian
posterior and a generalized BayesianHMAML-CNF, relying on Conditional Normalizing
Flows.
2 Background
This section introduces all the notions necessary for understanding our method. We start by presenting
the background and notation for Few-Shot learning. Then, we describe how the MAML algorithm
works and introduce the general idea of Hypernetworks dedicated to MAML updates. Finally, we
briefly explain Conditional Normalizing Flows.
The terminology describing the Few-Shot learning setup is dispersive due to the colliding defini-
tions used in the literature. Here, we use the nomenclature derived from the Meta-Learning literature,
which is the most prevalent at the time of writing [50, 45].
Let
S={(xl,yl)}L
l=1
be a support-set containing
L
input-output pairs with classes distributed
uniformly. In the One-Shot scenario, each class is represented by a single example, and
L=K
,
where
K
is the number of the considered classes in the given task. In the Few-Shot scenarios, each
class usually has from 2to 5representatives in the support set S.
Let
Q={(xm,ym)}M
m=1
be a query set (sometimes referred to in the literature as a target set), with
examples of
M
, where
M
is typically an order of magnitude greater than
K
. Support and query
sets are grouped in task
T={S,Q}
. Few-Shot models have randomly selected examples from the
training set
D={Tn}N
n=1
during training. During inference, we consider task
T={S,X}
, where
2
S
is a set of support with known classes and
X
is a set of unlabeled query inputs. The goal is to
predict the class labels for the query inputs
x∈ X
, assuming support set
S
and using the model
trained on the data D.
Model-Agnostic Meta-Learning (MAML) MAML [
14
] is one of the standard algorithms for
Few-Shot learning, which learns the parameters of a model so that it can adapt to a new task in
a few gradient steps. For the model, we use a neural network
fθ
parameterized by weights
θ
. Its
architecture consists of a feature extractor (backbone)
E(·)
and a fully connected layer. The universal
weights θ= (θE, θH)include θEfor the feature extractor and θHfor the classification head.
When adapting for a new task
Ti={Si,Qi}
, the parameters
θ
are updated to
θ
i
. Such an update is
achieved in one or more gradient descent updates on
Ti
. In the simplest case of one gradient update,
the parameters are updated as follows:
θ
i=θαθLTi(fθ),
where
α
is a step size hyperparameter. The loss function for a data set
D
is cross-entropy. The
meta-optimization across tasks is performed via stochastic gradient descent (SGD):
θθβθX
Tip(T)
LTi(fθ
i)
where βis the meta step size (see Fig. 1).
Hypernetwork approche to MAML. HyperMAML [
33
] is a generalization of the MAML algo-
rithm, which uses non-gradient-based updates generated by hypernetworks [
20
]. Analogically to
MAML, it considers a model represented by a function
fθ
with parameters
θ
. When adapting to a
new task
Ti
, the parameters of the model
θ
become
θ
i
. Contrary to MAML, in HyperMAML the
updated parameters θ
iare computed using a hypernetwork Hϕas
θ
i=θ+Hϕ(Si, θ).
The hypernetwork
Hϕ
is a neural network consisting of a feature extractor
E(·)
, which transforms
support sets into a lower-dimensional representation, and fully connected layers aggregate the
representation. To achieve permutation invariance, the embeddings are sorted according to their
respective classes before aggregation.
Similarly to MAML, the universal weights
θ
consist of the features extractor’s weights
θE
and
the classification head’s weights
θH
, i.e.,
θ= (θE, θH)
. However, HyperMAML keeps
θE
shared
between tasks and updates only θH, e.g.,
θ
i= (θE
i, θH
i)=(θE
i, θH
i+Hϕ(Si, θ)).
Continuous Normalizing Flows (CNF). The idea of normalizing flows [
11
] relies on the trans-
formation of a simple prior probability distribution
PZ
(usually a Gaussian one) defined in the
latent space
Z
into a complex one in the output space
Y
through a series of invertible mappings:
Fη=FK. . . F1:ZY.
The log-probability density of the output variable is given by the
change of variables formula
log PY(y;η) = log PZ(z)
K
X
k=1
log det Fk
zk1,
where
z=F1
η(y)
and
P(y;η)
denotes the probability density function induced by the normalizing
flow with parameters
η
. The intermediate layers
Fi
must be designed so that both the inverse map
and the determinant of the Jacobian are computable.
The continuous normalizing flow [
8
] is a modification of the above approach, where instead of a
discrete sequence of iterations, we allow the transformation to be defined by a solution to a differential
equation
z(t)
t =g(z(t), t),
where
g
is a neural network that has an unrestricted architecture. CNF,
Fη:ZY
, is a solution of differential equations with the initial value problem
z(t0) = y
,
z(t)
t =gη(z(t), t). In such a case, we have
Fη(z) = Fη(z(t0)) = z(t0) + Rt1
t0gη(z(t), t)dt,
F1
η(y) = y+Rt0
t1gη(z(t), t)dt,
3
摘要:

HypernetworkapproachtoBayesianMAMLP.Borycki,P.Kubacki,M.Przewi˛e´zlikowski,T.Ku´smierczyk,J.Tabor,P.SpurekFacultyofMathematicsandComputerScience,JagiellonianUniversity,Kraków,Polandprzemyslaw.spurek@gmail.comAbstractThemaingoalofFew-Shotlearningalgorithmsistoenablelearningfromsmallamountsofdata.Oneo...

展开>> 收起<<
Hypernetwork approach to Bayesian MAML.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.33MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注