Hypernetwork approach to Bayesian MAML

2025-04-22 0 0 1.33MB 14 页 10玖币

侵权投诉

P. Borycki, P. Kubacki, M. Przewi˛e´zlikowski, T. Ku´

smierczyk, J. Tabor, P. Spurek

Faculty of Mathematics and Computer Science,

Jagiellonian University, Kraków, Poland

przemyslaw.spurek@gmail.com

Abstract

The main goal of Few-Shot learning algorithms is to enable learning from small

amounts of data. One of the most popular and elegant Few-Shot learning ap-

proaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this

method is to learn the shared universal weights of a meta-model, which are then

adapted for speciﬁc tasks. However, the method suffers from over-ﬁtting and

poorly quantiﬁes uncertainty due to limited data size. Bayesian approaches could,

in principle, alleviate these shortcomings by learning weight distributions in place

of point-wise weights. Unfortunately, previous modiﬁcations of MAML are limited

due to the simplicity of Gaussian posteriors, MAML-like gradient-based weight

updates, or by the same structure enforced for universal and adapted weights.

In this paper, we propose a novel framework for Bayesian MAML called

BayesianHMAML, which employs Hypernetworks for weight updates. It learns the

universal weights point-wise, but a probabilistic structure is added when adapted

for speciﬁc tasks. In such a framework, we can use simple Gaussian distributions

or more complicated posteriors induced by Continuous Normalizing Flows.

1 Introduction

Few-Shot learning models easily adapt to previously unseen tasks based on a few labeled samples.

One of the most popular and elegant among them is Model-Agnostic Meta-Learning (MAML) [14].

The main idea behind this method is to produce universal weights which can be rapidly updated

to solve new small tasks (see the ﬁrst plot in Fig. 1). However, limited data sets lead to two

main problems. First, the method tends to overﬁt to training data, preventing us from using deep

architectures with large numbers of weights. Second, it lacks good quantiﬁcation of uncertainty,

e.g., the model does not know how reliable its predictions are. Both problems can be addressed by

employing Bayesian Neural Networks (BNNs) [

], which learn distributions in place of point-wise

estimates.

There exist a few Bayesian modiﬁcations of the classical MAML algorithm. Bayesian MAML [

Amortized bayesian meta-learning [

], PACOH [

], FO-MAML [

], MLAP-M [

], Meta-

Mixture [

] learn distributions for the common universal weights, which are then updated to per-task

local weights distributions. The above modiﬁcations of MAML, similar to the original MAML,

rely on gradient-based updates. Weights specialized for small tasks are obtained by taking a ﬁxed

number of gradient steps from the standard universal weights. Such a procedure needs two levels of

Bayesian regularization and the universal distribution is usually employed as a prior for the per-task

specializations (see the second plot in Fig. 1). However, the hierarchical structure complicates the

optimization procedure and limits updates in the MAML procedure.

The paper presents BayesianHMAML – a new framework for Bayesian Few-Shot learning. It

simpliﬁes the explained above weight-adapting procedure and thanks to the use of hypernetworks,

enables learning more complicated posterior updates. Similar to the previous approaches, the

ﬁnal weight posteriors are obtained by updating from the universal weights. However, we avoid

arXiv:2210.02796v2 [cs.LG] 30 Aug 2023

MAML BayesianMAML BayesianHMAML BayesianHMAML

(Gaussian) (CNF)

θ′

θ∼ N (µ, σ)

θ′

i∼ N (µ′

i, σ′

i)|N (µ, σ)

θ′

i∼ N (µ′

i, σ′

θ′

i∼CNFθ

Figure 1: Comparison of four models: MAML [

], BayesianMAML [

], as well as

BayesianHMAML-G, and BayesianHMAML-CNF. In the classic MAML, we have universal weights

, which are adapted to

θ′

for individual tasks

. In BayesianMAML, the posterior distributions for

individual small tasks are obtained in a few gradient-based updates from the universal distribution.

In BayesianHMAML-G, we learn point-wise universal weights similar to MAML, but parameters

of the specialized Gaussian posteriors are produced by a hypernetwork. Unlike BayesianMAML,

the per-task distributions do not share a common prior distribution. In BayesianHMAML-CNF, the

hypernetwork conditions a CNF, which can model arbitrary non-Gaussian posteriors.

learning the aforementioned hierarchical structure by point-wise modeling of the universal weights.

The probabilistic structure is added only later when specializing the model for a speciﬁc task. In

BayesianHMAML updates from the universal weights to the per-task specialized ones are generated

by hypernetworks instead of the previously used gradient-based optimization. Because hypernetworks

can easily model more complex structures, they allow for better adaptations. In particular, we tested

the standard Gaussian posteriors (see the third plot in Fig. 1) against more general posteriors induced

by Continuous Normalizing Flows (CNF) [19] (see the right-most plot in Fig. 1).

To the best of our knowledge, BayesianHMAML is the ﬁrst approach that uses hypernetworks with

Bayesian learning for Few-Shot learning tasks. Our contributions can be summarized as follows:

•

We introduce a novel framework for Bayesian Few-Shot learning, which simpliﬁes updating

procedure and allows using complicated posterior distributions.

•

Compared to the previous Bayesian modiﬁcations of MAML, BayesianHMAML employs

the hypernetworks architecture for producing signiﬁcantly more ﬂexible weight updates.

•

We implement two versions of the model: BayesianHMAML-G, a classical Gaussian

posterior and a generalized BayesianHMAML-CNF, relying on Conditional Normalizing

Flows.

2 Background

This section introduces all the notions necessary for understanding our method. We start by presenting

the background and notation for Few-Shot learning. Then, we describe how the MAML algorithm

works and introduce the general idea of Hypernetworks dedicated to MAML updates. Finally, we

brieﬂy explain Conditional Normalizing Flows.

The terminology describing the Few-Shot learning setup is dispersive due to the colliding deﬁni-

tions used in the literature. Here, we use the nomenclature derived from the Meta-Learning literature,

which is the most prevalent at the time of writing [50, 45].

Let

S={(xl,yl)}L

l=1

be a support-set containing

input-output pairs with classes distributed

uniformly. In the One-Shot scenario, each class is represented by a single example, and

L=K

where

is the number of the considered classes in the given task. In the Few-Shot scenarios, each

class usually has from 2to 5representatives in the support set S.

Let

Q={(xm,ym)}M

m=1

be a query set (sometimes referred to in the literature as a target set), with

examples of

, where

is typically an order of magnitude greater than

. Support and query

sets are grouped in task

T={S,Q}

. Few-Shot models have randomly selected examples from the

training set

D={Tn}N

n=1

during training. During inference, we consider task

T∗={S∗,X∗}

, where

S∗

is a set of support with known classes and

X∗

is a set of unlabeled query inputs. The goal is to

predict the class labels for the query inputs

x∈ X∗

, assuming support set

S∗

and using the model

trained on the data D.

Model-Agnostic Meta-Learning (MAML) MAML [

] is one of the standard algorithms for

Few-Shot learning, which learns the parameters of a model so that it can adapt to a new task in

a few gradient steps. For the model, we use a neural network

fθ

parameterized by weights

. Its

architecture consists of a feature extractor (backbone)

E(·)

and a fully connected layer. The universal

weights θ= (θE, θH)include θEfor the feature extractor and θHfor the classiﬁcation head.

When adapting for a new task

Ti={Si,Qi}

, the parameters

are updated to

θ′

. Such an update is

achieved in one or more gradient descent updates on

. In the simplest case of one gradient update,

the parameters are updated as follows:

θ′

i=θ−α∇θLTi(fθ),

where

is a step size hyperparameter. The loss function for a data set

is cross-entropy. The

meta-optimization across tasks is performed via stochastic gradient descent (SGD):

θ←θ−β∇θX

Ti∼p(T)

LTi(fθ′

where βis the meta step size (see Fig. 1).

Hypernetwork approche to MAML. HyperMAML [

] is a generalization of the MAML algo-

rithm, which uses non-gradient-based updates generated by hypernetworks [

]. Analogically to

MAML, it considers a model represented by a function

fθ

with parameters

. When adapting to a

new task

, the parameters of the model

become

θ′

. Contrary to MAML, in HyperMAML the

updated parameters θ′

iare computed using a hypernetwork Hϕas

θ′

i=θ+Hϕ(Si, θ).

The hypernetwork

Hϕ

is a neural network consisting of a feature extractor

E(·)

, which transforms

support sets into a lower-dimensional representation, and fully connected layers aggregate the

representation. To achieve permutation invariance, the embeddings are sorted according to their

respective classes before aggregation.

Similarly to MAML, the universal weights

consist of the features extractor’s weights

θE

and

the classiﬁcation head’s weights

θH

, i.e.,

θ= (θE, θH)

. However, HyperMAML keeps

θE

shared

between tasks and updates only θH, e.g.,

θ′

i= (θ′E

i, θ′H

i)=(θE

i, θH

i+Hϕ(Si, θ)).

Continuous Normalizing Flows (CNF). The idea of normalizing ﬂows [

] relies on the trans-

formation of a simple prior probability distribution

(usually a Gaussian one) deﬁned in the

latent space

into a complex one in the output space

through a series of invertible mappings:

Fη=FK◦. . . ◦F1:Z→Y.

The log-probability density of the output variable is given by the

change of variables formula

log PY(y;η) = log PZ(z)−

k=1

log det ∂Fk

∂zk−1,

where

z=F−1

η(y)

and

P(y;η)

denotes the probability density function induced by the normalizing

ﬂow with parameters

. The intermediate layers

must be designed so that both the inverse map

and the determinant of the Jacobian are computable.

The continuous normalizing ﬂow [

] is a modiﬁcation of the above approach, where instead of a

discrete sequence of iterations, we allow the transformation to be deﬁned by a solution to a differential

equation

∂z(t)

∂t =g(z(t), t),

where

is a neural network that has an unrestricted architecture. CNF,

Fη:Z→Y

, is a solution of differential equations with the initial value problem

z(t0) = y

∂z(t)

∂t =gη(z(t), t). In such a case, we have

Fη(z) = Fη(z(t0)) = z(t0) + Rt1

t0gη(z(t), t)dt,

F−1

η(y) = y+Rt0

t1gη(z(t), t)dt,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HypernetworkapproachtoBayesianMAMLP.Borycki,P.Kubacki,M.Przewi˛e´zlikowski,T.Ku´smierczyk,J.Tabor,P.SpurekFacultyofMathematicsandComputerScience,JagiellonianUniversity,Kraków,Polandprzemyslaw.spurek@gmail.comAbstractThemaingoalofFew-Shotlearningalgorithmsistoenablelearningfromsmallamountsofdata.Oneo...

展开>> 收起<<

Hypernetwork approach to Bayesian MAML.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hypernetwork approach to Bayesian MAML

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: