UNCERTAINTY -AWARE META-LEARNING FOR MULTI - MODAL TASK DISTRIBUTIONS Cesar Almecija1 Apoorva Sharma2 Navid Azizan1

2025-05-06 2 0 1.11MB 22 页 10玖币

侵权投诉

UNCERTAINTY-AWARE META-LEARNING FOR MULTI-

MODAL TASK DISTRIBUTIONS

Cesar Almecija1, Apoorva Sharma2∗& Navid Azizan1

1Massachusetts Institute of Technology

2NVIDIA Research

{almecija,azizan}@mit.edu,apoorvas@nvidia.com

ABSTRACT

Meta-learning or learning to learn is a popular approach for learning new tasks

with limited data (i.e., few-shot learning) by leveraging the commonalities among

different tasks. However, meta-learned models can perform poorly when context

data is limited, or when data is drawn from an out-of-distribution (OoD) task.

Especially in safety-critical settings, this necessitates an uncertainty-aware ap-

proach to meta-learning. In addition, the often multimodal nature of task distri-

butions can pose unique challenges to meta-learning methods. In this work, we

present UNLIMITD (uncertainty-aware meta-learning for multimodal task distri-

butions), a novel method for meta-learning that (1) makes probabilistic predictions

on in-distribution tasks efﬁciently, (2) is capable of detecting OoD context data at

test time, and (3) performs on heterogeneous, multimodal task distributions. To

achieve this goal, we take a probabilistic perspective and train a parametric, tune-

able distribution over tasks on the meta-dataset. We construct this distribution by

performing Bayesian inference on a linearized neural network, leveraging Gaus-

sian process theory. We demonstrate that UNLIMITD’s predictions compare fa-

vorably to, and outperform in most cases, the standard baselines, especially in the

low-data regime. Furthermore, we show that UNLIMITD is effective in detecting

data from OoD tasks. Finally, we conﬁrm that both of these ﬁndings continue to

hold in the multimodal task-distribution setting. 1

1 INTRODUCTION

Learning to learn is essential in human intelligence but is still a wide area of research in machine

learning. Meta-learning has emerged as a popular approach to enable models to perform well on new

tasks using limited data. It involves ﬁrst a meta-training process, when the model learns valuable

features from a set of tasks. Then, at test time, using only few datapoints from a new, unseen task, the

model (1) adapts to this new task (i.e., performs few-shot learning with context data), and then (2)

infers by making predictions on new, unseen query inputs from the same task. A popular baseline for

meta-learning, which has attracted a large amount of attention, is Model-Agnostic Meta-Learning

(MAML) (Finn et al., 2017), in which the adaptation process consists of ﬁne-tuning the parameters

of the model via gradient descent.

However, meta-learning methods can often struggle in several ways when deployed in challenging

real-world scenarios. First, when context data is too limited to fully identify the test-time task,

accurate prediction can be challenging. As these predictions can be untrustworthy, this necessitates

the development of meta-learning methods that can express uncertainty during adaptation (Yoon

et al., 2018; Harrison et al., 2018). In addition, meta-learning models may not successfully adapt

to “unusual” tasks, i.e., when test-time context data is drawn from an out-of-distribution (OoD)

task not well represented in the training dataset (Jeong & Kim, 2020; Iwata & Kumagai, 2022).

Finally, special care has to be taken when learning tasks that have a large degree of heterogeneity.

An important example is the case of tasks with a multimodal distribution, i.e., when there are no

common features shared across all the tasks, but the tasks can be broken down into subsets (modes)

in a way that the ones from the same subset share common features (Vuorio et al., 2019).

∗This work was completed prior to Apoorva starting at NVIDIA Research.

1An implementation is available at https://github.com/azizanlab/unlimitd

arXiv:2210.01881v1 [cs.LG] 4 Oct 2022

Data sampled

from 𝑝 𝑓

Construct

෤𝑝𝜉𝑓

Meta-training

Data example Example of functions drawn from ෤𝑝𝜉𝑓

Figure 1: The true task distribution p(f)can be multimodal, i.e., containing multiple clusters of tasks

(e.g., lines and sines). Our approach UNLIMITD ﬁts p(f)with a parametric, tuneable distribution

˜pξ(f)yielded by Bayesian linear regression on a linearized neural network.

Our contributions. We present UNLIMITD (uncertainty-aware meta-learning for multimodal

task distributions), a novel meta-learning method that leverages probabilistic tools to address the

aforementioned issues. Speciﬁcally, UNLIMITD models the true distribution of tasks with a learn-

able distribution constructed over a linearized neural network and uses analytic Bayesian inference

to perform uncertainty-aware adaption. We present three variants (namely, UNLIMITD-I, UNLIM-

ITD-R, and UNLIMITD-F) that reﬂect a trade-off between learning a rich prior distribution over the

weights and maintaining the full expressivity of the network; we show that UNLIMITD-F strikes a

balance between the two, making it the most appealing variant. Finally, we demonstrate that (1) our

method allows for efﬁcient probabilistic predictions on in-distribution tasks, that compare favorably

to, and in most cases outperform, the existing baselines, (2) it is effective in detecting context data

from OoD tasks at test time, and that (3) both these ﬁndings continue to hold in the multimodal

task-distribution setting.

The rest of the paper is organized as follows. Section 2 formalizes the problem. Section 3 presents

background information on the linearization of neural networks and Bayesian linear regression. We

detail our approach and its three variants in Section 4. We discuss related work in detail in Section 5.

Finally, we present our experimental results concerning the performance of UNLIMITD in Section 6

and conclude in Section 7.

2 PROBLEM STATEMENT

A task Ticonsists of a function fifrom which data is drawn. At test time, the prediction steps

are broken down into (1) adaptation, that is identifying fiusing Kcontext datapoints (Xi,Yi)

from the task, and (2) inference, that is making predictions for fion the query inputs Xi

∗. Later

the predictions can be compared with the query ground-truths Yi

∗to estimate the quality of the

prediction, for example in terms of mean squared error (MSE). The meta-training consists in learning

valuable features from a cluster of tasks, which is a set of similar tasks (e.g., sines with different

phases and amplitudes but same frequency), so that at test time the predictions can be accurate on

tasks from the same cluster. We take a probabilistic, functional perspective and represent a cluster

by p(f), a theoretical distribution over the function space that describes the probability of a task

belonging to the cluster. Learning p(f)is appealing, as it allows for performing OoD detection in

addition to making predictions. Adaptation amounts to computing the conditional distribution given

test context data, and one can obtain an uncertainty metric by evaluating the negative log-likelihood

(NLL) of the context data under p(f).

Thus, our goal is to construct a parametric, learnable functional distribution ˜pξ(f)that approaches

the theoretical distribution p(f), with a structure that allows tractable conditioning and likelihood

computation, even in deep learning contexts. In practice, however, we are not given p(f), but only a

meta-training dataset Dthat we assume is sampled from p(f):D={(f

Xi,e

Yi)}N

i=1, where Nis the

number of tasks available during training, and (f

Xi,e

Yi)∼ T iis the entire pool of data from which

we can draw subsets of context data (Xi,Yi). Consequently, in the meta-training phase, we aim to

optimize ˜pξ(f)to capture properties of p(f), using only the samples in D.

Once we have ˜pξ(f), we can evaluate it both in terms of how it performs for few-shot learning (by

comparing the predictions with the ground truths in terms of MSE), as well as for OoD detection

(by measuring how well the NLL of context data serves to classify in-distribution tasks against OoD

tasks, measured via the AUC-ROC score).

3 BACKGROUND

3.1 BAYESIAN LINEAR REGRESSION AND GAUSSIAN PROCESSES

Efﬁcient Bayesian meta-learning requires a tractable inference process at test time. In general, this

is only possible analytically in a few cases. One of them is the Bayesian linear regression with

Gaussian noise and a Gaussian prior on the weights. Viewing it from a nonparametric, functional

approach, this model is equivalent to a Gaussian process (GP) (Rasmussen & Williams, 2005).

Let X= (x1,...,xK)∈RNx×Kbe a batch of K Nx-dimensional inputs, and let y=

(y1,...,yK)∈RNyKbe a vectorized batch of Ny-dimensional outputs. In the Bayesian linear

regression model, these quantities are related according to y=φ(X)>ˆ

θ+ε∈RNyKwhere

θ∈RPare the weights of the model, and the inputs are mapped via φ:RNx×K→RP×NyK.

Notice how this is a generalization of the usual one-dimensional linear regression (Ny= 1).

If we assume a Gaussian prior on the weights ˆ

θ∼ N (µ,Σ)and a Gaussian noise ε∼ N (0,Σε)

with Σε=σ2

εI, then the model describes a multivariate Gaussian distribution on yfor any X.

Equivalently, this means that this model describes a GP distribution over functions, with mean and

covariance function (or kernel)

µprior(xt) = φ(xt)>µ,

covprior(xt1,xt2) = φ(xt1)>Σφ(xt2) + Σε=: kΣ(xt1,xt2) + Σε.(1)

This GP enables tractable computation of the likelihood of any batch of data (X,Y)given this

distribution over functions. The structure of this distribution is governed by the feature map φand

the prior over the weights, speciﬁed by µand Σ.

This distribution can also easily be conditioned to perform inference. Given a batch of data (X,Y),

the posterior predictive distribution is also a GP, with an updated mean and covariance function

µpost(xt∗) = kΣ(xt∗,X) (kΣ(X,X) + Σε)−1Y,

covpost(xt1∗,xt2∗) = kΣ(xt1∗,xt2∗)−kΣ(xt1∗,X) (kΣ(X,X) + Σε)−1kΣ(X,xt2∗).(2)

Here, µpost(X∗)represents our model’s adapted predictions for the test data, which we can compare

to Y∗to evaluate the quality of our predictions, for example, via mean squared error (assuming that

test data is clean, following Rasmussen & Williams (2005)). The diagonal of covpost(X∗,X∗)can

be interpreted as a per-input level of conﬁdence that captures the ambiguity in making predictions

with only a limited amount of context data.

3.2 THE LINEARIZATION OF A NEURAL NETWORK YIELDS AN EXPRESSIVE LINEAR

REGRESSION MODEL

As discussed, the choice of feature map φplays an important role in specifying a linear regression

model. In the deep learning context, recent work has demonstrated that the linear model obtained

when linearizing a deep neural network with respect to its weights at initialization, wherein the

Jacobian of the network operates as the feature map, can well approximate the training behavior of

wide nonlinear deep neural networks (Jacot et al., 2018; Azizan et al., 2021; Liu et al., 2020; Neal,

1996; Lee et al., 2018).

Let fbe a neural network f: (θ,xt)7→ yt, where θ∈RPare the parameters of the model,

x∈RNxis an input and y∈RNyan output. The linearized network (w.r.t. the parameters) around

θ0is

f(θ,xt)−f(θ0,xt)≈Jθ(f)(θ0,xt)(θ−θ0),

where Jθ(f)(·,·) : RP×RNx→RNy×Pis the Jacobian of the network (w.r.t. the parameters).

In the case where the model accepts a batch of Kinputs X= (x1,...,xK)and returns Y=

(y1,...,yK), we generalize fto g:RP×RNx×K→RNy×K, with Y=g(θ,X). Consequently,

we generalize the linearization:

g(θ,X)−g(θ0,X)≈J(θ0,X)(θ−θ0),

where J(·,·) : RP×RNx×K→RNyK×Pis a shorthand for Jθ(g)(·,·). Note that we have implicitly

vectorized the outputs, and throughout the work, we will interchange the matrices RNy×Kand the

vectorized matrices RNyK.

This linearization can be viewed as the NyK-dimensional linear regression

z=φθ0(X)>ˆ

θ∈RNyK,(3)

where the feature map φθ0(·) : RNx×K→RP×NyKis the transposed Jacobian J(θ0,·)>. The

parameters of this linear regression ˆ

θ= (θ−θ0)are the correction to the parameters chosen

as the linearization point. Equivalently, this can be seen as a kernel regression with the kernel

kθ0(X1,X2) = J(θ0,X1)J(θ0,X2)>, which is commonly referred to as the Neural Tangent Ker-

nel (NTK) of the network. Note that the NTK depends on the linearization point θ0. Building

on these ideas, Maddox et al. (2021) show that the NTK obtained via linearizing a DNN after it

has been trained on a task yields a GP that is well-suited for adaptation and ﬁne-tuning to new,

similar tasks. Furthermore, they show that networks trained on similar tasks tend to have similar

Jacobians, suggesting that neural network linearization can yield an effective model for multi-task

contexts such as meta-learning. In this work, we leverage these insights to construct our parametric

functional distribution ˜pξ(f)via linearizing a neural network model.

4 OUR APPROACH: UNLIMITD

In this section, we describe our meta-learning algorithm UNLIMITD and the construction of a

parametric functional distribution ˜pξ(f)that can model the true underlying distribution over tasks

p(f). First, we focus on the single cluster case, where a Gaussian process structure on ˜pξ(f)can

effectively model the true distribution of tasks, and detail how we can leverage meta-training data

Dfrom a single cluster of tasks to train the parameters ξof our model. Next, we will generalize our

approach to the multimodal setting, with more than one cluster of tasks. Here, we construct ˜pξ(f)as

a mixture of GPs and develop a training approach that can automatically identify the clusters present

in the training dataset without requiring the meta-training dataset to contain any additional structure

such as cluster labels.

4.1 TRACTABLY STRUCTURING THE PRIOR PREDICTIVE DISTRIBUTION OVER FUNCTIONS

VIA A GAUSSIAN DISTRIBUTION OVER THE WEIGHTS

In our approach, we choose ˜pξ(f)to be the GP distribution over functions that arises from a Gaussian

prior on the weights of the linearization of a neural network (equation 3). Consider a particular task

Tiand a batch of Kcontext data (Xi,Yi). The resulting prior predictive distribution, derived from

equation 1 after evaluating on the context inputs, is Y|Xi∼ N (µY|Xi,ΣY|Xi), where

µY|Xi=J(θ0,Xi)µ,ΣY|Xi=J(θ0,Xi)ΣJ(θ0,Xi)>+Σε.(4)

In this setup, the parameters ξof ˜pξ(f)that we wish to optimize are the linearization point θ0, and

the parameters of the prior over the weights (µ,Σ). Given this Gaussian prior, it is straightforward

to compute the joint NLL of the context labels Yi,

NLL(Xi,Yi) = 1

2

Yi−µY|Xi

2

Σ−1

Y|Xi

+ log det ΣY|Xi+NyKlog 2π.(5)

The NLL (a) serves as a loss function quantifying the quality of ξduring training and (b) serves as an

uncertainty signal at test time to evaluate whether context data (Xi,Yi)is OoD. Given this model,

adaptation is tractable as we can condition this GP on the context data analytically. In addition, we

can efﬁciently make probabilistic predictions by evaluating the mean and covariance of the resulting

posterior predictive distribution on the query inputs, using equation 2.

4.1.1 PARAMETERIZING THE PRIOR COVARIANCE OVER THE WEIGHTS

When working with deep neural networks, the number of weights Pcan surpass 106. While it

remains tractable to deal with θ0and µ, whose memory footprint grows linearly with P, it can

quickly become intractable to make computations with (let alone store) a dense prior covariance

matrix over the weights Σ∈RP×P. Thus, we must impose some structural assumptions on the

prior covariance to scale to deep neural network models.

Imposing a unit covariance. One simple way to tackle this issue would be to remove Σfrom the

learnable parameters ξ, i.e., ﬁxing it to be the identity Σ=IP. In this case, ξ= (θ0,µ). This

computational beneﬁt comes at the cost of model expressivity, as we lose a degree of freedom in

how we can optimize our learned prior distribution ˜pξ(f). In particular, we are unable to choose a

prior over the weights of our model that captures correlations between elements of the feature map.

Learning a low-dimensional representation of the covariance. An alternative is to learn a low-

rank representation of Σ, allowing for a learnable weight-space prior covariance that can encode

correlations. Speciﬁcally, we consider a covariance of the form Σ=Q>diag(s2)Q, where Qis a

ﬁxed projection matrix on an s-dimensional subspace of RP, while s2is learnable. In this case, the

parameters that are learned are ξ= (θ0,µ,s). We deﬁne S:= diag(s2). The computation of the

covariance of the prior predictive (equation 4) could then be broken down into two steps:

A:= J(θ0,Xi)Q>

J(θ0,Xi)ΣJ(θ0,Xi)>=ASA>

which requires a memory footprint of O(P(s+NyK)), if we include the storage of the Jacobian.

Because NyKPin typical deep learning contexts, it sufﬁces that sPso that it becomes

tractable to deal with this new representation of the covariance.

A trade-off between feature-map expressiveness and learning a rich prior over the weights.

Note that even if a low-dimensional representation of Σenriches the prior distribution over the

weights, it also restrains the expressiveness of the feature map in the kernel by projecting the P-

dimensional features J(θ0,X)on a subspace of size sPvia Q. This presents a trade-off: we

can use the full feature map, but limit the weight-space prior covariance to be the identity matrix

by keeping Σ=I(case UNLIMITD-I). Alternatively, we could learn a low-rank representation

of Σby randomly choosing sorthogonal directions in RP, with the risk that they could limit the

expressiveness of the feature map if the directions are not relevant to the problem that is considered

(case UNLIMITD-R). As a compromise between these two cases, we can choose the projection

matrix more intelligently and project to the most impactful subspace of the full feature map — in

this way, we can reap the beneﬁts of a tuneable prior covariance while minimizing the useful features

that the projection drops. To select this subspace, we construct this projection map by choosing the

top seigenvectors of the Fisher information matrix (FIM) evaluated on the training dataset D(case

UNLIMITD-F). Recent work has shown that the FIM for deep neural networks tends to have rapid

spectral decay (Sharma et al., 2021), which suggests that keeping only a few of the top eigenvectors

of the FIM is enough to encode an expressive task-tailored prior. See Appendix A.1 for more details.

4.1.2 GENERALIZING THE STRUCTURE TO A MIXTURE OF GAUSSIANS

When learning on multiple clusters of tasks, p(f)can become non-unimodal, and thus cannot be

accurately described by a single GP. Instead, we can capture this multimodality by structuring ˜pξ(f)

as a mixture of Gaussian processes.

Building a more general structure. We assume that at train time, a task Ticomes from any

cluster {Cj}j=α

j=1 with equal probability. Thus, we choose to construct ˜pξ(f)as an equal-weighted

mixture of αGaussian processes.

For each element of the mixture, the structure is similar to the single cluster case, where the pa-

rameters of the cluster’s weight-space prior are given by (µj,Σj). We choose to have both the

projection matrix Qand the linearization point θ0(and hence, the feature map φ(·) = J(θ0,·))

shared across the clusters. This yields improved computational efﬁciency, as we can compute

the projected features once, simultaneously, for all clusters. This yields the parameters ξα=

(θ0,Q,(µ1,s1),...,(µα,sα)).

This can be viewed as a mixture of linear regression models, with a common feature map but sepa-

rate, independent prior distributions over the weights for each cluster. These separate distributions

are encoded using the low-dimensional representations Sjfor each Σj. Notice how this is a gener-

alization of the single cluster case, for when α= 1,˜pξ(f)becomes a Gaussian and ξα=ξ2.

2In theory, it is possible to drop Qand extend the identity covariance case to the multi-cluster setting;

however, this leads to each cluster having an identical covariance function, and thus is not effective at modeling

heterogeneous behaviors among clusters.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UNCERTAINTY-AWAREMETA-LEARNINGFORMULTI-MODALTASKDISTRIBUTIONSCesarAlmecija1,ApoorvaSharma2&NavidAzizan11MassachusettsInstituteofTechnology2NVIDIAResearchfalmecija,azizang@mit.edu,apoorvas@nvidia.comABSTRACTMeta-learningorlearningtolearnisapopularapproachforlearningnewtaskswithlimiteddata(i.e.,few-s...

展开>> 收起<<

UNCERTAINTY -AWARE META-LEARNING FOR MULTI - MODAL TASK DISTRIBUTIONS Cesar Almecija1 Apoorva Sharma2 Navid Azizan1.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UNCERTAINTY -AWARE META-LEARNING FOR MULTI - MODAL TASK DISTRIBUTIONS Cesar Almecija1 Apoorva Sharma2 Navid Azizan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: