UNCERTAINTY -AWARE META-LEARNING FOR MULTI - MODAL TASK DISTRIBUTIONS Cesar Almecija1 Apoorva Sharma2 Navid Azizan1

2025-05-06 0 0 1.11MB 22 页 10玖币
侵权投诉
UNCERTAINTY-AWARE META-LEARNING FOR MULTI-
MODAL TASK DISTRIBUTIONS
Cesar Almecija1, Apoorva Sharma2& Navid Azizan1
1Massachusetts Institute of Technology
2NVIDIA Research
{almecija,azizan}@mit.edu,apoorvas@nvidia.com
ABSTRACT
Meta-learning or learning to learn is a popular approach for learning new tasks
with limited data (i.e., few-shot learning) by leveraging the commonalities among
different tasks. However, meta-learned models can perform poorly when context
data is limited, or when data is drawn from an out-of-distribution (OoD) task.
Especially in safety-critical settings, this necessitates an uncertainty-aware ap-
proach to meta-learning. In addition, the often multimodal nature of task distri-
butions can pose unique challenges to meta-learning methods. In this work, we
present UNLIMITD (uncertainty-aware meta-learning for multimodal task distri-
butions), a novel method for meta-learning that (1) makes probabilistic predictions
on in-distribution tasks efficiently, (2) is capable of detecting OoD context data at
test time, and (3) performs on heterogeneous, multimodal task distributions. To
achieve this goal, we take a probabilistic perspective and train a parametric, tune-
able distribution over tasks on the meta-dataset. We construct this distribution by
performing Bayesian inference on a linearized neural network, leveraging Gaus-
sian process theory. We demonstrate that UNLIMITD’s predictions compare fa-
vorably to, and outperform in most cases, the standard baselines, especially in the
low-data regime. Furthermore, we show that UNLIMITD is effective in detecting
data from OoD tasks. Finally, we confirm that both of these findings continue to
hold in the multimodal task-distribution setting. 1
1 INTRODUCTION
Learning to learn is essential in human intelligence but is still a wide area of research in machine
learning. Meta-learning has emerged as a popular approach to enable models to perform well on new
tasks using limited data. It involves first a meta-training process, when the model learns valuable
features from a set of tasks. Then, at test time, using only few datapoints from a new, unseen task, the
model (1) adapts to this new task (i.e., performs few-shot learning with context data), and then (2)
infers by making predictions on new, unseen query inputs from the same task. A popular baseline for
meta-learning, which has attracted a large amount of attention, is Model-Agnostic Meta-Learning
(MAML) (Finn et al., 2017), in which the adaptation process consists of fine-tuning the parameters
of the model via gradient descent.
However, meta-learning methods can often struggle in several ways when deployed in challenging
real-world scenarios. First, when context data is too limited to fully identify the test-time task,
accurate prediction can be challenging. As these predictions can be untrustworthy, this necessitates
the development of meta-learning methods that can express uncertainty during adaptation (Yoon
et al., 2018; Harrison et al., 2018). In addition, meta-learning models may not successfully adapt
to “unusual” tasks, i.e., when test-time context data is drawn from an out-of-distribution (OoD)
task not well represented in the training dataset (Jeong & Kim, 2020; Iwata & Kumagai, 2022).
Finally, special care has to be taken when learning tasks that have a large degree of heterogeneity.
An important example is the case of tasks with a multimodal distribution, i.e., when there are no
common features shared across all the tasks, but the tasks can be broken down into subsets (modes)
in a way that the ones from the same subset share common features (Vuorio et al., 2019).
This work was completed prior to Apoorva starting at NVIDIA Research.
1An implementation is available at https://github.com/azizanlab/unlimitd
1
arXiv:2210.01881v1 [cs.LG] 4 Oct 2022
Data sampled
from 𝑝 𝑓
Construct
𝑝𝜉𝑓
Meta-training
Data example Example of functions drawn from 𝑝𝜉𝑓
Figure 1: The true task distribution p(f)can be multimodal, i.e., containing multiple clusters of tasks
(e.g., lines and sines). Our approach UNLIMITD fits p(f)with a parametric, tuneable distribution
˜pξ(f)yielded by Bayesian linear regression on a linearized neural network.
Our contributions. We present UNLIMITD (uncertainty-aware meta-learning for multimodal
task distributions), a novel meta-learning method that leverages probabilistic tools to address the
aforementioned issues. Specifically, UNLIMITD models the true distribution of tasks with a learn-
able distribution constructed over a linearized neural network and uses analytic Bayesian inference
to perform uncertainty-aware adaption. We present three variants (namely, UNLIMITD-I, UNLIM-
ITD-R, and UNLIMITD-F) that reflect a trade-off between learning a rich prior distribution over the
weights and maintaining the full expressivity of the network; we show that UNLIMITD-F strikes a
balance between the two, making it the most appealing variant. Finally, we demonstrate that (1) our
method allows for efficient probabilistic predictions on in-distribution tasks, that compare favorably
to, and in most cases outperform, the existing baselines, (2) it is effective in detecting context data
from OoD tasks at test time, and that (3) both these findings continue to hold in the multimodal
task-distribution setting.
The rest of the paper is organized as follows. Section 2 formalizes the problem. Section 3 presents
background information on the linearization of neural networks and Bayesian linear regression. We
detail our approach and its three variants in Section 4. We discuss related work in detail in Section 5.
Finally, we present our experimental results concerning the performance of UNLIMITD in Section 6
and conclude in Section 7.
2 PROBLEM STATEMENT
A task Ticonsists of a function fifrom which data is drawn. At test time, the prediction steps
are broken down into (1) adaptation, that is identifying fiusing Kcontext datapoints (Xi,Yi)
from the task, and (2) inference, that is making predictions for fion the query inputs Xi
. Later
the predictions can be compared with the query ground-truths Yi
to estimate the quality of the
prediction, for example in terms of mean squared error (MSE). The meta-training consists in learning
valuable features from a cluster of tasks, which is a set of similar tasks (e.g., sines with different
phases and amplitudes but same frequency), so that at test time the predictions can be accurate on
tasks from the same cluster. We take a probabilistic, functional perspective and represent a cluster
by p(f), a theoretical distribution over the function space that describes the probability of a task
belonging to the cluster. Learning p(f)is appealing, as it allows for performing OoD detection in
addition to making predictions. Adaptation amounts to computing the conditional distribution given
test context data, and one can obtain an uncertainty metric by evaluating the negative log-likelihood
(NLL) of the context data under p(f).
Thus, our goal is to construct a parametric, learnable functional distribution ˜pξ(f)that approaches
the theoretical distribution p(f), with a structure that allows tractable conditioning and likelihood
computation, even in deep learning contexts. In practice, however, we are not given p(f), but only a
meta-training dataset Dthat we assume is sampled from p(f):D={(f
Xi,e
Yi)}N
i=1, where Nis the
number of tasks available during training, and (f
Xi,e
Yi)∼ T iis the entire pool of data from which
we can draw subsets of context data (Xi,Yi). Consequently, in the meta-training phase, we aim to
optimize ˜pξ(f)to capture properties of p(f), using only the samples in D.
Once we have ˜pξ(f), we can evaluate it both in terms of how it performs for few-shot learning (by
comparing the predictions with the ground truths in terms of MSE), as well as for OoD detection
2
(by measuring how well the NLL of context data serves to classify in-distribution tasks against OoD
tasks, measured via the AUC-ROC score).
3 BACKGROUND
3.1 BAYESIAN LINEAR REGRESSION AND GAUSSIAN PROCESSES
Efficient Bayesian meta-learning requires a tractable inference process at test time. In general, this
is only possible analytically in a few cases. One of them is the Bayesian linear regression with
Gaussian noise and a Gaussian prior on the weights. Viewing it from a nonparametric, functional
approach, this model is equivalent to a Gaussian process (GP) (Rasmussen & Williams, 2005).
Let X= (x1,...,xK)RNx×Kbe a batch of K Nx-dimensional inputs, and let y=
(y1,...,yK)RNyKbe a vectorized batch of Ny-dimensional outputs. In the Bayesian linear
regression model, these quantities are related according to y=φ(X)>ˆ
θ+εRNyKwhere
ˆ
θRPare the weights of the model, and the inputs are mapped via φ:RNx×KRP×NyK.
Notice how this is a generalization of the usual one-dimensional linear regression (Ny= 1).
If we assume a Gaussian prior on the weights ˆ
θ N (µ,Σ)and a Gaussian noise ε N (0,Σε)
with Σε=σ2
εI, then the model describes a multivariate Gaussian distribution on yfor any X.
Equivalently, this means that this model describes a GP distribution over functions, with mean and
covariance function (or kernel)
µprior(xt) = φ(xt)>µ,
covprior(xt1,xt2) = φ(xt1)>Σφ(xt2) + Σε=: kΣ(xt1,xt2) + Σε.(1)
This GP enables tractable computation of the likelihood of any batch of data (X,Y)given this
distribution over functions. The structure of this distribution is governed by the feature map φand
the prior over the weights, specified by µand Σ.
This distribution can also easily be conditioned to perform inference. Given a batch of data (X,Y),
the posterior predictive distribution is also a GP, with an updated mean and covariance function
µpost(xt) = kΣ(xt,X) (kΣ(X,X) + Σε)1Y,
covpost(xt1,xt2) = kΣ(xt1,xt2)kΣ(xt1,X) (kΣ(X,X) + Σε)1kΣ(X,xt2).(2)
Here, µpost(X)represents our model’s adapted predictions for the test data, which we can compare
to Yto evaluate the quality of our predictions, for example, via mean squared error (assuming that
test data is clean, following Rasmussen & Williams (2005)). The diagonal of covpost(X,X)can
be interpreted as a per-input level of confidence that captures the ambiguity in making predictions
with only a limited amount of context data.
3.2 THE LINEARIZATION OF A NEURAL NETWORK YIELDS AN EXPRESSIVE LINEAR
REGRESSION MODEL
As discussed, the choice of feature map φplays an important role in specifying a linear regression
model. In the deep learning context, recent work has demonstrated that the linear model obtained
when linearizing a deep neural network with respect to its weights at initialization, wherein the
Jacobian of the network operates as the feature map, can well approximate the training behavior of
wide nonlinear deep neural networks (Jacot et al., 2018; Azizan et al., 2021; Liu et al., 2020; Neal,
1996; Lee et al., 2018).
Let fbe a neural network f: (θ,xt)7→ yt, where θRPare the parameters of the model,
xRNxis an input and yRNyan output. The linearized network (w.r.t. the parameters) around
θ0is
f(θ,xt)f(θ0,xt)Jθ(f)(θ0,xt)(θθ0),
where Jθ(f)(·,·) : RP×RNxRNy×Pis the Jacobian of the network (w.r.t. the parameters).
In the case where the model accepts a batch of Kinputs X= (x1,...,xK)and returns Y=
(y1,...,yK), we generalize fto g:RP×RNx×KRNy×K, with Y=g(θ,X). Consequently,
we generalize the linearization:
g(θ,X)g(θ0,X)J(θ0,X)(θθ0),
3
where J(·,·) : RP×RNx×KRNyK×Pis a shorthand for Jθ(g)(·,·). Note that we have implicitly
vectorized the outputs, and throughout the work, we will interchange the matrices RNy×Kand the
vectorized matrices RNyK.
This linearization can be viewed as the NyK-dimensional linear regression
z=φθ0(X)>ˆ
θRNyK,(3)
where the feature map φθ0(·) : RNx×KRP×NyKis the transposed Jacobian J(θ0,·)>. The
parameters of this linear regression ˆ
θ= (θθ0)are the correction to the parameters chosen
as the linearization point. Equivalently, this can be seen as a kernel regression with the kernel
kθ0(X1,X2) = J(θ0,X1)J(θ0,X2)>, which is commonly referred to as the Neural Tangent Ker-
nel (NTK) of the network. Note that the NTK depends on the linearization point θ0. Building
on these ideas, Maddox et al. (2021) show that the NTK obtained via linearizing a DNN after it
has been trained on a task yields a GP that is well-suited for adaptation and fine-tuning to new,
similar tasks. Furthermore, they show that networks trained on similar tasks tend to have similar
Jacobians, suggesting that neural network linearization can yield an effective model for multi-task
contexts such as meta-learning. In this work, we leverage these insights to construct our parametric
functional distribution ˜pξ(f)via linearizing a neural network model.
4 OUR APPROACH: UNLIMITD
In this section, we describe our meta-learning algorithm UNLIMITD and the construction of a
parametric functional distribution ˜pξ(f)that can model the true underlying distribution over tasks
p(f). First, we focus on the single cluster case, where a Gaussian process structure on ˜pξ(f)can
effectively model the true distribution of tasks, and detail how we can leverage meta-training data
Dfrom a single cluster of tasks to train the parameters ξof our model. Next, we will generalize our
approach to the multimodal setting, with more than one cluster of tasks. Here, we construct ˜pξ(f)as
a mixture of GPs and develop a training approach that can automatically identify the clusters present
in the training dataset without requiring the meta-training dataset to contain any additional structure
such as cluster labels.
4.1 TRACTABLY STRUCTURING THE PRIOR PREDICTIVE DISTRIBUTION OVER FUNCTIONS
VIA A GAUSSIAN DISTRIBUTION OVER THE WEIGHTS
In our approach, we choose ˜pξ(f)to be the GP distribution over functions that arises from a Gaussian
prior on the weights of the linearization of a neural network (equation 3). Consider a particular task
Tiand a batch of Kcontext data (Xi,Yi). The resulting prior predictive distribution, derived from
equation 1 after evaluating on the context inputs, is Y|Xi N (µY|Xi,ΣY|Xi), where
µY|Xi=J(θ0,Xi)µ,ΣY|Xi=J(θ0,Xi)ΣJ(θ0,Xi)>+Σε.(4)
In this setup, the parameters ξof ˜pξ(f)that we wish to optimize are the linearization point θ0, and
the parameters of the prior over the weights (µ,Σ). Given this Gaussian prior, it is straightforward
to compute the joint NLL of the context labels Yi,
NLL(Xi,Yi) = 1
2
YiµY|Xi
2
Σ1
Y|Xi
+ log det ΣY|Xi+NyKlog 2π.(5)
The NLL (a) serves as a loss function quantifying the quality of ξduring training and (b) serves as an
uncertainty signal at test time to evaluate whether context data (Xi,Yi)is OoD. Given this model,
adaptation is tractable as we can condition this GP on the context data analytically. In addition, we
can efficiently make probabilistic predictions by evaluating the mean and covariance of the resulting
posterior predictive distribution on the query inputs, using equation 2.
4.1.1 PARAMETERIZING THE PRIOR COVARIANCE OVER THE WEIGHTS
When working with deep neural networks, the number of weights Pcan surpass 106. While it
remains tractable to deal with θ0and µ, whose memory footprint grows linearly with P, it can
quickly become intractable to make computations with (let alone store) a dense prior covariance
matrix over the weights ΣRP×P. Thus, we must impose some structural assumptions on the
prior covariance to scale to deep neural network models.
4
Imposing a unit covariance. One simple way to tackle this issue would be to remove Σfrom the
learnable parameters ξ, i.e., fixing it to be the identity Σ=IP. In this case, ξ= (θ0,µ). This
computational benefit comes at the cost of model expressivity, as we lose a degree of freedom in
how we can optimize our learned prior distribution ˜pξ(f). In particular, we are unable to choose a
prior over the weights of our model that captures correlations between elements of the feature map.
Learning a low-dimensional representation of the covariance. An alternative is to learn a low-
rank representation of Σ, allowing for a learnable weight-space prior covariance that can encode
correlations. Specifically, we consider a covariance of the form Σ=Q>diag(s2)Q, where Qis a
fixed projection matrix on an s-dimensional subspace of RP, while s2is learnable. In this case, the
parameters that are learned are ξ= (θ0,µ,s). We define S:= diag(s2). The computation of the
covariance of the prior predictive (equation 4) could then be broken down into two steps:
A:= J(θ0,Xi)Q>
J(θ0,Xi)ΣJ(θ0,Xi)>=ASA>
which requires a memory footprint of O(P(s+NyK)), if we include the storage of the Jacobian.
Because NyKPin typical deep learning contexts, it suffices that sPso that it becomes
tractable to deal with this new representation of the covariance.
A trade-off between feature-map expressiveness and learning a rich prior over the weights.
Note that even if a low-dimensional representation of Σenriches the prior distribution over the
weights, it also restrains the expressiveness of the feature map in the kernel by projecting the P-
dimensional features J(θ0,X)on a subspace of size sPvia Q. This presents a trade-off: we
can use the full feature map, but limit the weight-space prior covariance to be the identity matrix
by keeping Σ=I(case UNLIMITD-I). Alternatively, we could learn a low-rank representation
of Σby randomly choosing sorthogonal directions in RP, with the risk that they could limit the
expressiveness of the feature map if the directions are not relevant to the problem that is considered
(case UNLIMITD-R). As a compromise between these two cases, we can choose the projection
matrix more intelligently and project to the most impactful subspace of the full feature map — in
this way, we can reap the benefits of a tuneable prior covariance while minimizing the useful features
that the projection drops. To select this subspace, we construct this projection map by choosing the
top seigenvectors of the Fisher information matrix (FIM) evaluated on the training dataset D(case
UNLIMITD-F). Recent work has shown that the FIM for deep neural networks tends to have rapid
spectral decay (Sharma et al., 2021), which suggests that keeping only a few of the top eigenvectors
of the FIM is enough to encode an expressive task-tailored prior. See Appendix A.1 for more details.
4.1.2 GENERALIZING THE STRUCTURE TO A MIXTURE OF GAUSSIANS
When learning on multiple clusters of tasks, p(f)can become non-unimodal, and thus cannot be
accurately described by a single GP. Instead, we can capture this multimodality by structuring ˜pξ(f)
as a mixture of Gaussian processes.
Building a more general structure. We assume that at train time, a task Ticomes from any
cluster {Cj}j=α
j=1 with equal probability. Thus, we choose to construct ˜pξ(f)as an equal-weighted
mixture of αGaussian processes.
For each element of the mixture, the structure is similar to the single cluster case, where the pa-
rameters of the cluster’s weight-space prior are given by (µj,Σj). We choose to have both the
projection matrix Qand the linearization point θ0(and hence, the feature map φ(·) = J(θ0,·))
shared across the clusters. This yields improved computational efficiency, as we can compute
the projected features once, simultaneously, for all clusters. This yields the parameters ξα=
(θ0,Q,(µ1,s1),...,(µα,sα)).
This can be viewed as a mixture of linear regression models, with a common feature map but sepa-
rate, independent prior distributions over the weights for each cluster. These separate distributions
are encoded using the low-dimensional representations Sjfor each Σj. Notice how this is a gener-
alization of the single cluster case, for when α= 1,˜pξ(f)becomes a Gaussian and ξα=ξ2.
2In theory, it is possible to drop Qand extend the identity covariance case to the multi-cluster setting;
however, this leads to each cluster having an identical covariance function, and thus is not effective at modeling
heterogeneous behaviors among clusters.
5
摘要:

UNCERTAINTY-AWAREMETA-LEARNINGFORMULTI-MODALTASKDISTRIBUTIONSCesarAlmecija1,ApoorvaSharma2&NavidAzizan11MassachusettsInstituteofTechnology2NVIDIAResearchfalmecija,azizang@mit.edu,apoorvas@nvidia.comABSTRACTMeta-learningorlearningtolearnisapopularapproachforlearningnewtaskswithlimiteddata(i.e.,few-s...

展开>> 收起<<
UNCERTAINTY -AWARE META-LEARNING FOR MULTI - MODAL TASK DISTRIBUTIONS Cesar Almecija1 Apoorva Sharma2 Navid Azizan1.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:22 页 大小:1.11MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注