Multi-Modal Fusion by Meta-Initialization Matthew T. Jackson Department of Engineering Science

2025-05-02 0 0 1.51MB 8 页 10玖币
侵权投诉
Multi-Modal Fusion by Meta-Initialization
Matthew T. Jackson
Department of Engineering Science
University of Oxford
jackson@robots.ox.ac.uk
Shreshth A. Malik*
Department of Engineering Science
University of Oxford
shreshth@robots.ox.ac.uk
Michael T. Matthews
Department of Computer Science
University College London
Yousuf Mohamed-Ahmed
Department of Computer Science
University College London
Abstract
When experience is scarce, models may have insufficient information to adapt
to a new task. In this case, auxiliary information—such as a textual descrip-
tion of the task—can enable improved task inference and adaptation. In this
work, we propose an extension to the Model-Agnostic Meta-Learning algorithm
(MAML), which allows the model to adapt using auxiliary information as well as
task experience. Our method, Fusion by Meta-Initialization (FuMI), conditions
the model initialization on auxiliary information using a hypernetwork, rather than
learning a single, task-agnostic initialization. Furthermore, motivated by the short-
comings of existing multi-modal few-shot learning benchmarks, we constructed
iNat-Anim—a large-scale image classification dataset with succinct and visually
pertinent textual class descriptions. On iNat-Anim, FuMI significantly outper-
forms uni-modal baselines such as MAML in the few-shot regime. The code for
this project and a dataset exploration tool for iNat-Anim are publicly available at
https://github.com/s-a-malik/multi-few.
1 Introduction
Learning effectively in resource-constrained environments is an open challenge in machine learning
[1, 2, 3]. Yet humans are capable of rapidly learning new tasks from limited experience, in part by
drawing on auxiliary information about the task. This information can be particularly helpful in
the few-shot regime, as it can highlight features that have not been seen directly in task experience,
but are necessary to solve the task. For example, Figure 1 shows an example image classification
task where a text description of the class contains discriminative information that is not contained in
the training (support) images. Designing algorithms that can incorporate auxiliary information into
meta-learning approaches has consequently attracted much attention [4, 5, 6, 7, 8, 9, 10].
Model-agnostic meta-learning (MAML) [1] is a popular method for few-shot learning. However, it
cannot incorporate auxiliary task information. In this work, we propose Fusion by Meta-Initialization
(FuMI), an extension of MAML which uses a hypernetwork [11] to learn a mapping from auxiliary
task information to a parameter initialization. While MAML learns an initialization that facilitates
rapid learning across all tasks, FuMI conditions the initialization on the specific task to enable
improved adaption.
Existing multi-modal few-shot learning benchmarks largely rely on hand-crafted feature vectors for
each class [12, 13], or use noisy language descriptions from sources such as Wikipedia [14, 15].
Equal Contribution
Preprint. Under review.
arXiv:2210.04843v1 [cs.LG] 10 Oct 2022
Figure 1: An example few-shot learning task, using images and class descriptions from our proposed
dataset, iNat-Anim. Here, we see the class description contains information (the colour of the bird’s
breast) which is not found in the class images (as they are all turned away).
For this reason, we release iNat-Anim—a large animal species image classification dataset with
high quality descriptions of visual features. On this benchmark, we find that FuMI significantly
outperforms MAML in the very-few-shot regime.
2 Background
In the meta-learning framework [1], we suppose tasks are drawn from a task distribution
p(T)
. At
meta-train time, the model
fθ
is evaluated on a series of tasks
Ti∈ Dtrain
, where
Dtrain
is a finite set
of samples from
p(T)
. This gives task loss
LTi
, which is used to update the model parameters
θ
in
accordance with the meta-learning algorithm. At meta-test time, the trained model is evaluated on all
tasks in Dtest, another set of samples from p(T).
In an
N
-shot,
K
-way multi-modal classification problem
2
, a task
T= (S,Q)
is defined by a support
set
S={({xi,j }N
j=1, ti, yi)}K
i=1
and a query set
Q={({xi,j }M
j=1, yi)}K
i=1
, where
M
is the number
of query shots. The support set contains
N
samples and auxiliary class information
ti
for each of the
K
classes, which are used by the meta-learner to train an adapted model. Once this has been trained,
the adapted model is evaluated on the unseen query set, giving task loss
LQ
. In the context of our
work,
ti
denotes the textual description of the class
yi
, meaning each class has a textual description
and Nsupport images. Figure 1 shows an example task using the notation outlined here.
3 Data
Existing Multi-Modal Few-shot Benchmarks.
While there are a number of popular uni-modal
few-shot learning benchmarks [16, 17, 18], multi-modal benchmarks are less common. Some
works simply extend few-shot benchmarks by using the class label as auxiliary information [6, 19].
Benchmarks explicitly incorporating auxiliary modalities include Animals with Attributes (AWA)
[12] and Caltech-UCSD-Birds (CUB) [13] which augment images of animals/birds with hand-crafted
class attributes. While semantic class features can be highly discriminative, they require manual
labelling and are thus difficult to obtain at scale. Recent work instead uses the more general approach
of using natural language descriptions, for example, through augmenting CUB with Wikipedia articles
[14, 15]. However, these articles are subject to change and visual information is sparse, thus reducing
the relative benefit of the auxiliary information.
The iNat-Anim Dataset.
Motivated by these shortcomings, we constructed the iNat-Anim
3
dataset.
iNat-Anim consists of 195,605 images across 673 animal species, which is orders of magnitude larger
than existing benchmarks (AWA and CUB). The images are a subset of the iNaturalist 2021 CVPR
challenge [20] and have been augmented with textual descriptions from Animalia [21] to provide
2
For consistency with our dataset, the problem setting formulation is for classification. However our method
can also be applied to regression and reinforcement learning.
3https://doi.org/10.5281/zenodo.6703088
2
摘要:

Multi-ModalFusionbyMeta-InitializationMatthewT.JacksonDepartmentofEngineeringScienceUniversityofOxfordjackson@robots.ox.ac.ukShreshthA.Malik*DepartmentofEngineeringScienceUniversityofOxfordshreshth@robots.ox.ac.ukMichaelT.MatthewsDepartmentofComputerScienceUniversityCollegeLondonYousufMohamed-Ahmed...

展开>> 收起<<
Multi-Modal Fusion by Meta-Initialization Matthew T. Jackson Department of Engineering Science.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.51MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注