Multi-Modal Fusion by Meta-Initialization Matthew T. Jackson Department of Engineering Science

2025-05-02 1 0 1.51MB 8 页 10玖币

侵权投诉

Multi-Modal Fusion by Meta-Initialization

Matthew T. Jackson∗

Department of Engineering Science

University of Oxford

jackson@robots.ox.ac.uk

Shreshth A. Malik*

Department of Engineering Science

University of Oxford

shreshth@robots.ox.ac.uk

Michael T. Matthews

Department of Computer Science

University College London

Yousuf Mohamed-Ahmed

Department of Computer Science

University College London

Abstract

When experience is scarce, models may have insufﬁcient information to adapt

to a new task. In this case, auxiliary information—such as a textual descrip-

tion of the task—can enable improved task inference and adaptation. In this

work, we propose an extension to the Model-Agnostic Meta-Learning algorithm

(MAML), which allows the model to adapt using auxiliary information as well as

task experience. Our method, Fusion by Meta-Initialization (FuMI), conditions

the model initialization on auxiliary information using a hypernetwork, rather than

learning a single, task-agnostic initialization. Furthermore, motivated by the short-

comings of existing multi-modal few-shot learning benchmarks, we constructed

iNat-Anim—a large-scale image classiﬁcation dataset with succinct and visually

pertinent textual class descriptions. On iNat-Anim, FuMI signiﬁcantly outper-

forms uni-modal baselines such as MAML in the few-shot regime. The code for

this project and a dataset exploration tool for iNat-Anim are publicly available at

https://github.com/s-a-malik/multi-few.

1 Introduction

Learning effectively in resource-constrained environments is an open challenge in machine learning

[1, 2, 3]. Yet humans are capable of rapidly learning new tasks from limited experience, in part by

drawing on auxiliary information about the task. This information can be particularly helpful in

the few-shot regime, as it can highlight features that have not been seen directly in task experience,

but are necessary to solve the task. For example, Figure 1 shows an example image classiﬁcation

task where a text description of the class contains discriminative information that is not contained in

the training (support) images. Designing algorithms that can incorporate auxiliary information into

meta-learning approaches has consequently attracted much attention [4, 5, 6, 7, 8, 9, 10].

Model-agnostic meta-learning (MAML) [1] is a popular method for few-shot learning. However, it

cannot incorporate auxiliary task information. In this work, we propose Fusion by Meta-Initialization

(FuMI), an extension of MAML which uses a hypernetwork [11] to learn a mapping from auxiliary

task information to a parameter initialization. While MAML learns an initialization that facilitates

rapid learning across all tasks, FuMI conditions the initialization on the speciﬁc task to enable

improved adaption.

Existing multi-modal few-shot learning benchmarks largely rely on hand-crafted feature vectors for

each class [12, 13], or use noisy language descriptions from sources such as Wikipedia [14, 15].

∗Equal Contribution

Preprint. Under review.

arXiv:2210.04843v1 [cs.LG] 10 Oct 2022

Figure 1: An example few-shot learning task, using images and class descriptions from our proposed

dataset, iNat-Anim. Here, we see the class description contains information (the colour of the bird’s

breast) which is not found in the class images (as they are all turned away).

For this reason, we release iNat-Anim—a large animal species image classiﬁcation dataset with

high quality descriptions of visual features. On this benchmark, we ﬁnd that FuMI signiﬁcantly

outperforms MAML in the very-few-shot regime.

2 Background

In the meta-learning framework [1], we suppose tasks are drawn from a task distribution

p(T)

. At

meta-train time, the model

fθ

is evaluated on a series of tasks

Ti∈ Dtrain

, where

Dtrain

is a ﬁnite set

of samples from

p(T)

. This gives task loss

LTi

, which is used to update the model parameters

accordance with the meta-learning algorithm. At meta-test time, the trained model is evaluated on all

tasks in Dtest, another set of samples from p(T).

In an

-shot,

-way multi-modal classiﬁcation problem

, a task

T= (S,Q)

is deﬁned by a support

set

S={({xi,j }N

j=1, ti, yi)}K

i=1

and a query set

Q={({xi,j }M

j=1, yi)}K

i=1

, where

is the number

of query shots. The support set contains

samples and auxiliary class information

for each of the

classes, which are used by the meta-learner to train an adapted model. Once this has been trained,

the adapted model is evaluated on the unseen query set, giving task loss

. In the context of our

work,

denotes the textual description of the class

, meaning each class has a textual description

and Nsupport images. Figure 1 shows an example task using the notation outlined here.

3 Data

Existing Multi-Modal Few-shot Benchmarks.

While there are a number of popular uni-modal

few-shot learning benchmarks [16, 17, 18], multi-modal benchmarks are less common. Some

works simply extend few-shot benchmarks by using the class label as auxiliary information [6, 19].

Benchmarks explicitly incorporating auxiliary modalities include Animals with Attributes (AWA)

[12] and Caltech-UCSD-Birds (CUB) [13] which augment images of animals/birds with hand-crafted

class attributes. While semantic class features can be highly discriminative, they require manual

labelling and are thus difﬁcult to obtain at scale. Recent work instead uses the more general approach

of using natural language descriptions, for example, through augmenting CUB with Wikipedia articles

[14, 15]. However, these articles are subject to change and visual information is sparse, thus reducing

the relative beneﬁt of the auxiliary information.

The iNat-Anim Dataset.

Motivated by these shortcomings, we constructed the iNat-Anim

dataset.

iNat-Anim consists of 195,605 images across 673 animal species, which is orders of magnitude larger

than existing benchmarks (AWA and CUB). The images are a subset of the iNaturalist 2021 CVPR

challenge [20] and have been augmented with textual descriptions from Animalia [21] to provide

For consistency with our dataset, the problem setting formulation is for classiﬁcation. However our method

can also be applied to regression and reinforcement learning.

3https://doi.org/10.5281/zenodo.6703088

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-ModalFusionbyMeta-InitializationMatthewT.JacksonDepartmentofEngineeringScienceUniversityofOxfordjackson@robots.ox.ac.ukShreshthA.Malik*DepartmentofEngineeringScienceUniversityofOxfordshreshth@robots.ox.ac.ukMichaelT.MatthewsDepartmentofComputerScienceUniversityCollegeLondonYousufMohamed-Ahmed...

展开>> 收起<<

Multi-Modal Fusion by Meta-Initialization Matthew T. Jackson Department of Engineering Science.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Modal Fusion by Meta-Initialization Matthew T. Jackson Department of Engineering Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: