Visual - Semantic Contrastive Alignment for Few-Shot Image Classification Mohamed Afham Ranga Rodrigo

2025-05-06 1 0 565.53KB 9 页 10玖币

侵权投诉

Visual - Semantic Contrastive Alignment for

Few-Shot Image Classiﬁcation

Mohamed Afham, Ranga Rodrigo

Dept. of Electronic and Telecommunication Engineering, Univeristy of Moratuwa,

Sri Lanka

afhamaflal9@gmail.com

Abstract. Few-Shot learning aims to train and optimize a model that

can adapt to unseen visual classes with only a few labeled examples. The

existing few-shot learning (FSL) methods, heavily rely only on visual

data, thus fail to capture the semantic attributes to learn a more gener-

alized version of the visual concept from very few examples. However, it is

a known fact that human visual learning beneﬁts immensely from inputs

from multiple modalities such as vision, language, and audio. Inspired

by the human learning nature of encapsulating the existing knowledge of

a visual category which is in the form of language, we introduce a con-

trastive alignment mechanism for visual and semantic feature vectors to

learn much more generalized visual concepts for few-shot learning. Our

method simply adds an auxiliary contrastive learning objective which

captures the contextual knowledge of a visual category from a strong

textual encoder in addition to the existing training mechanism. Hence,

the approach is more generalized and can be plugged into any existing

FSL method. The pre-trained semantic feature extractor (learned from

a large-scale text corpora) we use in our approach provides a strong con-

textual prior knowledge to assist FSL. The experimental results done in

popular FSL datasets show that our approach is generic in nature and

provides a strong boost to the existing FSL baselines.

Keywords: Few-Shot Image Classiﬁcation, Vision-Language Learning,

Contrastive Learning

1 Introduction

In recent years, deep neural networks have already outperformed humans on

image classiﬁcation with enormous labeled samples supported, which may be

against human learning behavior. Humans, however, possess a fast adaptive ca-

pacity of recognizing novel classes with a handful of annotated samples. For

example, a child can easily generalize the concept of cats and quickly recognize

them in reality with only one picture from a book or the Internet. In contrast,

existing data-driven deep learning algorithms lag far behind humans in versatil-

ity and generalization ability. Therefore, how to construct human-like algorithms

and perform visual recognition tasks under data scarcity has important practi-

cal value, which also has attracted extensive research interest. To overcome this

arXiv:2210.11000v1 [cs.CV] 20 Oct 2022

2 M. Afham et al.

challenge, few-shot learning (FSL) is introduced for image classiﬁcation which

can learn and generalize from limited data.

The main paradigm of FSL is training a model on the base classes and requir-

ing it to accurately classify the novel classes with a limited number of examples,

which is still threatened by data scarcity. There are various initial line of works

study the problem of few-shot learning for image classiﬁcation [22,18,9,6] and

establish strong baselines to improve on top. Meta-learning used to be predomi-

nant approach to solve FSL then. However, some recent works adopted standard

supervision setting [20] along with various self-supervised approaches [15,10,19]

to enhance the quality of the results. However, it is to be noted that visual cate-

gories being identiﬁed only using class labels (numerical IDs) will seriously limit

the contextual features of the category since only a limited number of examples

are provided. Identifying this gap, recent line of works [24,17,12,2] adapted us-

ing semantic features as a prior knowledge or an auxiliary training mechanism

to enhance the FSL performance. RS-FSL [2] is the recent among all to lever-

age categorical descriptions to perform few-shot image classifcation. However,

it is to be noted that our method utilizes contrastive multimodal alignment for

FSL which has never been used in the literature to the best of our knowledge.

Further, our approach investigates both visual and semantic attributes in the

feature level while RS-FSL predicts the descriptions using the hybrid prototype.

The goal of our work is to capture the detailed semantic features and feed it to

the visual feature extractor which can then be easily adopted novel categories

with very few examples.

In this work we study the eﬀectiveness of contrastive learning which has

been proved to perform well [5,4] in standard self-supervised learning. It has

also been adapted to multimodal setting as well [14,13,1]. We utilize the simple

contrastive learning objective as an auxilliary training mechanism in addition to

the standard FSL baseline to provide the contextual knowledge to the model via

the semantic prototype generated using a designated semantic feature extractor.

We align both the semantic and visual prototypes of each class during an episode

of training and employ the contrastive learning learning objective such that the

corresponding prototypes regardless of the modalities to be embedded close to

each other in the multimodal embedding space. This facilitates a prior knowledge

to the visual feature extractor on the semantic attributes of the visual category

which is crucial in few-shot image classiﬁcation.

The major contribution of this approach can be summarized as follows:

–We show that a simple contrastive alignment of visual and semantic feature

vectors in the embedding space formulates a generalizable visual understand-

ing to perform few-shot image classiﬁcation.

–We introduce an auxiliary contrastive learning objective on top of the exist-

ing FSL approach, hence our method is a more generic approach and can be

plugged into any of the FSL baselines.

–Our experimental results on two standard FSL benchmarks show that mul-

timodal contrastive alignment improves the performance of the standard

baselines in FSL problem.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Visual-SemanticContrastiveAlignmentforFew-ShotImageClassificationMohamedAfham,RangaRodrigoDept.ofElectronicandTelecommunicationEngineering,UniveristyofMoratuwa,SriLankaafhamaflal9@gmail.comAbstract.Few-Shotlearningaimstotrainandoptimizeamodelthatcanadapttounseenvisualclasseswithonlyafewlabeledexampl...

展开>> 收起<<

Visual - Semantic Contrastive Alignment for Few-Shot Image Classification Mohamed Afham Ranga Rodrigo.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visual - Semantic Contrastive Alignment for Few-Shot Image Classification Mohamed Afham Ranga Rodrigo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: