Visual - Semantic Contrastive Alignment for Few-Shot Image Classification Mohamed Afham Ranga Rodrigo

2025-05-06 0 0 565.53KB 9 页 10玖币
侵权投诉
Visual - Semantic Contrastive Alignment for
Few-Shot Image Classification
Mohamed Afham, Ranga Rodrigo
Dept. of Electronic and Telecommunication Engineering, Univeristy of Moratuwa,
Sri Lanka
afhamaflal9@gmail.com
Abstract. Few-Shot learning aims to train and optimize a model that
can adapt to unseen visual classes with only a few labeled examples. The
existing few-shot learning (FSL) methods, heavily rely only on visual
data, thus fail to capture the semantic attributes to learn a more gener-
alized version of the visual concept from very few examples. However, it is
a known fact that human visual learning benefits immensely from inputs
from multiple modalities such as vision, language, and audio. Inspired
by the human learning nature of encapsulating the existing knowledge of
a visual category which is in the form of language, we introduce a con-
trastive alignment mechanism for visual and semantic feature vectors to
learn much more generalized visual concepts for few-shot learning. Our
method simply adds an auxiliary contrastive learning objective which
captures the contextual knowledge of a visual category from a strong
textual encoder in addition to the existing training mechanism. Hence,
the approach is more generalized and can be plugged into any existing
FSL method. The pre-trained semantic feature extractor (learned from
a large-scale text corpora) we use in our approach provides a strong con-
textual prior knowledge to assist FSL. The experimental results done in
popular FSL datasets show that our approach is generic in nature and
provides a strong boost to the existing FSL baselines.
Keywords: Few-Shot Image Classification, Vision-Language Learning,
Contrastive Learning
1 Introduction
In recent years, deep neural networks have already outperformed humans on
image classification with enormous labeled samples supported, which may be
against human learning behavior. Humans, however, possess a fast adaptive ca-
pacity of recognizing novel classes with a handful of annotated samples. For
example, a child can easily generalize the concept of cats and quickly recognize
them in reality with only one picture from a book or the Internet. In contrast,
existing data-driven deep learning algorithms lag far behind humans in versatil-
ity and generalization ability. Therefore, how to construct human-like algorithms
and perform visual recognition tasks under data scarcity has important practi-
cal value, which also has attracted extensive research interest. To overcome this
arXiv:2210.11000v1 [cs.CV] 20 Oct 2022
2 M. Afham et al.
challenge, few-shot learning (FSL) is introduced for image classification which
can learn and generalize from limited data.
The main paradigm of FSL is training a model on the base classes and requir-
ing it to accurately classify the novel classes with a limited number of examples,
which is still threatened by data scarcity. There are various initial line of works
study the problem of few-shot learning for image classification [22,18,9,6] and
establish strong baselines to improve on top. Meta-learning used to be predomi-
nant approach to solve FSL then. However, some recent works adopted standard
supervision setting [20] along with various self-supervised approaches [15,10,19]
to enhance the quality of the results. However, it is to be noted that visual cate-
gories being identified only using class labels (numerical IDs) will seriously limit
the contextual features of the category since only a limited number of examples
are provided. Identifying this gap, recent line of works [24,17,12,2] adapted us-
ing semantic features as a prior knowledge or an auxiliary training mechanism
to enhance the FSL performance. RS-FSL [2] is the recent among all to lever-
age categorical descriptions to perform few-shot image classifcation. However,
it is to be noted that our method utilizes contrastive multimodal alignment for
FSL which has never been used in the literature to the best of our knowledge.
Further, our approach investigates both visual and semantic attributes in the
feature level while RS-FSL predicts the descriptions using the hybrid prototype.
The goal of our work is to capture the detailed semantic features and feed it to
the visual feature extractor which can then be easily adopted novel categories
with very few examples.
In this work we study the effectiveness of contrastive learning which has
been proved to perform well [5,4] in standard self-supervised learning. It has
also been adapted to multimodal setting as well [14,13,1]. We utilize the simple
contrastive learning objective as an auxilliary training mechanism in addition to
the standard FSL baseline to provide the contextual knowledge to the model via
the semantic prototype generated using a designated semantic feature extractor.
We align both the semantic and visual prototypes of each class during an episode
of training and employ the contrastive learning learning objective such that the
corresponding prototypes regardless of the modalities to be embedded close to
each other in the multimodal embedding space. This facilitates a prior knowledge
to the visual feature extractor on the semantic attributes of the visual category
which is crucial in few-shot image classification.
The major contribution of this approach can be summarized as follows:
We show that a simple contrastive alignment of visual and semantic feature
vectors in the embedding space formulates a generalizable visual understand-
ing to perform few-shot image classification.
We introduce an auxiliary contrastive learning objective on top of the exist-
ing FSL approach, hence our method is a more generic approach and can be
plugged into any of the FSL baselines.
Our experimental results on two standard FSL benchmarks show that mul-
timodal contrastive alignment improves the performance of the standard
baselines in FSL problem.
摘要:

Visual-SemanticContrastiveAlignmentforFew-ShotImageClassificationMohamedAfham,RangaRodrigoDept.ofElectronicandTelecommunicationEngineering,UniveristyofMoratuwa,SriLankaafhamaflal9@gmail.comAbstract.Few-Shotlearningaimstotrainandoptimizeamodelthatcanadapttounseenvisualclasseswithonlyafewlabeledexampl...

展开>> 收起<<
Visual - Semantic Contrastive Alignment for Few-Shot Image Classification Mohamed Afham Ranga Rodrigo.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:9 页 大小:565.53KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注