Text2Model Text-based Model Induction for Zero-shot Image Classification Ohad Amosy

2025-05-02 0 0 7MB 18 页 10玖币
侵权投诉
Text2Model: Text-based Model Induction
for Zero-shot Image Classification
Ohad Amosy
Bar Ilan University, Israel
amosy3@gmail.com
Tomer Volk
Eilam Shapira
Eyal Ben-David
Roi Reichart
Technion - IIT, Israel
Gal Chechik
Bar Ilan University, Israel
NVIDIA Research, Israel
Abstract
We address the challenge of building task-
agnostic classifiers using only text descriptions,
demonstrating a unified approach to image clas-
sification, 3D point cloud classification, and
action recognition from scenes. Unlike ap-
proaches that learn a fixed representation of the
output classes, we generate at inference time
a model tailored to a query classification task.
To generate task-based zero-shot classifiers, we
train a hypernetwork that receives class descrip-
tions and outputs a multi-class model. The hy-
pernetwork is designed to be equivariant with
respect to the set of descriptions and the clas-
sification layer, thus obeying the symmetries
of the problem and improving generalization.
Our approach generates non-linear classifiers,
handles rich textual descriptions, and may be
adapted to produce lightweight models efficient
enough for on-device applications. We evaluate
this approach in a series of zero-shot classifica-
tion tasks, for image, point-cloud, and action
recognition, using a range of text descriptions:
From single words to rich descriptions. Our
results demonstrate strong improvements over
previous approaches, showing that zero-shot
learning can be applied with little training data.
Furthermore, we conduct an analysis with foun-
dational vision and language models, demon-
strating that they struggle to generalize when
describing what attributes the class lacks.
1 Introduction
We explore the challenge of zero-shot image clas-
sification by leveraging text descriptions. This ap-
proach pushes the boundaries of conventional clas-
sification methods by demanding that models cate-
gorize images into specific classes based solely on
written descriptions, without having previously en-
countered these classes during training.
1
In various
domains, numerous attempts have been made to
1
We note that our definitions of “zero shot” or “zero shot
learning” are slightly different than the ones used in the context
of text-only language models.
Figure 1: The text-to-model (T2M) setup. (a) Classifi-
cation tasks are described in rich language. (b) Tradi-
tional zero-shot methods produce static representations,
shared for all tasks. (c) T2M generates task-specific rep-
resentations and classifiers. This allows T2M to extract
task-specific discriminative features.
achieve zero-shot classification capacity (§2). Un-
fortunately, as we now explain, existing studies are
limited in two major ways: (1) Query-dependence;
and (2) Richness of Language description.
First, Query-dependence. To illustrate the is-
sue, consider a popular family of zero-shot learn-
ing (ZSL) approaches, which maps text (like class
labels) and images to a shared space (Globerson
et al.,2004;Zhang and Saligrama,2015;Zhang
et al.,2017a;Sung et al.,2018;Pahde et al.,2021).
To classify a new image from an unseen class, one
finds the closest class label in the shared space.
The problem with this family of shared-space ap-
proaches is that the learned representation (and the
kNN classifier that it induces) remains "frozen" af-
ter training, and is not tuned to the classification
task given at inference time. For instance, furry
toys would be mapped to the same shared represen-
tation regardless of whether they are to be distin-
guished from other toys, or from other furry things
(see Figure 1). The same limitation also hinders
another family of ZSL approaches, which synthe-
size samples from unseen classes at inference time
using conditional generative models, and use these
samples with kNN classification (Elhoseiny and
arXiv:2210.15182v3 [cs.CV] 30 Sep 2024
Elfeki,2019;Jha et al.,2021). Some approaches
address the query-dependence limitation by assum-
ing that test descriptions are known during training
(Han et al.,2021;Schonfeld et al.,2019), or by
(costly) training a classifier or generator at infer-
ence time (Xian et al.,2018;Schonfeld et al.,2019).
Instead, here we learn a model that produces task-
dependent classifiers and representations without
test-time training.
The second limitation is language richness. Nat-
ural language can be used to describe classes in
complex ways. Most notably, people use nega-
tive terms, like "dogs without fur", to distinguish
class members from other items. Previous work
could only handle limited richness of language de-
scriptions. For instance, it cannot represent ade-
quately textual descriptions with negative terms
(Akata et al.,2015;Xie et al.,2021b,a;Elhoseiny
and Elfeki,2019;Jha et al.,2021). In this paper,
we wish to handle the inherent linguistic richness
of natural language.
An alternative approach to address zero shot
image recognition tasks involves leveraging large
generative vision and language models (e.g.,
GPT4Vision). These foundational models, trained
on extensive datasets, exhibit high performance in
zero and few-shot scenarios. However, these mod-
els are associated with certain limitations: (1) They
entail significant computational expenses in both
training and inference. (2) Their training is specific
to particular domains (e.g., vision and language)
and may not extend seamlessly to other modal-
ities (e.g., 3D data and language). (3) Remark-
ably, even state-of-the-art foundational models en-
counter challenges when confronted with tasks in-
volving uncommon descriptions, as demonstrated
in §5.3.
In addition to the limitations posed by large
generative models, there is a growing demand for
smaller, more efficient models that can run on edge
devices with limited computational power, such
as mobile phones, embedded systems, or drones.
Giant models that require cloud-based infrastruc-
tures are often computationally expensive and not
suitable for real-time, on-device applications. Fur-
thermore, some companies are unable to rely on
cloud computing due to privacy concerns or le-
gal regulations that mandate keeping sensitive user
data within their local networks (on-premises). Our
approach addresses these needs by enabling the au-
tomatic generation of task-specific models that are
lightweight and capable of running on weaker de-
vices without requiring cloud resources.
Here, we describe a novel deep network architec-
ture and a learning workflow that addresses these
two aspects: (1) generating a discriminative model
tuned to requested classes at query time and (2)
supporting rich language and negative terms.
To achieve these properties, we propose an ap-
proach based on hypernetworks (HNs) (Ha et al.,
2016). An HN is a deep network that emits the
weights of another deep network (see Figure 2for
an illustration). Here, the HN receives a set of class
descriptions and emits a multi-class model that can
classify images according to these classes. Interest-
ingly, this text-image ZSL setup has an important
symmetric structure. In essence, if the order of in-
put descriptions is permuted, one would expect the
same classifiers to be produced, reflecting the same
permutation applied to the outputs. This property
is called equivariance, and it can be leveraged to
design better architectures (Finzi et al.,2020;Co-
hen et al.,2019;Kondor and Trivedi,2018;Finzi
et al.,2021). Taking invariance and equivariance
into account has been shown to provide significant
benefits for learning in spaces with symmetries
like sets (Zaheer et al.,2017;Maron et al.,2020;
Amosy et al.,2024) graphs (Herzig et al.,2018;Wu
et al.,2020) and deep weight spaces (Navon et al.,
2023). In general, however, HNs are not always
permutation equivariant. We design invariant and
equivariant layers and describe an HN architecture
that respects the symmetries of the problem, and
term it T2M-HN: a text-to-model hypernetwork.
We put the versatility of T2M-HN to the test
across an array of zero-shot classification tasks,
spanning diverse data types including images, 3D
point clouds, and 3D skeletal data for action recog-
nition. Our framework exhibits a remarkable ability
to incorporate various forms of class descriptions
including long and short texts, as well as class
names. Notably, T2M-HN surpasses the perfor-
mance of previous state-of-the-art methods in all
of these setups.
Our paper offers four key contributions: (1) It
identifies limitations of existing ZSL methods that
rely on fixed representations and distance-based
classifiers for text and image data. It proposes task-
dependent representations as an alternative; (2) It
introduces the Text-to-Model (T2M) approach for
generating deep classification models from textual
descriptions; (3) It investigates equivariance and
invariance properties of T2M models and designs
T2M-HN, an architecture based on HNs that ad-
Dataset Sample Description Example
name and type data type description
AwA (Lampert et al.,2009)
Animal
images
Class name (1) Moose
(2) Elephant
Long
(1) “An animal of the deer family with humped
shoulders, long legs, and a large head with antlers.,
(2) “A plant-eating mammal with a long trunk,
large ears, and thick, grey skin.
Negative (1) “An animal without stripes and not gray”,
(2) “An animal without fur and without horns”
Attribute (1) “Animals with fur”
(2) “Animals with long trunk”
Table 1: An illustration depicting the diverse tasks within the AwA dataset is provided. Appendix Acontains
illustrations for the remaining datasets.
Figure 2: The text-to-model learning problem and our architecture. Our model (yellow box) receives a set of class
descriptions as input and outputs weights
w
for a downstream on-demand model (orange). The model has two main
blocks: A pretrained text encoder and a hypernetwork that obeys certain invariance and equivariance symmetries.
The hypernetwork receives a set of dense descriptors to produce weights for the on-demand model.
heres to the symmetries of the learning problem;
and (4) It shows T2M-HN’s success in a range of
zero-shot tasks, including image and point-cloud
classification and action recognition, using diverse
text descriptions, surpassing current leading meth-
ods in all tasks.
2 Related work
In this section we cover previous approaches to
leverage textual description to classify images of
unseen classes.
Zero-shot learning (ZSL). The core challenge
in ZSL lies in recognizing images of unseen classes
based on their semantic associations with seen
classes. This association is sometimes learned us-
ing human-annotated attributes (Li et al.,2019;
Song et al.,2018;Morgado and Vasconcelos,2017;
Annadani and Biswas,2018). Another source of
information for learning semantic associations is
to use textual descriptions. Three main sources
were used in the literature to obtain text descrip-
tions of classes: (1) Using class names as descrip-
tions (Zhang et al.,2017a;Frome et al.,2013;
Changpinyo et al.,2017;Cheraghian et al.,2022);
(2) using encyclopedia articles that describe the
class (Lei Ba et al.,2015;Elhoseiny et al.,2017;
Qin et al.,2020;Bujwid and Sullivan,2021;Paz-
Argaman et al.,2020;Zhu et al.,2018); and (3) pro-
viding per-image descriptions manually annotated
by domain experts (Reed et al.,2016;Patterson and
Hays,2012;Wah et al.,2011). These can then be
aggregated into class-level descriptions.
Shared space ZSL. One popular approach to
ZSL is to learn a joint visual-semantic represen-
tation, using either attributes or natural text de-
scriptions. Some studies project visual features
onto the textual space (Frome et al.,2013;Lampert
et al.,2013;Xie et al.,2021b), others learn a map-
ping from a textual to a visual space (Zhang et al.,
2017a;Pahde et al.,2021), and some project both
images and texts into a new shared space (Akata
et al.,2015;Atzmon and Chechik,2018;Sung et al.,
2018;Zhang and Saligrama,2015;Atzmon and
Chechik,2019;Atzmon et al.,2020;Samuel et al.,
2021;Xie et al.,2021a;Radford et al.,2021). Once
both image and text can be encoded in the same
space, classifying an image from a new class can
be achieved without further training by first en-
coding the image and then selecting the nearest
class in the shared space. In comparison, instead
of nearest-neighbour based classification, our ap-
proach is learned in a discriminative way.
Generation-based ZSL. Another line of ZSL
studies uses generative models like GANs to gener-
ate representations of samples from unseen classes
(Elhoseiny and Elfeki,2019;Jha et al.,2021). Such
generative approaches have been applied in two
settings. Some studies assume they have access
to test-class descriptions (attributes or text) during
model training. Hence, they can train a classifier
over test-class images, generated by leveraging the
test-class descriptions (Liu et al.,2018;Schonfeld
et al.,2019;Han et al.,2021). Other studies assume
access to test-class descriptions only at test time.
Hence, they map the test-class descriptions to the
shared space of training classes and apply a nearest-
neighbor inference mechanism. In this work, we as-
sume that any information about test classes is only
available at test time. As a result, ZSL methods
assuming train-time access to information about
the test classes are beyond our scope.
2
Yet, works
assuming only test-time access to test-class infor-
mation form some of our baselines (Elhoseiny and
Elfeki,2019;Jha et al.,2021).
Hypernetworks (HNs, Ha et al. (2016)) were
applied to many computer vision and NLP prob-
lems, including ZSL (Yin et al.,2022), federated
learning (Amosy et al.,2024), domain adaptation
(Volk et al.,2022), language modeling (Suarez,
2017), machine translation (Platanios et al.,2018)
and many more. Here we use HNs for text-based
ZSL. The work by Lei Ba et al. (2015) also pre-
dicts model weights from textual descriptions, but
differs in two key ways. (1) They learn a constant
representation of each class; our method uses the
context of all the classes in a task to predict data
representation. (2) They predict weights of a linear
architecture; our T2M-HN applies to deeper ones.
Large vision-language models (LVLM) CLIP
(Radford et al.,2021), BLIP2 (Li et al.,2023) and
GPT4Vision show remarkable zero-shot capabil-
ities for vision-and-language tasks. A key differ-
ence between those approaches and this paper is
that CLIP and BLIP2 (the training approach of
GPT4Vision remains undisclosed) were trained on
massive multimodal data. In contrast, our approach
leverages the semantic compositionality of lan-
guage models, without requiring paired image-text
data. Furthermore, such large models are costly in
both training and inference. They demand substan-
tial resources, time and specialized knowledge that
is not accessible to most of the research community.
2
While these algorithms could in principle be re-trained
when new classes are presented at test-time (e.g. in a continual
learning (Ring,1995) setup), this would result in costly and
inefficient inference mechanism, and possibly also in catas-
trophic forgetting (McCloskey and Cohen,1989). We hence
do not include them in our experiments.
We successfully applied T2M-HN in domains lack-
ing large multimodal data, such as 3D point cloud
object recognition and skeleton sequence action
recognition. The drawback is that the T2M-HN
representation might react to language differences
that don’t matter for visual tasks.
3 Problem formulation
Our objective is to learn a mapping
τ
from a set of
k
natural language descriptions into the space of a
k
-class image classifier. Here, we address the case
where the architecture of the downstream classifier
is fixed and given in advance, but this assumption
can be relaxed as in Litany et al. (2022).
Formally, let
Sk={s1, . . . , sk}
be a set of
k
class descriptions drawn from a distribution
Pk
,
where sjis a text description of the jth class.
Let
τ
be a model parameterized by a set of
parameters
ϕ
. It takes the descriptors and pro-
duces a set of parameters
W
of a
k
-class clas-
sification model
f(·;W)
. Therefore, we have
τϕ:{s1,...sk} → Rd
, where
d
is the dimension
of
W
, that is, the number of parameters of
f(·;W)
,
and we denote W=τϕ(Sk).
Let
l:Y × Y R+
be a loss function, and
let
{xi, yi}n
i=1
be a labeled dataset from a distri-
bution
P
over
X × Y
. For
k
-class classification,
Y={1, . . . , k}
. We can explicitly write the loss
in terms of ϕas follows.
l(yi,ˆyi) = l(yi, f(xi;W))
=lyi, f(xi;τϕ(Sk)).(1)
See also Figure 2and note that
τ=hg
. The goal
of T2M is to minimize
ϕ=
arg min
ϕ
ESk∼PkE(x,y)∼P hly, f(x;τϕ(Sk))i.
(2)
The training objective becomes
ϕ=arg min
ϕX
jX
i
lyi, f(xi;τϕ(Skj)),(3)
where the sum over
j
means summing over all
descriptions from all sets in the training set.
4 Our approach
We first describe our approach, based on HNs. We
then discuss the symmetries of the problem, and an
architecture that can leverage these symmetries.
摘要:

Text2Model:Text-basedModelInductionforZero-shotImageClassificationOhadAmosyBarIlanUniversity,Israelamosy3@gmail.comTomerVolkEilamShapiraEyalBen-DavidRoiReichartTechnion-IIT,IsraelGalChechikBarIlanUniversity,IsraelNVIDIAResearch,IsraelAbstractWeaddressthechallengeofbuildingtask-agnosticclassifiersusi...

展开>> 收起<<
Text2Model Text-based Model Induction for Zero-shot Image Classification Ohad Amosy.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:7MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注