Text2Model Text-based Model Induction for Zero-shot Image Classification Ohad Amosy

2025-05-02 0 0 7MB 18 页 10玖币

侵权投诉

Text2Model: Text-based Model Induction

for Zero-shot Image Classiﬁcation

Ohad Amosy

Bar Ilan University, Israel

amosy3@gmail.com

Tomer Volk

Eilam Shapira

Eyal Ben-David

Roi Reichart

Technion - IIT, Israel

Gal Chechik

Bar Ilan University, Israel

NVIDIA Research, Israel

Abstract

We address the challenge of building task-

agnostic classiﬁers using only text descriptions,

demonstrating a uniﬁed approach to image clas-

siﬁcation, 3D point cloud classiﬁcation, and

action recognition from scenes. Unlike ap-

proaches that learn a ﬁxed representation of the

output classes, we generate at inference time

a model tailored to a query classiﬁcation task.

To generate task-based zero-shot classiﬁers, we

train a hypernetwork that receives class descrip-

tions and outputs a multi-class model. The hy-

pernetwork is designed to be equivariant with

respect to the set of descriptions and the clas-

siﬁcation layer, thus obeying the symmetries

of the problem and improving generalization.

Our approach generates non-linear classiﬁers,

handles rich textual descriptions, and may be

adapted to produce lightweight models efﬁcient

enough for on-device applications. We evaluate

this approach in a series of zero-shot classiﬁca-

tion tasks, for image, point-cloud, and action

recognition, using a range of text descriptions:

From single words to rich descriptions. Our

results demonstrate strong improvements over

previous approaches, showing that zero-shot

learning can be applied with little training data.

Furthermore, we conduct an analysis with foun-

dational vision and language models, demon-

strating that they struggle to generalize when

describing what attributes the class lacks.

1 Introduction

We explore the challenge of zero-shot image clas-

siﬁcation by leveraging text descriptions. This ap-

proach pushes the boundaries of conventional clas-

siﬁcation methods by demanding that models cate-

gorize images into speciﬁc classes based solely on

written descriptions, without having previously en-

countered these classes during training.

In various

domains, numerous attempts have been made to

We note that our deﬁnitions of “zero shot” or “zero shot

learning” are slightly different than the ones used in the context

of text-only language models.

Figure 1: The text-to-model (T2M) setup. (a) Classiﬁ-

cation tasks are described in rich language. (b) Tradi-

tional zero-shot methods produce static representations,

shared for all tasks. (c) T2M generates task-speciﬁc rep-

resentations and classiﬁers. This allows T2M to extract

task-speciﬁc discriminative features.

achieve zero-shot classiﬁcation capacity (§2). Un-

fortunately, as we now explain, existing studies are

limited in two major ways: (1) Query-dependence;

and (2) Richness of Language description.

First, Query-dependence. To illustrate the is-

sue, consider a popular family of zero-shot learn-

ing (ZSL) approaches, which maps text (like class

labels) and images to a shared space (Globerson

et al.,2004;Zhang and Saligrama,2015;Zhang

et al.,2017a;Sung et al.,2018;Pahde et al.,2021).

To classify a new image from an unseen class, one

ﬁnds the closest class label in the shared space.

The problem with this family of shared-space ap-

proaches is that the learned representation (and the

kNN classiﬁer that it induces) remains "frozen" af-

ter training, and is not tuned to the classiﬁcation

task given at inference time. For instance, furry

toys would be mapped to the same shared represen-

tation regardless of whether they are to be distin-

guished from other toys, or from other furry things

(see Figure 1). The same limitation also hinders

another family of ZSL approaches, which synthe-

size samples from unseen classes at inference time

using conditional generative models, and use these

samples with kNN classiﬁcation (Elhoseiny and

arXiv:2210.15182v3 [cs.CV] 30 Sep 2024

Elfeki,2019;Jha et al.,2021). Some approaches

address the query-dependence limitation by assum-

ing that test descriptions are known during training

(Han et al.,2021;Schonfeld et al.,2019), or by

(costly) training a classiﬁer or generator at infer-

ence time (Xian et al.,2018;Schonfeld et al.,2019).

Instead, here we learn a model that produces task-

dependent classiﬁers and representations without

test-time training.

The second limitation is language richness. Nat-

ural language can be used to describe classes in

complex ways. Most notably, people use nega-

tive terms, like "dogs without fur", to distinguish

class members from other items. Previous work

could only handle limited richness of language de-

scriptions. For instance, it cannot represent ade-

quately textual descriptions with negative terms

(Akata et al.,2015;Xie et al.,2021b,a;Elhoseiny

and Elfeki,2019;Jha et al.,2021). In this paper,

we wish to handle the inherent linguistic richness

of natural language.

An alternative approach to address zero shot

image recognition tasks involves leveraging large

generative vision and language models (e.g.,

GPT4Vision). These foundational models, trained

on extensive datasets, exhibit high performance in

zero and few-shot scenarios. However, these mod-

els are associated with certain limitations: (1) They

entail signiﬁcant computational expenses in both

training and inference. (2) Their training is speciﬁc

to particular domains (e.g., vision and language)

and may not extend seamlessly to other modal-

ities (e.g., 3D data and language). (3) Remark-

ably, even state-of-the-art foundational models en-

counter challenges when confronted with tasks in-

volving uncommon descriptions, as demonstrated

in §5.3.

In addition to the limitations posed by large

generative models, there is a growing demand for

smaller, more efﬁcient models that can run on edge

devices with limited computational power, such

as mobile phones, embedded systems, or drones.

Giant models that require cloud-based infrastruc-

tures are often computationally expensive and not

suitable for real-time, on-device applications. Fur-

thermore, some companies are unable to rely on

cloud computing due to privacy concerns or le-

gal regulations that mandate keeping sensitive user

data within their local networks (on-premises). Our

approach addresses these needs by enabling the au-

tomatic generation of task-speciﬁc models that are

lightweight and capable of running on weaker de-

vices without requiring cloud resources.

Here, we describe a novel deep network architec-

ture and a learning workﬂow that addresses these

two aspects: (1) generating a discriminative model

tuned to requested classes at query time and (2)

supporting rich language and negative terms.

To achieve these properties, we propose an ap-

proach based on hypernetworks (HNs) (Ha et al.,

2016). An HN is a deep network that emits the

weights of another deep network (see Figure 2for

an illustration). Here, the HN receives a set of class

descriptions and emits a multi-class model that can

classify images according to these classes. Interest-

ingly, this text-image ZSL setup has an important

symmetric structure. In essence, if the order of in-

put descriptions is permuted, one would expect the

same classiﬁers to be produced, reﬂecting the same

permutation applied to the outputs. This property

is called equivariance, and it can be leveraged to

design better architectures (Finzi et al.,2020;Co-

hen et al.,2019;Kondor and Trivedi,2018;Finzi

et al.,2021). Taking invariance and equivariance

into account has been shown to provide signiﬁcant

beneﬁts for learning in spaces with symmetries

like sets (Zaheer et al.,2017;Maron et al.,2020;

Amosy et al.,2024) graphs (Herzig et al.,2018;Wu

et al.,2020) and deep weight spaces (Navon et al.,

2023). In general, however, HNs are not always

permutation equivariant. We design invariant and

equivariant layers and describe an HN architecture

that respects the symmetries of the problem, and

term it T2M-HN: a text-to-model hypernetwork.

We put the versatility of T2M-HN to the test

across an array of zero-shot classiﬁcation tasks,

spanning diverse data types including images, 3D

point clouds, and 3D skeletal data for action recog-

nition. Our framework exhibits a remarkable ability

to incorporate various forms of class descriptions

including long and short texts, as well as class

names. Notably, T2M-HN surpasses the perfor-

mance of previous state-of-the-art methods in all

of these setups.

Our paper offers four key contributions: (1) It

identiﬁes limitations of existing ZSL methods that

rely on ﬁxed representations and distance-based

classiﬁers for text and image data. It proposes task-

dependent representations as an alternative; (2) It

introduces the Text-to-Model (T2M) approach for

generating deep classiﬁcation models from textual

descriptions; (3) It investigates equivariance and

invariance properties of T2M models and designs

T2M-HN, an architecture based on HNs that ad-

Dataset Sample Description Example

name and type data type description

AwA (Lampert et al.,2009)

Animal

images

Class name (1) Moose

(2) Elephant

Long

(1) “An animal of the deer family with humped

shoulders, long legs, and a large head with antlers.”,

(2) “A plant-eating mammal with a long trunk,

large ears, and thick, grey skin.”

Negative (1) “An animal without stripes and not gray”,

(2) “An animal without fur and without horns”

Attribute (1) “Animals with fur”

(2) “Animals with long trunk”

Table 1: An illustration depicting the diverse tasks within the AwA dataset is provided. Appendix Acontains

illustrations for the remaining datasets.

Figure 2: The text-to-model learning problem and our architecture. Our model (yellow box) receives a set of class

descriptions as input and outputs weights

for a downstream on-demand model (orange). The model has two main

blocks: A pretrained text encoder and a hypernetwork that obeys certain invariance and equivariance symmetries.

The hypernetwork receives a set of dense descriptors to produce weights for the on-demand model.

heres to the symmetries of the learning problem;

and (4) It shows T2M-HN’s success in a range of

zero-shot tasks, including image and point-cloud

classiﬁcation and action recognition, using diverse

text descriptions, surpassing current leading meth-

ods in all tasks.

2 Related work

In this section we cover previous approaches to

leverage textual description to classify images of

unseen classes.

Zero-shot learning (ZSL). The core challenge

in ZSL lies in recognizing images of unseen classes

based on their semantic associations with seen

classes. This association is sometimes learned us-

ing human-annotated attributes (Li et al.,2019;

Song et al.,2018;Morgado and Vasconcelos,2017;

Annadani and Biswas,2018). Another source of

information for learning semantic associations is

to use textual descriptions. Three main sources

were used in the literature to obtain text descrip-

tions of classes: (1) Using class names as descrip-

tions (Zhang et al.,2017a;Frome et al.,2013;

Changpinyo et al.,2017;Cheraghian et al.,2022);

(2) using encyclopedia articles that describe the

class (Lei Ba et al.,2015;Elhoseiny et al.,2017;

Qin et al.,2020;Bujwid and Sullivan,2021;Paz-

Argaman et al.,2020;Zhu et al.,2018); and (3) pro-

viding per-image descriptions manually annotated

by domain experts (Reed et al.,2016;Patterson and

Hays,2012;Wah et al.,2011). These can then be

aggregated into class-level descriptions.

Shared space ZSL. One popular approach to

ZSL is to learn a joint visual-semantic represen-

tation, using either attributes or natural text de-

scriptions. Some studies project visual features

onto the textual space (Frome et al.,2013;Lampert

et al.,2013;Xie et al.,2021b), others learn a map-

ping from a textual to a visual space (Zhang et al.,

2017a;Pahde et al.,2021), and some project both

images and texts into a new shared space (Akata

et al.,2015;Atzmon and Chechik,2018;Sung et al.,

2018;Zhang and Saligrama,2015;Atzmon and

Chechik,2019;Atzmon et al.,2020;Samuel et al.,

2021;Xie et al.,2021a;Radford et al.,2021). Once

both image and text can be encoded in the same

space, classifying an image from a new class can

be achieved without further training by ﬁrst en-

coding the image and then selecting the nearest

class in the shared space. In comparison, instead

of nearest-neighbour based classiﬁcation, our ap-

proach is learned in a discriminative way.

Generation-based ZSL. Another line of ZSL

studies uses generative models like GANs to gener-

ate representations of samples from unseen classes

(Elhoseiny and Elfeki,2019;Jha et al.,2021). Such

generative approaches have been applied in two

settings. Some studies assume they have access

to test-class descriptions (attributes or text) during

model training. Hence, they can train a classiﬁer

over test-class images, generated by leveraging the

test-class descriptions (Liu et al.,2018;Schonfeld

et al.,2019;Han et al.,2021). Other studies assume

access to test-class descriptions only at test time.

Hence, they map the test-class descriptions to the

shared space of training classes and apply a nearest-

neighbor inference mechanism. In this work, we as-

sume that any information about test classes is only

available at test time. As a result, ZSL methods

assuming train-time access to information about

the test classes are beyond our scope.

Yet, works

assuming only test-time access to test-class infor-

mation form some of our baselines (Elhoseiny and

Elfeki,2019;Jha et al.,2021).

Hypernetworks (HNs, Ha et al. (2016)) were

applied to many computer vision and NLP prob-

lems, including ZSL (Yin et al.,2022), federated

learning (Amosy et al.,2024), domain adaptation

(Volk et al.,2022), language modeling (Suarez,

2017), machine translation (Platanios et al.,2018)

and many more. Here we use HNs for text-based

ZSL. The work by Lei Ba et al. (2015) also pre-

dicts model weights from textual descriptions, but

differs in two key ways. (1) They learn a constant

representation of each class; our method uses the

context of all the classes in a task to predict data

representation. (2) They predict weights of a linear

architecture; our T2M-HN applies to deeper ones.

Large vision-language models (LVLM) CLIP

(Radford et al.,2021), BLIP2 (Li et al.,2023) and

GPT4Vision show remarkable zero-shot capabil-

ities for vision-and-language tasks. A key differ-

ence between those approaches and this paper is

that CLIP and BLIP2 (the training approach of

GPT4Vision remains undisclosed) were trained on

massive multimodal data. In contrast, our approach

leverages the semantic compositionality of lan-

guage models, without requiring paired image-text

data. Furthermore, such large models are costly in

both training and inference. They demand substan-

tial resources, time and specialized knowledge that

is not accessible to most of the research community.

While these algorithms could in principle be re-trained

when new classes are presented at test-time (e.g. in a continual

learning (Ring,1995) setup), this would result in costly and

inefﬁcient inference mechanism, and possibly also in catas-

trophic forgetting (McCloskey and Cohen,1989). We hence

do not include them in our experiments.

We successfully applied T2M-HN in domains lack-

ing large multimodal data, such as 3D point cloud

object recognition and skeleton sequence action

recognition. The drawback is that the T2M-HN

representation might react to language differences

that don’t matter for visual tasks.

3 Problem formulation

Our objective is to learn a mapping

from a set of

natural language descriptions into the space of a

-class image classiﬁer. Here, we address the case

where the architecture of the downstream classiﬁer

is ﬁxed and given in advance, but this assumption

can be relaxed as in Litany et al. (2022).

Formally, let

Sk={s1, . . . , sk}

be a set of

class descriptions drawn from a distribution

where sjis a text description of the jth class.

Let

be a model parameterized by a set of

parameters

. It takes the descriptors and pro-

duces a set of parameters

of a

-class clas-

siﬁcation model

f(·;W)

. Therefore, we have

τϕ:{s1,...sk} → Rd

, where

is the dimension

, that is, the number of parameters of

f(·;W)

and we denote W=τϕ(Sk).

Let

l:Y × Y → R+

be a loss function, and

let

{xi, yi}n

i=1

be a labeled dataset from a distri-

bution

over

X × Y

. For

-class classiﬁcation,

Y={1, . . . , k}

. We can explicitly write the loss

in terms of ϕas follows.

l(yi,ˆyi) = l(yi, f(xi;W))

=lyi, f(xi;τϕ(Sk)).(1)

Text2Model Text-based Model Induction for Zero-shot Image Classification Ohad Amosy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: