Elfeki,2019;Jha et al.,2021). Some approaches
address the query-dependence limitation by assum-
ing that test descriptions are known during training
(Han et al.,2021;Schonfeld et al.,2019), or by
(costly) training a classifier or generator at infer-
ence time (Xian et al.,2018;Schonfeld et al.,2019).
Instead, here we learn a model that produces task-
dependent classifiers and representations without
test-time training.
The second limitation is language richness. Nat-
ural language can be used to describe classes in
complex ways. Most notably, people use nega-
tive terms, like "dogs without fur", to distinguish
class members from other items. Previous work
could only handle limited richness of language de-
scriptions. For instance, it cannot represent ade-
quately textual descriptions with negative terms
(Akata et al.,2015;Xie et al.,2021b,a;Elhoseiny
and Elfeki,2019;Jha et al.,2021). In this paper,
we wish to handle the inherent linguistic richness
of natural language.
An alternative approach to address zero shot
image recognition tasks involves leveraging large
generative vision and language models (e.g.,
GPT4Vision). These foundational models, trained
on extensive datasets, exhibit high performance in
zero and few-shot scenarios. However, these mod-
els are associated with certain limitations: (1) They
entail significant computational expenses in both
training and inference. (2) Their training is specific
to particular domains (e.g., vision and language)
and may not extend seamlessly to other modal-
ities (e.g., 3D data and language). (3) Remark-
ably, even state-of-the-art foundational models en-
counter challenges when confronted with tasks in-
volving uncommon descriptions, as demonstrated
in §5.3.
In addition to the limitations posed by large
generative models, there is a growing demand for
smaller, more efficient models that can run on edge
devices with limited computational power, such
as mobile phones, embedded systems, or drones.
Giant models that require cloud-based infrastruc-
tures are often computationally expensive and not
suitable for real-time, on-device applications. Fur-
thermore, some companies are unable to rely on
cloud computing due to privacy concerns or le-
gal regulations that mandate keeping sensitive user
data within their local networks (on-premises). Our
approach addresses these needs by enabling the au-
tomatic generation of task-specific models that are
lightweight and capable of running on weaker de-
vices without requiring cloud resources.
Here, we describe a novel deep network architec-
ture and a learning workflow that addresses these
two aspects: (1) generating a discriminative model
tuned to requested classes at query time and (2)
supporting rich language and negative terms.
To achieve these properties, we propose an ap-
proach based on hypernetworks (HNs) (Ha et al.,
2016). An HN is a deep network that emits the
weights of another deep network (see Figure 2for
an illustration). Here, the HN receives a set of class
descriptions and emits a multi-class model that can
classify images according to these classes. Interest-
ingly, this text-image ZSL setup has an important
symmetric structure. In essence, if the order of in-
put descriptions is permuted, one would expect the
same classifiers to be produced, reflecting the same
permutation applied to the outputs. This property
is called equivariance, and it can be leveraged to
design better architectures (Finzi et al.,2020;Co-
hen et al.,2019;Kondor and Trivedi,2018;Finzi
et al.,2021). Taking invariance and equivariance
into account has been shown to provide significant
benefits for learning in spaces with symmetries
like sets (Zaheer et al.,2017;Maron et al.,2020;
Amosy et al.,2024) graphs (Herzig et al.,2018;Wu
et al.,2020) and deep weight spaces (Navon et al.,
2023). In general, however, HNs are not always
permutation equivariant. We design invariant and
equivariant layers and describe an HN architecture
that respects the symmetries of the problem, and
term it T2M-HN: a text-to-model hypernetwork.
We put the versatility of T2M-HN to the test
across an array of zero-shot classification tasks,
spanning diverse data types including images, 3D
point clouds, and 3D skeletal data for action recog-
nition. Our framework exhibits a remarkable ability
to incorporate various forms of class descriptions
including long and short texts, as well as class
names. Notably, T2M-HN surpasses the perfor-
mance of previous state-of-the-art methods in all
of these setups.
Our paper offers four key contributions: (1) It
identifies limitations of existing ZSL methods that
rely on fixed representations and distance-based
classifiers for text and image data. It proposes task-
dependent representations as an alternative; (2) It
introduces the Text-to-Model (T2M) approach for
generating deep classification models from textual
descriptions; (3) It investigates equivariance and
invariance properties of T2M models and designs
T2M-HN, an architecture based on HNs that ad-