successful methodology should be able to learn all aspects
of the complex interaction of objects and their states.
One simple solution to the above mentioned challenges is to
disentangle states from objects. If states can be completely
disentangled from objects, new combinations of states and
objects can be predicted easily. This approache have been
studied in some recent works [2, 30] on synthetic or sim-
ple datasets like UT-Zappos [40, 41]. In recent work, con-
trastive loss function-based methodology is introduced [14]
for real-world datasets. The approach in [14] uses a gener-
ative network to generate novel samples in order to bridge
the domain gap between seen and unseen samples. In the
real world, states can be imagined as transformation func-
tions of objects. A dry car changes the visual appearance
after the application of state wet. Approaches in [24, 15]
have studied the function of states as a transformation func-
tion. Other methodologies have tried to exploit all avail-
able information to learn useful correlations [28, 22]. Re-
cent works have explored learning dependency structures
between word embeddings and propagating them to unseen
classes [23, 18].
In the real world, compositions do not occur in isolation.
They are intricately entangled with each other and full of
noise. No approach has considered the holistic view of com-
plex interactions between compositions or their primitives.
While the approach in [23] exploits the shared information
between compositions that share primitives, it still ignores
the interaction of compositions that do not share primitives,
such as a coiled plate can be found in a cluttered kitchen
or a narrow road can be seen in an ancient city. We study
a more holistic view of interactions between compositions.
We argue that there is a hidden interdependency structure
between compositions. An overview of our approach is
shown in Figure. 1. We exploit self-attention mechanism
to explore hidden interdependency structure between prim-
itives of compositions and propagate knowledge between
them. Our contributions are as follows:
• We propose a multi-modal approach that learns to
embed related compositions closer and unrelated far
away.
• We propose a methodology that learns the hidden inter-
dependency structure between compositions by learn-
ing critical propagation routes and propagates knowl-
edge between compositions through these routes.
• Unlike [23], our approach does not require prior
knowledge of how the compositions are related.
2. Related Work
Recent works have exploited the fundamental nature of
states and objects to build novel algorithms. This includes
reasoning over the effect of states over objects [15, 24].
The approach introduced in [24] considers states as a linear
function that transforms objects into compositions. These
linear functions can add a state to a composition or remove a
state from the composition by inverting the linear function.
Approach in [15] also considers the symmetry of states.
Both approaches [24, 15] exploit group theory principles
like closure, associativity, commutativity and invertibility.
Both approaches [24, 15] use triplet loss [9] as the objective
function. In contrast with [24], [15] uses a coupling
and decoupling network to add or remove a state from a
composition. Other approaches have tried to exploit the
relationship between states and objects instead of assuming
states as transformative functions [2, 30, 12, 18, 22, 28].
The approach in [22] argues that context is essential to
model, i.e. red in red wine is different from red in red
tomato. The approach in [22] argues that compositional
classifiers lie in a smooth plane where they can be mod-
elled and propose to model compositional classifiers of
primitives using SVMs. These classifiers are pre-trained
using SVMs and then fed into a transformation network
that translates them into compositional space. The trans-
formation network is three layered non-linear Multi-Layer
Perceptron (MLP). Final predictions are retrieved by a
simple dot product between the output of the transfor-
mation network and image embedding. One recent work
uses word embeddings of primitives, and a simple MLP
projects embeddings into a shared semantic space [18]
(Compcos). Compcos [18] proposes to replace logits with
cosine similarity between image embedding and projected
word embeddings. Another recent work proposes to
replace multi-layer perceptron with a Graph Convolutional
Network [12] (GCN) to model the interaction between
compositions (CGE). CGE [23, 19] argues that composi-
tions are entangled by their shared primitives and proposes
to use GCN that propagates knowledge through entangled
compositions. CGE [23] utilizes a dot product based
compatibility function between compositional nodes and
image embeddings to calculate the scores of compositions.
While CGE [23] uses average of state and object embed-
dings as compositional embeddings, Compcos [18] uses
concatenation of state and object embeddings to represent
a composition.
Another view for solving CZSL problem is to disentangle
states from objects [2, 30, 14]. Approach in [2] proposes
to exploit causality [32, 43, 39, 4, 7, 27, 26] to disentangle
states and objects. The causal view in [2] assumes that
compositions are not the effect of images but rather
the cause of images, and do-intervention on primitives
generates a distribution from which a given image can
be sampled. Another recent work proposes to learn
independent prototypes of states and objects by enforcing
independence between the representation of states and
objects [30]. This approach [30] further exploits a GCN to