Learning Attention Propagation for Compositional Zero-Shot Learning Muhammad Gul Zain Ali Khan134Muhammad Ferjad Naeem2Luc Van Gool2 A. Pagani13Didier Stricker134Muhammad Zeshan Afzal134

2025-05-02 0 0 1.07MB 10 页 10玖币
侵权投诉
Learning Attention Propagation for Compositional Zero-Shot Learning
Muhammad Gul Zain Ali Khan1,3,4Muhammad Ferjad Naeem2Luc Van Gool2
A. Pagani1,3Didier Stricker1,3,4Muhammad Zeshan Afzal1,3,4
1DFKI, 2ETH Z¨
urich, 3TU Kaiserslautern, 4MindGarage
Abstract
Compositional zero-shot learning aims to recognize un-
seen compositions of seen visual primitives of object classes
and their states. While all primitives (states and objects)
are observable during training in some combination, their
complex interaction makes this task especially hard. For
example, wet changes the visual appearance of a dog very
differently from a bicycle. Furthermore, we argue that rela-
tionships between compositions go beyond shared states or
objects. A cluttered office can contain a busy table; even
though these compositions don’t share a state or object,
the presence of a busy table can guide the presence of a
cluttered office. We propose a novel method called Com-
positional Attention Propagated Embedding (CAPE) as a
solution. The key intuition to our method is that a rich
dependency structure exists between compositions arising
from complex interactions of primitives in addition to other
dependencies between compositions. CAPE learns to iden-
tify this structure and propagates knowledge between them
to learn class embedding for all seen and unseen compo-
sitions. In the challenging generalized compositional zero-
shot setting, we show that our method outperforms previ-
ous baselines to set a new state-of-the-art on three publicly
available benchmarks.
1. Introduction
Dog species differ considerably from each other.
However, when presented with an unseen dog specie, we
humans can recognize its states without hesitation. A
child that has seen a wet car can recognize a wet dog
regardless of the vast difference in appearance. Humans
excel at recognizing previously unseen compositions of
states and objects. This remarkable ability arises from
our ability to reason about various aspects of objects and
then generalize them over previously unseen objects. In
zero-shot learning, the goal is to predict unseen classes,
having seen a set of seen classes and a description of all the
classes. A vector of attributes for all classes is provided
Compositions
CAPE-Propagator
Projection
Shared Semantic
Space
Image Embedder
Figure 1. Shows overview of our approach. CAPE-Propagator ex-
ploits self-attention mechanism to learn interdependency structure
between compositions by identifying critical propagation routes.
We project output of self attention into a shared semantic space
along with image embedding. Compositions that are similar to
each other are placed near and far away from other compositions
in shared semantic space.
in the most common configuration of zero-shot learning.
The task is to learn the mapping between class description
vectors and images such that it can be generalized over
unseen classes [34, 33, 25]. Although deep neural networks
have been modelled after the human mind [34, 33, 25],
they struggle to perform well in zero-shot learning. In
this paper, we study a special setting of zero-shot learning
called Compositional Zero-Shot Learning.
“Compositional” in Compositional Zero-Shot Learning
derives from the composition of primitives of objects and
their states. At training time, all primitives (state, object)
are provided in some combination but not all compositions.
The goal is to predict novel compositions of primitives dur-
ing test time. This poses several challenges because of
the complex interaction between objects and their possible
states. For example, a wet car is very different from a wet
dog in visual appearance. Furthermore, a composition can
be abstract as well, i.e., Old city, Ancient Tower. In real-
world settings, multiple valid compositions can be found in
one image, i.e. A clean desk in an image of a wet dog. A
arXiv:2210.11557v1 [cs.CV] 20 Oct 2022
successful methodology should be able to learn all aspects
of the complex interaction of objects and their states.
One simple solution to the above mentioned challenges is to
disentangle states from objects. If states can be completely
disentangled from objects, new combinations of states and
objects can be predicted easily. This approache have been
studied in some recent works [2, 30] on synthetic or sim-
ple datasets like UT-Zappos [40, 41]. In recent work, con-
trastive loss function-based methodology is introduced [14]
for real-world datasets. The approach in [14] uses a gener-
ative network to generate novel samples in order to bridge
the domain gap between seen and unseen samples. In the
real world, states can be imagined as transformation func-
tions of objects. A dry car changes the visual appearance
after the application of state wet. Approaches in [24, 15]
have studied the function of states as a transformation func-
tion. Other methodologies have tried to exploit all avail-
able information to learn useful correlations [28, 22]. Re-
cent works have explored learning dependency structures
between word embeddings and propagating them to unseen
classes [23, 18].
In the real world, compositions do not occur in isolation.
They are intricately entangled with each other and full of
noise. No approach has considered the holistic view of com-
plex interactions between compositions or their primitives.
While the approach in [23] exploits the shared information
between compositions that share primitives, it still ignores
the interaction of compositions that do not share primitives,
such as a coiled plate can be found in a cluttered kitchen
or a narrow road can be seen in an ancient city. We study
a more holistic view of interactions between compositions.
We argue that there is a hidden interdependency structure
between compositions. An overview of our approach is
shown in Figure. 1. We exploit self-attention mechanism
to explore hidden interdependency structure between prim-
itives of compositions and propagate knowledge between
them. Our contributions are as follows:
• We propose a multi-modal approach that learns to
embed related compositions closer and unrelated far
away.
We propose a methodology that learns the hidden inter-
dependency structure between compositions by learn-
ing critical propagation routes and propagates knowl-
edge between compositions through these routes.
Unlike [23], our approach does not require prior
knowledge of how the compositions are related.
2. Related Work
Recent works have exploited the fundamental nature of
states and objects to build novel algorithms. This includes
reasoning over the effect of states over objects [15, 24].
The approach introduced in [24] considers states as a linear
function that transforms objects into compositions. These
linear functions can add a state to a composition or remove a
state from the composition by inverting the linear function.
Approach in [15] also considers the symmetry of states.
Both approaches [24, 15] exploit group theory principles
like closure, associativity, commutativity and invertibility.
Both approaches [24, 15] use triplet loss [9] as the objective
function. In contrast with [24], [15] uses a coupling
and decoupling network to add or remove a state from a
composition. Other approaches have tried to exploit the
relationship between states and objects instead of assuming
states as transformative functions [2, 30, 12, 18, 22, 28].
The approach in [22] argues that context is essential to
model, i.e. red in red wine is different from red in red
tomato. The approach in [22] argues that compositional
classifiers lie in a smooth plane where they can be mod-
elled and propose to model compositional classifiers of
primitives using SVMs. These classifiers are pre-trained
using SVMs and then fed into a transformation network
that translates them into compositional space. The trans-
formation network is three layered non-linear Multi-Layer
Perceptron (MLP). Final predictions are retrieved by a
simple dot product between the output of the transfor-
mation network and image embedding. One recent work
uses word embeddings of primitives, and a simple MLP
projects embeddings into a shared semantic space [18]
(Compcos). Compcos [18] proposes to replace logits with
cosine similarity between image embedding and projected
word embeddings. Another recent work proposes to
replace multi-layer perceptron with a Graph Convolutional
Network [12] (GCN) to model the interaction between
compositions (CGE). CGE [23, 19] argues that composi-
tions are entangled by their shared primitives and proposes
to use GCN that propagates knowledge through entangled
compositions. CGE [23] utilizes a dot product based
compatibility function between compositional nodes and
image embeddings to calculate the scores of compositions.
While CGE [23] uses average of state and object embed-
dings as compositional embeddings, Compcos [18] uses
concatenation of state and object embeddings to represent
a composition.
Another view for solving CZSL problem is to disentangle
states from objects [2, 30, 14]. Approach in [2] proposes
to exploit causality [32, 43, 39, 4, 7, 27, 26] to disentangle
states and objects. The causal view in [2] assumes that
compositions are not the effect of images but rather
the cause of images, and do-intervention on primitives
generates a distribution from which a given image can
be sampled. Another recent work proposes to learn
independent prototypes of states and objects by enforcing
independence between the representation of states and
objects [30]. This approach [30] further exploits a GCN to
摘要:

LearningAttentionPropagationforCompositionalZero-ShotLearningMuhammadGulZainAliKhan1,3,4MuhammadFerjadNaeem2LucVanGool2A.Pagani1,3DidierStricker1,3,4MuhammadZeshanAfzal1,3,41DFKI,2ETHZ¨urich,3TUKaiserslautern,4MindGarageAbstractCompositionalzero-shotlearningaimstorecognizeun-seencompositionsofseenvi...

展开>> 收起<<
Learning Attention Propagation for Compositional Zero-Shot Learning Muhammad Gul Zain Ali Khan134Muhammad Ferjad Naeem2Luc Van Gool2 A. Pagani13Didier Stricker134Muhammad Zeshan Afzal134.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.07MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注