Learning Attention Propagation for Compositional Zero-Shot Learning Muhammad Gul Zain Ali Khan134Muhammad Ferjad Naeem2Luc Van Gool2 A. Pagani13Didier Stricker134Muhammad Zeshan Afzal134

2025-05-02 0 0 1.07MB 10 页 10玖币

侵权投诉

Learning Attention Propagation for Compositional Zero-Shot Learning

Muhammad Gul Zain Ali Khan1,3,4Muhammad Ferjad Naeem2Luc Van Gool2

A. Pagani1,3Didier Stricker1,3,4Muhammad Zeshan Afzal1,3,4

1DFKI, 2ETH Z¨

urich, 3TU Kaiserslautern, 4MindGarage

Abstract

Compositional zero-shot learning aims to recognize un-

seen compositions of seen visual primitives of object classes

and their states. While all primitives (states and objects)

are observable during training in some combination, their

complex interaction makes this task especially hard. For

example, wet changes the visual appearance of a dog very

differently from a bicycle. Furthermore, we argue that rela-

tionships between compositions go beyond shared states or

objects. A cluttered ofﬁce can contain a busy table; even

though these compositions don’t share a state or object,

the presence of a busy table can guide the presence of a

cluttered ofﬁce. We propose a novel method called Com-

positional Attention Propagated Embedding (CAPE) as a

solution. The key intuition to our method is that a rich

dependency structure exists between compositions arising

from complex interactions of primitives in addition to other

dependencies between compositions. CAPE learns to iden-

tify this structure and propagates knowledge between them

to learn class embedding for all seen and unseen compo-

sitions. In the challenging generalized compositional zero-

shot setting, we show that our method outperforms previ-

ous baselines to set a new state-of-the-art on three publicly

available benchmarks.

1. Introduction

Dog species differ considerably from each other.

However, when presented with an unseen dog specie, we

humans can recognize its states without hesitation. A

child that has seen a wet car can recognize a wet dog

regardless of the vast difference in appearance. Humans

excel at recognizing previously unseen compositions of

states and objects. This remarkable ability arises from

our ability to reason about various aspects of objects and

then generalize them over previously unseen objects. In

zero-shot learning, the goal is to predict unseen classes,

having seen a set of seen classes and a description of all the

classes. A vector of attributes for all classes is provided

Compositions

CAPE-Propagator

Projection

Shared Semantic

Space

Image Embedder

Figure 1. Shows overview of our approach. CAPE-Propagator ex-

ploits self-attention mechanism to learn interdependency structure

between compositions by identifying critical propagation routes.

We project output of self attention into a shared semantic space

along with image embedding. Compositions that are similar to

each other are placed near and far away from other compositions

in shared semantic space.

in the most common conﬁguration of zero-shot learning.

The task is to learn the mapping between class description

vectors and images such that it can be generalized over

unseen classes [34, 33, 25]. Although deep neural networks

have been modelled after the human mind [34, 33, 25],

they struggle to perform well in zero-shot learning. In

this paper, we study a special setting of zero-shot learning

called Compositional Zero-Shot Learning.

“Compositional” in Compositional Zero-Shot Learning

derives from the composition of primitives of objects and

their states. At training time, all primitives (state, object)

are provided in some combination but not all compositions.

The goal is to predict novel compositions of primitives dur-

ing test time. This poses several challenges because of

the complex interaction between objects and their possible

states. For example, a wet car is very different from a wet

dog in visual appearance. Furthermore, a composition can

be abstract as well, i.e., Old city, Ancient Tower. In real-

world settings, multiple valid compositions can be found in

one image, i.e. A clean desk in an image of a wet dog. A

arXiv:2210.11557v1 [cs.CV] 20 Oct 2022

successful methodology should be able to learn all aspects

of the complex interaction of objects and their states.

One simple solution to the above mentioned challenges is to

disentangle states from objects. If states can be completely

disentangled from objects, new combinations of states and

objects can be predicted easily. This approache have been

studied in some recent works [2, 30] on synthetic or sim-

ple datasets like UT-Zappos [40, 41]. In recent work, con-

trastive loss function-based methodology is introduced [14]

for real-world datasets. The approach in [14] uses a gener-

ative network to generate novel samples in order to bridge

the domain gap between seen and unseen samples. In the

real world, states can be imagined as transformation func-

tions of objects. A dry car changes the visual appearance

after the application of state wet. Approaches in [24, 15]

have studied the function of states as a transformation func-

tion. Other methodologies have tried to exploit all avail-

able information to learn useful correlations [28, 22]. Re-

cent works have explored learning dependency structures

between word embeddings and propagating them to unseen

classes [23, 18].

In the real world, compositions do not occur in isolation.

They are intricately entangled with each other and full of

noise. No approach has considered the holistic view of com-

plex interactions between compositions or their primitives.

While the approach in [23] exploits the shared information

between compositions that share primitives, it still ignores

the interaction of compositions that do not share primitives,

such as a coiled plate can be found in a cluttered kitchen

or a narrow road can be seen in an ancient city. We study

a more holistic view of interactions between compositions.

We argue that there is a hidden interdependency structure

between compositions. An overview of our approach is

shown in Figure. 1. We exploit self-attention mechanism

to explore hidden interdependency structure between prim-

itives of compositions and propagate knowledge between

them. Our contributions are as follows:

• We propose a multi-modal approach that learns to

embed related compositions closer and unrelated far

away.

• We propose a methodology that learns the hidden inter-

dependency structure between compositions by learn-

ing critical propagation routes and propagates knowl-

edge between compositions through these routes.

• Unlike [23], our approach does not require prior

knowledge of how the compositions are related.

2. Related Work

Recent works have exploited the fundamental nature of

states and objects to build novel algorithms. This includes

reasoning over the effect of states over objects [15, 24].

The approach introduced in [24] considers states as a linear

function that transforms objects into compositions. These

linear functions can add a state to a composition or remove a

state from the composition by inverting the linear function.

Approach in [15] also considers the symmetry of states.

Both approaches [24, 15] exploit group theory principles

like closure, associativity, commutativity and invertibility.

Both approaches [24, 15] use triplet loss [9] as the objective

function. In contrast with [24], [15] uses a coupling

and decoupling network to add or remove a state from a

composition. Other approaches have tried to exploit the

relationship between states and objects instead of assuming

states as transformative functions [2, 30, 12, 18, 22, 28].

The approach in [22] argues that context is essential to

model, i.e. red in red wine is different from red in red

tomato. The approach in [22] argues that compositional

classiﬁers lie in a smooth plane where they can be mod-

elled and propose to model compositional classiﬁers of

primitives using SVMs. These classiﬁers are pre-trained

using SVMs and then fed into a transformation network

that translates them into compositional space. The trans-

formation network is three layered non-linear Multi-Layer

Perceptron (MLP). Final predictions are retrieved by a

simple dot product between the output of the transfor-

mation network and image embedding. One recent work

uses word embeddings of primitives, and a simple MLP

projects embeddings into a shared semantic space [18]

(Compcos). Compcos [18] proposes to replace logits with

cosine similarity between image embedding and projected

word embeddings. Another recent work proposes to

replace multi-layer perceptron with a Graph Convolutional

Network [12] (GCN) to model the interaction between

compositions (CGE). CGE [23, 19] argues that composi-

tions are entangled by their shared primitives and proposes

to use GCN that propagates knowledge through entangled

compositions. CGE [23] utilizes a dot product based

compatibility function between compositional nodes and

image embeddings to calculate the scores of compositions.

While CGE [23] uses average of state and object embed-

dings as compositional embeddings, Compcos [18] uses

concatenation of state and object embeddings to represent

a composition.

Another view for solving CZSL problem is to disentangle

states from objects [2, 30, 14]. Approach in [2] proposes

to exploit causality [32, 43, 39, 4, 7, 27, 26] to disentangle

states and objects. The causal view in [2] assumes that

compositions are not the effect of images but rather

the cause of images, and do-intervention on primitives

generates a distribution from which a given image can

be sampled. Another recent work proposes to learn

independent prototypes of states and objects by enforcing

independence between the representation of states and

objects [30]. This approach [30] further exploits a GCN to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningAttentionPropagationforCompositionalZero-ShotLearningMuhammadGulZainAliKhan1,3,4MuhammadFerjadNaeem2LucVanGool2A.Pagani1,3DidierStricker1,3,4MuhammadZeshanAfzal1,3,41DFKI,2ETHZ¨urich,3TUKaiserslautern,4MindGarageAbstractCompositionalzero-shotlearningaimstorecognizeun-seencompositionsofseenvi...

展开>> 收起<<

Learning Attention Propagation for Compositional Zero-Shot Learning Muhammad Gul Zain Ali Khan134Muhammad Ferjad Naeem2Luc Van Gool2 A. Pagani13Didier Stricker134Muhammad Zeshan Afzal134.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Attention Propagation for Compositional Zero-Shot Learning Muhammad Gul Zain Ali Khan134Muhammad Ferjad Naeem2Luc Van Gool2 A. Pagani13Didier Stricker134Muhammad Zeshan Afzal134

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: