Relational Proxies Emergent Relationships as Fine-Grained Discriminators Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata234Anjan Dutta5

2025-04-29 0 0 1.85MB 21 页 10玖币

侵权投诉

Relational Proxies: Emergent Relationships as

Fine-Grained Discriminators

Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata2,3,4 Anjan Dutta5∗

1University of Exeter 2University of Tübingen 3MPI for Informatics

4MPI for Intelligent Systems 5University of Surrey

Abstract

Fine-grained categories that largely share the same set of parts cannot be discrimi-

nated based on part information alone, as they mostly differ in the way the local

parts relate to the overall global structure of the object. We propose Relational Prox-

ies, a novel approach that leverages the relational information between the global

and local views of an object for encoding its semantic label. Starting with a rigorous

formalization of the notion of distinguishability between ﬁne-grained categories,

we prove the necessary and sufﬁcient conditions that a model must satisfy in order

to learn the underlying decision boundaries in the ﬁne-grained setting. We design

Relational Proxies based on our theoretical ﬁndings and evaluate it on seven chal-

lenging ﬁne-grained benchmark datasets and achieve state-of-the-art results on all

of them, surpassing the performance of all existing works with a margin exceeding

4% in some cases. We also experimentally validate our theory on ﬁne-grained dis-

tinguishability and obtain consistent results across multiple benchmarks. Implemen-

tation is available at https://github.com/abhrac/relational-proxies.

1 Introduction

Fine-grained visual categorization (FGVC) primarily requires identifying category-speciﬁc, discrimi-

native local attributes [

]. However, the relationship of the attributes with the global view of

the object is also known to encode semantic information [

]. Such a relationship can be thought

of as the way in which local attributes combine to form the overall object. When two categories

share a large number of local attributes, this cross-view relational information becomes the only

discriminator. To illustrate this in an intuitive example, Figure 1 shows two ﬁne-grained categories of

birds, the White-faced Plover (left and top-right) and the Kentish Plover (bottom-right). Along with

color and texture information, the two categories share a large number of local features like beak,

head, body, tail and wings. Given such constraints of largely overlapping attribute sets, relational

information like the distance between the head and the body, or the angular orientation of the legs

with respect to the body remain as the only available discriminators. We thus conjecture that the

way the global structure (view) of the object arises out of its local parts (views) must be an emergent

[

] property of the object which is implicitly encoded as the cross-view relationship. However, all

existing methods that consider both global and local information, do so in a relation-agnostic manner,

i.e., without considering cross-view relationships (we formalize relation-agnosticity in Section 3).

We hypothesize that when two categories largely share the same set of local attributes and differ

only in the way the attributes combine to generate the global view of the object, relation-agnostic

approaches do not capture the full semantic information in an input image. To prove our hypothesis,

we develop a rigorous formalization of the notion of distinguishability in the ﬁne-grained setting.

Via our theoretical framework, we identify the necessary and sufﬁcient conditions that a learner

must satisfy to completely learn a distribution of ﬁne-grained categories. Speciﬁcally, we prove

∗

A. Chaudhuri is with the Department of Computer Science at the University of Exeter. M. Mancini and Z.

Akata are with the Cluster of Excellence Machine Learning at the University of Tübingen. A. Dutta is with the

Institute for People-Centred AI at the University of Surrey.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02149v1 [cs.CV] 5 Oct 2022

that a learner must harness both view-speciﬁc (relation-agnostic) and cross-view (relation-aware)

information in an input image. We also prove that it is not possible to design a single encoder that

can achieve both of these objectives simultaneously. Based on our theoretical ﬁndings, we design a

learner that separately computes metric space embeddings for the relation-agnostic and relation-aware

components in an input image, through class representative vectors that we call Relational Proxies.

To summarize, we: (1) provide a theoretically rigorous formulation of the FGVC task and formally

prove the necessary and sufﬁcient conditions a learner must satisfy for FGVC, (2) introduce a

plug-and-play extension on top of conventional CNNs that helps leverage relationships between

global and local views of an object in the representation space for obtaining a complete encoding of

the ﬁne-grained semantic information in an input image, (3) achieve state-of-the-art results on all

benchmark FGVC datasets with signiﬁcant accuracy gains.

2 Related Work

Fine-grained visual categorization

Prior works have demonstrated the importance of learning

localized image features for FGVC [

], with extensions exploiting the relationship between

multiple images and between network layers [

]. The high intra-class and low inter-class variations

in FGVC datasets can be tackled by designing appropriate inductive biases like normalized object

poses [

] or via more data-driven methods like deep metric learning [

]. Analysing part-speciﬁc

features along with the global context was demonstrated through part detection based on activation

regions in CNN feature maps [

] or via context-aware attention pooling [

]. CNNs can also be

modiﬁed in novel ways for FGVC by incorporating boosting [

], kernel pooling [

], or by randomly

masking out a group of correlated channels during training [

]. Vision Transformers [

], with

their ability to attend to speciﬁc informative image patches, have also shown great promise in FGVC

[

]. To the best of our knowledge, we are the ﬁrst to provide a rigorous theoretical foundation

for FGVC and design a cross-view relational metric learning formulation based on the same.

Relation modelling in deep learning

Modelling relationships between entities has proven to be a

useful approach in many areas of deep learning including deep reinforcement learning [

], object

detection [

], question answering [

], graph representation learning [

], few-shot learning [

] and

knowledge distillation [

]. The usefulness of modelling relationships between different views of

the same image has been demonstrated in the self-supervised context by [

]. All the above works

either leverage or aim to learn relationships between entities, the nature of which is assumed to be

known apriori. Our work breaks free from such assumptions by modelling cross-view relationships

as learnable representations that optimize the end-task of FGVC.

Proxy-based deep metric learning

Motivated by the fact that pairwise losses for deep metric

learning incur a signiﬁcant computational overhead leading to slow convergence, the idea of using

proxies for learning metric spaces was ﬁrst proposed in [29] and enhanced in [39]. Proxies can also

be used to emulate properties of pairwise losses by capturing data-to-data relations (instead of just

data-to-proxy) leveraging relative hardness of datapoints [

], by making data representations follow

the semantic hierarchy inherent in real-world classes [

], or by regularizing sample distributions

around proxies to follow a non-isotropic distribution [

]. However, all the above works perform

proxy-based metric learning directly on data representations. In contrast, our approach is designed

to learn class proxies that can be used not only to capture isolated, view speciﬁc (local/global)

information for the underlying class, but also to learn the cross-view relationships such that they form

embeddings in a metric space.

3 Relational Proxies

Consider an image

x∈X

with a label

y∈Y

. Let

g=cg(x)

and

L={l1,l2, ... lk}=cl(x)

the global and set of local views of an image

respectively, where

and

are cropping functions

applied on

to obtain such views. Let

be an encoder that takes as input

v∈ {g} ∪ L

and maps it

to a latent space representation

z∈Rd

, where

is the representation dimensionality. Speciﬁcally,

the representations of the global view

and local views

obtained from

are then denoted by

zg=f(g)

and

ZL={f(l) : l∈L}={zl1,zl2, ... zlk}

respectively. Let

R: (g,L)→r

be a

random variable that encodes the relationships

between the global (

) and the set of local (

) views.

Align

𝑓

AST

Global Crop (𝑐𝑔)

Local Crops (𝑐𝑙)

𝜌𝐫

𝐫ℒrproxy

Discriminate

𝐳𝒈

𝐱

𝐳𝒈

𝐳𝒈𝐫

𝐳𝑙1

𝐳𝑙2

𝐳𝑙3

𝐳𝑙𝑘

Figure 1: We start by encoding the global and local views using a relation-agnostic encoder

. We

then compute the cross-view relational embedding

between the global

and the summary of

local

representations. The AST, in conjunction with

, form the cross-view relational function

. Finally, the learning of our Relational Proxies is conditioned by both view-speciﬁc (

and

)

and cross-view relational (

) information. Minimizing

Lrproxy

helps to align representations from the

same category, while discriminating across different categories in a metric space.

3.1 Problem Deﬁnition

We leverage the qualitative consistency in the deﬁnition of the ﬁne-grained visual categorization

(FGVC) problem in the relevant literature [

] to formalize the same in more quantitative

terms as follows.

Deﬁnition 1

(

k-distinguishability

)

Two categories

and

are said to be

-distinguishable iff

along with the global view, a classiﬁer needs at least

local features to tell them apart, i.e., the true

hypothesis can only distinguish between

and

if it has access to the complete set

{zg}∪ZL

, and

it fails to distinguish between C1and C2, if it only has access to {zg} ∪ ZL\zl,∀zl∈ZL.

The notion of k-distinguishability formalizes what it means for two categories to only be distinguish-

able in the ﬁne-grained but not in the coarse-grained setting. Given the concept of k-distinguishability,

the deﬁnition of FGVC problem directly follows from here:

Deﬁnition 2

(

Fine-Grained Visual Categorization Problem

PFGVC

)

A categorization problem

is said to belong to the

PFGVC

family, iff there exists at least one pair of categories

and

such

that they are k-distinguishable.

Unless otherwise stated, all datapoints

(x,y)

are considered to be sampled from

-distinguishable

categories of an instance of

PFGVC

. In the subsequent sections, we prove that for a learner to

completely model the class distribution for an instance of

PFGVC

, it must, alongside the view speciﬁc

representations

and

, also learn a function

that models the cross-view relationship between

the global and the local views. Thus, a function

, to model

, must satisfy the following properties:

(1) View-Uniﬁcation: Maps the set of all views

{g,l∈L}

of an image

to a single output

; (2)

Permutation Invariance: Produces the same output irrespective of the order of the local attributes, i.e.,

ξ(zg,{zl1,zl2, ... zlk}) = ξ(zg,{zlπ(1) ,zlπ(2) , ... zlπ(k)})

, for every permutation

, where

and

zli

are the representations of the global and the local views respectively, obtained from

. We provide

more details on the necessity of these properties in Section 6.1 of the Appendix.

3.2 Relation-Agnostic Representations and Information Gap

In this section, we formally study the nature of the representation spaces learned by models that do not

consider the cross-view relational information in the context of

PFGVC

. We term such representations

as being "relation-agnostic" and prove via Proposition 1 that they suffer from an Information Gap,

and thus are unable to capture the complete label information encoded in an input image.

Deﬁnition 3

(

Relation-Agnostic Representations - Information Theoretic

)

An encoder is said

to produce relation-agnostic representations if it independently encodes the global view

and local

views l∈Lof xwithout considering their relationship information r.

Lemma 1.

Given a relation-agnostic representation

, the conditional mutual information

between xand ygiven zcan be reduced to I(x;r|z).

Proof.

Given a relation-agnostic representation

, the only uncertainty that remains about the

label information

can be quantiﬁed as the cross-view relational information

,i.e.,

I(x;y|z) =

I(x;r). The proof of this statement is given in Identity 1 of the Appendix.

Intuitively, the conditional mutual information between

and

given

,i.e.,

I(x;y|z)

represents

the information for predicting

from

that

is unable to capture. Since

is relation-agnostic, the

only uncertainty that remains in

after

is the cross-set relationship between the global and the

local views, i.e.,

. Therefore, we can write

I(x;y|z) = I(x;r)

. Using this equality and further

factorizing I(x;r)using the chain rule for mutual information, we get:

I(x;y|z) = I(x;r) = I(x;r|z) + I(r;z) = I(x;r|z),

the latter equality following from Deﬁnition 3, which implies that

I(r;z)=0

, since

does not

explicitly model the local-to-global relationships r.

Lemma 2.

The mutual information between

and its relation-agnostic representation

does not

change with the knowledge of r.

Proof.

Following the chain rule [

], the mutual information between

and

,i.e.,

I(x;z)

can be

expressed as

I(x;z|r) + I(z;r)

. However, since

is relation-agnostic (Deﬁnition 3),

I(z;r)=0

Thus, I(x;z) = I(x;z|r) + I(z;r) = I(x;z|r).

Proposition 1.

For relation-agnostic representation

, the label information encoded in

strictly upper-bounded by the label information in

, i.e.,

I(x;y)> I(z;y)

by an amount

I(x;r|z)

Proof.

The mutual information

I(x;y)

between a datapoint

and its ground-truth label

can be

expressed as

I(x;y|z) + I(x;z)

based on the chain rule. Here

I(x;y|z)

represents the information

for predicting yfrom xthat zis unable to capture, while I(x;z)denotes the predictive information

that zdoes capture from x. We can thus rewrite I(x;y)using Lemma 1 and Lemma 2 as:

I(x;y) = I(x;y|z) + I(x;z) = I(x;r|z) + I(x;z|r)(1)

Now, using the chain rule of mutual information,

I(z;y) = I(z;y|x) + I(z;x)

. However, as a

consequence of the data processing inequality [12], I(z;y|x)=0(since zcannot encode any more

information about ythan x). Applying this and Lemma 2 to Equation (1):

I(x;y) = I(x;r|z) + I(x;z|r) = I(x;r|z) + I(x;z) = I(x;r|z) + I(z;y)

Therefore, I(x;y)> I(z;y), by an amount I(x;r|z).

Intuition:

By establishing a strict upper-bound, Proposition 1 shows that relation-agnostic encoders

cannot fully capture the label information in an input image. The quantity they are unable to capture

is given by I(x;r|z), which we call the Information Gap.

3.3 Sufﬁcient Learner

Proposition 1 states that the information gap exists if the representation space happens to be relation-

agnostic. We now explore if there is really the need to learn relation-agnostic representations in the

ﬁrst place. From there, we identify the necessary and sufﬁcient conditions for a complete learning of

I(x;y), and derive the requirements for a learner to do the same.

Deﬁnition 4

(

Relation-Agnostic Representations - Geometric

)

Let

n(·)

represent the



neighbourhood around a point in the limit

→02

. A representation space is relation-agnostic if and

only if ∀zl∈ZL:n(zl)∩n(zg) = φ.

An intuitive explanation of Deﬁnition 4 can be found in Section 6.3 of the Appendix.

Axiom 1. f

learns representations

such that a classiﬁer operating on the domain of

learns

a distribution

, minimizing its cross-entropy with the true distribution

−Piyilog(ˆ

yi)

, where

denotes the i-th class.

Lemma 3.

For an instance of

PFGVC

, the representation space learned by

is relation-agnostic,

i.e., the global view

and the set of local views

l∈L

are mapped to disjoint locations in the

representation space.

2The choice of determines the degree of relation-agnosticity of the representation space.

Proof.

From Deﬁnition 4, a representation space is not relation-agnostic iff

∃zl∈ZL:n(zl)∩

n(zg)6=φ

. Under this condition, the classiﬁer only has the information from

{zg} ∪ ZL\zl

instead

of the required

{zg} ∪ ZL

. Thus, for instances of

PFGVC

, according to Deﬁnition 1, removing the

relation-agnostic nature from the representation space of

would cause a downstream classiﬁer to

produce misclassiﬁcations across the instances of

-distinguishable categories, leading to a violation

of Axiom 1. Hence, fcan only learn relation-agnostic representations.

We can thus conclude from Lemma 3 and Proposition 1 that the necessary and sufﬁcient conditions

for a learner to capture the complete label information

I(x;y)

, are to consider both (1) the relation-

agnostic information zand (2) the cross-view relational information r.

Proposition 2.

An encoder

trained to learn relation-agnostic representations

of datapoints

cannot be used to model the relationship rbetween the global and local views of x.

Proof. f:xv→zv

is a unary function that takes as input a (global or local) view

of an image

and produces view-speciﬁc (Lemma 3) representations zvfor a downstream function g:zv→y.

For

to model the cross-view relationships, it must output the same vector

irrespective of whether

xv=g

xv=l∈L

,i.e. whether

is a global or a local view of the input image

(view-

uniﬁcation property of

). However, Lemma 3 prevents this from happening by requiring the output

space of fto be relation-agnostic. Hence, fcannot be used to model r.

Thus, to bridge the information gap, a learner must have distinct sub-models that individually satisfy

the properties of being relation-agnostic and relation-aware. Only such a learner could qualify as

being sufﬁcient for an instance of PFGVC.

Intuition:

In this section, we have effectively proven that the properties of relation-agnosticity and

relation-awareness are dual to each other. We show that while relation-agnosticity is not sufﬁcient,

it is a necessary condition for encoding the complete label information

I(x;y)

. We also show that

a disjoint encoder cannot be used to model the two properties alone without violating one of the

necessary criteria. The requirement of a separate, relation-aware sub-model follows from here.

3.4 Learning Relation-Agnostic and Relation-Aware Representations

Figure 1 depicts the end-to-end design of our framework. Derived from our theoretical ﬁndings, it

comprises of both the relation-agnostic agnostic encoder

, and the cross-view relational function,

expressed as a composition of the Attribute Summarization Transformer, AST, and a network for

view-uniﬁcation, ρ. Below, we elaborate on each of these components.

Relation-Agnostic Representations:

We follow recent literature [

] for localizing the object

of interest in the input image

and obtaining the global view

by thresholding the ﬁnal layer

activations of a CNN encoder

and detecting the largest connected component in the thresholded

feature map. We obtain the set of local views

{l1,l2. . . , lk}

as sub-crops of

(more details in

Section 4.1). Following the primary requirement of Proposition 1, we produce relation-agnostic

representations by propagating

and

through a CNN encoder

that independently encodes the

two view families as zg=f(g)and zli=f(li).

Relational Embeddings:

The second requirement, according to Proposition 1, for completely

learning

I(x;y)

is to minimize

I(x;r|z)

,i.e., the uncertainty about the relational information

encoded in

, given a relation-agnostic representation

. However, according to Proposition 2,

we cannot perform the same using the relation-agnostic encoder

. Contrary to existing relational

learning literature [

] that assumes the nature of relationships to be known beforehand, we take

a novel approach that models cross-view relationships as learnable representations of the input

. We

follow the deﬁnition of the relationship modelling function

ξ: (g,L)→r

, that takes as input relation-

agnostic representations of the global view

and the set of local views

ZL={zl1,zl1, ... zlk}

, and

outputs a relationship vector

, satisfying the View-Uniﬁcation and Permutation Invariance properties.

We satisfy the Permutation Invariance property by aggregating the local representations via a novel

Attribute Summarization Transformer (AST). We form a matrix whose columns constitute a learnable

summary embedding

followed by the local representations

zli

L= [zL,zl1,zl1, ... zlk]

We compute the self-attention output

∗

for each column

z∗

∗=a·ZLW

, where

a=σ(z∗Wq)·(ZLW)T/√D

, and

is the embedding dimension. By iteratively performing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RelationalProxies:EmergentRelationshipsasFine-GrainedDiscriminatorsAbhraChaudhuri1MassimilianoMancini2ZeynepAkata2,3,4AnjanDutta51UniversityofExeter2UniversityofTübingen3MPIforInformatics4MPIforIntelligentSystems5UniversityofSurreyAbstractFine-grainedcategoriesthatlargelysharethesamesetofpartscanno...

展开>> 收起<<

Relational Proxies Emergent Relationships as Fine-Grained Discriminators Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata234Anjan Dutta5.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Relational Proxies Emergent Relationships as Fine-Grained Discriminators Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata234Anjan Dutta5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: