Relational Proxies Emergent Relationships as Fine-Grained Discriminators Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata234Anjan Dutta5

2025-04-29 0 0 1.85MB 21 页 10玖币
侵权投诉
Relational Proxies: Emergent Relationships as
Fine-Grained Discriminators
Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata2,3,4 Anjan Dutta5
1University of Exeter 2University of Tübingen 3MPI for Informatics
4MPI for Intelligent Systems 5University of Surrey
Abstract
Fine-grained categories that largely share the same set of parts cannot be discrimi-
nated based on part information alone, as they mostly differ in the way the local
parts relate to the overall global structure of the object. We propose Relational Prox-
ies, a novel approach that leverages the relational information between the global
and local views of an object for encoding its semantic label. Starting with a rigorous
formalization of the notion of distinguishability between fine-grained categories,
we prove the necessary and sufficient conditions that a model must satisfy in order
to learn the underlying decision boundaries in the fine-grained setting. We design
Relational Proxies based on our theoretical findings and evaluate it on seven chal-
lenging fine-grained benchmark datasets and achieve state-of-the-art results on all
of them, surpassing the performance of all existing works with a margin exceeding
4% in some cases. We also experimentally validate our theory on fine-grained dis-
tinguishability and obtain consistent results across multiple benchmarks. Implemen-
tation is available at https://github.com/abhrac/relational-proxies.
1 Introduction
Fine-grained visual categorization (FGVC) primarily requires identifying category-specific, discrimi-
native local attributes [
50
,
45
,
21
]. However, the relationship of the attributes with the global view of
the object is also known to encode semantic information [
6
,
5
]. Such a relationship can be thought
of as the way in which local attributes combine to form the overall object. When two categories
share a large number of local attributes, this cross-view relational information becomes the only
discriminator. To illustrate this in an intuitive example, Figure 1 shows two fine-grained categories of
birds, the White-faced Plover (left and top-right) and the Kentish Plover (bottom-right). Along with
color and texture information, the two categories share a large number of local features like beak,
head, body, tail and wings. Given such constraints of largely overlapping attribute sets, relational
information like the distance between the head and the body, or the angular orientation of the legs
with respect to the body remain as the only available discriminators. We thus conjecture that the
way the global structure (view) of the object arises out of its local parts (views) must be an emergent
[
31
] property of the object which is implicitly encoded as the cross-view relationship. However, all
existing methods that consider both global and local information, do so in a relation-agnostic manner,
i.e., without considering cross-view relationships (we formalize relation-agnosticity in Section 3).
We hypothesize that when two categories largely share the same set of local attributes and differ
only in the way the attributes combine to generate the global view of the object, relation-agnostic
approaches do not capture the full semantic information in an input image. To prove our hypothesis,
we develop a rigorous formalization of the notion of distinguishability in the fine-grained setting.
Via our theoretical framework, we identify the necessary and sufficient conditions that a learner
must satisfy to completely learn a distribution of fine-grained categories. Specifically, we prove
A. Chaudhuri is with the Department of Computer Science at the University of Exeter. M. Mancini and Z.
Akata are with the Cluster of Excellence Machine Learning at the University of Tübingen. A. Dutta is with the
Institute for People-Centred AI at the University of Surrey.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02149v1 [cs.CV] 5 Oct 2022
that a learner must harness both view-specific (relation-agnostic) and cross-view (relation-aware)
information in an input image. We also prove that it is not possible to design a single encoder that
can achieve both of these objectives simultaneously. Based on our theoretical findings, we design a
learner that separately computes metric space embeddings for the relation-agnostic and relation-aware
components in an input image, through class representative vectors that we call Relational Proxies.
To summarize, we: (1) provide a theoretically rigorous formulation of the FGVC task and formally
prove the necessary and sufficient conditions a learner must satisfy for FGVC, (2) introduce a
plug-and-play extension on top of conventional CNNs that helps leverage relationships between
global and local views of an object in the representation space for obtaining a complete encoding of
the fine-grained semantic information in an input image, (3) achieve state-of-the-art results on all
benchmark FGVC datasets with significant accuracy gains.
2 Related Work
Fine-grained visual categorization
Prior works have demonstrated the importance of learning
localized image features for FGVC [
1
,
51
,
23
], with extensions exploiting the relationship between
multiple images and between network layers [
25
]. The high intra-class and low inter-class variations
in FGVC datasets can be tackled by designing appropriate inductive biases like normalized object
poses [
4
] or via more data-driven methods like deep metric learning [
7
]. Analysing part-specific
features along with the global context was demonstrated through part detection based on activation
regions in CNN feature maps [
16
,
49
] or via context-aware attention pooling [
3
]. CNNs can also be
modified in novel ways for FGVC by incorporating boosting [
28
], kernel pooling [
8
], or by randomly
masking out a group of correlated channels during training [
10
]. Vision Transformers [
41
], with
their ability to attend to specific informative image patches, have also shown great promise in FGVC
[
43
,
13
,
24
]. To the best of our knowledge, we are the first to provide a rigorous theoretical foundation
for FGVC and design a cross-view relational metric learning formulation based on the same.
Relation modelling in deep learning
Modelling relationships between entities has proven to be a
useful approach in many areas of deep learning including deep reinforcement learning [
48
], object
detection [
15
], question answering [
36
], graph representation learning [
2
], few-shot learning [
38
] and
knowledge distillation [
32
]. The usefulness of modelling relationships between different views of
the same image has been demonstrated in the self-supervised context by [
34
]. All the above works
either leverage or aim to learn relationships between entities, the nature of which is assumed to be
known apriori. Our work breaks free from such assumptions by modelling cross-view relationships
as learnable representations that optimize the end-task of FGVC.
Proxy-based deep metric learning
Motivated by the fact that pairwise losses for deep metric
learning incur a significant computational overhead leading to slow convergence, the idea of using
proxies for learning metric spaces was first proposed in [29] and enhanced in [39]. Proxies can also
be used to emulate properties of pairwise losses by capturing data-to-data relations (instead of just
data-to-proxy) leveraging relative hardness of datapoints [
18
], by making data representations follow
the semantic hierarchy inherent in real-world classes [
46
], or by regularizing sample distributions
around proxies to follow a non-isotropic distribution [
35
]. However, all the above works perform
proxy-based metric learning directly on data representations. In contrast, our approach is designed
to learn class proxies that can be used not only to capture isolated, view specific (local/global)
information for the underlying class, but also to learn the cross-view relationships such that they form
embeddings in a metric space.
3 Relational Proxies
Consider an image
xX
with a label
yY
. Let
g=cg(x)
and
L={l1,l2, ... lk}=cl(x)
be
the global and set of local views of an image
x
respectively, where
cg
and
cl
are cropping functions
applied on
x
to obtain such views. Let
f
be an encoder that takes as input
v∈ {g} ∪ L
and maps it
to a latent space representation
zRd
, where
d
is the representation dimensionality. Specifically,
the representations of the global view
g
and local views
L
obtained from
f
are then denoted by
zg=f(g)
and
ZL={f(l) : lL}={zl1,zl2, ... zlk}
respectively. Let
R: (g,L)r
be a
random variable that encodes the relationships
r
between the global (
g
) and the set of local (
L
) views.
2
Align
𝑓
𝑓
AST
ξ
Global Crop (𝑐𝑔)
Local Crops (𝑐𝑙)
𝜌𝐫
𝐫rproxy
Discriminate
p1
p2
𝐳𝒈
𝐳𝒈
𝐱
𝐳𝒈
𝐳𝒈𝐫
𝐳𝑙1
𝐳𝑙2
𝐳𝑙3
𝐳𝑙𝑘
Figure 1: We start by encoding the global and local views using a relation-agnostic encoder
f
. We
then compute the cross-view relational embedding
r
between the global
zg
and the summary of
local
zL
representations. The AST, in conjunction with
ρ
, form the cross-view relational function
ξ
. Finally, the learning of our Relational Proxies is conditioned by both view-specific (
zL
and
zg
)
and cross-view relational (
r
) information. Minimizing
Lrproxy
helps to align representations from the
same category, while discriminating across different categories in a metric space.
3.1 Problem Definition
We leverage the qualitative consistency in the definition of the fine-grained visual categorization
(FGVC) problem in the relevant literature [
25
,
49
,
13
,
3
] to formalize the same in more quantitative
terms as follows.
Definition 1
(
k-distinguishability
)
.
Two categories
C1
and
C2
are said to be
k
-distinguishable iff
along with the global view, a classifier needs at least
k
local features to tell them apart, i.e., the true
hypothesis can only distinguish between
C1
and
C2
if it has access to the complete set
{zg}ZL
, and
it fails to distinguish between C1and C2, if it only has access to {zg} ∪ ZL\zl,zlZL.
The notion of k-distinguishability formalizes what it means for two categories to only be distinguish-
able in the fine-grained but not in the coarse-grained setting. Given the concept of k-distinguishability,
the definition of FGVC problem directly follows from here:
Definition 2
(
Fine-Grained Visual Categorization Problem
-
PFGVC
)
.
A categorization problem
is said to belong to the
PFGVC
family, iff there exists at least one pair of categories
C1
and
C2
such
that they are k-distinguishable.
Unless otherwise stated, all datapoints
(x,y)
are considered to be sampled from
k
-distinguishable
categories of an instance of
PFGVC
. In the subsequent sections, we prove that for a learner to
completely model the class distribution for an instance of
PFGVC
, it must, alongside the view specific
representations
zg
and
ZL
, also learn a function
ξ
that models the cross-view relationship between
the global and the local views. Thus, a function
ξ
, to model
R
, must satisfy the following properties:
(1) View-Unification: Maps the set of all views
{g,lL}
of an image
x
to a single output
r
; (2)
Permutation Invariance: Produces the same output irrespective of the order of the local attributes, i.e.,
ξ(zg,{zl1,zl2, ... zlk}) = ξ(zg,{zlπ(1) ,zlπ(2) , ... zlπ(k)})
, for every permutation
π
, where
zg
and
zli
are the representations of the global and the local views respectively, obtained from
f
. We provide
more details on the necessity of these properties in Section 6.1 of the Appendix.
3.2 Relation-Agnostic Representations and Information Gap
In this section, we formally study the nature of the representation spaces learned by models that do not
consider the cross-view relational information in the context of
PFGVC
. We term such representations
as being "relation-agnostic" and prove via Proposition 1 that they suffer from an Information Gap,
and thus are unable to capture the complete label information encoded in an input image.
Definition 3
(
Relation-Agnostic Representations - Information Theoretic
)
.
An encoder is said
to produce relation-agnostic representations if it independently encodes the global view
g
and local
views lLof xwithout considering their relationship information r.
Lemma 1.
Given a relation-agnostic representation
z
of
x
, the conditional mutual information
between xand ygiven zcan be reduced to I(x;r|z).
3
Proof.
Given a relation-agnostic representation
z
of
x
, the only uncertainty that remains about the
label information
y
can be quantified as the cross-view relational information
r
,i.e.,
I(x;y|z) =
I(x;r). The proof of this statement is given in Identity 1 of the Appendix.
Intuitively, the conditional mutual information between
x
and
y
given
z
,i.e.,
I(x;y|z)
represents
the information for predicting
y
from
x
that
z
is unable to capture. Since
z
is relation-agnostic, the
only uncertainty that remains in
x
after
z
is the cross-set relationship between the global and the
local views, i.e.,
r
. Therefore, we can write
I(x;y|z) = I(x;r)
. Using this equality and further
factorizing I(x;r)using the chain rule for mutual information, we get:
I(x;y|z) = I(x;r) = I(x;r|z) + I(r;z) = I(x;r|z),
the latter equality following from Definition 3, which implies that
I(r;z)=0
, since
z
does not
explicitly model the local-to-global relationships r.
Lemma 2.
The mutual information between
x
and its relation-agnostic representation
z
does not
change with the knowledge of r.
Proof.
Following the chain rule [
12
], the mutual information between
x
and
z
,i.e.,
I(x;z)
can be
expressed as
I(x;z|r) + I(z;r)
. However, since
z
is relation-agnostic (Definition 3),
I(z;r)=0
.
Thus, I(x;z) = I(x;z|r) + I(z;r) = I(x;z|r).
Proposition 1.
For relation-agnostic representation
z
of
x
, the label information encoded in
z
is
strictly upper-bounded by the label information in
x
, i.e.,
I(x;y)> I(z;y)
by an amount
I(x;r|z)
.
Proof.
The mutual information
I(x;y)
between a datapoint
x
and its ground-truth label
y
can be
expressed as
I(x;y|z) + I(x;z)
based on the chain rule. Here
I(x;y|z)
represents the information
for predicting yfrom xthat zis unable to capture, while I(x;z)denotes the predictive information
that zdoes capture from x. We can thus rewrite I(x;y)using Lemma 1 and Lemma 2 as:
I(x;y) = I(x;y|z) + I(x;z) = I(x;r|z) + I(x;z|r)(1)
Now, using the chain rule of mutual information,
I(z;y) = I(z;y|x) + I(z;x)
. However, as a
consequence of the data processing inequality [12], I(z;y|x)=0(since zcannot encode any more
information about ythan x). Applying this and Lemma 2 to Equation (1):
I(x;y) = I(x;r|z) + I(x;z|r) = I(x;r|z) + I(x;z) = I(x;r|z) + I(z;y)
Therefore, I(x;y)> I(z;y), by an amount I(x;r|z).
Intuition:
By establishing a strict upper-bound, Proposition 1 shows that relation-agnostic encoders
cannot fully capture the label information in an input image. The quantity they are unable to capture
is given by I(x;r|z), which we call the Information Gap.
3.3 Sufficient Learner
Proposition 1 states that the information gap exists if the representation space happens to be relation-
agnostic. We now explore if there is really the need to learn relation-agnostic representations in the
first place. From there, we identify the necessary and sufficient conditions for a complete learning of
I(x;y), and derive the requirements for a learner to do the same.
Definition 4
(
Relation-Agnostic Representations - Geometric
)
.
Let
n(·)
represent the
-
neighbourhood around a point in the limit
02
. A representation space is relation-agnostic if and
only if zlZL:n(zl)n(zg) = φ.
An intuitive explanation of Definition 4 can be found in Section 6.3 of the Appendix.
Axiom 1. f
learns representations
z
such that a classifier operating on the domain of
z
learns
a distribution
ˆ
y
, minimizing its cross-entropy with the true distribution
Piyilog(ˆ
yi)
, where
i
denotes the i-th class.
Lemma 3.
For an instance of
PFGVC
, the representation space learned by
f
is relation-agnostic,
i.e., the global view
g
and the set of local views
lL
are mapped to disjoint locations in the
representation space.
2The choice of determines the degree of relation-agnosticity of the representation space.
4
Proof.
From Definition 4, a representation space is not relation-agnostic iff
zlZL:n(zl)
n(zg)6=φ
. Under this condition, the classifier only has the information from
{zg} ∪ ZL\zl
instead
of the required
{zg} ∪ ZL
. Thus, for instances of
PFGVC
, according to Definition 1, removing the
relation-agnostic nature from the representation space of
f
would cause a downstream classifier to
produce misclassifications across the instances of
k
-distinguishable categories, leading to a violation
of Axiom 1. Hence, fcan only learn relation-agnostic representations.
We can thus conclude from Lemma 3 and Proposition 1 that the necessary and sufficient conditions
for a learner to capture the complete label information
I(x;y)
, are to consider both (1) the relation-
agnostic information zand (2) the cross-view relational information r.
Proposition 2.
An encoder
f
trained to learn relation-agnostic representations
z
of datapoints
x
cannot be used to model the relationship rbetween the global and local views of x.
Proof. f:xvzv
is a unary function that takes as input a (global or local) view
xv
of an image
x
and produces view-specific (Lemma 3) representations zvfor a downstream function g:zvy.
For
f
to model the cross-view relationships, it must output the same vector
r
irrespective of whether
xv=g
or
xv=lL
,i.e. whether
xv
is a global or a local view of the input image
x
(view-
unification property of
ξ
). However, Lemma 3 prevents this from happening by requiring the output
space of fto be relation-agnostic. Hence, fcannot be used to model r.
Thus, to bridge the information gap, a learner must have distinct sub-models that individually satisfy
the properties of being relation-agnostic and relation-aware. Only such a learner could qualify as
being sufficient for an instance of PFGVC.
Intuition:
In this section, we have effectively proven that the properties of relation-agnosticity and
relation-awareness are dual to each other. We show that while relation-agnosticity is not sufficient,
it is a necessary condition for encoding the complete label information
I(x;y)
. We also show that
a disjoint encoder cannot be used to model the two properties alone without violating one of the
necessary criteria. The requirement of a separate, relation-aware sub-model follows from here.
3.4 Learning Relation-Agnostic and Relation-Aware Representations
Figure 1 depicts the end-to-end design of our framework. Derived from our theoretical findings, it
comprises of both the relation-agnostic agnostic encoder
f
, and the cross-view relational function,
ξ
,
expressed as a composition of the Attribute Summarization Transformer, AST, and a network for
view-unification, ρ. Below, we elaborate on each of these components.
Relation-Agnostic Representations:
We follow recent literature [
44
,
49
] for localizing the object
of interest in the input image
x
and obtaining the global view
g
by thresholding the final layer
activations of a CNN encoder
f
and detecting the largest connected component in the thresholded
feature map. We obtain the set of local views
{l1,l2. . . , lk}
as sub-crops of
g
(more details in
Section 4.1). Following the primary requirement of Proposition 1, we produce relation-agnostic
representations by propagating
g
and
li
through a CNN encoder
f
that independently encodes the
two view families as zg=f(g)and zli=f(li).
Relational Embeddings:
The second requirement, according to Proposition 1, for completely
learning
I(x;y)
is to minimize
I(x;r|z)
,i.e., the uncertainty about the relational information
r
encoded in
x
, given a relation-agnostic representation
z
. However, according to Proposition 2,
we cannot perform the same using the relation-agnostic encoder
f
. Contrary to existing relational
learning literature [
32
,
34
] that assumes the nature of relationships to be known beforehand, we take
a novel approach that models cross-view relationships as learnable representations of the input
x
. We
follow the definition of the relationship modelling function
ξ: (g,L)r
, that takes as input relation-
agnostic representations of the global view
zg
and the set of local views
ZL={zl1,zl1, ... zlk}
, and
outputs a relationship vector
r
, satisfying the View-Unification and Permutation Invariance properties.
We satisfy the Permutation Invariance property by aggregating the local representations via a novel
Attribute Summarization Transformer (AST). We form a matrix whose columns constitute a learnable
summary embedding
zL
followed by the local representations
zli
as
Z0
L= [zL,zl1,zl1, ... zlk]
.
We compute the self-attention output
z0
for each column
z
in
Z0
L
as
z0
=a·ZLW
, where
a=σ(zWq)·(ZLW)T/D
, and
D
is the embedding dimension. By iteratively performing
5
摘要:

RelationalProxies:EmergentRelationshipsasFine-GrainedDiscriminatorsAbhraChaudhuri1MassimilianoMancini2ZeynepAkata2,3,4AnjanDutta51UniversityofExeter2UniversityofTübingen3MPIforInformatics4MPIforIntelligentSystems5UniversityofSurreyAbstractFine-grainedcategoriesthatlargelysharethesamesetofpartscanno...

展开>> 收起<<
Relational Proxies Emergent Relationships as Fine-Grained Discriminators Abhra Chaudhuri1Massimiliano Mancini2Zeynep Akata234Anjan Dutta5.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.85MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注