
Proof.
From Definition 4, a representation space is not relation-agnostic iff
∃zl∈ZL:n(zl)∩
n(zg)6=φ
. Under this condition, the classifier only has the information from
{zg} ∪ ZL\zl
instead
of the required
{zg} ∪ ZL
. Thus, for instances of
PFGVC
, according to Definition 1, removing the
relation-agnostic nature from the representation space of
f
would cause a downstream classifier to
produce misclassifications across the instances of
k
-distinguishable categories, leading to a violation
of Axiom 1. Hence, fcan only learn relation-agnostic representations.
We can thus conclude from Lemma 3 and Proposition 1 that the necessary and sufficient conditions
for a learner to capture the complete label information
I(x;y)
, are to consider both (1) the relation-
agnostic information zand (2) the cross-view relational information r.
Proposition 2.
An encoder
f
trained to learn relation-agnostic representations
z
of datapoints
x
cannot be used to model the relationship rbetween the global and local views of x.
Proof. f:xv→zv
is a unary function that takes as input a (global or local) view
xv
of an image
x
and produces view-specific (Lemma 3) representations zvfor a downstream function g:zv→y.
For
f
to model the cross-view relationships, it must output the same vector
r
irrespective of whether
xv=g
or
xv=l∈L
,i.e. whether
xv
is a global or a local view of the input image
x
(view-
unification property of
ξ
). However, Lemma 3 prevents this from happening by requiring the output
space of fto be relation-agnostic. Hence, fcannot be used to model r.
Thus, to bridge the information gap, a learner must have distinct sub-models that individually satisfy
the properties of being relation-agnostic and relation-aware. Only such a learner could qualify as
being sufficient for an instance of PFGVC.
Intuition:
In this section, we have effectively proven that the properties of relation-agnosticity and
relation-awareness are dual to each other. We show that while relation-agnosticity is not sufficient,
it is a necessary condition for encoding the complete label information
I(x;y)
. We also show that
a disjoint encoder cannot be used to model the two properties alone without violating one of the
necessary criteria. The requirement of a separate, relation-aware sub-model follows from here.
3.4 Learning Relation-Agnostic and Relation-Aware Representations
Figure 1 depicts the end-to-end design of our framework. Derived from our theoretical findings, it
comprises of both the relation-agnostic agnostic encoder
f
, and the cross-view relational function,
ξ
,
expressed as a composition of the Attribute Summarization Transformer, AST, and a network for
view-unification, ρ. Below, we elaborate on each of these components.
Relation-Agnostic Representations:
We follow recent literature [
44
,
49
] for localizing the object
of interest in the input image
x
and obtaining the global view
g
by thresholding the final layer
activations of a CNN encoder
f
and detecting the largest connected component in the thresholded
feature map. We obtain the set of local views
{l1,l2. . . , lk}
as sub-crops of
g
(more details in
Section 4.1). Following the primary requirement of Proposition 1, we produce relation-agnostic
representations by propagating
g
and
li
through a CNN encoder
f
that independently encodes the
two view families as zg=f(g)and zli=f(li).
Relational Embeddings:
The second requirement, according to Proposition 1, for completely
learning
I(x;y)
is to minimize
I(x;r|z)
,i.e., the uncertainty about the relational information
r
encoded in
x
, given a relation-agnostic representation
z
. However, according to Proposition 2,
we cannot perform the same using the relation-agnostic encoder
f
. Contrary to existing relational
learning literature [
32
,
34
] that assumes the nature of relationships to be known beforehand, we take
a novel approach that models cross-view relationships as learnable representations of the input
x
. We
follow the definition of the relationship modelling function
ξ: (g,L)→r
, that takes as input relation-
agnostic representations of the global view
zg
and the set of local views
ZL={zl1,zl1, ... zlk}
, and
outputs a relationship vector
r
, satisfying the View-Unification and Permutation Invariance properties.
We satisfy the Permutation Invariance property by aggregating the local representations via a novel
Attribute Summarization Transformer (AST). We form a matrix whose columns constitute a learnable
summary embedding
zL
followed by the local representations
zli
as
Z0
L= [zL,zl1,zl1, ... zlk]
.
We compute the self-attention output
z0
∗
for each column
z∗
in
Z0
L
as
z0
∗=a·ZLW
, where
a=σ(z∗Wq)·(ZLW)T/√D
, and
D
is the embedding dimension. By iteratively performing
5