
Ridge-CCA.
Our distance is most closely related to ridge-CCA, introduced by [
Vin76
] as a regu-
larized version of Canonical Covariance Analysis (CCA) when the covariance matrices
Σφ
or
Σψ
are close to singular. More specifically, for any
λ≥0
, define the matrix
Cλ:= Σ−λ
φΣφψΣ−λ
ψΣψφ
;
the ridge-CCA similarity measure is defined as
ρλ−CCA = tr(Cλ)
. Hence, we readily see from
Proposition 1that GULP and ridge-CCA are describing the same geometry over representations. To see
this, recall that Lemma 1provides an isometric embedding
f7→ f(X)Σ−λ
ff(X0)
of representation
maps into
L2(P⊗2
X)
. While GULP is the distance on this Hilbert space, ridge-CCA is the inner product.
Ridge-CCA was briefly considered in the seminal work [
KNLH19
] but discarded because of (i) its lack
of interpretability and (ii) the absence of a rule to select
λ
. We argue that in fact, our prediction-driven
derivation of GULP gives a clear and compelling interpretation of this geometry (as well as suggests
several extensions; see Section 5). Moreover, we show that tunability of
λ
is, in fact, a desirable
feature that allows to represent the space of representations at various resolutions, giving various
levels of information; for example, in Figure 6, higher λleads to a coarser clustering structure.
CCA.
Due to the connection with ridge-
CCA
, our
GULP
distance is related to (unregularized)
CCA
when
λ= 0
. Specifically, defining
C:= Σ−1
φΣφψΣ−1
ψΣψφ
, the mean-squared-CCA similarity
measure is given by (see [Eat07, Def. 10.2]):
ρCCA(φ, ψ) := tr(C)
k= 1 −1
2kE(φ(X)>Σ−1
φφ(X0)−ψ(X)>Σ−1
ψψ(X0))2,
where
X
is an independent copy of
X0
; the last identity can be checked directly. From Lemma 1it
can be seen that our GULP distance d0(φ, ψ)with λ= 0 is a linear transformation of ρCCA.
It can be checked that
ρCCA
takes values in
[0,1]
, which has led researchers to simply propose
1−ρCCA
as a dissimilarity measure. Interestingly, this choice turns out to produce a valid (squared) metric, i.e.,
a dissimilarity measure that satisfies the triangle inequality. Indeed, we get that
d2
CCA(φ, ψ)=1−ρCCA(φ, ψ) = 1
2kE(K(˜
φ(X),˜
φ(X0)) −K(˜
ψ(X),˜
ψ(X0)))2
where
K(u, v) = u>v
is the linear kernel over
Rd
and
˜
φ:= Σ−1/2
φφ
(where
˜
ψ
and
˜
φ
are the
whitened versions of
ψ
and
φ
respectively). These identities have two consequences: (i) we see
from Lemma 1that
dCCA
corresponds to the GULP distance with
λ= 0
up to a scaling factor and
(ii)
dCCA
is a valid pseudometric on the space of representations, since we just exhibited an isometry
T:˜
f7→ K(˜
f(X),˜
f(X0))
with
L2(P⊗2
X)
. We show in Appendix A.2 that
dCCA (φ, ψ)=0
iff
ψ(X) = Aφ(X)
a.s. for some matrix
A
. Note that the invariance of
ρCCA
to linear transformations
was previously known and criticized in [KNLH19] as arguably too strong.
CKA.
In fact, thanks to the additional structure of the Hilbert space
L2(P⊗2
X)
, the
dCCA
distance
comes with an inner product
hT(˜
φ), T (˜
ψ)iCCA := 1
2kE[K(˜
φ(X),˜
φ(X0))K(˜
ψ(X),˜
ψ(X0))]
This observation allows us to connect CCA with CKA, another measure of similarity between distribu-
tions that is borrowed from classical literature on kernel methods [
CSTEK01
,
CMR12
] and that was
recently made popular by [
KNLH19
]. Under our normalization assumptions, CKA is a measure of
similarity given by
ρCKA (φ, ψ) = E[K(φ(X), φ(X0))K(ψ(X), ψ(X0))]
pE[K(φ(X), φ(X0))2]E[K(ψ(X), ψ(X0))2]
=hT(φ), T (ψ)iCCA
kT(φ)kCCAkT(ψ)kCCA
= cos (](T(φ), T (ψ))) ,
where
kTk2
CCA =hT, T iCCA
and
]
denotes the angle in the geometry induced by
h·,·iCCA
. In turn,
d2
CKA
is chosen as
d2
CKA = 1 −ρCKA
, which does not yield a pseudometric. This observation highlights
two major differences between CCA and CKA: the first measures inner products and works with
whitened representations, while the second measures angles and works with raw representations. As
illustrated in the experimental section 4as well as in [
DDS21
], this additional whitening step appears
to be detrimental to the overall qualities of this distance measure.
The fact that GULP with
λ= 0
recovers
dCCA
(i.e.
d2
0= 2kd2
CCA
) is illustrated in Figure 2. As shown,
although GULP has a roughly monotone relationship with CKA, they remain quite different.
5