Exact first moments of the RV coefficient by invariant orthogonal integration_2

2025-04-27 0 0 946.69KB 14 页 10玖币
侵权投诉
EXACT FIRST MOMENTS OF THE RV COEFFICIENT BY
INVARIANT ORTHOGONAL INTEGRATION
François Bavaud
University of Lausanne, Switzerland
fbavaud@unil.ch
October 4, 2022
ABSTRACT
The RV coefficient measures the similarity between two multivariate configurations, and its
significance testing has attracted various proposals in the last decades. We present a new
approach, the invariant orthogonal integration, permitting to obtain the exact first four moments
of the RV coefficient under the null hypothesis. It consists in averaging along the Haar measure
the respective orientations of the two configurations, and can be applied to any multivariate
setting endowed with Euclidean distances between the observations. Our proposal also covers the
weighted setting of observations of unequal importance, where the exchangeability assumption,
justifying the usual permutation tests, breaks down.
The proposed RV moments express as simple functions of the kernel eigenvalues occurring in the
weighted multidimensional scaling of the two configurations. The expressions for the third and
fourth moments seem original. The first three moments can be obtained by elementary means,
but computing the fourth moment requires a more sophisticated apparatus, the Weingarten
calculus for orthogonal groups. The central role of standard kernels and their spectral moments
is emphasized.
Keywords
RV coefficient
·
weighted multidimensional scaling
·
spectral moments
·
invariant orthogonal
integration ·Weingarten calculus
1 Introduction
The RV coefficient is a well-known measure of similarity between two datasets, each consisting of multivariate
profiles measured on the same
n
observations or objects. This contribution proposes a new approach, the invariant
orthogonal integration, permitting to obtain the exact first four moments of the RV coefficient under the null
hypothesis of absence of relation between the two datasets. The main results, theorem 1 and corollary 1, are
exposed in section 3.1. The approach is fully nonparametric, and allows the handling of weighted objets, typically
made of aggregates such as regions, documents or species, which abound in multivariate analysis.
In the present distance-based data-analytic approach, data sets are constituted by weighted configurations specified
by the object weights together with their pair dissimilarities, assumed to be squared Euclidean. Factorial
coordinates, reproducing the dissimilarities, and permitting a maximum compression of the configuration inertia,
obtain by weighted multidimensional scaling. The latter, seldom exposed in the literature and hence briefly
recalled in section 2.1, is a direct generalization of classical scaling. The central step is provided by the spectral
decomposition of the matrix of weighted centered scalar products or kernel. It permits to decompose the spectral
eigenspace into a trivial one-dimensional part, determined by the object weights, common to both configurations,
and a non-trivial part of dimension
n1
, orthogonal to the square root of the weights. The weighted RV coefficient
obtains as the normalized scalar product between the kernels of the two configurations (section 2.2), and turns out
to be equivalent to its original definition expressed by cross-covariances (Escoufier, 1973; Robert and Escoufier,
1976).
arXiv:2210.00639v1 [math.ST] 2 Oct 2022
Exact first moments of the RV coefficient by invariant orthogonal integration
After recalling the above preliminaries, somewhat lengthy but necessary, the heart of this contribution can be
uncovered: invariant orthogonal integration consists in computing the expected null moments of the RV coefficient
by averaging, along the invariant Haar orthogonal measure in the non-trivial eigenspace, the orientations of
one configuration with respect to the other, by orthogonal transformation of, say, the first eigenspace (section
3.2). It constitutes a distinct alternative, with different outcomes, to the traditional permutation approach, whose
exchangeability assumption breaks down for weighted objects: typically, the profile dispersion is expected to be
larger for lighter objects (Bavaud, 2013) and the
n
object scores cannot follow the same distribution. The present
approach also yields a novel significance test for the RV coefficient (equation 16), taking into account skewness
and kurtosis corrections to the usual normal approximation.
Computing the moments of the RV coefficient requires to evaluate the orthogonal coefficients (23) constituted by
Haar expectations of orthogonal monomials. Low-order moments can be computed, with increasing difficulty, by
elementary means (section 3.3), but the fourth-order moment requires a more systematic approach (section 3.6),
provided by the Weingarten calculus developed by workers in random matrix theory and free probability. Both
procedures yield the same results for low-order moments (section 3.7), which is both expected and reassuring.
The first RV moment (11) coincides with all known proposals. The second centered RV moment (12) is simpler
than its permutation analog, and underlines the effective dimensionality of a configuration. The third centered RV
moment (13) is particularly enlightening: the RV skewness is simply proportional to the product of the spectral
skewness of both configurations, thus elucidating the often noticed positive skewness of the RV coefficient. The
expression for the fourth centered RV moment (9), (14) is also simple to express and to compute, yet more difficult
to interpret.
2 Euclidean configurations in a weighted setting: a concise remainder
2.1 Weighted multidimensional scaling and standard kernels
Consider
n
objects endowed with positive weights
fi>0
with
Pn
i=1 fi= 1
, as well with pairwise dissimilarities
D= (Dij )
between pairs of objects. The
n×n
matrix
D
is assumed to be squared Euclidean, that is of the form
Dij =kxixjk2
for
xi,xjRr
, with
rn1
. The pair
(f,D)
constitutes a weighted configuration, with
fi= 1/n for unweighted configurations.
Weighted multidimensional scaling aims at determining object coordinates
X= (x)Rn×r
reproducing the
dissimilarities
D
while expressing a maximum amount of dispersion or inertia
(3) in low dimensions. It is
performed by the following weighted generalization of the well-known Torgerson–Gower scaling procedure (see
e.g. Borg and Groenen, 2005): first, define
Π=diag(f)
, as well as the weighted centering matrix
H=In1nf>
,
which obeys H2=H. However, H>6=H, unless fis uniform.
Second, compute the matrix
B
of scalar products by double centering:
B=1
2H D H>
. Third, define the
n×n
kernel Kas the matrix of weighted scalar products :
K=Π BΠ,that is Kij =pfifjBij .
Fourth, perform the spectral decomposition with ˆ
Uorthogonal and ˆ
Λdiagonal
K=ˆ
Uˆ
Λˆ
U>ˆ
Uˆ
U>=ˆ
U>ˆ
U=Inˆ
Λ=diag(λ).(1)
By construction,
K
possesses one trivial eigenvalue
λ0= 0
associated to the eigenvector
f
and
n1
non-
negative eigenvalues decreasingly ordered as
λ1λ2. . . λn10
, among which
r=rg(K)
are strictly
positive.
From now on the trivial eigenspace will be discarded: set
ˆ
U= (f|U)
, where
URn×(n1)
and
Λ=
diag(λ1, . . . , λn1). Direct substitution from (1) yields
K=UΛU>UU>=Inff>U>U=In1U>f=0n.(2)
Finally, the searched for coordinates obtain as
X=Π1
21
2
, that is
x=uλα/fi
. One verifies easily
that
Dij =
n1
X
α=1
(xxjα)2∆ = 1
2
n
X
i,j=1
fifjDij =Tr(K) =
n1
X
α=1
λα.(3)
2
Exact first moments of the RV coefficient by invariant orthogonal integration
Figure 1: Two weighted configurations (f,DX)(left) and (f,DY)(right) embedded in Rn1
The kernels considered here are positive semi-definite and obey in addition
Kf=0n
. We call them standard
kernels. They can be related to the weighted version of centered kernels of Machine Learning (see e.g. Cortes
et al., 2012). To each weighted configuration (f,D)corresponds a unique standard kernel K, and conversely.
The matrix
K0=Inff>
appearing in (2) constitutes a standard kernel, referred to as the neutral kernel
in view of property
K0K=K0K=K
for any standard kernel
K
. The corresponding dissimilarities are the
weighted discrete distances
D0
ij =(1
fi+1
fjfor i6=j
0otherwise.
2.2 The RV coefficient
Consider two weighted configurations
(f,DX)
and
(f,DY)
endowed with the same weights
f
, or equivalently two
standard kernels
KX
and
KY
(Figure 1). Their similarity can be measured by the weighted RV coefficient defined
as
RV =RVXY =Tr(KXKY)
pTr(K2
X)Tr(K2
Y)(4)
which constitutes the cosine similarity between the vectorized matrices
KX
and
KY
. As a consequence,
RVXY 0
(since KXand KYare positive semi-definite), RVXY 1(by the Cauchy-Schwarz inequality) and RVXX = 1.
Quantity (4) is a straightforward weighted generalization of the RV coefficient (Escoufier, 1973; Robert and
Escoufier, 1976): consider multivariate features
XRn×p
and
YRn×q
, directly entering into the definition
of
DX
and
DY
as coordinates, or equivalently as
KX=ΠXcX>
cΠ
and
KY=ΠYcY>
cΠ
, where
Xc=HX and Yc=HY are the centered scores.
The weighted covariances are
ΣXX =X>
cΠXc
and
ΣY Y =Y>
cΠYc
. The cross-covariances are
ΣXY =
X>
cΠYcand ΣY X =Y>
cΠXc=Σ>
XY . The original RV coefficient is defined in the feature space as
RVXY =Tr(ΣXY ΣY X )
pTr(Σ2
XX )Tr(Σ2
Y Y ).(5)
Proving the identity of (4) and (5) is easy.
3 Computing the moments of the RV coefficient by invariant orthogonal integration
3.1 Main result and significance testing
Define the CV coefficient by the quantity CV =Tr(KXKY).
3
摘要:

EXACTFIRSTMOMENTSOFTHERVCOEFFICIENTBYINVARIANTORTHOGONALINTEGRATIONFrançoisBavaudUniversityofLausanne,Switzerlandfbavaud@unil.chOctober4,2022ABSTRACTTheRVcoefcientmeasuresthesimilaritybetweentwomultivariatecongurations,anditssignicancetestinghasattractedvariousproposalsinthelastdecades.Wepresenta...

展开>> 收起<<
Exact first moments of the RV coefficient by invariant orthogonal integration_2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:946.69KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注