Robust Graph Structure Learning via Multiple Statistical Tests Yaohua Wang

2025-05-03 0 0 1.1MB 21 页 10玖币
侵权投诉
Robust Graph Structure Learning via
Multiple Statistical Tests
Yaohua Wang
Alibaba Group
xiachen.wyh@alibaba-inc.com
Fangyi Zhang *
Queensland University of Technology
Centre for Robotics (QCR)
fangyi.zhang@qut.edu.au
Ming Lin
Amazon
minglamz@amazon.com
Senzhang Wang
Central South University
szwang@csu.edu.cn
Xiuyu Sun
Alibaba Group
xiuyu.sxy@alibaba-inc.com
Rong Jin
Twitter
rongjinemail@gmail.com
Abstract
Graph structure learning aims to learn connectivity in a graph from data. It is
particularly important for many computer vision related tasks since no explicit
graph structure is available for images for most cases. A natural way to construct a
graph among images is to treat each image as a node and assign pairwise image
similarities as weights to corresponding edges. It is well known that pairwise
similarities between images are sensitive to the noise in feature representations,
leading to unreliable graph structures. We address this problem from the viewpoint
of statistical tests. By viewing the feature vector of each node as an independent
sample, the decision of whether creating an edge between two nodes based on their
similarity in feature representation can be thought as a single statistical test. To
improve the robustness in the decision of creating an edge, multiple samples are
drawn and integrated by multiple statistical tests to generate a more reliable simi-
larity measure, consequentially more reliable graph structure. The corresponding
elegant matrix form named
B-Attention
is designed for efficiency. The effective-
ness of multiple tests for graph structure learning is verified both theoretically and
empirically on multiple clustering and ReID benchmark datasets. Source codes are
available at https://github.com/Thomas-wyh/B-Attention.
1 Introduction
Graph structure learning plays an important role in Graph Convolutional Networks (GCNs). It seeks
to discover underlying graph structures for better graph representation learning when graph structures
are unavailable, noisy or corrupted. These issues are typical in the field of vision tasks [
58
,
63
,
62
,
20
,
42
,
50
,
57
,
13
,
39
,
67
,
47
]. As an example, there is no explicit graph structure between photos
in your phone albums, so the graph-based approaches [
20
,
42
,
57
] in photo management softwares
seek to learn graph structure. A popular research direction on it is to consider graph structure
Equal contribution.
Work done while at Alibaba.
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03956v4 [cs.CV] 23 Dec 2022
learning as similarity measure learning upon the node embedding space [
60
]. They learn weighted
adjacency matrices [
20
,
42
,
9
,
10
,
43
,
64
,
28
,
8
,
25
,
35
] as graphs and feed them to GCNs to learn
representations.
However, the main problem of the above weighted-based methods is that weights are not always
accurate and reliable: two nodes with smaller associations may have a higher weight score than
those with larger associations. The root of this problem is closely related to the
noise
in the high
dimensional feature vectors extracted by the deep learning model, as similarities between nodes are
usually decided based on their feature vectors. The direct consequence of using an inaccurate graph
structure is so called feature pollution [
57
], leading to a performance degradation in downstream
tasks. In our empirical studies, we found that close to
1/3
of total weights are assigned to noisy edges
when unnormalized cosine similarities between feature vectors are used to calculate weights.
Many methods have been developed to improve similarity measure between nodes in graph, including
adding learnable parameters [
9
,
28
,
8
], introducing nonlinearity [
10
,
43
,
64
], Gaussian kernel func-
tions [
25
,
35
] and self-attention [
53
,
12
,
20
,
42
]. Despite the progress, they are limited to modifying
similarity function, without explicitly addressing the noise in feature vectors, a more fundamental
issue for graph structure learning. One way to capture the noise in feature vectors is to view the
feature vector of each node as an independent sample from some unknown distribution associated
with each node. As a result, the similarity score between two nodes can be thought as the expectation
of a test to decide if two nodes share the same distribution. Since, only one vector is sampled for
each node in the above methods, the statistical test is essentially based on a single sample from both
nodes, making the calculated weights inaccurate and unreliable.
An possible solution towards addressing the noise in feature vectors is to run multiple tests, a common
approach to reduce variance in sampling and decision making [
4
,
11
,
26
]. Multiple samples are
drawn for each node, and each sample can be used to run an independent test to decide if two
nodes share the same distribution. Based on the results from
multiple tests
, we develop a novel
and robust similarity measure that significantly improves the robustness and reliability of calculated
weights. This paper will show how to accomplish multiple tests in the absence of graph structures and
theoretically prove why the proposed approach can produce a better similarity measure under specified
conditions. This similarity measure is also crafted into an efficient and elegant form of matrix, named
B-Attention
.
B-Attention
utilizes the fourth order statistics of the features, which is a significant
difference with popular attentions. Benefiting from a better similarity measure produced by multiple
tests, graph structures can be significantly improved, resulting in better graph representations and
therefore boosting the performance of downstream tasks. We empirically verify the effectivness of
the proposed method over the task of clustering and ReID.
In summary, this paper makes the following three major contributions: (1) To the best of our
knowledge, this is the first paper to address inaccurate edge weights problem by statistical tests on the
task of graph structure learning over images. (2) A novel and robust similarity measure is proposed
based on multiple tests, proven theoretically to be superior and designed in an elegant matrix form
to improve graph structure quality. (3) Extensive experiments further verify the robustness of the
graph structure learning method against noise. State-of-the-art (SOTA) performances are achieved on
multiple clustering and ReID benchmarks with the learned graph structures.
2 Methodology
To handle the noise in feature vectors, we model the feature vector of each node as a sample drawn
from the underlying distribution associated with that node, and the similarity score between two
nodes as the expectation of a statistical test. The
test
is to compare their similarity to a given threshold
to decide if two nodes share the same distribution and that decision can be captured by a binary
Bernoulli random variable. The comparison in statistical test may lead to error edges, i.e.,
noisy
edges
(an edge connects two nodes of different categories) or
missing edges
(two nodes of the same
category should be connected but are not). The superiority of the similarity measure can be revealed
by comparing the probability of creating an error edge. The proposed similarity measure will be
introduced in Section 2.2, and its further design in matrix form will be presented in Section 2.3.
2.1 Preliminary
Chernoff Bounds.
Sums of random variables have been studied for a long time [
4
,
11
,
26
]. Cher-
noff Bounds [
11
] provide sharply decreasing bounds that are in the exponential form on tail distribu-
2
tions of sums of independent random variables. It is a widely used fundamental tool in mathematics
and computer science [32, 41]. An often used and looser form of Chernoff Bounds is as follows.
Lemma 1
(Chernoff Bounds [
11
])
.
Suppose
Xi
is a binary random variable taking value in set
{0,1}
with
P(Xi= 1) = pi
. Define
X=Pn
1Xi
where
{X1, X2,··· , Xn}
are independently
sampled. Then for any 0<<1,
P(X(1 + )µ)e2
3µ,
P(X(1 )µ)e2
3µ.
It shows that with a high probability
1δ
, we have
X(1+)µ
, or
X(1)µ
, where
δ=e2
3
.
GCNs.
A GCN network normally consists of multiple GCN layers. Given an input graph
G(X,A)
with Lvertices, the output of a normal GCN layer [34] is:
X
0=σ(˜
D1
2˜
A˜
D1
2XWl),where ˜
A=A+I,˜
Dii =
L
X
j=1
˜
Ai,j ,(1)
XRL×M
is input vertex features with
M
dimensions,
ARL×L
is their adjacency matrix;
IRL×L
is an identity matrix,
˜
DRL×L
is the diagonal degree matrix of
˜
A
;
WlRM×M0
is a
trainable matrix, and
σ(·)
is an activation function such as ReLU [
19
];
X0RL×M0
is the output
vertex features with M0dimensions, M0=Min this work.
2.2 Multiple tests on similarity measure
Let
V={v1, v2, ..., vN}
be the collection of
N
nodes. Each node is assumed to be in one of two
categories:
+1
or
1
for simplicity. For each node
vi
, the subset of candidate nodes to connect it
is denoted by
Vi
. This subset can be decided by a simple criterion such as by
k
nearest neighbours.
When calculating the proposed similarity between
vi
and
vj∈ Vi
, the common candidate nodes
between
vi
and
vj
, i.e.,
Vi,j =Vi∩ Vj
are first identified. The nodes in the
Vi,j
are treated as the
multiple samples
drawn for the comparison of
vi
and
vj
. Then, each
vk∈ Vi,j
performs single
test twice, one for
vk
and
vi
and one for
vk
and
vj
. The
single test
here is a comparison using the
pairwise similarity based on the features of
vk
and
vi
, or
vk
and
vj
. The pairwise similarity can
be cosine similarity, Gaussian kernel similarity or any other form, and is denoted by
Sim-S
. The
proposed similarity score between
vi
and
vj
increases by one if the two tests yield
C(vi) = C(vk)
and
C(vk) = C(vj)
, and zero otherwise, where
C(v)
is the category of node
v
. It contains multiple
single tests, hence the name multiple tests, corresponding similarity measure denoted by Sim-M.
For the convenience of analysis, a few assumptions are further introduced. First, the size of
Vi,j
is
m
,
and among the nodes in
Vi,j
,
αm
nodes share the same category as
vi
and
(1 α)m
nodes belong
to the opposite category, with α > 1
2. For any node vj∈ Vi, its probability of being connected with
vi
, according to the statistical test, is
q[0,1]
if
C(vi)6=C(vj)
, and the probability increases to
1p>q
if
C(vi) = C(vj)
. This section will show that, if
m
is sufficiently large (i.e., enough
number of tests can be run), there will exist an appropriate threshold for the Sim-M such that the
number of error edges can be reduced significantly compared to Sim-S.
Proposition 1.
For the single test method, in expectation, the number of noisy edges
E{Nnoisy}=
(1 α)qm and the number of missing edges E{Nmiss}=α(1 p)m.
The theorem below shows that both
E{Nnoisy}
and
E{Nmiss}
can be reduced by a significant portion
when mis large enough.
Theorem 1. Suppose γ > 1and
m > 3((p+q)2+ (2α1)(p2q2))2
pq((pq)2+ (2α1)(p2q2))2log( γ
min((1 α)q, α(1 p))),(2)
We have E{Nnoisy}= (1 α)qm/γ and E{Nmiss}=α(1 p)m/γ .
Proof.
Consider node
vi
and one of its candidate node
vj∈ Vi
.
Sk
i,j
is denoted as the similarity
score between
vi
and
vj
based on the statistical test with respect to
vk
. As shown in Figure 1,
vk
may
3
𝒗!
𝒗𝒊
𝒗𝒌
𝑣$
%
𝑝𝑞
𝑞𝑝
𝐸 𝑆&,!
$= 𝛼 ∗ 𝑝𝑞 + 1 − 𝛼 𝑞𝑝 =𝑝𝑞,𝑖𝑓0𝐶 𝑣&≠ 𝐶(𝑣!)
𝒗!
𝒗𝒊
𝒗𝒌
𝑝
𝑞
𝑞
𝑝
𝐸 𝑆&,!
$= 𝛼 ∗ 𝑝(+ 1 − 𝛼 ∗ 𝑞(,𝑖𝑓0𝐶 𝑣&= 𝐶(𝑣!)
𝑣$
%
Figure 1: Illustration of
E[Sk
i,j ]
in
Sim-M on two nodes from different
categories and the same category.
Nodes of different colors are from
different categories.
Self Attention
Features of nodes
dot product
addition
-Attention
Figure 2: Illustration of the
B-Attention
mechanism. The
self-attention part is the same as that in Transformers. The
Q-Attention
part generates
AX
and then pay attention to it
to generate the
Aqart
. The two output attention maps are
fused as the final output Aband.
be of the same category as
vi
or not (denoted by
v0k
for a better show). Evidently,
E[Sk
i,j ] = pq
if
C(vi)6=C(vj), and E[Sk
i,j ] = αp2+ (1 α)q2if C(vi) = C(vj).
The Sim-M between viand vjis defined as:
Si,j =1
mX
k∈Vi,j
Sk
i,j .(3)
Using the Chernoff Bounds [11], we have with a probability 1δ,
Si,j =1
mPkVi,j Sk(1 + q3log(1)
pqm )pq, C(vi)6=C(vj),
Si,j =1
mPkVi,j Sk(1 q3log(1)
(αp2+(1α)q2)m)(αp2+ (1 α)q2), C(vi) = C(vj),
(4)
By assuming that mis large enough, i.e.,
s3 log (1)
pqm <(pq)2+ (2α1)(p2q2)
(p+q)2+ (2α1)(p2q2),(5)
by choosing the similarity threshold
St= (1 + s3log(1)
pqm )pq
2+ (1 s3log(1)
(αp2+ (1 α)q2)m)αp2+ (1 α)q2
2,(6)
only with a probability
δ
, we will miss a correct edge, and with a probability
δ
we will introduce a
noisy edge. We complete the proof by substituting δin Equation 5 with
δ,1
γmin((1 α)q, α(1 p)).(7)
We can guarantee that the Sim-M between two nodes from the same category is strictly larger than
that for two nodes from different categories, leading to the distillation of all noisy edges.
2.3 Multiple tests on GCNs
With a better similarity measure prototype above, we further craft it into an elegant and efficient matrix
form for adaptation of graph structure learning in the real-world. The analysis in
Theorem
1 can be
4
easily extended to real similarities, instead of binary simiarities, by using the generalized version of
Chernoff inequality detailed in Section E of Appendix. Given the L2 normalized original node features
X
, the idea of Sim-S can be implemented by inner product of features, i.e.
AX=XXT
. The twice-
run tests on each node can be achieved by
AX
multiplying itself. Inspired by the design of popular
Transformers [
53
], some learnable parameters are also introduced. It contains the
fourth-order
statistics of features
, which is a significant difference, since self-attention only has second-order. If
self-attention is regarded as
Duet of Attention
, our method is analogous to
Quertet of Attention
(
Q-
Attention
for short). The vectorized representation of Equation 3,
Q-Attention Aqart RL×L
, is
then implemented as follows:
Aqart =QqartKT
qart,where Qqart =AXWQ
qart,Kqart =AXWK
qart,(8)
WQ
qart RL×L0
and
WK
qart RL×L0
are learnable weights,
L0=L
in this work. L2 normalization
is also applied to AX.
Motivated by the further enhancements from combinations of intrinsic graph structures and implicit
graph structures in literature [
35
,
8
,
36
], the self-attention can play an intrinsic role to fuse with the
Q-Attention in vision tasks, although no intrinsic graph structure exists between images at all. The
self-attention part is the same as that in Transformers [53]. The attention map Aself RL×Lis:
Aself =Qself Kself T
Md,where Qself =XWQ
self ,Kself =XWK
self ,(9)
WQ
self RM×Md
and
WK
self RM×Md
are the learnable weights,
Md
is the dimension of query
and key features. The combined form blends the duet and quartet of attention, hence the name
Band of Attention
(
B-Attention
for short). The overview of
B-Attention
mechanism is shown in
Figure 2.
The fusion of Aself and Aqart is defined as (where softmax(·)is a softmax function):
Aband =softmax(θqartAqart +θself Aself ).(10)
When combined with the neural network layer in Equation 1, the final form is:
X0=σ(AbandXWl).(11)
In real implementation, considering the graph size can be very large for large-scale datasets, sub-graph
sampling strategy [
57
,
55
] is adopted to make it more computationally efficient and scalable. In
particular,
kseed
nodes are first selected as seeds from a dataset in each sampling step. Then the seed
nodes together with their
k
nearest neighbours (
k
NN) [
14
] form the nodes of a sub-graph. As a result,
Equations 8 and 9 are applied to the sub-graph during training and inference. For a dataset of size
Nd
, it takes about
Nd/kseed
iterations to process the whole dataset on average. The pseudo code for
self-attention and Q-Attention is detailed in Section D of Appendix.
3 Experiments
Experiments are designed with three levels. Firstly, experiments are conducted to analyse the advan-
tage of Sim-M over that of Sim-S and its robustness to noise. Secondly, armed with multiple tests on
GCNs, i.e.,
B-Attention
is further investigated in comparison with the self-attention and a commonly
used form of GCNs on robustness and superiority. Ablation experiments are then conducted to
study how each part of
B-Attention
contributes to and influences the performance. Finally, with the
robust graph structure learned by
B-Attention
, comparison experiments in downstream graph-based
clustering and ReID tasks are conducted to evaluate the benefits of the proposed approach.
3.1 Experimental setups
Datasets
Experiments are conducted on commonly used visual datasets
4
with three different
types of objects, i.e.,
MS-Celeb
[
21
] (human faces),
MSMT17
[
59
] (human bodies) and
VeRi-
776
[
37
] (vehicles), to verify the generalization of the proposed method. In these datasets, each
4DECLARATION:
Considering the ethical and privacy issues, all the datasets used in this paper are feature
vectors which can not be restored to images.
5
摘要:

RobustGraphStructureLearningviaMultipleStatisticalTestsYaohuaWangAlibabaGroupxiachen.wyh@alibaba-inc.comFangyiZhang*yQueenslandUniversityofTechnologyCentreforRobotics(QCR)fangyi.zhang@qut.edu.auMingLin†Amazonminglamz@amazon.comSenzhangWangzCentralSouthUniversityszwang@csu.edu.cnXiuyuSun‡AlibabaGrou...

展开>> 收起<<
Robust Graph Structure Learning via Multiple Statistical Tests Yaohua Wang.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.1MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注