Robust Graph Structure Learning via Multiple Statistical Tests Yaohua Wang

2025-05-03 0 0 1.1MB 21 页 10玖币

侵权投诉

Robust Graph Structure Learning via

Multiple Statistical Tests

Yaohua Wang ∗

Alibaba Group

xiachen.wyh@alibaba-inc.com

Fangyi Zhang *†

Queensland University of Technology

Centre for Robotics (QCR)

fangyi.zhang@qut.edu.au

Ming Lin †

Amazon

minglamz@amazon.com

Senzhang Wang ‡

Central South University

szwang@csu.edu.cn

Xiuyu Sun ‡

Alibaba Group

xiuyu.sxy@alibaba-inc.com

Rong Jin †

Twitter

rongjinemail@gmail.com

Abstract

Graph structure learning aims to learn connectivity in a graph from data. It is

particularly important for many computer vision related tasks since no explicit

graph structure is available for images for most cases. A natural way to construct a

graph among images is to treat each image as a node and assign pairwise image

similarities as weights to corresponding edges. It is well known that pairwise

similarities between images are sensitive to the noise in feature representations,

leading to unreliable graph structures. We address this problem from the viewpoint

of statistical tests. By viewing the feature vector of each node as an independent

sample, the decision of whether creating an edge between two nodes based on their

similarity in feature representation can be thought as a single statistical test. To

improve the robustness in the decision of creating an edge, multiple samples are

drawn and integrated by multiple statistical tests to generate a more reliable simi-

larity measure, consequentially more reliable graph structure. The corresponding

elegant matrix form named

B-Attention

is designed for efﬁciency. The effective-

ness of multiple tests for graph structure learning is veriﬁed both theoretically and

empirically on multiple clustering and ReID benchmark datasets. Source codes are

available at https://github.com/Thomas-wyh/B-Attention.

1 Introduction

Graph structure learning plays an important role in Graph Convolutional Networks (GCNs). It seeks

to discover underlying graph structures for better graph representation learning when graph structures

are unavailable, noisy or corrupted. These issues are typical in the ﬁeld of vision tasks [

]. As an example, there is no explicit graph structure between photos

in your phone albums, so the graph-based approaches [

] in photo management softwares

seek to learn graph structure. A popular research direction on it is to consider graph structure

∗Equal contribution.

†Work done while at Alibaba.

‡Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03956v4 [cs.CV] 23 Dec 2022

learning as similarity measure learning upon the node embedding space [

]. They learn weighted

adjacency matrices [

] as graphs and feed them to GCNs to learn

representations.

However, the main problem of the above weighted-based methods is that weights are not always

accurate and reliable: two nodes with smaller associations may have a higher weight score than

those with larger associations. The root of this problem is closely related to the

noise

in the high

dimensional feature vectors extracted by the deep learning model, as similarities between nodes are

usually decided based on their feature vectors. The direct consequence of using an inaccurate graph

structure is so called feature pollution [

], leading to a performance degradation in downstream

tasks. In our empirical studies, we found that close to

1/3

of total weights are assigned to noisy edges

when unnormalized cosine similarities between feature vectors are used to calculate weights.

Many methods have been developed to improve similarity measure between nodes in graph, including

adding learnable parameters [

], introducing nonlinearity [

], Gaussian kernel func-

tions [

] and self-attention [

]. Despite the progress, they are limited to modifying

similarity function, without explicitly addressing the noise in feature vectors, a more fundamental

issue for graph structure learning. One way to capture the noise in feature vectors is to view the

feature vector of each node as an independent sample from some unknown distribution associated

with each node. As a result, the similarity score between two nodes can be thought as the expectation

of a test to decide if two nodes share the same distribution. Since, only one vector is sampled for

each node in the above methods, the statistical test is essentially based on a single sample from both

nodes, making the calculated weights inaccurate and unreliable.

An possible solution towards addressing the noise in feature vectors is to run multiple tests, a common

approach to reduce variance in sampling and decision making [

]. Multiple samples are

drawn for each node, and each sample can be used to run an independent test to decide if two

nodes share the same distribution. Based on the results from

multiple tests

, we develop a novel

and robust similarity measure that signiﬁcantly improves the robustness and reliability of calculated

weights. This paper will show how to accomplish multiple tests in the absence of graph structures and

theoretically prove why the proposed approach can produce a better similarity measure under speciﬁed

conditions. This similarity measure is also crafted into an efﬁcient and elegant form of matrix, named

B-Attention

utilizes the fourth order statistics of the features, which is a signiﬁcant

difference with popular attentions. Beneﬁting from a better similarity measure produced by multiple

tests, graph structures can be signiﬁcantly improved, resulting in better graph representations and

therefore boosting the performance of downstream tasks. We empirically verify the effectivness of

the proposed method over the task of clustering and ReID.

In summary, this paper makes the following three major contributions: (1) To the best of our

knowledge, this is the ﬁrst paper to address inaccurate edge weights problem by statistical tests on the

task of graph structure learning over images. (2) A novel and robust similarity measure is proposed

based on multiple tests, proven theoretically to be superior and designed in an elegant matrix form

to improve graph structure quality. (3) Extensive experiments further verify the robustness of the

graph structure learning method against noise. State-of-the-art (SOTA) performances are achieved on

multiple clustering and ReID benchmarks with the learned graph structures.

2 Methodology

To handle the noise in feature vectors, we model the feature vector of each node as a sample drawn

from the underlying distribution associated with that node, and the similarity score between two

nodes as the expectation of a statistical test. The

test

is to compare their similarity to a given threshold

to decide if two nodes share the same distribution and that decision can be captured by a binary

Bernoulli random variable. The comparison in statistical test may lead to error edges, i.e.,

noisy

edges

(an edge connects two nodes of different categories) or

missing edges

(two nodes of the same

category should be connected but are not). The superiority of the similarity measure can be revealed

by comparing the probability of creating an error edge. The proposed similarity measure will be

introduced in Section 2.2, and its further design in matrix form will be presented in Section 2.3.

2.1 Preliminary

Chernoff Bounds.

Sums of random variables have been studied for a long time [

]. Cher-

noff Bounds [

] provide sharply decreasing bounds that are in the exponential form on tail distribu-

tions of sums of independent random variables. It is a widely used fundamental tool in mathematics

and computer science [32, 41]. An often used and looser form of Chernoff Bounds is as follows.

Lemma 1

(Chernoff Bounds [

])

Suppose

is a binary random variable taking value in set

{0,1}

with

P(Xi= 1) = pi

. Deﬁne

X=Pn

1Xi

where

{X1, X2,··· , Xn}

are independently

sampled. Then for any 0<<1,

P(X≥(1 + )µ)≤e−2

3µ,

P(X≤(1 −)µ)≤e−2

3µ.

It shows that with a high probability

1−δ

, we have

X≤(1+)µ

, or

X≥(1−)µ

, where

δ=e−2

GCNs.

A GCN network normally consists of multiple GCN layers. Given an input graph

G(X,A)

with Lvertices, the output of a normal GCN layer [34] is:

0=σ(˜

D−1

2˜

A˜

D−1

2XWl),where ˜

A=A+I,˜

Dii =

j=1

Ai,j ,(1)

X∈RL×M

is input vertex features with

dimensions,

A∈RL×L

is their adjacency matrix;

I∈RL×L

is an identity matrix,

D∈RL×L

is the diagonal degree matrix of

;

Wl∈RM×M0

is a

trainable matrix, and

σ(·)

is an activation function such as ReLU [

];

X0∈RL×M0

is the output

vertex features with M0dimensions, M0=Min this work.

2.2 Multiple tests on similarity measure

Let

V={v1, v2, ..., vN}

be the collection of

nodes. Each node is assumed to be in one of two

categories:

−1

for simplicity. For each node

, the subset of candidate nodes to connect it

is denoted by

. This subset can be decided by a simple criterion such as by

nearest neighbours.

When calculating the proposed similarity between

and

vj∈ Vi

, the common candidate nodes

between

and

, i.e.,

Vi,j =Vi∩ Vj

are ﬁrst identiﬁed. The nodes in the

Vi,j

are treated as the

multiple samples

drawn for the comparison of

and

. Then, each

vk∈ Vi,j

performs single

test twice, one for

and

and one for

and

. The

single test

here is a comparison using the

pairwise similarity based on the features of

and

, or

and

. The pairwise similarity can

be cosine similarity, Gaussian kernel similarity or any other form, and is denoted by

Sim-S

. The

proposed similarity score between

and

increases by one if the two tests yield

C(vi) = C(vk)

and

C(vk) = C(vj)

, and zero otherwise, where

C(v)

is the category of node

. It contains multiple

single tests, hence the name multiple tests, corresponding similarity measure denoted by Sim-M.

For the convenience of analysis, a few assumptions are further introduced. First, the size of

Vi,j

and among the nodes in

Vi,j

αm

nodes share the same category as

and

(1 −α)m

nodes belong

to the opposite category, with α > 1

2. For any node vj∈ Vi, its probability of being connected with

, according to the statistical test, is

q∈[0,1]

C(vi)6=C(vj)

, and the probability increases to

1≥p>q

C(vi) = C(vj)

. This section will show that, if

is sufﬁciently large (i.e., enough

number of tests can be run), there will exist an appropriate threshold for the Sim-M such that the

number of error edges can be reduced signiﬁcantly compared to Sim-S.

Proposition 1.

For the single test method, in expectation, the number of noisy edges

E{Nnoisy}=

(1 −α)qm and the number of missing edges E{Nmiss}=α(1 −p)m.

The theorem below shows that both

E{Nnoisy}

and

E{Nmiss}

can be reduced by a signiﬁcant portion

when mis large enough.

Theorem 1. Suppose γ > 1and

m > 3((p+q)2+ (2α−1)(p2−q2))2

pq((p−q)2+ (2α−1)(p2−q2))2log( γ

min((1 −α)q, α(1 −p))),(2)

We have E{Nnoisy}= (1 −α)qm/γ and E{Nmiss}=α(1 −p)m/γ .

Proof.

Consider node

and one of its candidate node

vj∈ Vi

i,j

is denoted as the similarity

score between

and

based on the statistical test with respect to

. As shown in Figure 1,

may

𝒗!

𝒗𝒊

…

𝒗𝒌

𝑣$

…

𝑝𝑞

𝑞𝑝

𝐸 𝑆&,!

$= 𝛼 ∗ 𝑝𝑞 + 1 − 𝛼 ∗ 𝑞𝑝 =𝑝𝑞,𝑖𝑓0𝐶 𝑣&≠ 𝐶(𝑣!)

𝒗!

𝒗𝒊

…

𝒗𝒌

…

𝑝

𝑞

𝑝

𝐸 𝑆&,!

$= 𝛼 ∗ 𝑝(+ 1 − 𝛼 ∗ 𝑞(,𝑖𝑓0𝐶 𝑣&= 𝐶(𝑣!)

𝑣$

Figure 1: Illustration of

E[Sk

i,j ]

Sim-M on two nodes from different

categories and the same category.

Nodes of different colors are from

different categories.

Self Attention

Features of nodes

dot product

addition

-Attention

Figure 2: Illustration of the

B-Attention

mechanism. The

self-attention part is the same as that in Transformers. The

Q-Attention

part generates

and then pay attention to it

to generate the

Aqart

. The two output attention maps are

fused as the ﬁnal output Aband.

be of the same category as

or not (denoted by

v0k

for a better show). Evidently,

E[Sk

i,j ] = pq

C(vi)6=C(vj), and E[Sk

i,j ] = αp2+ (1 −α)q2if C(vi) = C(vj).

The Sim-M between viand vjis deﬁned as:

Si,j =1

k∈Vi,j

i,j .(3)

Using the Chernoff Bounds [11], we have with a probability 1−δ,







Si,j =1

mPk∈Vi,j Sk≤(1 + q3log(1/δ)

pqm )pq, C(vi)6=C(vj),

Si,j =1

mPk∈Vi,j Sk≥(1 −q3log(1/δ)

(αp2+(1−α)q2)m)(αp2+ (1 −α)q2), C(vi) = C(vj),

(4)

By assuming that mis large enough, i.e.,

s3 log (1/δ)

pqm <(p−q)2+ (2α−1)(p2−q2)

(p+q)2+ (2α−1)(p2−q2),(5)

by choosing the similarity threshold

St= (1 + s3log(1/δ)

pqm )pq

2+ (1 −s3log(1/δ)

(αp2+ (1 −α)q2)m)αp2+ (1 −α)q2

2,(6)

only with a probability

, we will miss a correct edge, and with a probability

we will introduce a

noisy edge. We complete the proof by substituting δin Equation 5 with

δ,1

γmin((1 −α)q, α(1 −p)).(7)

We can guarantee that the Sim-M between two nodes from the same category is strictly larger than

that for two nodes from different categories, leading to the distillation of all noisy edges.

2.3 Multiple tests on GCNs

With a better similarity measure prototype above, we further craft it into an elegant and efﬁcient matrix

form for adaptation of graph structure learning in the real-world. The analysis in

Theorem

1 can be

easily extended to real similarities, instead of binary simiarities, by using the generalized version of

Chernoff inequality detailed in Section E of Appendix. Given the L2 normalized original node features

, the idea of Sim-S can be implemented by inner product of features, i.e.

AX=XXT

. The twice-

run tests on each node can be achieved by

multiplying itself. Inspired by the design of popular

Transformers [

], some learnable parameters are also introduced. It contains the

fourth-order

statistics of features

, which is a signiﬁcant difference, since self-attention only has second-order. If

self-attention is regarded as

Duet of Attention

, our method is analogous to

Quertet of Attention

(

Q-

Attention

for short). The vectorized representation of Equation 3,

Q-Attention Aqart ∈RL×L

, is

then implemented as follows:

Aqart =QqartKT

qart,where Qqart =AXWQ

qart,Kqart =AXWK

qart,(8)

qart ∈RL×L0

and

qart ∈RL×L0

are learnable weights,

L0=L

in this work. L2 normalization

is also applied to AX.

Motivated by the further enhancements from combinations of intrinsic graph structures and implicit

graph structures in literature [

], the self-attention can play an intrinsic role to fuse with the

Q-Attention in vision tasks, although no intrinsic graph structure exists between images at all. The

self-attention part is the same as that in Transformers [53]. The attention map Aself ∈RL×Lis:

Aself =Qself Kself T

√Md,where Qself =XWQ

self ,Kself =XWK

self ,(9)

self ∈RM×Md

and

self ∈RM×Md

are the learnable weights,

is the dimension of query

and key features. The combined form blends the duet and quartet of attention, hence the name

Band of Attention

(

B-Attention

for short). The overview of

B-Attention

mechanism is shown in

Figure 2.

The fusion of Aself and Aqart is deﬁned as (where softmax(·)is a softmax function):

Aband =softmax(θqartAqart +θself Aself ).(10)

When combined with the neural network layer in Equation 1, the ﬁnal form is:

X0=σ(AbandXWl).(11)

In real implementation, considering the graph size can be very large for large-scale datasets, sub-graph

sampling strategy [

] is adopted to make it more computationally efﬁcient and scalable. In

particular,

kseed

nodes are ﬁrst selected as seeds from a dataset in each sampling step. Then the seed

nodes together with their

nearest neighbours (

NN) [

] form the nodes of a sub-graph. As a result,

Equations 8 and 9 are applied to the sub-graph during training and inference. For a dataset of size

, it takes about

Nd/kseed

iterations to process the whole dataset on average. The pseudo code for

self-attention and Q-Attention is detailed in Section D of Appendix.

3 Experiments

Experiments are designed with three levels. Firstly, experiments are conducted to analyse the advan-

tage of Sim-M over that of Sim-S and its robustness to noise. Secondly, armed with multiple tests on

GCNs, i.e.,

B-Attention

is further investigated in comparison with the self-attention and a commonly

used form of GCNs on robustness and superiority. Ablation experiments are then conducted to

study how each part of

B-Attention

contributes to and inﬂuences the performance. Finally, with the

robust graph structure learned by

B-Attention

, comparison experiments in downstream graph-based

clustering and ReID tasks are conducted to evaluate the beneﬁts of the proposed approach.

3.1 Experimental setups

Datasets

Experiments are conducted on commonly used visual datasets

with three different

types of objects, i.e.,

MS-Celeb

[

] (human faces),

MSMT17

[

] (human bodies) and

VeRi-

776

[

] (vehicles), to verify the generalization of the proposed method. In these datasets, each

4DECLARATION:

Considering the ethical and privacy issues, all the datasets used in this paper are feature

vectors which can not be restored to images.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustGraphStructureLearningviaMultipleStatisticalTestsYaohuaWangAlibabaGroupxiachen.wyh@alibaba-inc.comFangyiZhang*yQueenslandUniversityofTechnologyCentreforRobotics(QCR)fangyi.zhang@qut.edu.auMingLinAmazonminglamz@amazon.comSenzhangWangzCentralSouthUniversityszwang@csu.edu.cnXiuyuSunAlibabaGrou...

展开>> 收起<<

Robust Graph Structure Learning via Multiple Statistical Tests Yaohua Wang.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Graph Structure Learning via Multiple Statistical Tests Yaohua Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: