learning as similarity measure learning upon the node embedding space [
60
]. They learn weighted
adjacency matrices [
20
,
42
,
9
,
10
,
43
,
64
,
28
,
8
,
25
,
35
] as graphs and feed them to GCNs to learn
representations.
However, the main problem of the above weighted-based methods is that weights are not always
accurate and reliable: two nodes with smaller associations may have a higher weight score than
those with larger associations. The root of this problem is closely related to the
noise
in the high
dimensional feature vectors extracted by the deep learning model, as similarities between nodes are
usually decided based on their feature vectors. The direct consequence of using an inaccurate graph
structure is so called feature pollution [
57
], leading to a performance degradation in downstream
tasks. In our empirical studies, we found that close to
1/3
of total weights are assigned to noisy edges
when unnormalized cosine similarities between feature vectors are used to calculate weights.
Many methods have been developed to improve similarity measure between nodes in graph, including
adding learnable parameters [
9
,
28
,
8
], introducing nonlinearity [
10
,
43
,
64
], Gaussian kernel func-
tions [
25
,
35
] and self-attention [
53
,
12
,
20
,
42
]. Despite the progress, they are limited to modifying
similarity function, without explicitly addressing the noise in feature vectors, a more fundamental
issue for graph structure learning. One way to capture the noise in feature vectors is to view the
feature vector of each node as an independent sample from some unknown distribution associated
with each node. As a result, the similarity score between two nodes can be thought as the expectation
of a test to decide if two nodes share the same distribution. Since, only one vector is sampled for
each node in the above methods, the statistical test is essentially based on a single sample from both
nodes, making the calculated weights inaccurate and unreliable.
An possible solution towards addressing the noise in feature vectors is to run multiple tests, a common
approach to reduce variance in sampling and decision making [
4
,
11
,
26
]. Multiple samples are
drawn for each node, and each sample can be used to run an independent test to decide if two
nodes share the same distribution. Based on the results from
multiple tests
, we develop a novel
and robust similarity measure that significantly improves the robustness and reliability of calculated
weights. This paper will show how to accomplish multiple tests in the absence of graph structures and
theoretically prove why the proposed approach can produce a better similarity measure under specified
conditions. This similarity measure is also crafted into an efficient and elegant form of matrix, named
B-Attention
.
B-Attention
utilizes the fourth order statistics of the features, which is a significant
difference with popular attentions. Benefiting from a better similarity measure produced by multiple
tests, graph structures can be significantly improved, resulting in better graph representations and
therefore boosting the performance of downstream tasks. We empirically verify the effectivness of
the proposed method over the task of clustering and ReID.
In summary, this paper makes the following three major contributions: (1) To the best of our
knowledge, this is the first paper to address inaccurate edge weights problem by statistical tests on the
task of graph structure learning over images. (2) A novel and robust similarity measure is proposed
based on multiple tests, proven theoretically to be superior and designed in an elegant matrix form
to improve graph structure quality. (3) Extensive experiments further verify the robustness of the
graph structure learning method against noise. State-of-the-art (SOTA) performances are achieved on
multiple clustering and ReID benchmarks with the learned graph structures.
2 Methodology
To handle the noise in feature vectors, we model the feature vector of each node as a sample drawn
from the underlying distribution associated with that node, and the similarity score between two
nodes as the expectation of a statistical test. The
test
is to compare their similarity to a given threshold
to decide if two nodes share the same distribution and that decision can be captured by a binary
Bernoulli random variable. The comparison in statistical test may lead to error edges, i.e.,
noisy
edges
(an edge connects two nodes of different categories) or
missing edges
(two nodes of the same
category should be connected but are not). The superiority of the similarity measure can be revealed
by comparing the probability of creating an error edge. The proposed similarity measure will be
introduced in Section 2.2, and its further design in matrix form will be presented in Section 2.3.
2.1 Preliminary
Chernoff Bounds.
Sums of random variables have been studied for a long time [
4
,
11
,
26
]. Cher-
noff Bounds [
11
] provide sharply decreasing bounds that are in the exponential form on tail distribu-
2