
There have been various methods proposed to solve UNOD
[9]. They can be roughly divided into two categories, namely,
non-deep and deep-learning-based methods. Non-deep meth-
ods usually leverage traditional machine learning methods
such as matrix factorization [10], density-based clustering
[11], and relational learning [12] to encode the graph informa-
tion and detect outliers. However, these methods fail to address
the computational challenge with high-dimension data [13].
With the rapid prevalence of Graph Neural Networks (GNNs)
[14], more and more methods are based on deep learning
[15] and GNNs nowadays [3]. For example, DOMINANT [4]
employs two GNN autoencoders to reconstruct the attribute
information and structure information. According to the results
reported in [9], most deep-learning-based methods have a
much better performance than non-deep methods in detecting
injected outliers. To unify the outlier injection process, [9]
adopts the outlier injection approach from [4] as the standard
injection approach.
Challenge. Although the recent deep-learning-based methods
have achieved an excellent performance in UNOD, we find
that the most widely-used outlier injection approach, which
is adopted by [4, 9, 16, 17, 18, 19, 20, 21, 22, 23, 24],
will cause a serious data leakage issue. Here, we refer to
the data leakage [25] in machine learning, which means the
information strongly associated with the labels is leaked to
the training dataset. After employing this approach to inject
outliers, structural outliers will have a larger node degree than
the average while attribute vectors of contextual outliers will
have a larger L2-norm (a.k.a. Euclidean norm) than expected.
As a result, a simple solution only utilizing node degree or
L2-norm of attribute vectors as the outlier score to detect the
corresponding type of outliers can acquire a quite satisfying
performance as shown in Fig 2. The metric of AUC [26] is
adopted here to measure the detection performance. Under
such a serious data leakage issue in injected datasets, most
existing algorithms cannot have a better performance than the
simple solution. In addition, it is observed that existing algo-
rithms have a performance drop in varied injection settings, in
which the data leakage issue caused by the current injection
approach is alleviated. Therefore, it is urgent to find out the
cause of data leakage and reduce its impact. On the other hand,
it is also necessary to exploit an effective UNOD algorithm,
which has a superior performance and is robust to the data
leakage issue. Moreover, the balance between structural and
contextual outliers detection performance is little considered
in existing works [9]. An algorithm with unbalanced detection
may only have detection ability for a certain type of outliers.
It is found that existing algorithms focus more on contextual
outliers than structural outliers when detecting them. To gain
more feasible algorithms, comprehensive metrics for balance
evaluation should be devised.
Our Solution. In this paper, we are devoted to analyzing the
cause of data leakage and devising a superior outlier detection
method for UNOD to achieve better performance in both the
current injection setting and varied injection settings.
Fig. 2. After injecting outliers in four datasets, node degree is employed
to detect structural outliers while L2-norm is employed to detect contextual
outliers. Both of them, compared to the random detector, achieve unexpectedly
high scores.
In particular, we propose a novel variance-based model
to detect structural outliers, which adopts the variance of
representations of neighbor nodes to detect structural outliers.
To the best of our knowledge, it is the first time to employ
the neighbor variance to detect outliers. Existing algorithms
are based on either reconstruction of adjacency matrix [4, 18]
or contrastive learning [16, 23] to detect structural outliers.
According to the definition, the essence of a structural outlier
is its inconsistent neighbors that come from different commu-
nities. However, existing algorithms cannot directly capture
this essence. In this case, we devise a deep graph model to
measure the inconsistency among neighbors by the variance
of latent representations of neighbors, which captures the
essence and gives a better utility for detection. On top of this,
a new framework, Variance-based Graph Outlier Detection
(VGOD), is also proposed to detect two types of outliers with
a variance-based model and an attribute reconstruction model.
To address the balance issue, we separately train two models
to avoid overtraining and normalize two types of outlier scores
to eliminate the scale difference. To evaluate the detection
balance on two outlier types, we introduce a new metric to
measure the gap in the performance score. The experiments are
conducted on 5 real-world datasets and the results demonstrate
that VGOD achieves the best detection performance.
Contribution. Our major contributions are as follows.
1) To the best of our knowledge, we are the first to identify
the data leakage issue in the most widely-used outlier
injection approach.
2) We analyze the cause of data leakage in depth, give
suggestions for the future design of the outlier injection
approach, and purpose a new approach for injecting
structural outliers.
3) We propose a novel variance-based model and a new
VGOD framework, which outperforms existing algo-
rithms in detecting outliers and alleviates the issue of
the balance of detection.
4) Extensive experiments are conducted to demonstrate that
our approach achieves the best detection performance
and the detection balance.