Unsupervised Graph Outlier Detection Problem Revisit New Insight and Superior Method Yihong Huangy Liping Wangy Fan Zhangz Xuemin Linx

2025-05-06 0 0 1.78MB 15 页 10玖币

侵权投诉

Unsupervised Graph Outlier Detection: Problem

Revisit, New Insight, and Superior Method

Yihong Huang†, Liping Wang†, Fan Zhang‡, Xuemin Lin§

†East China Normal University, ‡Guangzhou University, §Shanghai Jiao tong University

hyh957947142@gmail.com, lipingwang@sei.ecnu.edu.cn

zhangf@gzhu.edu.cn, lxue@cse.unsw.edu.au

Abstract—A large number of studies on Graph Outlier De-

tection (GOD) have emerged in recent years due to its wide

applications, in which Unsupervised Node Outlier Detection

(UNOD) on attributed networks is an important area. UNOD

focuses on detecting two kinds of typical outliers in graphs: the

structural outlier and the contextual outlier. Most existing works

conduct experiments based on datasets with injected outliers.

However, we ﬁnd that the most widely-used outlier injection

approach has a serious data leakage issue. By only utilizing

such data leakage, a simple approach can achieve state-of-the-art

performance in detecting outliers. In addition, we observe that

existing algorithms have a performance drop with the mitigated

data leakage issue. The other major issue is on balanced detection

performance between the two types of outliers, which has not

been considered by existing studies.

In this paper, we analyze the cause of the data leakage issue

in depth since the injection approach is a building block to

advance UNOD. Moreover, we devise a novel variance-based

model to detect structural outliers, which outperforms existing

algorithms signiﬁcantly and is more robust at kinds of injection

settings. On top of this, we propose a new framework, Variance-

based Graph Outlier Detection (VGOD), which combines our

variance-based model and attribute reconstruction model to

detect outliers in a balanced way. Finally, we conduct extensive

experiments to demonstrate the effectiveness and efﬁciency of

VGOD. The results on 5 real-world datasets validate that VGOD

achieves not only the best performance in detecting outliers but

also a balanced detection performance between structural and

contextual outliers.

Index Terms—Graph Outlier Detection; Graph Neural Net-

work; Unsupervised Graph learning; Attributed Networks

I. INTRODUCTION

Graph Outlier Detection [1] (GOD, a.k.a. graph anomaly

detection) is a fundamental graph mining task. It has various

applications in high-impact domains and complex systems,

such as ﬁnancial fraudster identiﬁcation [2]. The detection

objects of GOD can be classiﬁed into different levels like

node, edge, community, and graph [3]. For example, detecting

abnormal users in a social network is the node-level GOD

task while detecting abnormal molecules can be regarded as a

graph-level GOD task.

Due to the high cost or unavailability of manually la-

beling the ground truth outliers, a large number of existing

GOD approaches are carried out in an unsupervised manner

[4, 5], which aims to detect the instances that signiﬁcantly

deviate from the majority of instances in the graph [6].

Attributed networks (a.k.a. attributed graphs) are a powerful

data representation for many real-world complex systems (e.g.

a social network with user proﬁles), in which entities can

be represented as nodes with their attribute information; the

interaction or relationship between entities can be represented

as edges [7]. In recent years, the study of Unsupervised Node

Outlier Detection (UNOD) on attributed networks has been

blooming due to its wide applications [3, 8, 9]. Different

from traditional global outlier detection and time series outlier

detection, it deﬁnes two new typical types of outliers on

attributed networks, namely, structural outlier and contextual

outlier [4].

(a) Structural Outlier (b) Contextual Outlier

Normal Node Outlier Node Node Attributes

community A

community B

Fig. 1. An example of structural and contextual outliers in UNOD.

In Fig 1, there is a toy example for these two kinds

of basic outliers. Particularly, structural outliers are those

nodes structurally connected to different communities, i.e.,

their structural neighborhood is inconsistent. In other words,

a structural outlier has normal attributes while it may have

several abnormal links. For example, those people from dif-

ferent communities but have a strong connection with each

other can be regarded as structural outliers. As shown in Fig

1(a), there are two communities outlined with orange circles

and structural outliers are those nodes with abnormal links

to other communities. On the other hand, a contextual outlier

has a consistent neighborhood structure while its attributes are

corrupted, noisy, or signiﬁcantly different from its neighbor

nodes’ attributes. For example, in Fig 1(b), suppose that the

node in red is a football player while the nodes in green

are music teachers. In this case, the node in red is regarded

as a contextual outlier since it has a vast difference from

its neighbors. In the real world, datasets are much more

complicated than this toy example and it is difﬁcult to measure

the degree of inconsistency among nodes.

arXiv:2210.12941v3 [cs.LG] 9 Jan 2023

There have been various methods proposed to solve UNOD

[9]. They can be roughly divided into two categories, namely,

non-deep and deep-learning-based methods. Non-deep meth-

ods usually leverage traditional machine learning methods

such as matrix factorization [10], density-based clustering

[11], and relational learning [12] to encode the graph informa-

tion and detect outliers. However, these methods fail to address

the computational challenge with high-dimension data [13].

With the rapid prevalence of Graph Neural Networks (GNNs)

[14], more and more methods are based on deep learning

[15] and GNNs nowadays [3]. For example, DOMINANT [4]

employs two GNN autoencoders to reconstruct the attribute

information and structure information. According to the results

reported in [9], most deep-learning-based methods have a

much better performance than non-deep methods in detecting

injected outliers. To unify the outlier injection process, [9]

adopts the outlier injection approach from [4] as the standard

injection approach.

Challenge. Although the recent deep-learning-based methods

have achieved an excellent performance in UNOD, we ﬁnd

that the most widely-used outlier injection approach, which

is adopted by [4, 9, 16, 17, 18, 19, 20, 21, 22, 23, 24],

will cause a serious data leakage issue. Here, we refer to

the data leakage [25] in machine learning, which means the

information strongly associated with the labels is leaked to

the training dataset. After employing this approach to inject

outliers, structural outliers will have a larger node degree than

the average while attribute vectors of contextual outliers will

have a larger L2-norm (a.k.a. Euclidean norm) than expected.

As a result, a simple solution only utilizing node degree or

L2-norm of attribute vectors as the outlier score to detect the

corresponding type of outliers can acquire a quite satisfying

performance as shown in Fig 2. The metric of AUC [26] is

adopted here to measure the detection performance. Under

such a serious data leakage issue in injected datasets, most

existing algorithms cannot have a better performance than the

simple solution. In addition, it is observed that existing algo-

rithms have a performance drop in varied injection settings, in

which the data leakage issue caused by the current injection

approach is alleviated. Therefore, it is urgent to ﬁnd out the

cause of data leakage and reduce its impact. On the other hand,

it is also necessary to exploit an effective UNOD algorithm,

which has a superior performance and is robust to the data

leakage issue. Moreover, the balance between structural and

contextual outliers detection performance is little considered

in existing works [9]. An algorithm with unbalanced detection

may only have detection ability for a certain type of outliers.

It is found that existing algorithms focus more on contextual

outliers than structural outliers when detecting them. To gain

more feasible algorithms, comprehensive metrics for balance

evaluation should be devised.

Our Solution. In this paper, we are devoted to analyzing the

cause of data leakage and devising a superior outlier detection

method for UNOD to achieve better performance in both the

current injection setting and varied injection settings.

Fig. 2. After injecting outliers in four datasets, node degree is employed

to detect structural outliers while L2-norm is employed to detect contextual

outliers. Both of them, compared to the random detector, achieve unexpectedly

high scores.

In particular, we propose a novel variance-based model

to detect structural outliers, which adopts the variance of

representations of neighbor nodes to detect structural outliers.

To the best of our knowledge, it is the ﬁrst time to employ

the neighbor variance to detect outliers. Existing algorithms

are based on either reconstruction of adjacency matrix [4, 18]

or contrastive learning [16, 23] to detect structural outliers.

According to the deﬁnition, the essence of a structural outlier

is its inconsistent neighbors that come from different commu-

nities. However, existing algorithms cannot directly capture

this essence. In this case, we devise a deep graph model to

measure the inconsistency among neighbors by the variance

of latent representations of neighbors, which captures the

essence and gives a better utility for detection. On top of this,

a new framework, Variance-based Graph Outlier Detection

(VGOD), is also proposed to detect two types of outliers with

a variance-based model and an attribute reconstruction model.

To address the balance issue, we separately train two models

to avoid overtraining and normalize two types of outlier scores

to eliminate the scale difference. To evaluate the detection

balance on two outlier types, we introduce a new metric to

measure the gap in the performance score. The experiments are

conducted on 5 real-world datasets and the results demonstrate

that VGOD achieves the best detection performance.

Contribution. Our major contributions are as follows.

1) To the best of our knowledge, we are the ﬁrst to identify

the data leakage issue in the most widely-used outlier

injection approach.

2) We analyze the cause of data leakage in depth, give

suggestions for the future design of the outlier injection

approach, and purpose a new approach for injecting

structural outliers.

3) We propose a novel variance-based model and a new

VGOD framework, which outperforms existing algo-

rithms in detecting outliers and alleviates the issue of

the balance of detection.

4) Extensive experiments are conducted to demonstrate that

our approach achieves the best detection performance

and the detection balance.

II. RELATED WORK

A. Graph Neural Network

GNNs [14] are a group of neural network models which

utilize the graph structure for network representation learning

and various tasks. Among GNNs, GCN [27] is one of the most

inﬂuential models, which extends the convolutional operation

in sequence or grid data to graph-structured data. Furthermore,

to aggregate messages from neighbors more ﬂexibly, GAT [28]

introduces an attention mechanism to learn the importance

of each neighbor node. On the other hand, GraphSage [29]

adopts the sampling-based method to aggregate the neighbor

information to work in large-scale graphs. From a topological

learning perspective, GIN [30] is a more expressive model

than GCN and can achieve the same discriminative power

as the 1-WL graph isomorphism test [31]. In our proposed

framework, GNN plays a vital role in the network embedding

representation of nodes. Generally, the GNN module in our

framework can be set to any type of the above-mentioned

GNNs.

B. Unsupervised Node Outlier Detection on Attributed Net-

works

UNOD on attributed networks has attracted considerable

research interest in recent years due to its wide application in

complex systems. Radar [32] utilizes the residuals of attribute

information and its coherence graph structure for outlier de-

tection. ANOMALOUS [33] conducts attribute selection and

outlier detection jointly based on CUR decomposition and

residual analysis. However, these methods have computation

limitations in high-dimension attributes due to their shallow

mechanisms.

Quite a few studies based on the deep-learning technique

have emerged recently [3]. Dominant [4] builds deep autoen-

coders on top of GCN layers to reconstruct the adjacency

and attribute matrices. AnomalyDAE [18] employs dual au-

toencoder architecture with cross-modality interactions and

the attention mechanism to reconstruct the adjacency and

attribute matrices. CoLA [16] and SL-GAD [23] perform

the UNOD task via contrastive self-supervised learning and

random walk to embed nodes. AEGIS [17] studies UNOD in

the inductive setting by utilizing generative adversarial ideas to

generate potential outliers. DONE [34] employs deep unsuper-

vised autoencoders to generate the network embedding which

eliminates the effects of outliers at the same time. CONAD

[19] adopts four data augmentation strategies and contrastive

learning for outlier detection. GUIDE [21] replaces adjacency

reconstruction with higher-order structure reconstruction to

detect structural outliers. Under the manner of outlier injection,

all these above deep methods show superior performance than

non-deep methods in detecting these two types of outliers. To

evaluate UNOD algorithms, [9] adopts the most widely-used

outlier injection approach from [4] as the standard injection

method and provides uniﬁed benchmarks for UNOD, which

facilitates fairness for comparing different methods.

Current UNOD methods have achieved an excellent per-

formance. However, as demonstrated in Fig 2, the widely-

used outlier injection approach exists a data leakage issue.

To our surprise, simply using the combination of L2-norm

and node degree to detect outliers can achieve state-of-the-

art performance. Therefore, our work focuses on analyzing

the cause of the data leakage issue and designing a superior

method. In addition, as mentioned in [9] that no current

method has a balanced detection performance on two outlier

types, we also consider the balance issue in our method.

III. PRELIMINARY

In this section, we formally present some concepts which

are used throughout this paper and deﬁne the problem. We

use lowercase letters (e.g. a), bold lowercase letters (e.g. x),

uppercase letters (e.g. X), and calligraphic fonts (e.g. V) to

denote scalars, vectors, matrices, and sets, respectively.

A. Graph Neural Network

GNNs stack L layers of message-passing layers. Each layer

performs a message passing through the given graph structure.

After the initial node feature h0∈Rd0is transformed by L

layers, the vector representation hL∈RdLis learned for each

node v. Most message-passing layers can be expressed using

the following rule:

h(l)

v=σ(Ψ(l)(AGG({Φ(l)(h(l−1)

u), u ∈ Nv∪ {v}}))) (1)

where σ(·)is the active function, Ψ(l)(·)and Φ(l)(·)de-

note differentiable functions such as Multi-Layer Perceptrons

(MLP). AGG(·)denotes a differentiable, permutation invariant

function (e.g. sum, mean, max) and Nvdenotes node v’s

direct linked neighbors.

Here, we introduce three commonly used GNNs, namely

GCN, GAT, and GIN.

Graph Convolution Network (GCN) [27] is the most

widely-used GNN module, which adopts the propagation rule:

H(l+1) =σ(ˆ

AH(l)W(l))(2)

where ˆ

Ais the symmetric normalized adjacency matrix, H(l)

is the lth hidden layer node representation, and W(l)is the

parameters in the lth hidden layer.

Graph Attention Network (GAT) [28] ﬂexibly aggregates

messages from neighbors with calculated weight αij (vs.

average weight adopted by GCN) of each edge hi, jias

αij =exp(LeakyReLU (aT[Whi||Whj]))

Pk∈Niexp(LeakyReLU (aT[Whi||Whk])) (3)

where aand Ware the learnable weights. Layer mask (l)is

omitted for simplicity.

Graph Isomorphism Network (GIN) [30] is the expres-

sively more powerful GNN model, which follows the rule to

propagate messages as

H(l)=σ(Ψ(l)(A+ (1 + )·I)H(l−1))(4)

where can be a ﬁxed or learnable scalar parameter, and Iand

Ais the identity matrix and adjacency matrix, respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedGraphOutlierDetection:ProblemRevisit,NewInsight,andSuperiorMethodYihongHuangy,LipingWangy,FanZhangz,XueminLinxyEastChinaNormalUniversity,zGuangzhouUniversity,xShanghaiJiaotongUniversityhyh957947142@gmail.com,lipingwang@sei.ecnu.edu.cnzhangf@gzhu.edu.cn,lxue@cse.unsw.edu.auAbstractAlarge...

展开>> 收起<<

Unsupervised Graph Outlier Detection Problem Revisit New Insight and Superior Method Yihong Huangy Liping Wangy Fan Zhangz Xuemin Linx.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Graph Outlier Detection Problem Revisit New Insight and Superior Method Yihong Huangy Liping Wangy Fan Zhangz Xuemin Linx

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: