Path Association Rule Mining

2025-05-02 0 0 9.1MB 11 页 10玖币
侵权投诉
Path Association Rule Mining
Yuya Sasaki
sasaki@ist.osaka-u.ac.jp
Osaka University
ABSTRACT
Graph association rule mining is a data mining technique used for
discovering regularities in graph data. In this study, we propose
a novel concept, path association rule mining, to discover the cor-
relations of path patterns that frequently appear in a given graph.
Reachability path patterns (i.e., existence of paths from a vertex
to another vertex) are applied in our concept to discover diverse
regularities. We show that the problem is NP-hard, and we develop
an ecient algorithm in which the anti-monotonic property is used
on path patterns. Subsequently, we develop approximation and
parallelization techniques to eciently and scalably discover rules.
We use real-life graphs to experimentally verify the eectiveness
of path association rules and the eciency of our algorithm.
1 INTRODUCTION
Association rule mining is data mining technique used for dis-
covering regularities between itemsets in large databases [
1
,
49
].
Association rules are represented as
𝑋𝑌
, where
𝑋
and
𝑌
are
the disjoint and called antecedent and consequent, respectively.
Association rules have attracted research attention because of their
usefulness in many applications such as market analysis [
27
], web
mining [29], and bioinformatics [32].
Regularities between entities in large-scale graphs are a critical
topic of research. Graph association rule mining on a single large
graph, which is an extended form of general association rule mining,
exhibits considerable potential for studying regularities [
14
,
18
,
19
,
45
]. Graph association rules are represented as
𝐺𝑋𝐺𝑌
,
where
𝐺𝑋
and
𝐺𝑌
are graph patterns. Since graphs are widely
used in many applications, association rules on graphs have a large
opportunity to discovery valuable insights and knowledge.
Applications on graph association rule mining.
Graph associ-
ation rule mining exhibits numerous application scenarios.
Social analysis
: Social analysis is critical for understanding and im-
proving our life. Social relationships can be modeled using graphs,
and graph association rule mining can be used to discover regular-
ities between the status of people and their relationship patterns.
For example, social relationships aect to our health [
25
] and hap-
piness [
23
]. We can nd interesting rules, for example, “if people
are happy, they are healthy and have friends.
Recommendation
: In current recommendation systems, user be-
havior, such as viewing and clicking [
11
,
31
], toward particular
products are tracked. Bipartite graphs are used to model user be-
havior to products. Graph association rule mining can be used nd
potential customers by nding regularities among user behavior,
products, and proles.
Social discrimination checking
: Social discrimination in graphs should
be solved for building fair machine learning models [
10
]. Machine
learning models for graph data can be suered from such discrimi-
natory bias as same as transactional data. Since graph association
rules can discover regularities that include discriminatory bias, they
can be used to remove discriminatory bias in the graph data.
Motivation and challenge.
Graph association rule mining is fun-
damental for graph analysis, but only a few techniques have been
developed [
14
,
18
,
19
,
45
]. Existing methods cannot be applied to
the aforementioned applications due to their own semantics (see
Sec. 7 in detail).
Existing works have four obstacles to apply the aforementioned
applications. First, in all existing works, a set of vertices in conse-
quent (i.e.,
𝐺𝑌
) must be a subset of those in the antecedent (i.e.,
𝐺𝑋
).
Thus, they cannot discover association rules such that
𝐺𝑋
and
𝐺𝑌
involve distinct vertices. Second, since
𝐺𝑋
and
𝐺𝑌
are restrictive
(e.g., a single edge or no common edges in
𝐺𝑋
and
𝐺𝑌
), they cannot
nd the dierence of vertex attributes. Third, each rule and discov-
ery algorithm assumes a single graph type, for example, graphs
with labeled vertices and unlabeled edges, and thus does not handle
more generalized graphs than that it assumes (e.g., if discovery
algorithms assume graphs with unlabeled edges, it cannot handle
ones with labeled edges). Fourth, their antecedent and consequent
are basic graph pattern, so it does not capture the connectivity
between vertices.
Therefore, we need a novel graph association rule mining tech-
nique. We introduce three challenges that have not been addressed
in existing studies. First, simple association rules are required to
exibly capture regularities that can represent (1) no restriction of
vertices in antecedent and consequent and (2) connectivity between
vertices. Second, an ecient and scalable mining algorithm is re-
quired to accurately nd all frequent patterns in general property
graphs that include vertex attributes and edge labels. Finally, such
rules should be eective and provide insight and knowledge, and
oer conventional metrics such as support, condence, and lift.
Contribution.
We propose a novel and simple concept, path as-
sociation rule mining, and develop an ecient and scalable algo-
rithm for discovering path association rules. Path association rule
mining aims to discover regularities between path patterns; path
patterns are sequences of vertex attribute sets and edge labels.
Reachablity patterns can be applied as path patterns that repre-
sent label-constrained reachability between two vertices in directed
edges. This mining nds rules
𝑝𝑋𝑝𝑌
such that more than
𝜃
vertices become sources of path patterns
𝑝𝑋
and
𝑝𝑌
, where
𝜃
is
a given minimum support. This rule can naturally oer standard
metrics, such as absolute/relative support, condence, and lift.
Example 1.1. Figure 1 illustrates an example of path associa-
tion rule mining. A given graph includes 12 vertices, 15 edges, 4
types of edge labels, and 8 types of vertex attributes. From this
graph, we mine path association rules with
𝜃
is one. Path pattern
⟨{Male},Follows,{Art}
is frequent because vertices
𝑣8
and
𝑣9
be-
come the sources of the path patterns (we say that
𝑣8
and
𝑣9
match
with the path pattern). Similarly, path pattern
⟨{CS},BelongTo,{Uni}
arXiv:2210.13136v1 [cs.DB] 24 Oct 2022
Yuya Sasaki
Follows
Art
BelongTo LocatedIn Likes
CS Chem
CS
Art
Art
CS
Length 1 simple path
Museum Uni City Male Female
CS CS
Length 2 simple path Kleene path
Art CS
Art
Rule
Art
CS
Chem
CS
CS CS
Art
CS CS CS
Art
CS CS CS
CS
Art
CS
CS
Art
CS
CS
Art
p1 p2 p3 p4 p5 p6 p7p8 p9
e1e2
e5e6
e3e4
e7
e9
e8
e12
e11
e10
e13 e14 e15
p10 p11 p12 p13
v1v2
v3v4
v9
v10
v5v6
v7v8v11
v12
r1 r2 r3 r4 r5 r6
Figure 1: Path association rules
matches
𝑣8
and
𝑣9
.
⟨{CS},Follows,{Art}⟩ ⇒ ⟨{Male},BelongTo,
{Uni}is discovered as a path association rule.
The path association rule problem is NP-hard. Therefore, we
develop an algorithm to eciently nd the set of frequent path
patterns and rules. Our algorithm employs an Apriori approach
using anti-monotonic property that path association rules hold to
prune the candidate of frequent path patterns and rules. We can
prune path patterns in two ways, namely vertically, which stops
extending the length of path patterns, and horizontally, which stops
extending the number of attributes in path patterns. Our algorithm
can exactly compute support, condence, and other measures of
rules. Furthermore, we develop parallelization and approximation
techniques to scalably accelerate nding association rules. In our
parallelization technique, we estimate costs of vertices to compute
path patterns, and assign the set of vertices into threads according
to the costs. We reduce the computation costs in two approximate
methods; reducing candidates of sux of path patterns and sam-
pling vertices to be computed.
The contributions of the study are summarized as follows:
Novel Concept
: We propose path association rule mining
to capture regularities between path patterns. We can apply
reachability path patterns, which diers from existing graph
association rule mining.
Ecient and Scalable Algorithm
: We develop an ecient
and scalable algorithm that parallely nds the rules with
pruning infrequent path patterns. Our algorithm can apply
approximation techniques for more eciency.
Discoveries in Real-World Graphs
: We demonstrate that
path association rule mining is eective in knowledge ex-
traction and bias checking.
Reproducibility
. The code and datasets used in this study are
available at github. Since codes on the existing works [
14
,
16
,
18
,
19
,
45
]
are not open publicly, our code is the rst open software on graph
association rule mining on a single large graph. Our supplementary
le in the github includes proofs of our theorems and lemmas, and
additional related studies and experimental results.
2 OUR CONCEPT
We propose a novel concept, path association rule mining to ef-
fectively determine the regulations between attributes of vertices
connected by edge labels. Its remarkable feature compared with
existing graph association rules is that it captures correlations of
distinct path patterns from the same vertices, which are often useful
in many applications. In addition, the path association rule mining
discovers rules on general property graphs, which widely cover
many graph types (e.g., labeled graphs).
2.1 Notations
We consider graph
G=(V,E,L,A)
where
V
is a nite set of
vertices,
E V × L × V
is the set of edges with label
∈ L
,
L
is
a nite set of edge labels, and
A
is a nite set of attributes. Edge
𝑒∈ E
is represented by a triple
(𝑣, ℓ𝑒, 𝑣)
where an edge from vertex
𝑣
to vertex
𝑣
with edge label
𝑒
. Attribute
𝑎∈ A
is a categorical
value that represents a feature of vertex. We dene
𝐴⊆ A
as a set
of attributes and each 𝑣∈ V has a set of attributes 𝐴(𝑣).
Path is a sequence of vertices and edges
𝑣0, 𝑒0, 𝑣1, . . . , 𝑒𝑛1, 𝑣𝑛
where
𝑛
is a length of the path. We dene
𝑣0
and
𝑣𝑛
as the source
and target of path, respectively.
Example 2.1. In Figure 1,
L={Follows,BelongTo,LocatedIn,
Likes}
and
A={Museum,Uni,City,Male,Female,CS,Chem,Art}
.
𝑣8, 𝑒9, 𝑣5, 𝑒3, 𝑣2
is a two length path where
𝑣8
and
𝑣2
are the source
and target of the path, respectively.
2.2 Path Association Rule
We here dene path association rules that we propose in this paper,
after dening path patterns.
Path pattern:
We dene two path patterns, namely simple and
reachability path patterns.
Simple path pattern is a sequence of attribute sets and edge
labels
𝐴0, ℓ0, 𝐴1, . . . , 𝑛1, 𝐴𝑛
where
𝐴𝑖⊆ A
and
𝑖∈ L
.
Given vertex
𝑣
and simple path pattern
𝑝
,
𝑣
matches
𝑝
if
𝑣
is
a source of a path that satises 𝐴𝑖𝐴(𝑣𝑖)and 𝑖=𝑒𝑖for all 𝑖.
Reachability path pattern is a pair of attribute sets and edge
label
𝐴0, ℓ, 𝐴1
, where
𝐴0, 𝐴1⊆ A
and
∈ L
. Given
Path Association Rule Mining
vertex
𝑣
and reachability path pattern
𝑝
,
𝑣
matches
𝑝
if
𝑣
is a
source of a path that satises
𝐴0𝐴(𝑣0)
,
𝐴1𝐴(𝑣𝑛)
, and
=𝑒𝑖for all 𝑖.
We denote by
V (𝑝)
the set of all matched vertices of path pattern
𝑝
and by
|V (𝑝)|
the number of matched vertices of
𝑝
. We call a
path pattern whose all attributes sets include a single attribute unit
path pattern.
Denition 2.2 (Dominance). Given two path patterns
𝑝=𝐴0, ℓ0, 𝐴1,
. . . , 𝑛1, 𝐴𝑛
and
𝑝=𝐴
0, ℓ
0, 𝐴
1, . . . , ℓ
𝑚1, 𝐴
𝑚
,
𝑝
dominates
𝑝
if
𝑚𝑛
, and
𝑖=𝑖
and
𝐴
𝑖𝐴𝑖
for all
𝑖=
0to
𝑚
. We denote
𝑝𝑝
when 𝑝is dominated by 𝑝.
Intuitively, dominating path patterns are more complex than
their dominated path patterns.
Example 2.3. In Figure 1,
𝑣𝐻
matches simple path patterns
⟨{CS,
Male},Follows,{CS,female}⟩
and reachability path patterns
⟨{CS,Male},
Follows,{CS,Male}⟩
.
V ({Male},Follows,{Male}⟩)
is
{𝑣8, 𝑣9, 𝑣12}
.
Path association rule:
We dene path association rules as fol-
lows.
Denition 2.4. A path association rule
𝑟
is dened by
𝑝𝑋𝑝𝑌
,
where
𝑝𝑋
and
𝑝𝑌
are path patterns. We refer to
𝑝𝑋
and
𝑝𝑌
as the
antecedent and consequent of rule, respectively.
The path association rule evaluates frequencies and condential
probabilities of vertices that match both
𝑝𝑋
and
𝑝𝑌
. We apply
homorphily semantics on path association rules. Thus, we allow that
a single path is shared by both 𝑝𝑋and 𝑝𝑌.
Example 2.5. Given a rule
⟨{CS},Follows,{Art}⟩ ⇒ ⟨{CS},
BelongTo,{Uni}⟩
in Figure 1,
𝑣8
and
𝑣9
are matched with this rule.
2.3 Metrics of association rules
The path association rule oers general metrics of association rules.
We can naturally dene support, condence, and lift for path asso-
ciation rules.
Support: The support of path association rules indicates how many
vertices are applied into sources of paths. We dene two support
metrics of absolute and relative supports. Most of graph association
rule mining does not oer relative support (see Sec 7) because it is
hard to compute the possible maximum number of matched graph
patterns. The absolute support is dened as follows:
ASupp(𝑝𝑋𝑝𝑌)=|V (𝑝𝑋) V (𝑝𝑌)|.
Since the maximum value of
|V (𝑝)|
is the number of vertices,
the relative support is dened as follows:
RSupp(𝑝𝑋𝑝𝑌)=|V (𝑝𝑋) V (𝑝𝑌)|
|V| .
Condence: The condence of path association rules indicates
the probability that vertices satises
𝑝𝑋
if they satisfy
𝑝𝑌
. The
condence is dened as follows:
Conf (𝑝𝑋𝑝𝑌)=|V (𝑝𝑋) V (𝑝𝑌)|
|V (𝑝𝑋)| .
Lift: Lift is not oered by most of graph association rule mining as
well as the relative support. We can dene lift as follows:
Li(𝑝𝑋𝑝𝑌)=|V (𝑝𝑋) V (𝑝𝑌)| · |V|
|V (𝑝𝑋)| · |V(𝑝𝑌)| .
Others: Similarly, we can dene other measures.
Example 2.6. We explain the metrics by using Figure 1.
ASupp(r1)
,
RSupp(r1)
,
Conf (r1)
, and
Li(r1)
are 2, 0.16, 1.0, and 6, respectively.
2.4 Problem Denition
We dene a problem that we solve in this paper.
Denition 2.7 (Path association rule mining problem). Given graph
G
, minimum (absolute or relative) support
𝜃
, maximum path length
𝑘
, the path association rule mining problem aims to nd all rules
with their metrics, where
𝑟
satises (1)
supp(𝑟)
is larger than
𝜃
, (2)
lengths of paths are at most
𝑘
, and (3)
𝑝𝑋
is not dominated
𝑝𝑌
and
vice versa. We say that 𝑝is frequent if |V (𝑝) | >𝜃.
Theorem 2.8. Path association rule mining problem is NP-hard.
Proof sketch: We show that the problem is NP-hard through reduc-
tion from as frequent subsequence enumeration problem, which
is NP-hard [
47
]. Sequences and subsequences are sub classes of
graphs and path patterns, respectively. Thus, the path association
rule mining problem is NP-hard.
Remarks1:
The path association rule is general compared with
conventional association rule mining for itemsets. Each vertex and
its attributes can be considered as a transaction and items of the
transaction, respectively. If no edges exist in graphs, path associ-
ation rule mining is equivalent to conventional association rule
mining for itemsets.
Remarks2:
In recent real-life query logs for Wikidata [
44
], more
than 90 % of queries are path patterns [
4
,
5
]. From this fact, graph
analysis often utilizes path patterns instead of graph patterns to un-
derstand real-world relationships. Therefore, path association rule
mining can be useful in real-world analysis on many applications.
3 OUR ALGORITHM
We present an ecient algorithm for path association rule mining.
3.1 Overview
For solving the path association rule mining problem, we need to
enumerate frequent path patterns and vertices that match the path
patterns. To reduce the candidates of path patterns, our algorithm
employs an Apriori strategy which initially nds non-complex
frequent path patterns and generates candidates of frequent com-
plex path patterns form the found path patterns, based on anti-
monotonic properties.
Anti-monotonic properties.
Anti-monotonic properties are lever-
aged to reduce candidates of path patterns eectively. Path patterns
naturally satisfy anti-monotonic properties as follows:
Theorem 3.1. If 𝑝𝑝,V (𝑝) V (𝑝).
Proof: When
𝑝
is dominated by
𝑝
,
𝑝
is not longer than
𝑝
, all
edge labels on
𝑝
and
𝑝
are the same order, and attribute sets on
𝑝
are subset or equal to those on
𝑝
. All vertices that match
𝑝
are matched with path patterns whose attributes are subset of
𝑝
.
Therefore, if path pattern 𝑝𝑝,V (𝑝) V (𝑝).
摘要:

PathAssociationRuleMiningYuyaSasakisasaki@ist.osaka-u.ac.jpOsakaUniversityABSTRACTGraphassociationruleminingisadataminingtechniqueusedfordiscoveringregularitiesingraphdata.Inthisstudy,weproposeanovelconcept,pathassociationrulemining,todiscoverthecor-relationsofpathpatternsthatfrequentlyappearinagive...

展开>> 收起<<
Path Association Rule Mining.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:9.1MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注