HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY DETECTION HABCD A P REPRINT

2025-05-08 0 0 1.04MB 23 页 10玖币
侵权投诉
HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY
DETECTION (H–ABCD)
A PREPRINT
Bogumił Kami´
nskiPaweł PrałatFrançois Théberge
June 14, 2023
ABSTRACT
The Artificial Benchmark for Community Detection (ABCD) graph is a recently introduced random
graph model with community structure and power-law distribution for both degrees and community
sizes. The model generates graphs with similar properties as the well-known LFR one, and its main
parameter
ξ
can be tuned to mimic its counterpart in the LFR model, the mixing parameter
µ
. In this
paper, we introduce hypergraph counterpart of the ABCD model, h–ABCD, which also produces
random hypergraph with distributions of ground-truth community sizes and degrees following power-
law. As in the original ABCD, the new model h–ABCD can produce hypergraphs with various levels
of noise. More importantly, the model is flexible and can mimic any desired level of homogeneity of
hyperedges that fall into one community. As a result, it can be used as a suitable, synthetic playground
for analyzing and tuning hypergraph community detection algorithms.
1 Introduction
Many networks that are currently modelled as graphs would be more accurately modelled as hypergraphs. This includes
the collaboration network in which nodes correspond to researchers and hyperedges correspond to papers that consist
of nodes associated with researchers that co-authorship a given paper. After many years of intense research using
graph theory in modelling and mining complex networks [
51
,
34
,
27
,
40
], hypergraphs start gaining considerable
traction [
11
,
8
,
6
,
9
]. Standard but important questions in network science are revisited in the context of hypergraphs.
This includes various aspects related to community detection in networks [
37
,
38
,
45
,
44
,
19
,
63
,
62
,
10
,
16
,
2
] which
we concentrate on in this paper. However, hypergraphs also create brand new questions which did not have their
counterparts for graphs. For example, how hyperedges overlap in empirical hypergraphs [
50
]? Or how the existing
patterns in a hypergraph affect the formation of new hyperedges [35]?
Despite the fact that currently there is a vivid discussion around hypergraphs, the theory and tools are still not sufficiently
developed to allow most problems, including clustering, to be tackled directly within this context. Indeed, researchers
and practitioners often create the 2-section graph of a hypergraph of interest (that is, replace each hyperedge with a
clique) and apply classical tools designed for graphs. After moving to the 2-section graph, one clearly loses some
information about hyperedges of size greater than two and so there is a common belief that one can do better by using
the knowledge of the original hypergraph.
As mentioned earlier, there are some recent attempts to directly deal with hypergraphs in the context of clustering.
In [
64
], methods from spectral clustering are generalized to hypergraphs; [
22
] proposes a framework to infer missing
hyperedges and detect overlapping communities, while in [
17
], extensions of non-backtracking spectral clustering are
proposed in the context of hypergraphs. In Kumar et al. [
45
,
44
], the authors reduce the problem to graphs but use
original hypergraphs to iteratively adjust weights to encourage some hyperedges to be included in some cluster but
Decision Analysis and Support Unit, SGH Warsaw School of Economics, Warsaw, Poland; e-mail:
bogumil.kaminski@sgh.waw.pl
Department of Mathematics, Toronto Metropolitan University, Toronto, ON, Canada; e-mail:
pralat@ryerson.ca
. Part of
this work was done while the author was visiting the Simons Institute for the Theory of Computing.
Tutte Institute for Mathematics and Computing, Ottawa, ON, Canada; email: theberge@ieee.org
arXiv:2210.15009v3 [cs.SI] 12 Jun 2023
discourage other ones (this process can be viewed as separating signal from noise). Moreover, in [
37
,
38
] a number of
extensions of the classic null model for graphs are proposed that can potentially be used by hypergraph algorithms.
Unfortunately, there are many ways such extensions can be done depending on how often nodes in one community share
hyperedges with nodes from other communities. This is something that varies between networks at hand and usually
depends on the hyperedge sizes. Let us come back to the collaboration network we discussed earlier. Hyperedges
associated with papers written by mathematicians might be more homogeneous and smaller in comparison with those
written by medical doctors who tend to work in large and multidisciplinary teams. Moreover, in general, papers with a
large number of co-authors tend to be less homogeneous, and other patterns can be identified [
35
]. A good clustering
algorithm should be able to automatically decide which extension should be used. However, in order to be able to design
and properly tune such algorithms, one needs to have a synthetic random hypergraph model that is able to simulate
various scenarios.
The hypergraph model, h–ABCD, is our response to this need from both practitioners and academia. This random
graph model, similarly to the original ABCD model, produces hypergraphs with community structure and power-law
distribution for both degrees and community sizes. Indeed, as with graphs, power-law distribution is a common
feature present in many real-world hypergraphs; for example, [
25
] shows power-law distributions for several real-world
hypergraph decompositions, and we have checked that in the presented cases the hypergraph degree distributions also
follow power-law distributions.
The paper is structured as follows. In Section 2, we briefly discuss the history of ABCD, the “parent” of the h–ABCD
model introduced in this paper, before summarizing other synthetic hypergraphs. We focus on models important in the
context of community detection and explain the differences between the existing models and our model. The h–ABCD
model is defined in Section 3. Experiments highlighting important properties of the model are discussed in Section 4.
In Section 5, we conclude the paper with a few future research directions.
2 Existing Models
2.1 ABCD graph Models
There are very few datasets with ground-truth identified and labelled. As a result, there is need for synthetic random
graph models with community structure that resemble real-world networks in order to benchmark and tune clustering
algorithms that are unsupervised by nature. The LFR (Lancichinetti, Fortunato, Radicchi) model [
48
,
46
] generates
networks with communities and at the same time it allows for the heterogeneity in the distributions of both node degrees
and of community sizes. It became a standard and extensively used method for generating artificial networks.
The Artificial Benchmark for Community Detection (ABCD) [
39
] was recently introduced and implemented
§
, including
a fast implementation
that uses multiple threads (ABCDe) [
42
]. Undirected variant of LFR and ABCD produce graphs
with comparable properties but ABCD/ABCDe is faster than LFR and can be easily tuned to allow the user to make
a smooth transition between the two extremes: pure (disjoint) communities and random graph with no community
structure. Moreover, it is easier to analyze theoretically—for example, in [
36
] various theoretical asymptotic properties
of the ABCD model are investigated including the modularity function that, despite some known issues such as the
“resolution limit” reported in [
28
], is an important graph property of networks in the context of community detection.
Finally, the building blocks in the model are flexible and may be adjusted to satisfy different needs. For example, the
original ABCD model was recently adjusted to include potential outliers in [
41
] resulting in ABCD+o model. Adjusting
the model to hypergraphs is much more complex but doable.
2.2 Other Hypergraph Models
The classical configuration model, which was first introduced by Bollobás [
13
], is a standard model producing graphs
with a given degree sequence. It was generalized to higher order structures many times [
18
,
26
]. One early example
is the folksonomy, a tripartite structure of users, resources, and tags, that was modelled as hypergraphs generated
via configuration-type model in [
29
]. The variant of the configuration model in [
23
,
24
] generalizes it even further,
namely, to simplicial complexes. Simplicial complexes are attractive tools when studying topological aspects of discrete
data [
15
], but subset inclusion is a strong property, not suitable from our application point of view, namely, community
detection. In general, configuration models do not pay attention to labels of nodes and so cannot produce graphs with
community structure. Having said that, they can be used as an ingredient of models which produce networks with
§https://github.com/bkamins/ABCDGraphGenerator.jl/
https://github.com/tolcz/ABCDeGraphGenerator.jl/
2
communities. The configuration model was used in the original ABCD model and we will use its generalization again
for h–ABCD.
The Chung-Lu model for graphs (see [
20
] for details) gives an efficient and simple way to generate graphs with an
expected degree sequence, but without community structure. It can also be easily generalized for bipartite graphs, as
described in [
3
]; the bipartite representation is often used to model hypergraphs. In the same paper, a generalization of
the BTER (block two-level Erd˝
os-Rényi) model is proposed, also for bipartite graphs. In the standard BTER model,
given degree distribution is preserved as well as the degree-wise clustering coefficients. With bipartite graphs, there are
no 3-cycles, so a new metamorphosis coefficient is introduced, which is based on 4-cycles, and the model preserves this
coefficient in a degree-wise matter.
The stochastic block model (SBM), first introduced in [
33
], is one of the most important graph models in community
detection and clustering. One benefit of it is that, being a generative model we can formally study the probability of
inferring the ground truth. This is what distinguishes it from the configuration and Chung-Lu models. There are many
variants and various applications of the SBM. For more details, we direct the reader to [
1
], one of the many surveys
on this model. In particular, one can generalize the SBM to hypergraphs as it was done in, for example, [
30
,
43
,
14
],
while in [
49
], a generalization of SBM is proposed for bipartite networks. Most SBM-type models have ground-truth
communities embedded into the model but do not produce graphs with realistic degree distributions; a variant of SBM
is proposed in [
56
] to approximate power-law degree distribution but as far as we know, there is no such model for
hypergraphs.
In [
57
], an interesting model is introduced in which hyperedges can be generated via a sampling method, given the
number of communities, node memberships, and affinity parameters between the communities. The node degrees and
hyperedge sizes can either be given explicitly or sampled from the model. This approach is more complex than the one
taken by the family of ABCD models. The authors of [
57
] report that generating sparse hypergraphs with
105
nodes
takes roughly half an hour, which is orders of magnitude slower than with h–ABCD.
In [
22
], the probabilistic Hypergraph-MT model for hypergraphs is presented and is shown to be useful for inferring
missing hyperedges as well as detecting overlapping communities. While the model as presented does not explicitly
provide a benchmark, in the conclusion of the paper, the authors mention that their model can also be used to sample
synthetic data with hypergraph structure.
Other hypergraphs models include core-periphery structure model [
55
,
59
], preferential attachment model with power-
law degree distribution and high modularity [31], and entropy-based models [58].
Finally, the authors of [
32
] propose the HyGen model that seems to be similar to the one introduced by us in this paper.
Their algorithm is shown to be fast and scalable using MPI standard for its distributed generation. (MPI, message passing
interface is a standardized means of exchanging messages between multiple computers running a parallel program
across distributed memory.) The HyGen model uses the idea of building independently community and background
hypergraphs that was also considered in the original ABCD graph generator [
42
] and is used in our h–ABCD model.
The main difference between the two models is that HyGen assumes that hyperedges belonging to communities are
completely contained in them, that is, all members of a community hyperedge belong to one community. Our h–ABCD
algorithm allows for community hyperedges to be only partially contained within a community as long as majority of
their members belong to that community. This feature is a significant advantage of h–ABCD over HyGen as most
complex networks have non-strict hyperedges. See Section 3 for more details of community and background hypergraph
generation.
2.3 Distinctive Features of h–ABCD
The goal of this research project is to introduce (and efficiently implement) a scalable synthetic hypergraph benchmark
model with community structure and power-law degree distribution as well as community sizes. Since none of the
existing hypergraph models satisfy all of the desired properties, we moved back to graph models and tried to adjust one
of them to our needs.
h–ABCD model that is proposed in this paper produces networks that have power-law degree distributions and
distributions of community sizes. Alternatively, the user may easily inject the two distributions as parameters of the
model. The user may control the level of noise which covers all spectrum of possibilities of community strength.
The model returns the ground-truth communities that then may be used to benchmark and tune community detection
algorithms. The model, by design, is fast but it is also efficiently implemented in Julia language. (Julia is a high-
level, high-performance, dynamic programming language that recently gains a lot of interest in scientific computing
applications [
12
].) As a result, it is by orders of magnitude faster than other models that we are aware of. Finally,
because of its simplicity, its properties as well as various processes (such as spreading of information, anomalies
3
detection, etc.) can be analyzed theoretically. Such studies, complemented by simulations that are typically easier to
perform, often uncover important mechanisms that are present in real-world complex networks. One such spectacular
example is a study of the preferential attachment model that explains the following phenomena: “rich-get-richer”
mechanisms are responsible for creating power-law distributions that are typically observed in complex networks [
4
].
Analysis of theoretical properties of h-ABCD are left as future work.
3 Definition of the Model
3.1 Parameters of the Model
Let us start with introducing parameters of the model. The h–ABCD model is governed by parameters summarized in
Table 1. Note, in particular, that the number of hyperedges,
m
, is not a parameter of the model. It is a random variable
that depends on the number of nodes
n
, the degree distribution of the graph, and the distribution of hyperedge sizes
qd
. This random variable is well concentrated around its expectation but it is not a parameter provided as an input.
Similarly, the number of communities,
, is not a parameter of the model but a random variable that depends on the
number of nodes
n
and the distribution of community sizes. The process of determining of these values, as well as a
detailed explanation of the interpretation of the presented parameters, is discussed further in this section.
parameter range description
nNnumber of nodes
γ(2,3) power-law exponent of degree distribution
δNminimum degree at least δ
DNmaximum degree at most D
β(1,2) power-law exponent of distribution of community sizes
sN\[δ]community sizes at least s
SN\[s1] community sizes at most S
ξ(0,1) level of noise (fraction of non-community hyperedges)
LNsize of largest hyperedges
qd[0,1] fraction of hyperedges that are of size d;PL
d=1 qd= 1
wc,d [0,1]
fraction of community hyperedges of size
d
that have exactly
c
within-
community nodes; Pd
c=d/2+1 wc,d = 1
Table 1: Parameters of the h–ABCD model
The model generates a hypergraph on
n
nodes. The degree distribution follows power-law with exponent
γ
, minimum
and maximum value equal to
δ
and, respectively,
D
. Community sizes are between
s
and
S
, and also follow power-
law distribution, but this time with exponent
β
. The suggested range of values for parameters
γ
and
β
are chosen
according to experimental values commonly observed in complex networks not only represented as graphs [
5
,
54
]
but also as hypergraphs [
25
]. Parameter
ξ
is responsible for the level of noise. If
ξ= 0
, then each hyperedge is a
community hyperedge meaning that majority of its nodes belong to one community. On the other extreme, if
ξ= 1
,
then communities do not play any roles and hyperedges are simply “sprinkled” across the entire hypergraph that we
will refer to as background hypergraph. Vector
(q1, . . . , qL)
determines the distribution of the number of hyperedges of
a given size.
Finally, parameters
wc,d
specify how many nodes from its own community a given community hyperedge should have.
We call a community hyperedge to be of a
(c, d)
type if it has size
d
and exactly
c
of its nodes belong to one of the
communities. Note, in particular, that we require that a community hyperedge must have more than a half of its nodes
from the community. Therefore, wc,d is defined for d/2< c d, where d[L].
The model is flexible and may accept any family of parameters
wc,d
satisfying specific needs of the users, but here is a
list of three standard options implemented in the code (see Table 2 for example calculations):
majority model: wc,d is uniform for all admissible values of c, that is, for any d/2< c d,
wc,d =1
(d− ⌊d/2)=1
d/2,
linear model: wc,d is proportional to cfor all admissible values of c, that is, for any d/2< c d,
wc,d =2c
(d+d/2+ 1)(d− ⌊d/2)=2c
(d+d/2+ 1)d/2,
4
majority wc,d
cd
1 2 3 4 5
1 1
2 1 1/2
3 1/2 1/2 1/3
4 1/2 1/3
5 1/3
linear wc,d
cd
1 2 3 4 5
1 1
2 1 2/5
3 3/5 3/7 3/12
4 4/7 4/12
5 5/12
strict wc,d
cd
12345
1 1
2 1 0
3 1 0 0
4 1 0
5 1
Table 2: Example values of wc,d for majority,linear, and strict weights for d[5].
strict model: only “pure” hyperedges are allowed, that is
wd,d = 1 and wc,d = 0 for d/2< c < d.
In the above formulas and later in the paper, for a given xR,xis used to denote the floor of x(that is, the largest
integer not larger than x) and xto denote its ceiling (that is, the smallest integer not smaller than x).
3.2 The Big Picture
We summarize the main steps followed to generate the hypergraph h–ABCD below. More details are provided right
after.
1. Sample degrees of nodes.
2.
Sample community sizes ensuring that the desired properties hold, in particular, that the sum of community
sizes equals the number of nodes n.
3. Compute the number of hyperedges of size one and then generate them.
4.
For each node, compute what fraction of its degree should be assigned to community hyperedges and what
fraction remains to be used for background hyperedges.
5.
Assign nodes to communities ensuring that they fit into them (for example, nodes of very high degree cannot be
assigned to small communities, as they would not be able to form simple hyperedges within such communities).
One of such admissible assignments is selected randomly.
6. Generate community hypergraphs.
7. Generate the background hypergraph.
8.
Resolve potential problems with infeasible hyperedges (either being multisets or duplicates) by executing the
rewiring process.
3.3 Definition of the Model—Details
Simple Hypergraphs vs. Multi-hypergraphs
Two variants of the model are considered in this paper, and both of them are implemented and available at the associated
GitHub repository
||
. The first variant (that is assumed to be used by default) produces simple hypergraphs whereas the
second one generates multi-hypergraphs. “Simple” in the context of hypergraphs means that hyperedges are sets of
nodes but multi-sets are not allowed. Indeed, since all edges in a graph are of size two, loops in the context of graphs
are multi-sets of size two in which one node is repeated twice. In particular, there are no multi-sets of size two which
correspond to loops in the graph terminology. Similarly, no hyperedges can be repeated so, in particular, there are no
parallel edges in the language of graphs (two identical hyperedges of size two in the language of hypergraphs).
Degree Distribution
In the context of hypergraphs, the degree of a node
vV
of a hypergraph
G= (V, E)
is defined to be the number
of hyperedges this node belongs to, regardless of their sizes. (For multi-hypergraphs, the number of occurrences of a
node in a hyperedge as well as the number of repetitions of a hyperedge are taken into account.) The volume
vol(V)
of
G
is defined to be the sum of degrees over all nodes in the hypergraph. Hence, if there are
md
hyperedges of size
||https://github.com/bkamins/ABCDHypergraphGenerator.jl
5
摘要:

HYPERGRAPHARTIFICIALBENCHMARKFORCOMMUNITYDETECTION(H–ABCD)APREPRINTBogumiłKami´nski∗PawełPrałat†FrançoisThéberge‡June14,2023ABSTRACTTheArtificialBenchmarkforCommunityDetection(ABCD)graphisarecentlyintroducedrandomgraphmodelwithcommunitystructureandpower-lawdistributionforbothdegreesandcommunitysizes...

展开>> 收起<<
HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY DETECTION HABCD A P REPRINT.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.04MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注