HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY DETECTION HABCD A P REPRINT

2025-05-08 0 0 1.04MB 23 页 10玖币

侵权投诉

HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY

DETECTION (H–ABCD)

A PREPRINT

Bogumił Kami´

nski∗Paweł Prałat†François Théberge‡

June 14, 2023

ABSTRACT

The Artiﬁcial Benchmark for Community Detection (ABCD) graph is a recently introduced random

graph model with community structure and power-law distribution for both degrees and community

sizes. The model generates graphs with similar properties as the well-known LFR one, and its main

parameter

can be tuned to mimic its counterpart in the LFR model, the mixing parameter

. In this

paper, we introduce hypergraph counterpart of the ABCD model, h–ABCD, which also produces

random hypergraph with distributions of ground-truth community sizes and degrees following power-

law. As in the original ABCD, the new model h–ABCD can produce hypergraphs with various levels

of noise. More importantly, the model is ﬂexible and can mimic any desired level of homogeneity of

hyperedges that fall into one community. As a result, it can be used as a suitable, synthetic playground

for analyzing and tuning hypergraph community detection algorithms.

1 Introduction

Many networks that are currently modelled as graphs would be more accurately modelled as hypergraphs. This includes

the collaboration network in which nodes correspond to researchers and hyperedges correspond to papers that consist

of nodes associated with researchers that co-authorship a given paper. After many years of intense research using

graph theory in modelling and mining complex networks [

], hypergraphs start gaining considerable

traction [

]. Standard but important questions in network science are revisited in the context of hypergraphs.

This includes various aspects related to community detection in networks [

] which

we concentrate on in this paper. However, hypergraphs also create brand new questions which did not have their

counterparts for graphs. For example, how hyperedges overlap in empirical hypergraphs [

]? Or how the existing

patterns in a hypergraph affect the formation of new hyperedges [35]?

Despite the fact that currently there is a vivid discussion around hypergraphs, the theory and tools are still not sufﬁciently

developed to allow most problems, including clustering, to be tackled directly within this context. Indeed, researchers

and practitioners often create the 2-section graph of a hypergraph of interest (that is, replace each hyperedge with a

clique) and apply classical tools designed for graphs. After moving to the 2-section graph, one clearly loses some

information about hyperedges of size greater than two and so there is a common belief that one can do better by using

the knowledge of the original hypergraph.

As mentioned earlier, there are some recent attempts to directly deal with hypergraphs in the context of clustering.

In [

], methods from spectral clustering are generalized to hypergraphs; [

] proposes a framework to infer missing

hyperedges and detect overlapping communities, while in [

], extensions of non-backtracking spectral clustering are

proposed in the context of hypergraphs. In Kumar et al. [

], the authors reduce the problem to graphs but use

original hypergraphs to iteratively adjust weights to encourage some hyperedges to be included in some cluster but

∗

Decision Analysis and Support Unit, SGH Warsaw School of Economics, Warsaw, Poland; e-mail:

bogumil.kaminski@sgh.waw.pl

†

Department of Mathematics, Toronto Metropolitan University, Toronto, ON, Canada; e-mail:

pralat@ryerson.ca

. Part of

this work was done while the author was visiting the Simons Institute for the Theory of Computing.

‡Tutte Institute for Mathematics and Computing, Ottawa, ON, Canada; email: theberge@ieee.org

arXiv:2210.15009v3 [cs.SI] 12 Jun 2023

discourage other ones (this process can be viewed as separating signal from noise). Moreover, in [

] a number of

extensions of the classic null model for graphs are proposed that can potentially be used by hypergraph algorithms.

Unfortunately, there are many ways such extensions can be done depending on how often nodes in one community share

hyperedges with nodes from other communities. This is something that varies between networks at hand and usually

depends on the hyperedge sizes. Let us come back to the collaboration network we discussed earlier. Hyperedges

associated with papers written by mathematicians might be more homogeneous and smaller in comparison with those

written by medical doctors who tend to work in large and multidisciplinary teams. Moreover, in general, papers with a

large number of co-authors tend to be less homogeneous, and other patterns can be identiﬁed [

]. A good clustering

algorithm should be able to automatically decide which extension should be used. However, in order to be able to design

and properly tune such algorithms, one needs to have a synthetic random hypergraph model that is able to simulate

various scenarios.

The hypergraph model, h–ABCD, is our response to this need from both practitioners and academia. This random

graph model, similarly to the original ABCD model, produces hypergraphs with community structure and power-law

distribution for both degrees and community sizes. Indeed, as with graphs, power-law distribution is a common

feature present in many real-world hypergraphs; for example, [

] shows power-law distributions for several real-world

hypergraph decompositions, and we have checked that in the presented cases the hypergraph degree distributions also

follow power-law distributions.

The paper is structured as follows. In Section 2, we brieﬂy discuss the history of ABCD, the “parent” of the h–ABCD

model introduced in this paper, before summarizing other synthetic hypergraphs. We focus on models important in the

context of community detection and explain the differences between the existing models and our model. The h–ABCD

model is deﬁned in Section 3. Experiments highlighting important properties of the model are discussed in Section 4.

In Section 5, we conclude the paper with a few future research directions.

2 Existing Models

2.1 ABCD graph Models

There are very few datasets with ground-truth identiﬁed and labelled. As a result, there is need for synthetic random

graph models with community structure that resemble real-world networks in order to benchmark and tune clustering

algorithms that are unsupervised by nature. The LFR (Lancichinetti, Fortunato, Radicchi) model [

] generates

networks with communities and at the same time it allows for the heterogeneity in the distributions of both node degrees

and of community sizes. It became a standard and extensively used method for generating artiﬁcial networks.

The Artiﬁcial Benchmark for Community Detection (ABCD) [

] was recently introduced and implemented

, including

a fast implementation

that uses multiple threads (ABCDe) [

]. Undirected variant of LFR and ABCD produce graphs

with comparable properties but ABCD/ABCDe is faster than LFR and can be easily tuned to allow the user to make

a smooth transition between the two extremes: pure (disjoint) communities and random graph with no community

structure. Moreover, it is easier to analyze theoretically—for example, in [

] various theoretical asymptotic properties

of the ABCD model are investigated including the modularity function that, despite some known issues such as the

“resolution limit” reported in [

], is an important graph property of networks in the context of community detection.

Finally, the building blocks in the model are ﬂexible and may be adjusted to satisfy different needs. For example, the

original ABCD model was recently adjusted to include potential outliers in [

] resulting in ABCD+o model. Adjusting

the model to hypergraphs is much more complex but doable.

2.2 Other Hypergraph Models

The classical conﬁguration model, which was ﬁrst introduced by Bollobás [

], is a standard model producing graphs

with a given degree sequence. It was generalized to higher order structures many times [

]. One early example

is the folksonomy, a tripartite structure of users, resources, and tags, that was modelled as hypergraphs generated

via conﬁguration-type model in [

]. The variant of the conﬁguration model in [

] generalizes it even further,

namely, to simplicial complexes. Simplicial complexes are attractive tools when studying topological aspects of discrete

data [

], but subset inclusion is a strong property, not suitable from our application point of view, namely, community

detection. In general, conﬁguration models do not pay attention to labels of nodes and so cannot produce graphs with

community structure. Having said that, they can be used as an ingredient of models which produce networks with

§https://github.com/bkamins/ABCDGraphGenerator.jl/

¶https://github.com/tolcz/ABCDeGraphGenerator.jl/

communities. The conﬁguration model was used in the original ABCD model and we will use its generalization again

for h–ABCD.

The Chung-Lu model for graphs (see [

] for details) gives an efﬁcient and simple way to generate graphs with an

expected degree sequence, but without community structure. It can also be easily generalized for bipartite graphs, as

described in [

]; the bipartite representation is often used to model hypergraphs. In the same paper, a generalization of

the BTER (block two-level Erd˝

os-Rényi) model is proposed, also for bipartite graphs. In the standard BTER model,

given degree distribution is preserved as well as the degree-wise clustering coefﬁcients. With bipartite graphs, there are

no 3-cycles, so a new metamorphosis coefﬁcient is introduced, which is based on 4-cycles, and the model preserves this

coefﬁcient in a degree-wise matter.

The stochastic block model (SBM), ﬁrst introduced in [

], is one of the most important graph models in community

detection and clustering. One beneﬁt of it is that, being a generative model we can formally study the probability of

inferring the ground truth. This is what distinguishes it from the conﬁguration and Chung-Lu models. There are many

variants and various applications of the SBM. For more details, we direct the reader to [

], one of the many surveys

on this model. In particular, one can generalize the SBM to hypergraphs as it was done in, for example, [

while in [

], a generalization of SBM is proposed for bipartite networks. Most SBM-type models have ground-truth

communities embedded into the model but do not produce graphs with realistic degree distributions; a variant of SBM

is proposed in [

] to approximate power-law degree distribution but as far as we know, there is no such model for

hypergraphs.

In [

], an interesting model is introduced in which hyperedges can be generated via a sampling method, given the

number of communities, node memberships, and afﬁnity parameters between the communities. The node degrees and

hyperedge sizes can either be given explicitly or sampled from the model. This approach is more complex than the one

taken by the family of ABCD models. The authors of [

] report that generating sparse hypergraphs with

105

nodes

takes roughly half an hour, which is orders of magnitude slower than with h–ABCD.

In [

], the probabilistic Hypergraph-MT model for hypergraphs is presented and is shown to be useful for inferring

missing hyperedges as well as detecting overlapping communities. While the model as presented does not explicitly

provide a benchmark, in the conclusion of the paper, the authors mention that their model can also be used to sample

synthetic data with hypergraph structure.

Other hypergraphs models include core-periphery structure model [

], preferential attachment model with power-

law degree distribution and high modularity [31], and entropy-based models [58].

Finally, the authors of [

] propose the HyGen model that seems to be similar to the one introduced by us in this paper.

Their algorithm is shown to be fast and scalable using MPI standard for its distributed generation. (MPI, message passing

interface is a standardized means of exchanging messages between multiple computers running a parallel program

across distributed memory.) The HyGen model uses the idea of building independently community and background

hypergraphs that was also considered in the original ABCD graph generator [

] and is used in our h–ABCD model.

The main difference between the two models is that HyGen assumes that hyperedges belonging to communities are

completely contained in them, that is, all members of a community hyperedge belong to one community. Our h–ABCD

algorithm allows for community hyperedges to be only partially contained within a community as long as majority of

their members belong to that community. This feature is a signiﬁcant advantage of h–ABCD over HyGen as most

complex networks have non-strict hyperedges. See Section 3 for more details of community and background hypergraph

generation.

2.3 Distinctive Features of h–ABCD

The goal of this research project is to introduce (and efﬁciently implement) a scalable synthetic hypergraph benchmark

model with community structure and power-law degree distribution as well as community sizes. Since none of the

existing hypergraph models satisfy all of the desired properties, we moved back to graph models and tried to adjust one

of them to our needs.

h–ABCD model that is proposed in this paper produces networks that have power-law degree distributions and

distributions of community sizes. Alternatively, the user may easily inject the two distributions as parameters of the

model. The user may control the level of noise which covers all spectrum of possibilities of community strength.

The model returns the ground-truth communities that then may be used to benchmark and tune community detection

algorithms. The model, by design, is fast but it is also efﬁciently implemented in Julia language. (Julia is a high-

level, high-performance, dynamic programming language that recently gains a lot of interest in scientiﬁc computing

applications [

].) As a result, it is by orders of magnitude faster than other models that we are aware of. Finally,

because of its simplicity, its properties as well as various processes (such as spreading of information, anomalies

detection, etc.) can be analyzed theoretically. Such studies, complemented by simulations that are typically easier to

perform, often uncover important mechanisms that are present in real-world complex networks. One such spectacular

example is a study of the preferential attachment model that explains the following phenomena: “rich-get-richer”

mechanisms are responsible for creating power-law distributions that are typically observed in complex networks [

Analysis of theoretical properties of h-ABCD are left as future work.

3 Deﬁnition of the Model

3.1 Parameters of the Model

Let us start with introducing parameters of the model. The h–ABCD model is governed by parameters summarized in

Table 1. Note, in particular, that the number of hyperedges,

, is not a parameter of the model. It is a random variable

that depends on the number of nodes

, the degree distribution of the graph, and the distribution of hyperedge sizes

. This random variable is well concentrated around its expectation but it is not a parameter provided as an input.

Similarly, the number of communities,

ℓ

, is not a parameter of the model but a random variable that depends on the

number of nodes

and the distribution of community sizes. The process of determining of these values, as well as a

detailed explanation of the interpretation of the presented parameters, is discussed further in this section.

parameter range description

nNnumber of nodes

γ(2,3) power-law exponent of degree distribution

δNminimum degree at least δ

DNmaximum degree at most D

β(1,2) power-law exponent of distribution of community sizes

sN\[δ]community sizes at least s

SN\[s−1] community sizes at most S

ξ(0,1) level of noise (fraction of non-community hyperedges)

LNsize of largest hyperedges

qd[0,1] fraction of hyperedges that are of size d;PL

d=1 qd= 1

wc,d [0,1]

fraction of community hyperedges of size

that have exactly

within-

community nodes; Pd

c=⌊d/2⌋+1 wc,d = 1

Table 1: Parameters of the h–ABCD model

The model generates a hypergraph on

nodes. The degree distribution follows power-law with exponent

, minimum

and maximum value equal to

and, respectively,

. Community sizes are between

and

, and also follow power-

law distribution, but this time with exponent

. The suggested range of values for parameters

and

are chosen

according to experimental values commonly observed in complex networks not only represented as graphs [

]

but also as hypergraphs [

]. Parameter

is responsible for the level of noise. If

ξ= 0

, then each hyperedge is a

community hyperedge meaning that majority of its nodes belong to one community. On the other extreme, if

ξ= 1

then communities do not play any roles and hyperedges are simply “sprinkled” across the entire hypergraph that we

will refer to as background hypergraph. Vector

(q1, . . . , qL)

determines the distribution of the number of hyperedges of

a given size.

Finally, parameters

wc,d

specify how many nodes from its own community a given community hyperedge should have.

We call a community hyperedge to be of a

(c, d)

type if it has size

and exactly

of its nodes belong to one of the

communities. Note, in particular, that we require that a community hyperedge must have more than a half of its nodes

from the community. Therefore, wc,d is deﬁned for d/2< c ≤d, where d∈[L].

The model is ﬂexible and may accept any family of parameters

wc,d

satisfying speciﬁc needs of the users, but here is a

list of three standard options implemented in the code (see Table 2 for example calculations):

•majority model: wc,d is uniform for all admissible values of c, that is, for any d/2< c ≤d,

wc,d =1

(d− ⌊d/2⌋)=1

⌈d/2⌉,

•linear model: wc,d is proportional to cfor all admissible values of c, that is, for any d/2< c ≤d,

wc,d =2c

(d+⌊d/2⌋+ 1)(d− ⌊d/2⌋)=2c

(d+⌊d/2⌋+ 1)⌈d/2⌉,

majority wc,d

1 2 3 4 5

1 1

2 1 1/2

3 1/2 1/2 1/3

4 1/2 1/3

5 1/3

linear wc,d

1 2 3 4 5

1 1

2 1 2/5

3 3/5 3/7 3/12

4 4/7 4/12

5 5/12

strict wc,d

12345

1 1

2 1 0

3 1 0 0

4 1 0

5 1

Table 2: Example values of wc,d for majority,linear, and strict weights for d∈[5].

•strict model: only “pure” hyperedges are allowed, that is

wd,d = 1 and wc,d = 0 for d/2< c < d.

In the above formulas and later in the paper, for a given x∈R,⌊x⌋is used to denote the ﬂoor of x(that is, the largest

integer not larger than x) and ⌈x⌉to denote its ceiling (that is, the smallest integer not smaller than x).

3.2 The Big Picture

We summarize the main steps followed to generate the hypergraph h–ABCD below. More details are provided right

after.

1. Sample degrees of nodes.

Sample community sizes ensuring that the desired properties hold, in particular, that the sum of community

sizes equals the number of nodes n.

3. Compute the number of hyperedges of size one and then generate them.

For each node, compute what fraction of its degree should be assigned to community hyperedges and what

fraction remains to be used for background hyperedges.

Assign nodes to communities ensuring that they ﬁt into them (for example, nodes of very high degree cannot be

assigned to small communities, as they would not be able to form simple hyperedges within such communities).

One of such admissible assignments is selected randomly.

6. Generate community hypergraphs.

7. Generate the background hypergraph.

Resolve potential problems with infeasible hyperedges (either being multisets or duplicates) by executing the

rewiring process.

3.3 Deﬁnition of the Model—Details

Simple Hypergraphs vs. Multi-hypergraphs

Two variants of the model are considered in this paper, and both of them are implemented and available at the associated

GitHub repository

. The ﬁrst variant (that is assumed to be used by default) produces simple hypergraphs whereas the

second one generates multi-hypergraphs. “Simple” in the context of hypergraphs means that hyperedges are sets of

nodes but multi-sets are not allowed. Indeed, since all edges in a graph are of size two, loops in the context of graphs

are multi-sets of size two in which one node is repeated twice. In particular, there are no multi-sets of size two which

correspond to loops in the graph terminology. Similarly, no hyperedges can be repeated so, in particular, there are no

parallel edges in the language of graphs (two identical hyperedges of size two in the language of hypergraphs).

Degree Distribution

In the context of hypergraphs, the degree of a node

v∈V

of a hypergraph

G= (V, E)

is deﬁned to be the number

of hyperedges this node belongs to, regardless of their sizes. (For multi-hypergraphs, the number of occurrences of a

node in a hyperedge as well as the number of repetitions of a hyperedge are taken into account.) The volume

vol(V)

is deﬁned to be the sum of degrees over all nodes in the hypergraph. Hence, if there are

hyperedges of size

||https://github.com/bkamins/ABCDHypergraphGenerator.jl

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HYPERGRAPHARTIFICIALBENCHMARKFORCOMMUNITYDETECTION(H–ABCD)APREPRINTBogumiłKami´nski∗PawełPrałat†FrançoisThéberge‡June14,2023ABSTRACTTheArtificialBenchmarkforCommunityDetection(ABCD)graphisarecentlyintroducedrandomgraphmodelwithcommunitystructureandpower-lawdistributionforbothdegreesandcommunitysizes...

展开>> 收起<<

HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY DETECTION HABCD A P REPRINT.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HYPERGRAPH ARTIFICIAL BENCHMARK FOR COMMUNITY DETECTION HABCD A P REPRINT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: