communities. The configuration model was used in the original ABCD model and we will use its generalization again
for h–ABCD.
The Chung-Lu model for graphs (see [
20
] for details) gives an efficient and simple way to generate graphs with an
expected degree sequence, but without community structure. It can also be easily generalized for bipartite graphs, as
described in [
3
]; the bipartite representation is often used to model hypergraphs. In the same paper, a generalization of
the BTER (block two-level Erd˝
os-Rényi) model is proposed, also for bipartite graphs. In the standard BTER model,
given degree distribution is preserved as well as the degree-wise clustering coefficients. With bipartite graphs, there are
no 3-cycles, so a new metamorphosis coefficient is introduced, which is based on 4-cycles, and the model preserves this
coefficient in a degree-wise matter.
The stochastic block model (SBM), first introduced in [
33
], is one of the most important graph models in community
detection and clustering. One benefit of it is that, being a generative model we can formally study the probability of
inferring the ground truth. This is what distinguishes it from the configuration and Chung-Lu models. There are many
variants and various applications of the SBM. For more details, we direct the reader to [
1
], one of the many surveys
on this model. In particular, one can generalize the SBM to hypergraphs as it was done in, for example, [
30
,
43
,
14
],
while in [
49
], a generalization of SBM is proposed for bipartite networks. Most SBM-type models have ground-truth
communities embedded into the model but do not produce graphs with realistic degree distributions; a variant of SBM
is proposed in [
56
] to approximate power-law degree distribution but as far as we know, there is no such model for
hypergraphs.
In [
57
], an interesting model is introduced in which hyperedges can be generated via a sampling method, given the
number of communities, node memberships, and affinity parameters between the communities. The node degrees and
hyperedge sizes can either be given explicitly or sampled from the model. This approach is more complex than the one
taken by the family of ABCD models. The authors of [
57
] report that generating sparse hypergraphs with
105
nodes
takes roughly half an hour, which is orders of magnitude slower than with h–ABCD.
In [
22
], the probabilistic Hypergraph-MT model for hypergraphs is presented and is shown to be useful for inferring
missing hyperedges as well as detecting overlapping communities. While the model as presented does not explicitly
provide a benchmark, in the conclusion of the paper, the authors mention that their model can also be used to sample
synthetic data with hypergraph structure.
Other hypergraphs models include core-periphery structure model [
55
,
59
], preferential attachment model with power-
law degree distribution and high modularity [31], and entropy-based models [58].
Finally, the authors of [
32
] propose the HyGen model that seems to be similar to the one introduced by us in this paper.
Their algorithm is shown to be fast and scalable using MPI standard for its distributed generation. (MPI, message passing
interface is a standardized means of exchanging messages between multiple computers running a parallel program
across distributed memory.) The HyGen model uses the idea of building independently community and background
hypergraphs that was also considered in the original ABCD graph generator [
42
] and is used in our h–ABCD model.
The main difference between the two models is that HyGen assumes that hyperedges belonging to communities are
completely contained in them, that is, all members of a community hyperedge belong to one community. Our h–ABCD
algorithm allows for community hyperedges to be only partially contained within a community as long as majority of
their members belong to that community. This feature is a significant advantage of h–ABCD over HyGen as most
complex networks have non-strict hyperedges. See Section 3 for more details of community and background hypergraph
generation.
2.3 Distinctive Features of h–ABCD
The goal of this research project is to introduce (and efficiently implement) a scalable synthetic hypergraph benchmark
model with community structure and power-law degree distribution as well as community sizes. Since none of the
existing hypergraph models satisfy all of the desired properties, we moved back to graph models and tried to adjust one
of them to our needs.
h–ABCD model that is proposed in this paper produces networks that have power-law degree distributions and
distributions of community sizes. Alternatively, the user may easily inject the two distributions as parameters of the
model. The user may control the level of noise which covers all spectrum of possibilities of community strength.
The model returns the ground-truth communities that then may be used to benchmark and tune community detection
algorithms. The model, by design, is fast but it is also efficiently implemented in Julia language. (Julia is a high-
level, high-performance, dynamic programming language that recently gains a lot of interest in scientific computing
applications [
12
].) As a result, it is by orders of magnitude faster than other models that we are aware of. Finally,
because of its simplicity, its properties as well as various processes (such as spreading of information, anomalies
3