Model-based clustering in simple hypergraphs through a stochastic blockmodel Luca Brusa1and Catherine Matias2

2025-04-26 0 0 1.09MB 59 页 10玖币
侵权投诉
Model-based clustering in simple hypergraphs through a
stochastic blockmodel
Luca Brusa1and Catherine Matias2
1Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Via
Bicocca degli Arcimboldi 8, 20100 Milano, Italy. Email: luca.brusa@unimib.it
2Sorbonne Université, Université de Paris Cité, Centre National de la Recherche
Scientifique, Laboratoire de Probabilités, Statistique et Modélisation, 4 place Jussieu,
75252 PARIS Cedex 05, France. Email: catherine.matias@cnrs.fr
Abstract
We propose a model to address the overlooked problem of node clustering in simple hy-
pergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in
the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochas-
tic blockmodel for graphs and assumes the existence of latent node groups and hyperedges
are conditionally independent given these groups. We first establish the generic identifia-
bility of the model parameters. We then develop a variational approximation Expectation-
Maximization algorithm for parameter inference and node clustering, and derive a statistical
criterion for model selection.
To illustrate the performance of our Rpackage HyperSBM, we compare it with other node
clustering methods using synthetic data generated from the model, as well as from a line
clustering experiment and a co-authorship dataset.
Keywords: co-authorship network, high-order interactions, latent variable model, line
clustering, non-uniform hypergraph, variational expectation-maximization.
1 Introduction
Over the past two decades, a wide range of models has been developed to capture pairwise
interactions represented in graphs. However, modern applications in various fields have high-
lighted the necessity to consider high-order interactions, which involve groups of three or more
nodes. Simple examples include triadic and larger group interactions in social networks (whose
importance has been recognized early on, see Wolff, 1950), scientific co-authorship (Estrada and
Rodríguez-Velázquez, 2006), interactions among more than two species in ecological systems
1
arXiv:2210.05983v3 [stat.ME] 17 May 2024
(Muyinda et al., 2020; Singh and Baruah, 2021), or high-order correlations between neurons in
brain networks (Chelaru et al., 2021). To formalize these high-order interactions, hypergraphs
provide the most general framework. Similar to a graph, a hypergraph consists of a set of nodes
and a set of hyperedges, where each hyperedge is a subset of nodes involved in an interaction.
In this context, it is important to distinguish simple hypergraphs from multiset hypergraphs,
where hyperedges can contain repeated nodes. Multisets are a generalization of sets, allowing
elements to appear with varying multiplicities. Recent reviews on high-order interactions can be
found in the works of Battiston et al. (2020), Bick et al. (2023), and Torres et al. (2021).
Despite the increasing interest in high-order interactions, the statistical literature on this topic
remains limited. Some graph-based statistics, such as centrality or clustering coefficient, have
been extended to hypergraphs to aid in understanding the structure and extracting information
from the data (Estrada and Rodríguez-Velázquez, 2006). However, these statistics do not fulfill
the need for random hypergraph models.
Early analyses of hypergraphs have relied on their embedding into the space of bipartite graphs
(see, e.g., Battiston et al., 2020). Hypergraphs with self-loops and multiple hyperedges (weighted
hyperedges with integer-valued weights) are equivalent to bipartite graphs. However, bipartite
graph models were not specifically designed for hypergraphs and may introduce artifacts; we refer
to Section A in the Supplementary Material for more details.
Generalizing Erdős-Rényi’s model of random graphs leads to uniformly random hypergraphs.
This model involves drawing uniformly at random from the set of all m-uniform hypergraphs
(hypergraphs with hyperedges of fixed cardinality m) over a set of nnodes. However, similar
to Erdős-Rényi’s model for graphs, this hypergraph model is too simplistic and homogeneous to
be used for statistical analysis of real-world datasets. In the configuration model for random
graphs, the graphs are generated by drawing uniformly at random from the set of all possible
graphs over a set of nnodes, while satisfying a given prescribed degree sequence. In the context
of hypergraphs, configuration models were proposed in Ghoshal et al. (2009), focusing on tripar-
tite and 3-uniform hypergraphs. Later, Chodrow (2020) extended the configuration model to a
more general hypergraph setup. In these references, both the node degrees and the hyperedge
sizes are kept fixed (a consequence of the fact that they rely on bipartite representations of hy-
pergraphs). The configuration model is useful for sampling (hyper)graphs with the same degree
sequence (and hyperedge sizes) as an observed dataset through shuffling algorithms. Therefore,
it is often employed as a null model in statistical analyses. However, sampling exactly (rather
than approximately) from this model poses challenges, particularly in the case of hypergraphs.
For a comprehensive discussion on this issue, we refer readers to Section 4 in Chodrow (2020).
Another popular approach for extracting information from heterogeneous data is clustering.
In the context of graphs, stochastic blockmodels (SBMs) were introduced in the early eighties
2
(Frank and Harary, 1982; Holland et al., 1983) and have since evolved in various directions. These
models assume that nodes are grouped into clusters, and the probabilities of connections between
nodes are determined by their cluster memberships. Variants of SBMs have been developed to
handle weighted graphs and degree-corrected versions, among others. In the context of hyper-
graphs, Ghoshdastidar and Dukkipati (2017) studied a spectral clustering approach based on a
hypergraph Laplacian, and obtained its weak consistency under a Hypergraph SBM (HSBM) un-
der certain restrictions on the model parameters. More recently, Deng et al. (2024) established the
strong consistency of the basic spectral clustering under the degree-corrected HSBM (DCHSBM)
in the sparse regime where the maximum expected hyperdegree might be of order Ω(log n)and
nis the number of nodes. By introducing hypergraphons, Balasubramanian (2021) extended
the ideas of hypergraph SBMs to a nonparametric setting. In a parallel vein, Turnbull et al.
(2021) proposed a latent space model for hypergraphs, generalizing random geometric graphs to
hypergraphs, although it was not specifically designed to capture clustering. An approach linked
to SBMs is presented in Vazquez (2009), where nodes belong to latent groups and participate in
a hyperedge with a probability that depends on both their group and the specific hyperedge.
Modularity is a widely used criterion for clustering entities in the context of interaction
data. It aims to identify specific clusters, known as communities, characterized by high within-
group connection probabilities and low between-group connection probabilities (Ghoshdastidar
and Dukkipati, 2014). However, in the hypergraph context, the definition of modularity is not
unique. In particular, Kamiński et al. (2019) introduced a “strict” modularity criterion, where only
hyperedges with all their nodes belonging to the same group contribute to an increase in modular-
ity. Their criterion measures the deviation of the number of these homogeneous hyperedges from
a new null model called the configuration-like model for hypergraphs, where the average values
of the degrees are fixed. Building upon this, Chodrow et al. (2021) proposed a degree-corrected
hypergraph SBM and introduced two new modularity criteria. Similar to Kamiński et al. (2019),
one of these criteria utilizes an “all-or-nothing” affinity function that distinguishes whether a given
hyperedge is entirely contained within a single cluster or not. In this setup, they established a
connection between approximate maximum likelihood estimation and their modularity criterion.
This work is reminiscent of the work of Newman (2016) in the graph context. However, the
estimators proposed by Chodrow et al. (2021) do not guarantee maximum likelihood estimation,
as the parameter space is constrained by assuming a symmetric affinity function. We refer to
Poda and Matias (2024) for an empirical comparison of these modularity-based methods.
It is important to highlight that the developments presented in Kamiński et al. (2019) and
Chodrow et al. (2021) are specifically conducted in the context of multiset hypergraphs, where
hyperedges can contain repeated nodes with certain multiplicities. The use of multiset hyper-
graphs simplifies some of the challenges associated with computing modularity. However, to the
best of our knowledge, modularity approaches still lack instantiation in the case of simple hyper-
graphs where each node can only appear once in a hyperedge. More specifically, the null model
3
used in hypergraph modularity criteria relies on a model for multiset hypergraphs, similar to how
the null model used in classical graph modularity is based on graphs with self-loops. While it is
known in the case of graphs that this assumption is inadequate, as it induces a stronger deviation
than expected and affects sparse networks as well (Massen and Doye, 2005; Cafieri et al., 2010;
Squartini and Garlaschelli, 2011), the assumption of multisets has not yet been discussed in the
context of hypergraph modularity.
In the context of community detection, random walk approaches have also been utilized for
hypergraph clustering (Swan and Zhan, 2021). Additionally, low-rank tensor decompositions have
been explored (Ke et al., 2020). The misclassification rate for the community detection problem
in hypergraphs and its limits have been analyzed in various contexts (see, for instance, Ahn et al.,
2018; Chien et al., 2019; Cole and Zhu, 2020). It is worth mentioning that a recent approach
has been proposed to cluster hyperedges instead of nodes (Ng and Murphy, 2022). However, our
focus in this work is on clustering nodes.
The literature on high-order interactions often discusses simplicial complexes alongside hy-
pergraphs (Battiston et al., 2020). However, the unique characteristic of simplicial complexes,
where each subset of an occurring interaction should also occur, places them outside the scope of
this introduction, which is specifically focused on hypergraphs.
In this article, our focus is on model-based clustering for simple hypergraphs, specifically
studying stochastic hypergraph blockmodels. We formulate a general stochastic blockmodel for
simple hypergraphs, along with various submodels (Section 2.1). We provide the first result on the
generic identifiability of parameters in a hypergraph stochastic blockmodel (Section 2.2). Param-
eter inference and node clustering are performed using a variational Expectation-Maximization
(VEM) algorithm (Section 2.3) that approximates the maximum likelihood estimator. Model se-
lection for the number of groups is based on an integrated classification likelihood (ICL) criterion
(Section 2.4). To illustrate the performance of our method, we conduct experiments on syn-
thetic sparse hypergraphs, including a comparison with hypergraph spectral clustering (HSC)
and modularity approaches (Section 3). Notably, the line clustering experiment (Section 3.4)
highlights the significant differences between our approach and the one proposed by Chodrow
et al. (2021). We also analyze a co-authorship dataset, presenting conclusions that differ from
spectral clustering and bipartite stochastic blockmodels (Section 4). We discuss (Section 5) our
approach, its advantages, current limitations and possible extensions. An Rpackage, HyperSBM,
which implements our method in efficient C++ code, as well as all associated scripts, are available
online (see Section 6). This manuscript is accompanied by a Supplementary Material (SM) that
contains additional information and experiments, as well as the proofs of all theoretical results.
4
2 A stochastic blockmodel for hypergraphs
2.1 Model formulation
Let H= (V,E)represent a binary hypergraph, where V={1, . . . , n}is a set of nnodes and E
is the set of hyperedges. In this context, a hyperedge of size m2is defined as a collection
of mdistinct nodes from V. We do not allow for hyperedges to be multisets or self-loops. Let
M= max
e∈E |e|denote the largest possible size of hyperedges in E, with M2(for graphs, M= 2).
We define the sets of (unordered) node subsets, (ordered) node tuples, and hyperedges of size m
as
V(m)={i1, . . . , im}:i1, . . . , im∈ V are all distinct,
Vm=(i1, . . . , im) : i1, . . . , im∈ V are all distinct,
E(m)={i1, . . . , im} ∈ V(m):{i1, . . . , im} ∈ E,
respectively. Obviously E=SM
m=2 E(m)SM
m=2 V(m). For each node subset {i1, . . . , im} ∈ V(m),
we define the indicator variable:
Yi1,...,im={i1,...,im}∈E =
1if {i1, . . . , im}∈E,
0if {i1, . . . , im}/∈ E.
We represent a random hypergraph as Y= (Yi1,...,im)i1,...,im∈V(m),2mM.
Similar to the formulation of the stochastic blockmodel (SBM) for graphs, we assume that
the nodes in the hypergraph belong to Qunobserved groups. We use Z1, . . . , Znto denote n
independent and identically distributed latent variables, where Zifollows a prior distribution
πq=P(Zi=q)for each q= 1, . . . , Q. The values πqsatisfy πq0and PQ
q=1 πq= 1. To simplify
notation, we sometimes represent Zias a binary vector Zi= (Zi1, . . . , ZiQ)∈ {0,1}Q, where only
one element, Ziq, equals 1. We also define Z= (Z1, . . . , Zn). Every m-subset of nodes {i1, . . . , im}
in V(m)is associated to a latent configuration, namely a set {Zi1, . . . , Zim}={q1, . . . , qm}of
latent groups to which these nodes belong. The values of the latent groups within a configuration
may be repeated, so that each {q1, . . . , qm}is a multiset. Now, given the latent variables Z, all
indicator variables Yi1,...,imare assumed to be independent and to follow a Bernoulli distribution
whose parameter depends on the latent configuration:
Yi1,...,im|{Zi1=q1, . . . , Zim=qm} ∼ B(B(m,n)
q1,...,qm),for any {i1, . . . , im}∈V(m).
Here B(m,n)
q1,...,qm=B(m,n)
q1,...,qmrepresents the probability that munordered nodes, with latent configu-
ration {q1, . . . , qm}, are connected into a hyperedge. To simplify notation, we drop the superscript
(m, n). However, the model may account for 2 possible sparse settings. First, as the number of
nodes nincreases, it is natural to assume that the probability of a hyperedge may decrease;
5
摘要:

Model-basedclusteringinsimplehypergraphsthroughastochasticblockmodelLucaBrusa1andCatherineMatias21DepartmentofStatisticsandQuantitativeMethods,UniversityofMilano-Bicocca,ViaBicoccadegliArcimboldi8,20100Milano,Italy.Email:luca.brusa@unimib.it2SorbonneUniversité,UniversitédeParisCité,CentreNationaldel...

展开>> 收起<<
Model-based clustering in simple hypergraphs through a stochastic blockmodel Luca Brusa1and Catherine Matias2.pdf

共59页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:59 页 大小:1.09MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 59
客服
关注