Model-based clustering in simple hypergraphs through a stochastic blockmodel Luca Brusa1and Catherine Matias2

2025-04-26 1 0 1.09MB 59 页 10玖币

侵权投诉

Model-based clustering in simple hypergraphs through a

stochastic blockmodel

Luca Brusa1and Catherine Matias2

1Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Via

Bicocca degli Arcimboldi 8, 20100 Milano, Italy. Email: luca.brusa@unimib.it

2Sorbonne Université, Université de Paris Cité, Centre National de la Recherche

Scientiﬁque, Laboratoire de Probabilités, Statistique et Modélisation, 4 place Jussieu,

75252 PARIS Cedex 05, France. Email: catherine.matias@cnrs.fr

Abstract

We propose a model to address the overlooked problem of node clustering in simple hy-

pergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in

the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochas-

tic blockmodel for graphs and assumes the existence of latent node groups and hyperedges

are conditionally independent given these groups. We ﬁrst establish the generic identiﬁa-

bility of the model parameters. We then develop a variational approximation Expectation-

Maximization algorithm for parameter inference and node clustering, and derive a statistical

criterion for model selection.

To illustrate the performance of our Rpackage HyperSBM, we compare it with other node

clustering methods using synthetic data generated from the model, as well as from a line

clustering experiment and a co-authorship dataset.

Keywords: co-authorship network, high-order interactions, latent variable model, line

clustering, non-uniform hypergraph, variational expectation-maximization.

1 Introduction

Over the past two decades, a wide range of models has been developed to capture pairwise

interactions represented in graphs. However, modern applications in various ﬁelds have high-

lighted the necessity to consider high-order interactions, which involve groups of three or more

nodes. Simple examples include triadic and larger group interactions in social networks (whose

importance has been recognized early on, see Wolﬀ, 1950), scientiﬁc co-authorship (Estrada and

Rodríguez-Velázquez, 2006), interactions among more than two species in ecological systems

arXiv:2210.05983v3 [stat.ME] 17 May 2024

(Muyinda et al., 2020; Singh and Baruah, 2021), or high-order correlations between neurons in

brain networks (Chelaru et al., 2021). To formalize these high-order interactions, hypergraphs

provide the most general framework. Similar to a graph, a hypergraph consists of a set of nodes

and a set of hyperedges, where each hyperedge is a subset of nodes involved in an interaction.

In this context, it is important to distinguish simple hypergraphs from multiset hypergraphs,

where hyperedges can contain repeated nodes. Multisets are a generalization of sets, allowing

elements to appear with varying multiplicities. Recent reviews on high-order interactions can be

found in the works of Battiston et al. (2020), Bick et al. (2023), and Torres et al. (2021).

Despite the increasing interest in high-order interactions, the statistical literature on this topic

remains limited. Some graph-based statistics, such as centrality or clustering coeﬃcient, have

been extended to hypergraphs to aid in understanding the structure and extracting information

from the data (Estrada and Rodríguez-Velázquez, 2006). However, these statistics do not fulﬁll

the need for random hypergraph models.

Early analyses of hypergraphs have relied on their embedding into the space of bipartite graphs

(see, e.g., Battiston et al., 2020). Hypergraphs with self-loops and multiple hyperedges (weighted

hyperedges with integer-valued weights) are equivalent to bipartite graphs. However, bipartite

graph models were not speciﬁcally designed for hypergraphs and may introduce artifacts; we refer

to Section A in the Supplementary Material for more details.

Generalizing Erdős-Rényi’s model of random graphs leads to uniformly random hypergraphs.

This model involves drawing uniformly at random from the set of all m-uniform hypergraphs

(hypergraphs with hyperedges of ﬁxed cardinality m) over a set of nnodes. However, similar

to Erdős-Rényi’s model for graphs, this hypergraph model is too simplistic and homogeneous to

be used for statistical analysis of real-world datasets. In the conﬁguration model for random

graphs, the graphs are generated by drawing uniformly at random from the set of all possible

graphs over a set of nnodes, while satisfying a given prescribed degree sequence. In the context

of hypergraphs, conﬁguration models were proposed in Ghoshal et al. (2009), focusing on tripar-

tite and 3-uniform hypergraphs. Later, Chodrow (2020) extended the conﬁguration model to a

more general hypergraph setup. In these references, both the node degrees and the hyperedge

sizes are kept ﬁxed (a consequence of the fact that they rely on bipartite representations of hy-

pergraphs). The conﬁguration model is useful for sampling (hyper)graphs with the same degree

sequence (and hyperedge sizes) as an observed dataset through shuﬄing algorithms. Therefore,

it is often employed as a null model in statistical analyses. However, sampling exactly (rather

than approximately) from this model poses challenges, particularly in the case of hypergraphs.

For a comprehensive discussion on this issue, we refer readers to Section 4 in Chodrow (2020).

Another popular approach for extracting information from heterogeneous data is clustering.

In the context of graphs, stochastic blockmodels (SBMs) were introduced in the early eighties

(Frank and Harary, 1982; Holland et al., 1983) and have since evolved in various directions. These

models assume that nodes are grouped into clusters, and the probabilities of connections between

nodes are determined by their cluster memberships. Variants of SBMs have been developed to

handle weighted graphs and degree-corrected versions, among others. In the context of hyper-

graphs, Ghoshdastidar and Dukkipati (2017) studied a spectral clustering approach based on a

hypergraph Laplacian, and obtained its weak consistency under a Hypergraph SBM (HSBM) un-

der certain restrictions on the model parameters. More recently, Deng et al. (2024) established the

strong consistency of the basic spectral clustering under the degree-corrected HSBM (DCHSBM)

in the sparse regime where the maximum expected hyperdegree might be of order Ω(log n)and

nis the number of nodes. By introducing hypergraphons, Balasubramanian (2021) extended

the ideas of hypergraph SBMs to a nonparametric setting. In a parallel vein, Turnbull et al.

(2021) proposed a latent space model for hypergraphs, generalizing random geometric graphs to

hypergraphs, although it was not speciﬁcally designed to capture clustering. An approach linked

to SBMs is presented in Vazquez (2009), where nodes belong to latent groups and participate in

a hyperedge with a probability that depends on both their group and the speciﬁc hyperedge.

Modularity is a widely used criterion for clustering entities in the context of interaction

data. It aims to identify speciﬁc clusters, known as communities, characterized by high within-

group connection probabilities and low between-group connection probabilities (Ghoshdastidar

and Dukkipati, 2014). However, in the hypergraph context, the deﬁnition of modularity is not

unique. In particular, Kamiński et al. (2019) introduced a “strict” modularity criterion, where only

hyperedges with all their nodes belonging to the same group contribute to an increase in modular-

ity. Their criterion measures the deviation of the number of these homogeneous hyperedges from

a new null model called the conﬁguration-like model for hypergraphs, where the average values

of the degrees are ﬁxed. Building upon this, Chodrow et al. (2021) proposed a degree-corrected

hypergraph SBM and introduced two new modularity criteria. Similar to Kamiński et al. (2019),

one of these criteria utilizes an “all-or-nothing” aﬃnity function that distinguishes whether a given

hyperedge is entirely contained within a single cluster or not. In this setup, they established a

connection between approximate maximum likelihood estimation and their modularity criterion.

This work is reminiscent of the work of Newman (2016) in the graph context. However, the

estimators proposed by Chodrow et al. (2021) do not guarantee maximum likelihood estimation,

as the parameter space is constrained by assuming a symmetric aﬃnity function. We refer to

Poda and Matias (2024) for an empirical comparison of these modularity-based methods.

It is important to highlight that the developments presented in Kamiński et al. (2019) and

Chodrow et al. (2021) are speciﬁcally conducted in the context of multiset hypergraphs, where

hyperedges can contain repeated nodes with certain multiplicities. The use of multiset hyper-

graphs simpliﬁes some of the challenges associated with computing modularity. However, to the

best of our knowledge, modularity approaches still lack instantiation in the case of simple hyper-

graphs where each node can only appear once in a hyperedge. More speciﬁcally, the null model

used in hypergraph modularity criteria relies on a model for multiset hypergraphs, similar to how

the null model used in classical graph modularity is based on graphs with self-loops. While it is

known in the case of graphs that this assumption is inadequate, as it induces a stronger deviation

than expected and aﬀects sparse networks as well (Massen and Doye, 2005; Caﬁeri et al., 2010;

Squartini and Garlaschelli, 2011), the assumption of multisets has not yet been discussed in the

context of hypergraph modularity.

In the context of community detection, random walk approaches have also been utilized for

hypergraph clustering (Swan and Zhan, 2021). Additionally, low-rank tensor decompositions have

been explored (Ke et al., 2020). The misclassiﬁcation rate for the community detection problem

in hypergraphs and its limits have been analyzed in various contexts (see, for instance, Ahn et al.,

2018; Chien et al., 2019; Cole and Zhu, 2020). It is worth mentioning that a recent approach

has been proposed to cluster hyperedges instead of nodes (Ng and Murphy, 2022). However, our

focus in this work is on clustering nodes.

The literature on high-order interactions often discusses simplicial complexes alongside hy-

pergraphs (Battiston et al., 2020). However, the unique characteristic of simplicial complexes,

where each subset of an occurring interaction should also occur, places them outside the scope of

this introduction, which is speciﬁcally focused on hypergraphs.

In this article, our focus is on model-based clustering for simple hypergraphs, speciﬁcally

studying stochastic hypergraph blockmodels. We formulate a general stochastic blockmodel for

simple hypergraphs, along with various submodels (Section 2.1). We provide the ﬁrst result on the

generic identiﬁability of parameters in a hypergraph stochastic blockmodel (Section 2.2). Param-

eter inference and node clustering are performed using a variational Expectation-Maximization

(VEM) algorithm (Section 2.3) that approximates the maximum likelihood estimator. Model se-

lection for the number of groups is based on an integrated classiﬁcation likelihood (ICL) criterion

(Section 2.4). To illustrate the performance of our method, we conduct experiments on syn-

thetic sparse hypergraphs, including a comparison with hypergraph spectral clustering (HSC)

and modularity approaches (Section 3). Notably, the line clustering experiment (Section 3.4)

highlights the signiﬁcant diﬀerences between our approach and the one proposed by Chodrow

et al. (2021). We also analyze a co-authorship dataset, presenting conclusions that diﬀer from

spectral clustering and bipartite stochastic blockmodels (Section 4). We discuss (Section 5) our

approach, its advantages, current limitations and possible extensions. An Rpackage, HyperSBM,

which implements our method in eﬃcient C++ code, as well as all associated scripts, are available

online (see Section 6). This manuscript is accompanied by a Supplementary Material (SM) that

contains additional information and experiments, as well as the proofs of all theoretical results.

2 A stochastic blockmodel for hypergraphs

2.1 Model formulation

Let H= (V,E)represent a binary hypergraph, where V={1, . . . , n}is a set of nnodes and E

is the set of hyperedges. In this context, a hyperedge of size m≥2is deﬁned as a collection

of mdistinct nodes from V. We do not allow for hyperedges to be multisets or self-loops. Let

M= max

e∈E |e|denote the largest possible size of hyperedges in E, with M≥2(for graphs, M= 2).

We deﬁne the sets of (unordered) node subsets, (ordered) node tuples, and hyperedges of size m

V(m)={i1, . . . , im}:i1, . . . , im∈ V are all distinct,

Vm=(i1, . . . , im) : i1, . . . , im∈ V are all distinct,

E(m)={i1, . . . , im} ∈ V(m):{i1, . . . , im} ∈ E,

respectively. Obviously E=SM

m=2 E(m)⊆SM

m=2 V(m). For each node subset {i1, . . . , im} ∈ V(m),

we deﬁne the indicator variable:

Yi1,...,im={i1,...,im}∈E =





1if {i1, . . . , im}∈E,

0if {i1, . . . , im}/∈ E.

We represent a random hypergraph as Y= (Yi1,...,im)i1,...,im∈V(m),2≤m≤M.

Similar to the formulation of the stochastic blockmodel (SBM) for graphs, we assume that

the nodes in the hypergraph belong to Qunobserved groups. We use Z1, . . . , Znto denote n

independent and identically distributed latent variables, where Zifollows a prior distribution

πq=P(Zi=q)for each q= 1, . . . , Q. The values πqsatisfy πq≥0and PQ

q=1 πq= 1. To simplify

notation, we sometimes represent Zias a binary vector Zi= (Zi1, . . . , ZiQ)∈ {0,1}Q, where only

one element, Ziq, equals 1. We also deﬁne Z= (Z1, . . . , Zn). Every m-subset of nodes {i1, . . . , im}

in V(m)is associated to a latent conﬁguration, namely a set {Zi1, . . . , Zim}={q1, . . . , qm}of

latent groups to which these nodes belong. The values of the latent groups within a conﬁguration

may be repeated, so that each {q1, . . . , qm}is a multiset. Now, given the latent variables Z, all

indicator variables Yi1,...,imare assumed to be independent and to follow a Bernoulli distribution

whose parameter depends on the latent conﬁguration:

Yi1,...,im|{Zi1=q1, . . . , Zim=qm} ∼ B(B(m,n)

q1,...,qm),for any {i1, . . . , im}∈V(m).

Here B(m,n)

q1,...,qm=B(m,n)

q1,...,qmrepresents the probability that munordered nodes, with latent conﬁgu-

ration {q1, . . . , qm}, are connected into a hyperedge. To simplify notation, we drop the superscript

(m, n). However, the model may account for 2 possible sparse settings. First, as the number of

nodes nincreases, it is natural to assume that the probability of a hyperedge may decrease;

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Model-basedclusteringinsimplehypergraphsthroughastochasticblockmodelLucaBrusa1andCatherineMatias21DepartmentofStatisticsandQuantitativeMethods,UniversityofMilano-Bicocca,ViaBicoccadegliArcimboldi8,20100Milano,Italy.Email:luca.brusa@unimib.it2SorbonneUniversité,UniversitédeParisCité,CentreNationaldel...

展开>> 收起<<

Model-based clustering in simple hypergraphs through a stochastic blockmodel Luca Brusa1and Catherine Matias2.pdf

共59页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model-based clustering in simple hypergraphs through a stochastic blockmodel Luca Brusa1and Catherine Matias2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: