LazyFox Fast and parallelized overlapping community detection in large graphs Tim Garrels1 Athar Khodabakhsh1 Bernhard Y. Renard12 and Katharina

2025-05-03 0 0 409.76KB 17 页 10玖币
侵权投诉
LazyFox: Fast and parallelized overlapping
community detection in large graphs
Tim Garrels1, Athar Khodabakhsh1, Bernhard Y. Renard1,2, and Katharina
Baum1
1Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam,
Prof.-Dr.-Helmert Str. 2-3, 14482 Potsdam, Germany
2Department of Mathematics and Computer Science, Free University Berlin, Takustraße
9, 14195 Berlin, Germany
Corresponding author:
Katharina Baum
Email address: katharina.baum@hpi.de
ABSTRACT
The detection of communities in graph datasets provides insight about a graph’s underlying structure and
is an important tool for various domains such as social sciences, marketing, traffic forecast, and drug
discovery. While most existing algorithms provide fast approaches for community detection, their results
usually contain strictly separated communities. However, most datasets would semantically allow for or
even require overlapping communities that can only be determined at much higher computational cost.
We build on an efficient algorithm, FOX, that detects such overlapping communities. FOX measures the
closeness of a node to a community by approximating the count of triangles which that node forms with
that community. We propose LAZYFOX, a multi-threaded version of the FOX algorithm, which provides
even faster detection without an impact on community quality. This allows for the analyses of significantly
larger and more complex datasets. LAZYFOX enables overlapping community detection on complex
graph datasets with millions of nodes and billions of edges in days instead of weeks. As part of this
work, LAZYFOXs implementation was published and is available as a tool under an MIT licence at
https://github.com/TimGarrels/LazyFox.
INTRODUCTION & RELATED WORK
Graphs, also called networks, are present in many fields as they capture the interaction of entities as
edges between nodes representing those entities. Communities (or clusters, or modules) are considered
to be important network structures as they describe functionally similar groups within networks. The
identification of such groups is relevant in various domains such as biology (Barabasi and Oltvai, 2004;
Regan et al., 2002; Boccaletti et al., 2006; Guimer
`
a and Amaral, 2005; Ahn et al., 2010), medicine
(Barabasi et al., 2011; Gavin et al., 2006) and technical infrastructure (Regan and Barabasi, 2003;
Guimer
`
a et al., 2005). In social networks, communities can identify common interest groups or friend
circles (Mcauley and Leskovec, 2014). In drug research, new, targetable proteins can be discovered by
clustering proteins into functionality groups (Ma et al., 2019). Similar products of an online shop can be
grouped together to derive product recommendations (Basuchowdhuri et al., 2014).
While community itself is a term whose precise definition is highly context dependent (Kelley et al.,
2012), various algorithms have been developed to detect such structures (Fortunato, 2010). In this work,
we focus on communities of bond. These are communities where the membership of a community is
based on the relation to other members – their connectivity in the network – rather than an affinity to the
identity of that group as a whole. This naturally leads to more interconnections between group members
(Ren et al., 2007). Algorithms detecting such communities typically exploit this property by optimizing
communities so that nodes of one group are more densely connected with each other than with nodes
outside of that group.
arXiv:2210.03211v1 [cs.SI] 6 Oct 2022
Disjoint Community Detection
Disjoint communities capture structural node clusters within a network where each node belongs to
exactly one community. While algorithms based on label propagation (Raghavan et al., 2007), stochastic
blockmodels (Hofman and Wiggins, 2008; Wang and Wong, 1987) and game theory (Bu et al., 2017) for
disjoint community detection exist, the most common approach is to define a metric that measures desired
structural properties of a community (Lancichinetti et al., 2011; Yang and Leskovec, 2012; Newman,
2006; Prat-P
´
erez et al., 2012). By choosing communities that increase the metric over communities
that decrease it, the communities are optimized to display that structural property. This way a clear
community definition (groups of nodes that result in the highest value of the metric) is created and
detected communities are to some degree controllable in their structure.
Overlapping Community Detection
Most approaches identify communities in such a way that each node can belong to only one community.
While disjoint communities yield valuable information, most real world datasets contain functional groups
that are overlapping (Reid et al., 2011; Lancichinetti et al., 2009). People can belong to multiple social
circles, proteins can have various biological functions or be target of multiple drugs, and products rarely
fall into exactly one category. Therefore, traditional community detection algorithms have been adapted
and new algorithms have been proposed to detect overlapping communities (Fortunato, 2010). However,
overlapping community detection is a computationally more expensive problem than regular (disjoint)
community detection. Long runtimes of days or even weeks make community detection unfeasible in
large datasets, which is especially true for overlapping community detection (Lancichinetti and Fortunato,
2009; Danon et al., 2005). This limits the usability of most algorithms to smaller sized networks with
edge counts in the lower millions.
Algorithms which detect overlapping communities can use other approaches than disjoint community
detection algorithms such as clique expansion (Lee et al., 2010; Palla et al., 2005), matrix factorization
(Yang and Leskovec, 2013; Psorakis et al., 2011) and link clustering (Shi et al., 2013; Ahn et al., 2010;
Evans and Lambiotte, 2009). Some approaches also propose post-processing steps to create overlapping
communities from pre-existing disjoint communities (Chakraborty, 2015).
Moreover, approaches that have been proposed to detect disjoint communities can also be adjusted
to directly yield overlapping communities. Label propagation based algorithms can allow nodes to hold
multiple labels at the same time, thereby creating overlapping communities (Xie and Szymanski, 2012;
Gregory, 2009). Stochastic blockmodels can support mixed memberships (Airoldi et al., 2007; Gopalan
and Blei, 2013). And if an algorithm is based on the optimization of a metric measuring structural
properties, allowing nodes to join multiple communities during the optimization can also change the result
from disjoint to overlapping communities (Lancichinetti et al., 2009).
Enabling Large Scale Community Detection
Large scale datasets pose a challenge for many community detection algorithms as their runtime increases
drastically (Lancichinetti and Fortunato, 2009; Danon et al., 2005). Various approaches can be used to
improve runtime and enable the use of community detection algorithms on large datasets. Parallelization
can leverage modern multi-core CPUs, and algorithms with independent computation steps can profit from
parallelization directly (Liu and Chamberlain, 2018). Other algorithms cannot be parallelized without
logical changes, however, small alterations often do not influence the results substantially (Prat-P
´
erez
et al., 2014). Such parallelized approaches can also be distributed to a multi-machine computing cluster,
enabling the use of even more computational resources (Saltz et al., 2015). Finally, algorithms relying
on metric optimization can also be improved in runtime by replacing the metric with an estimator. This
reduces computational effort and speeds up the community detection.
Weighted Community Clustering Score
The Weighted Community Clustering (
WCC
) score (Prat-
P
´
erez et al., 2012) is a metric that has been successfully used to detect communities (Saltz et al., 2015;
Song et al., 2015; Prat-P
´
erez et al., 2014). It measures structural closeness by counting the number of
triangles a node forms with a community to determine the membership of that node to that community.
Thereby, it is closely related to the general concept of clustering coefficients. In Methods we explain
this metric in more detail. Community detection algorithms based on this metric optimize communities
to maximize the global
WCC
score by allowing nodes to individually join or leave communities. Such
2/17
algorithms can yield disjoint or overlapping communities depending on whether nodes are allowed to join
multiple communities.
Efficient WCC-based large scale overlapping community detection WCC
-based overlapping com-
munity detection approaches also suffer from computational complexity issues when dealing with large
scale datasets. Thus, parallelized (Prat-P
´
erez et al., 2014) and distributed (Saltz et al., 2015)
WCC
versions
have been proposed that allow multiple nodes at the same time to decide for changes to the community
partition, so those decisions can be computed on separate threads or machines. Furthermore, computation-
ally less expensive estimations of the
WCC
score have been suggested (Prat-P
´
erez et al., 2014; Lyu et al.,
2020) to further improve runtime. Thereby, Lyu and colleagues have found that approximating triangle
counts with their metric,
[
WCC
, in an algorithm they call FOX yields very similar results as calculating
the exact count. While their proposed estimation allows FOX to compute overlapping communities on
large networks with millions of edges, the implementation of FOX was not published making it hard to
reproduce the results or to use them for further applications.
LAZYFOX- Parallelized Large Scale Overlapping Community Detection
We here propose our tool
for overlapping community detection, LAZYFOX, that contains an implementation of the FOX algorithm.
In addition, we extend the work of Lyu and colleagues and combine it with ideas from (Prat-P
´
erez et al.,
2014) by introducing parallelism to leverage the advantage of modern multiprocessor systems to even
further accelerate the algorithm.
We show that LAZYFOX produces extremely similar results to FOX in a fraction of FOXs runtime,
making it more efficient to work with. This also enables the usage of the approach on large scale graphs
like the social media network Friendster with millions of nodes and billions of edges (Leskovec and
Krevl, 2014). We analyse the impact of hyperparameters of the FOX and LAZYFOX algorithm on real-
world examples. The C++ implementation of LAZYFOX (which also includes a FOX implementation) is
published as open-source under MIT licence.
In the following sections we introduce the datasets, the algorithmic details of FOX and LAZYFOX
and the methods used to evaluate them. We describe the impact of our LAZYFOX contribution on the
runtime and community quality of three datasets. We include the analysis of a fourth dataset to illustrate
the ability of LAZYFOX to handle large scale datasets in contrast to FOX. Finally, we summarize and
discuss our results.
METHODS
We introduce LAZYFOX, with its required input data and preprocessing, the employed metric
[
WCC
, and
the performed steps in the algorithm. We describe the parallelization that sets LAZYFOX apart from FOX
(Lyu et al., 2020) and allows LAZYFOX to scale across multiple CPU cores, and finally the performance
measures employed here.
Input Data
Our algorithm LAZYFOX takes the edge list of an undirected, unweighted graph
G(V,E)
as input,
V
denoting the nodes and
E
the edges of the graph. We assume this graph to be connected for the theoretical
discussions, however, LAZYFOX will still work on a graph with multiple connected components.
Preprocessing
LAZYFOX performs the following preprocessing steps on the input edge list of G:
Remove Multi-Edges
In graphs a pair of nodes can be connected via multiple edges. Since our research
on FOX and LAZYFOX works on “by-bond” communities, we regard multiple connections between two
entities as one. Therefore, we remove such duplicated edges, turning any input multi-graph into a simple
graph.
Dense Node Labels
The node identifiers (IDs) in edge lists are not necessarily consecutive due to
preprocessing (this is the case in most datasets introduced in the ’Datasets’ section). We apply a shift to
the node IDs to restore their consecutive nature. This allows both FOX and LAZYFOX to use the node IDs
directly as data-structure indices which speeds up computation.
3/17
Node Order
As LAZYFOX computes communities by gradually improving node memberships, there
needs to be a well defined order in which the nodes are being processed. We define this order by sorting
them by decreasing clustering coefficient (CC):
CC(i) = (ki
di·(di1)/2,if di>1,
0,else
ki
denotes the number of triangles containing node
i
(equal to the number of edges between two neighbors
of node
i
), while
di
denotes the degree of node
i
. A higher CC indicates that a node is central in its
neighborhood, and we process more central nodes first. Ties are resolved by node degrees in decreasing
order. This order ensures that nodes with less connectivity and thus less influence can adapt to changes in
memberships of more important, connected nodes resulting in a more coherent community structure at
the end of one iteration.
Initial Clustering
To initiate the clustering process LAZYFOX computes a greedy, non-overlapping
community decomposition using the above node order. A node that is not yet part of a community is
assigned to a new community. Then all its not yet assigned neighbors join this community.
This process allows for self-initialization on any given network. LAZYFOX derives the initial
community count and the initial clustering from the structure of the underlying network. On the other
hand, LAZYFOX is, just like FOX (Lyu et al., 2020), also able to improve upon an existing division into
communities by replacing the initial clustering step by inserting those existing divisions. Therefore,
known structural properties can be taken into account. This way LAZYFOX can be used to generate
overlapping community structure from partitions, i.e. from non-overlapping community structure. See
Lyu et al. (2020) for a discussion and examples.
[
WCC– A Metric to Optimize for
Lyu and colleagues introduced
[
WCC
as an advanced metric to assess the quality of a partition into
communities. This metric forms the core of both the FOX and the LAZYFOX algorithm as they use
[
WCC
to decide on the necessary local optimization steps. To provide a better understanding of
[
WCC
we will
first describe WCC.
WCC
The weighted clustering coefficient
WCC
as introduced by Prat-P
´
erez et al. (2012) is a score that
rates a community decomposition as the sum of its community ratings. We denote such a rating of a
community decomposition P={C1,...,Ck}as
WCC(P) =
k
i=1
WCC(Ci)
This rating of an individual community
Ci
can be again decomposed into how well the individual nodes of
that community fit into the community:
WCC(Ci) =
xCi
WCC(x,Ci)
Such a fit of a node
x
into its community
C
is assessed by the ratio of
t(x,C)
to
t(x,V)
: The number
of triangles that the node forms within its community to the number of triangles that it forms within
the whole graph. The assessment also uses
vt(x,C)
, the number of nodes in the community
C
that form
triangles with node
x
. Furthermore,
vt(x,V\C)
, the number of nodes outside of the community
C
that
form triangles with node x, influences this assessment:
WCC(x,C) = (t(x,C)
t(x,V)·vt(x,V)
|C\{x}|+vt(x,V\C),if t(x,V)>0,
0 else
Heuristic Counting of Triangles
The counting of triangles for the
WCC
computation is expensive.
To accelerate the evaluation of a new community decomposition, the exact count can be replaced by a
heuristic. Rather than counting the triangles a node forms with the nodes of a community
C
and all other
nodes, [
WCC (Lyu et al., 2020) approximates this by using the number of edges instead.
4/17
摘要:

LazyFox:FastandparallelizedoverlappingcommunitydetectioninlargegraphsTimGarrels1,AtharKhodabakhsh1,BernhardY.Renard1,2,andKatharinaBaum11HassoPlattnerInstitute,DigitalEngineeringFaculty,UniversityofPotsdam,Prof.-Dr.-HelmertStr.2-3,14482Potsdam,Germany2DepartmentofMathematicsandComputerScience,FreeUn...

展开>> 收起<<
LazyFox Fast and parallelized overlapping community detection in large graphs Tim Garrels1 Athar Khodabakhsh1 Bernhard Y. Renard12 and Katharina.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:409.76KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注