LazyFox Fast and parallelized overlapping community detection in large graphs Tim Garrels1 Athar Khodabakhsh1 Bernhard Y. Renard12 and Katharina

2025-05-03 0 0 409.76KB 17 页 10玖币

侵权投诉

LazyFox: Fast and parallelized overlapping

community detection in large graphs

Tim Garrels1, Athar Khodabakhsh1, Bernhard Y. Renard1,2, and Katharina

Baum1

1Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam,

Prof.-Dr.-Helmert Str. 2-3, 14482 Potsdam, Germany

2Department of Mathematics and Computer Science, Free University Berlin, Takustraße

9, 14195 Berlin, Germany

Corresponding author:

Katharina Baum

Email address: katharina.baum@hpi.de

ABSTRACT

The detection of communities in graph datasets provides insight about a graph’s underlying structure and

is an important tool for various domains such as social sciences, marketing, trafﬁc forecast, and drug

discovery. While most existing algorithms provide fast approaches for community detection, their results

usually contain strictly separated communities. However, most datasets would semantically allow for or

even require overlapping communities that can only be determined at much higher computational cost.

We build on an efﬁcient algorithm, FOX, that detects such overlapping communities. FOX measures the

closeness of a node to a community by approximating the count of triangles which that node forms with

that community. We propose LAZYFOX, a multi-threaded version of the FOX algorithm, which provides

even faster detection without an impact on community quality. This allows for the analyses of signiﬁcantly

larger and more complex datasets. LAZYFOX enables overlapping community detection on complex

graph datasets with millions of nodes and billions of edges in days instead of weeks. As part of this

work, LAZYFOX’s implementation was published and is available as a tool under an MIT licence at

https://github.com/TimGarrels/LazyFox.

INTRODUCTION & RELATED WORK

Graphs, also called networks, are present in many ﬁelds as they capture the interaction of entities as

edges between nodes representing those entities. Communities (or clusters, or modules) are considered

to be important network structures as they describe functionally similar groups within networks. The

identiﬁcation of such groups is relevant in various domains such as biology (Barabasi and Oltvai, 2004;

Regan et al., 2002; Boccaletti et al., 2006; Guimer

a and Amaral, 2005; Ahn et al., 2010), medicine

(Barabasi et al., 2011; Gavin et al., 2006) and technical infrastructure (Regan and Barabasi, 2003;

Guimer

a et al., 2005). In social networks, communities can identify common interest groups or friend

circles (Mcauley and Leskovec, 2014). In drug research, new, targetable proteins can be discovered by

clustering proteins into functionality groups (Ma et al., 2019). Similar products of an online shop can be

grouped together to derive product recommendations (Basuchowdhuri et al., 2014).

While community itself is a term whose precise deﬁnition is highly context dependent (Kelley et al.,

2012), various algorithms have been developed to detect such structures (Fortunato, 2010). In this work,

we focus on communities of bond. These are communities where the membership of a community is

based on the relation to other members – their connectivity in the network – rather than an afﬁnity to the

identity of that group as a whole. This naturally leads to more interconnections between group members

(Ren et al., 2007). Algorithms detecting such communities typically exploit this property by optimizing

communities so that nodes of one group are more densely connected with each other than with nodes

outside of that group.

arXiv:2210.03211v1 [cs.SI] 6 Oct 2022

Disjoint Community Detection

Disjoint communities capture structural node clusters within a network where each node belongs to

exactly one community. While algorithms based on label propagation (Raghavan et al., 2007), stochastic

blockmodels (Hofman and Wiggins, 2008; Wang and Wong, 1987) and game theory (Bu et al., 2017) for

disjoint community detection exist, the most common approach is to deﬁne a metric that measures desired

structural properties of a community (Lancichinetti et al., 2011; Yang and Leskovec, 2012; Newman,

2006; Prat-P

erez et al., 2012). By choosing communities that increase the metric over communities

that decrease it, the communities are optimized to display that structural property. This way a clear

community deﬁnition (groups of nodes that result in the highest value of the metric) is created and

detected communities are to some degree controllable in their structure.

Overlapping Community Detection

Most approaches identify communities in such a way that each node can belong to only one community.

While disjoint communities yield valuable information, most real world datasets contain functional groups

that are overlapping (Reid et al., 2011; Lancichinetti et al., 2009). People can belong to multiple social

circles, proteins can have various biological functions or be target of multiple drugs, and products rarely

fall into exactly one category. Therefore, traditional community detection algorithms have been adapted

and new algorithms have been proposed to detect overlapping communities (Fortunato, 2010). However,

overlapping community detection is a computationally more expensive problem than regular (disjoint)

community detection. Long runtimes of days or even weeks make community detection unfeasible in

large datasets, which is especially true for overlapping community detection (Lancichinetti and Fortunato,

2009; Danon et al., 2005). This limits the usability of most algorithms to smaller sized networks with

edge counts in the lower millions.

Algorithms which detect overlapping communities can use other approaches than disjoint community

detection algorithms such as clique expansion (Lee et al., 2010; Palla et al., 2005), matrix factorization

(Yang and Leskovec, 2013; Psorakis et al., 2011) and link clustering (Shi et al., 2013; Ahn et al., 2010;

Evans and Lambiotte, 2009). Some approaches also propose post-processing steps to create overlapping

communities from pre-existing disjoint communities (Chakraborty, 2015).

Moreover, approaches that have been proposed to detect disjoint communities can also be adjusted

to directly yield overlapping communities. Label propagation based algorithms can allow nodes to hold

multiple labels at the same time, thereby creating overlapping communities (Xie and Szymanski, 2012;

Gregory, 2009). Stochastic blockmodels can support mixed memberships (Airoldi et al., 2007; Gopalan

and Blei, 2013). And if an algorithm is based on the optimization of a metric measuring structural

properties, allowing nodes to join multiple communities during the optimization can also change the result

from disjoint to overlapping communities (Lancichinetti et al., 2009).

Enabling Large Scale Community Detection

Large scale datasets pose a challenge for many community detection algorithms as their runtime increases

drastically (Lancichinetti and Fortunato, 2009; Danon et al., 2005). Various approaches can be used to

improve runtime and enable the use of community detection algorithms on large datasets. Parallelization

can leverage modern multi-core CPUs, and algorithms with independent computation steps can proﬁt from

parallelization directly (Liu and Chamberlain, 2018). Other algorithms cannot be parallelized without

logical changes, however, small alterations often do not inﬂuence the results substantially (Prat-P

erez

et al., 2014). Such parallelized approaches can also be distributed to a multi-machine computing cluster,

enabling the use of even more computational resources (Saltz et al., 2015). Finally, algorithms relying

on metric optimization can also be improved in runtime by replacing the metric with an estimator. This

reduces computational effort and speeds up the community detection.

Weighted Community Clustering Score

The Weighted Community Clustering (

WCC

) score (Prat-

erez et al., 2012) is a metric that has been successfully used to detect communities (Saltz et al., 2015;

Song et al., 2015; Prat-P

erez et al., 2014). It measures structural closeness by counting the number of

triangles a node forms with a community to determine the membership of that node to that community.

Thereby, it is closely related to the general concept of clustering coefﬁcients. In Methods we explain

this metric in more detail. Community detection algorithms based on this metric optimize communities

to maximize the global

WCC

score by allowing nodes to individually join or leave communities. Such

2/17

algorithms can yield disjoint or overlapping communities depending on whether nodes are allowed to join

multiple communities.

Efﬁcient WCC-based large scale overlapping community detection WCC

-based overlapping com-

munity detection approaches also suffer from computational complexity issues when dealing with large

scale datasets. Thus, parallelized (Prat-P

erez et al., 2014) and distributed (Saltz et al., 2015)

WCC

versions

have been proposed that allow multiple nodes at the same time to decide for changes to the community

partition, so those decisions can be computed on separate threads or machines. Furthermore, computation-

ally less expensive estimations of the

WCC

score have been suggested (Prat-P

erez et al., 2014; Lyu et al.,

2020) to further improve runtime. Thereby, Lyu and colleagues have found that approximating triangle

counts with their metric,

[

WCC

, in an algorithm they call FOX yields very similar results as calculating

the exact count. While their proposed estimation allows FOX to compute overlapping communities on

large networks with millions of edges, the implementation of FOX was not published making it hard to

reproduce the results or to use them for further applications.

LAZYFOX- Parallelized Large Scale Overlapping Community Detection

We here propose our tool

for overlapping community detection, LAZYFOX, that contains an implementation of the FOX algorithm.

In addition, we extend the work of Lyu and colleagues and combine it with ideas from (Prat-P

erez et al.,

2014) by introducing parallelism to leverage the advantage of modern multiprocessor systems to even

further accelerate the algorithm.

We show that LAZYFOX produces extremely similar results to FOX in a fraction of FOX’s runtime,

making it more efﬁcient to work with. This also enables the usage of the approach on large scale graphs

like the social media network Friendster with millions of nodes and billions of edges (Leskovec and

Krevl, 2014). We analyse the impact of hyperparameters of the FOX and LAZYFOX algorithm on real-

world examples. The C++ implementation of LAZYFOX (which also includes a FOX implementation) is

published as open-source under MIT licence.

In the following sections we introduce the datasets, the algorithmic details of FOX and LAZYFOX

and the methods used to evaluate them. We describe the impact of our LAZYFOX contribution on the

runtime and community quality of three datasets. We include the analysis of a fourth dataset to illustrate

the ability of LAZYFOX to handle large scale datasets in contrast to FOX. Finally, we summarize and

discuss our results.

METHODS

We introduce LAZYFOX, with its required input data and preprocessing, the employed metric

[

WCC

, and

the performed steps in the algorithm. We describe the parallelization that sets LAZYFOX apart from FOX

(Lyu et al., 2020) and allows LAZYFOX to scale across multiple CPU cores, and ﬁnally the performance

measures employed here.

Input Data

Our algorithm LAZYFOX takes the edge list of an undirected, unweighted graph

G(V,E)

as input,

denoting the nodes and

the edges of the graph. We assume this graph to be connected for the theoretical

discussions, however, LAZYFOX will still work on a graph with multiple connected components.

Preprocessing

LAZYFOX performs the following preprocessing steps on the input edge list of G:

Remove Multi-Edges

In graphs a pair of nodes can be connected via multiple edges. Since our research

on FOX and LAZYFOX works on “by-bond” communities, we regard multiple connections between two

entities as one. Therefore, we remove such duplicated edges, turning any input multi-graph into a simple

graph.

Dense Node Labels

The node identiﬁers (IDs) in edge lists are not necessarily consecutive due to

preprocessing (this is the case in most datasets introduced in the ’Datasets’ section). We apply a shift to

the node IDs to restore their consecutive nature. This allows both FOX and LAZYFOX to use the node IDs

directly as data-structure indices which speeds up computation.

3/17

Node Order

As LAZYFOX computes communities by gradually improving node memberships, there

needs to be a well deﬁned order in which the nodes are being processed. We deﬁne this order by sorting

them by decreasing clustering coefﬁcient (CC):

CC(i) = (ki

di·(di−1)/2,if di>1,

0,else

denotes the number of triangles containing node

(equal to the number of edges between two neighbors

of node

), while

denotes the degree of node

. A higher CC indicates that a node is central in its

neighborhood, and we process more central nodes ﬁrst. Ties are resolved by node degrees in decreasing

order. This order ensures that nodes with less connectivity and thus less inﬂuence can adapt to changes in

memberships of more important, connected nodes resulting in a more coherent community structure at

the end of one iteration.

Initial Clustering

To initiate the clustering process LAZYFOX computes a greedy, non-overlapping

community decomposition using the above node order. A node that is not yet part of a community is

assigned to a new community. Then all its not yet assigned neighbors join this community.

This process allows for self-initialization on any given network. LAZYFOX derives the initial

community count and the initial clustering from the structure of the underlying network. On the other

hand, LAZYFOX is, just like FOX (Lyu et al., 2020), also able to improve upon an existing division into

communities by replacing the initial clustering step by inserting those existing divisions. Therefore,

known structural properties can be taken into account. This way LAZYFOX can be used to generate

overlapping community structure from partitions, i.e. from non-overlapping community structure. See

Lyu et al. (2020) for a discussion and examples.

[

WCC– A Metric to Optimize for

Lyu and colleagues introduced

[

WCC

as an advanced metric to assess the quality of a partition into

communities. This metric forms the core of both the FOX and the LAZYFOX algorithm as they use

[

WCC

to decide on the necessary local optimization steps. To provide a better understanding of

[

WCC

we will

ﬁrst describe WCC.

WCC

The weighted clustering coefﬁcient

WCC

as introduced by Prat-P

erez et al. (2012) is a score that

rates a community decomposition as the sum of its community ratings. We denote such a rating of a

community decomposition P={C1,...,Ck}as

WCC(P) =

∑

i=1

WCC(Ci)

This rating of an individual community

can be again decomposed into how well the individual nodes of

that community ﬁt into the community:

WCC(Ci) = ∑

x∈Ci

WCC(x,Ci)

Such a ﬁt of a node

into its community

is assessed by the ratio of

t(x,C)

t(x,V)

: The number

of triangles that the node forms within its community to the number of triangles that it forms within

the whole graph. The assessment also uses

vt(x,C)

, the number of nodes in the community

that form

triangles with node

. Furthermore,

vt(x,V\C)

, the number of nodes outside of the community

that

form triangles with node x, inﬂuences this assessment:

WCC(x,C) = (t(x,C)

t(x,V)·vt(x,V)

|C\{x}|+vt(x,V\C),if t(x,V)>0,

0 else

Heuristic Counting of Triangles

The counting of triangles for the

WCC

computation is expensive.

To accelerate the evaluation of a new community decomposition, the exact count can be replaced by a

heuristic. Rather than counting the triangles a node forms with the nodes of a community

and all other

nodes, [

WCC (Lyu et al., 2020) approximates this by using the number of edges instead.

4/17

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LazyFox:FastandparallelizedoverlappingcommunitydetectioninlargegraphsTimGarrels1,AtharKhodabakhsh1,BernhardY.Renard1,2,andKatharinaBaum11HassoPlattnerInstitute,DigitalEngineeringFaculty,UniversityofPotsdam,Prof.-Dr.-HelmertStr.2-3,14482Potsdam,Germany2DepartmentofMathematicsandComputerScience,FreeUn...

展开>> 收起<<

LazyFox Fast and parallelized overlapping community detection in large graphs Tim Garrels1 Athar Khodabakhsh1 Bernhard Y. Renard12 and Katharina.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LazyFox Fast and parallelized overlapping community detection in large graphs Tim Garrels1 Athar Khodabakhsh1 Bernhard Y. Renard12 and Katharina

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: