Disjoint Community Detection
Disjoint communities capture structural node clusters within a network where each node belongs to
exactly one community. While algorithms based on label propagation (Raghavan et al., 2007), stochastic
blockmodels (Hofman and Wiggins, 2008; Wang and Wong, 1987) and game theory (Bu et al., 2017) for
disjoint community detection exist, the most common approach is to define a metric that measures desired
structural properties of a community (Lancichinetti et al., 2011; Yang and Leskovec, 2012; Newman,
2006; Prat-P
´
erez et al., 2012). By choosing communities that increase the metric over communities
that decrease it, the communities are optimized to display that structural property. This way a clear
community definition (groups of nodes that result in the highest value of the metric) is created and
detected communities are to some degree controllable in their structure.
Overlapping Community Detection
Most approaches identify communities in such a way that each node can belong to only one community.
While disjoint communities yield valuable information, most real world datasets contain functional groups
that are overlapping (Reid et al., 2011; Lancichinetti et al., 2009). People can belong to multiple social
circles, proteins can have various biological functions or be target of multiple drugs, and products rarely
fall into exactly one category. Therefore, traditional community detection algorithms have been adapted
and new algorithms have been proposed to detect overlapping communities (Fortunato, 2010). However,
overlapping community detection is a computationally more expensive problem than regular (disjoint)
community detection. Long runtimes of days or even weeks make community detection unfeasible in
large datasets, which is especially true for overlapping community detection (Lancichinetti and Fortunato,
2009; Danon et al., 2005). This limits the usability of most algorithms to smaller sized networks with
edge counts in the lower millions.
Algorithms which detect overlapping communities can use other approaches than disjoint community
detection algorithms such as clique expansion (Lee et al., 2010; Palla et al., 2005), matrix factorization
(Yang and Leskovec, 2013; Psorakis et al., 2011) and link clustering (Shi et al., 2013; Ahn et al., 2010;
Evans and Lambiotte, 2009). Some approaches also propose post-processing steps to create overlapping
communities from pre-existing disjoint communities (Chakraborty, 2015).
Moreover, approaches that have been proposed to detect disjoint communities can also be adjusted
to directly yield overlapping communities. Label propagation based algorithms can allow nodes to hold
multiple labels at the same time, thereby creating overlapping communities (Xie and Szymanski, 2012;
Gregory, 2009). Stochastic blockmodels can support mixed memberships (Airoldi et al., 2007; Gopalan
and Blei, 2013). And if an algorithm is based on the optimization of a metric measuring structural
properties, allowing nodes to join multiple communities during the optimization can also change the result
from disjoint to overlapping communities (Lancichinetti et al., 2009).
Enabling Large Scale Community Detection
Large scale datasets pose a challenge for many community detection algorithms as their runtime increases
drastically (Lancichinetti and Fortunato, 2009; Danon et al., 2005). Various approaches can be used to
improve runtime and enable the use of community detection algorithms on large datasets. Parallelization
can leverage modern multi-core CPUs, and algorithms with independent computation steps can profit from
parallelization directly (Liu and Chamberlain, 2018). Other algorithms cannot be parallelized without
logical changes, however, small alterations often do not influence the results substantially (Prat-P
´
erez
et al., 2014). Such parallelized approaches can also be distributed to a multi-machine computing cluster,
enabling the use of even more computational resources (Saltz et al., 2015). Finally, algorithms relying
on metric optimization can also be improved in runtime by replacing the metric with an estimator. This
reduces computational effort and speeds up the community detection.
Weighted Community Clustering Score
The Weighted Community Clustering (
WCC
) score (Prat-
P
´
erez et al., 2012) is a metric that has been successfully used to detect communities (Saltz et al., 2015;
Song et al., 2015; Prat-P
´
erez et al., 2014). It measures structural closeness by counting the number of
triangles a node forms with a community to determine the membership of that node to that community.
Thereby, it is closely related to the general concept of clustering coefficients. In Methods we explain
this metric in more detail. Community detection algorithms based on this metric optimize communities
to maximize the global
WCC
score by allowing nodes to individually join or leave communities. Such
2/17