Inuence Maximization Divide and Conquer Siddharth Patwardhan1Filippo Radicchi1and Santo Fortunato2 1 1Center for Complex Networks and Systems Research

2025-05-05 0 0 772.65KB 12 页 10玖币
侵权投诉
Influence Maximization: Divide and Conquer
Siddharth Patwardhan,1Filippo Radicchi,1and Santo Fortunato2, 1
1Center for Complex Networks and Systems Research,
Luddy School of Informatics, Computing, and Engineering,
Indiana University, Bloomington, Indiana 47408, USA
2Indiana University Network Science Institute (IUNI)
The problem of influence maximization, i.e., finding the set of nodes having maximal influence on
a network, is of great importance for several applications. In the past two decades, many heuris-
tic metrics to spot influencers have been proposed. Here, we introduce a framework to boost the
performance of any such metric. The framework consists in dividing the network into sectors of in-
fluence, and then selecting the most influential nodes within these sectors. We explore three different
methodologies to find sectors in a network: graph partitioning, graph hyperbolic embedding, and
community structure. The framework is validated with a systematic analysis of real and synthetic
networks. We show that the gain in performance generated by dividing a network into sectors before
selecting the influential spreaders increases as the modularity and heterogeneity of the network in-
crease. Also, we show that the division of the network into sectors can be efficiently performed in a
time that scales linearly with the network size, thus making the framework applicable to large-scale
influence maximization problems.
I. INTRODUCTION
The spread of news, ideas, rumours, opinions, and
awareness in social networks is generally analyzed in
terms of processes of information diffusion [1–4]. A
well-established feature of this type of processes on real,
heterogeneous networks is that a small fraction of nodes
may have a disproportionately large influence over the
rest of the system [3, 5, 6]. Therefore, influence maxi-
mization (IM)– the problem of finding the optimal set of
nodes that have the most influence or the largest collec-
tive reach on the network– is central for potentially many
applications [3, 7].
Kempe et al. were the first to formalize the IM prob-
lem [8]. They showed that the problem is NP-hard, and
that solutions to the IM problem can only be approx-
imated. Also, they proposed a greedy optimization al-
gorithm guaranteeing a solution that is within a factor
(1 1/e)'0.63 from the optimal solution for two main
classes of spreading models. Greedy optimization con-
sists in building the set of influential spreaders in a net-
work sequentially by adding one spreader at a time to the
set. At each stage of the algorithm, the best spreader is
chosen as the node, among those outside the current set
of optimal spreaders, that generates the largest increment
in the influence of the set of spreaders. Importantly, the
gain in influence that a candidate spreader could bring is
estimated by adding it to the current set of already se-
lected spreaders, and simulating numerically the spread-
ing process. This procedure, although computationally
expensive, allows for properly assessing the combined in-
fluence that multiple spreaders usually have in a network.
The original recipe by Kempe et al. can be applied to
relatively small networks only. Followup studies further
improved upon the complexity of the greedy algorithm
proposed by Kempe et al. allowing for the study of IM
problems in larger settings [3, 9–13]. Speedup is also pos-
sible by first dividing the network into sectors, and then
performing greedy optimization within each sector sepa-
rately [3, 10–13]. In these approaches, sectors are gener-
ally identified in terms of network communities. Finding
communities in networks is a task that can be performed
in a time that grows linearly with the network size [14].
However, since these algorithms still rely on the estima-
tion of the influence function via numerical simulations,
they can only be used to deal with IM problems on net-
works of moderate size.
As more efficient alternatives, several purely topologi-
cal metrics of node centrality were proposed to quantify
the influence of the nodes [5, 9, 15–17]. The assumption
behind this approach is that a topological centrality met-
ric is a good proxy for dynamical influence. As the com-
putation of a network centrality metric does not involve
simulating the actual spreading process, centrality-based
algorithms can be applied to study the IM problem in
large-scale networks. However, their performance in ap-
proximating solutions to IM problem is systematically
worse than that of the greedy algorithm [6].
A common drawback of centrality-based algorithms is
assuming that each seed acts as an independent spreader
in the network so that the influence of a set of spreaders
is given by the sum of the influence of each individual
seed. This is clearly a weak assumption. For example,
it is well known that even in the case of simple conta-
gion models like the independent cascade model, the best
strategy is not choosing highly influential nodes in the
same closely connected neighborhood [18], but choosing
sufficiently distant nodes [19]. Two main ways of allevi-
ating this issue are considered in the literature. A first
way consists in defining an adaptive version of the cen-
trality metric at hand, so that the effect of the already
selected spreaders is discounted from the estimation of
the influence of the nodes under observation. This trick
is able to greatly improve the performance of even basic
degree centrality, whose adaptive version excels in perfor-
mance [6]. A second way proposed by Chen et al. is first
arXiv:2210.01203v2 [physics.soc-ph] 6 Oct 2022
2
partitioning the network into sectors, and then estimat-
ing nodes’ influence within their own sectors [20]. The
rationale behind this procedure is that sectors represent
relatively independent parts of a network, thus selecting
seeds from different sectors represents a straightforward
way of reducing the overlap between portions of the net-
work that multiple spreaders are able to influence. The
rationale is similar to the one used in greedy optimization
performed on network communities [3, 10–13], however,
sectors in Chen et al. are obtained by clustering nodes
on the basis of their node2vec embedding [21]. One of
the advantages of using geometric embedding instead of
community structure is the possibility of having full con-
trol on the number of sectors used in the division of the
network. On the other hand, identifying sectors in an
high-dimensional space as the one generated by node2vec
is computational expensive. Further in the procedure by
Chen et al., the number of sectors is set equal to the
number of spreaders that should be identified, requiring
therefore to find sectors afresh whenever the size of the
seed set is varied. The result is an algorithm that does
not scale well with the system size.
In this paper, we generalize and combine the above
ideas into a scalable approach. We propose a pipeline
consisting in dividing the network into sectors and then
choosing influential spreaders based on the division of
the network into sectors. Scalability is obtained by im-
posing the number of sectors to be independent from the
number of spreaders. We explore three different method-
ologies to divide the network into sectors, namely graph
partitioning, graph hyperbolic embedding, and commu-
nity structure. The first two methods allow us to identify
sectors in the graph in a time that grows linearly with
the network size. The use of centrality metrics like adap-
tive degree centrality that also can be computed in linear
time allows us to produce solutions to the IM problem in
large networks. Hyperbolic embedding requires instead
a time that grows quadratically with the network size,
but allows for a flexible and straightforward way of iden-
tifying network sectors. The method can be used only in
sufficiently small networks.
We systematically validate our approach on a large
corpus of real-world networks, demonstrating its effec-
tiveness in approximating solutions to the IM problem.
Furthermore, we leverage the Lancichinetti-Fortunato-
Radicchi (LFR) network model [22] to show that the
method is particularly useful in solving IM problems on
modular and heterogeneous networks.
II. METHODS
A. Networks
1. Real networks
We take advantage of a corpus of 52 undirected and
unweighted real-world networks. Sizes of these networks
FIG. 1. The divide-and-conquer approach to influence
maximization. The network is first divided into sectors of
influence, here represented by different colors. Each influen-
tial spreader is chosen by first randomly picking a sector, and
then selecting a node within the sector, that is not yet part of
the set of spreaders, according to some criterion, typically the
value of a centrality score. The operation is iterated until a
desired number of spreaders is selected. The size of nodes in
the figure is proportional to their degrees, here used to proxy
nodes’ influence. Seven influential spreaders, depicted as bold
circles, are selected from the four available sectors.
range from N= 500 to N= 26,498 nodes. The upper
bound on the maximum size of the networks analyzed is
due to the high complexity of the greedy optimization al-
gorithm, which we use as the baseline for estimating the
performance of the other algorithms. We consider net-
works from different domains. Specifically, our corpus of
networks include social, technological, information, bio-
logical, and transportation networks. Details about the
analyzed networks can be found in the Appendix.
2. LFR model
To systematically analyze the dependence of the pro-
posed algorithm’s performance on the modularity and the
heterogeneity of the network structure, we use the LFR
network model [22], commonly adopted as benchmark for
community detection algorithms [23]. The LFR model
allows us to generate synthetic networks with power-law
distributions of degree and community size. Parameters
of the model are the power-law exponent of the degree
distribution τ1, the average degree hki, the maximum de-
gree kmax, the power-law exponent of the community size
distribution τ2, and the mixing parameter µ, which is the
3
average fraction of neighbors outside the community of a
node. Low values of µindicate well separated and pro-
nounced communities; the larger µthe less strong the
community structure is.
B. Independent cascade model
In this work, we focus our attention on the Independent
Cascade Model (ICM) which is one of the most stud-
ied spreading models in the context of influence maxi-
mization (IM) [8]. The ICM is a discrete-time conta-
gion model, similar in spirit to the Susceptible-Infected-
Recovered model [24]. In the initial configuration, all
nodes are in the susceptible state, except for the nodes
in the set of spreaders that are in the infected state. At a
given time step, each infected node first attempts to in-
fect its susceptible neighbors with probability p, and then
recovers. Recovered nodes do no longer participate in the
dynamics. The dynamics proceeds by repeating the pre-
viously described iteration over the newly infected nodes.
The spreading process stops once there are no infected
nodes remaining in the network. The influence of the
set of spreaders is quantified as the size of the outbreak,
i.e., the number of nodes that are found in the recovered
state at the end of the dynamics. Clearly, this number
may differ from realization to realization of the model due
to the stochastic nature of the spreading events. The IM
problem consists in finding the set of spreaders leading
to the largest average value of the outbreak size [8]. The
optimization is constrained by the number of nodes that
can compose the set of spreaders. The typical setting in
practical applications consists in finding a small set of
spreaders in a very large network.
As a function of the spreading probability p, the ICM
displays a transition from a non-endemic regime, where
the size of the outbreak is small compared to the network
size, to an endemic regime, where the outbreak involves a
large portion of the nodes in the network. The IM prob-
lem is particularly challenging and interesting around the
point where such a change of regime occurs. We define
it as the pseudo-critical value pof the ordinary bond-
percolation model on the network. Specifically, prep-
resents the threshold between the non-endemic and en-
demic regimes for the ICM started from one randomly
chosen seed; this fact follows from the exact mapping of
critical SIR-like spreading to bond percolation on net-
works [25]. We stress that each network is characterized
by a different pvalue; the numerical estimation of a net-
work’s pis performed using the Newman-Ziff algorithm
[26, 27].
C. The divide-and-conquer algorithm
The input of our algorithm is an unweighted and undi-
rected network G= (V, E), with set of nodes Vand set of
edges E. We denote the size of the network as N=|V|.
The algorithm requires also to choose the number kof
desired influential spreaders, and the number Sof sec-
tors used to divide the network. The divide-and-conquer
(DC) algorithm consists of two main components (see
Figure I). First, we divide the network into Ssectors, or
vertex subsets, V1, V2, ..., VS. We have V=SS
i=1 Viand
ViVj=for all i6=j. Second, we form the set of
kinfluential spreaders by adding one node at a time to
the set. Starting from an empty set, at each of the k
iterations, we first select a random sector, and then pick
the most influential node in the sector that is not already
included in the set of spreaders. One can use any suit-
able methodology to divide the network and any suitable
centrality metric to select influential spreaders from the
sectors. Clearly, for S= 1 no actual division of the net-
work into sectors is performed. In this case, the selection
of influential spreaders is made relying on the centrality
metric scores only, thus according to the standard pro-
cedure used in the literature [6]. For S=N, seed nodes
are randomly selected.
We note that the above procedure is conceptually iden-
tical to the one introduced by Chen et al. [20]. However,
there are a few important practical differences. First,
Chen et al. consider high-dimensional node2vec embed-
dings only [21]. node2vec requires a non-trivial calibra-
tion of several hyperparameters that is known to be es-
sential for task performance, but adds significant com-
putational burden to the procedure [28]. Also, the high-
dimensionality of the node2vec embedding space makes
the identification procedure of the sectors non trivial. Fi-
nally, Chen et al. impose S=k, with one seed selected
per sector. This fact implies that increasing the seed set
from kto k+ 1 requires redefining the sectors afresh, an
operation that requires a time that grows at least lin-
early with the network size N. Since in IM problems
one typically uses a number of spreaders proportional to
the size of the system [6], the resulting complexity of the
algorithm is at least quadratic.
1. Dividing the network
We consider three possible methods of dividing a net-
work into sectors: (i) graph partitioning, (ii) graph hy-
perbolic embedding, and (iii) community structure. Be-
low, we briefly summarize each of these methods.
Graph partitioning consists in splitting a graph
into an arbitrarily chosen number of sectors of roughly
equal size, such that the total number of edges lying be-
tween the corresponding subgraphs is minimized [29, 30].
To perform graph partitioning, we take advantage of
METIS [30], i.e., the algorithm that implements the
multilevel partitioning technique introduced in Refs. [31]
and [32]. The computational time of METIS grows as
S N [30].
Graph hyperbolic embedding is another represen-
tation that allows to divide a network into sectors. Here,
sectors are given by groups of close-by nodes in vector
摘要:

InuenceMaximization:DivideandConquerSiddharthPatwardhan,1FilippoRadicchi,1andSantoFortunato2,11CenterforComplexNetworksandSystemsResearch,LuddySchoolofInformatics,Computing,andEngineering,IndianaUniversity,Bloomington,Indiana47408,USA2IndianaUniversityNetworkScienceInstitute(IUNI)Theproblemofinuence...

展开>> 收起<<
Inuence Maximization Divide and Conquer Siddharth Patwardhan1Filippo Radicchi1and Santo Fortunato2 1 1Center for Complex Networks and Systems Research.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:772.65KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注