Inuence Maximization Divide and Conquer Siddharth Patwardhan1Filippo Radicchi1and Santo Fortunato2 1 1Center for Complex Networks and Systems Research

2025-05-05 0 0 772.65KB 12 页 10玖币

侵权投诉

Inﬂuence Maximization: Divide and Conquer

Siddharth Patwardhan,1Filippo Radicchi,1and Santo Fortunato2, 1

1Center for Complex Networks and Systems Research,

Luddy School of Informatics, Computing, and Engineering,

Indiana University, Bloomington, Indiana 47408, USA

2Indiana University Network Science Institute (IUNI)

The problem of inﬂuence maximization, i.e., ﬁnding the set of nodes having maximal inﬂuence on

a network, is of great importance for several applications. In the past two decades, many heuris-

tic metrics to spot inﬂuencers have been proposed. Here, we introduce a framework to boost the

performance of any such metric. The framework consists in dividing the network into sectors of in-

ﬂuence, and then selecting the most inﬂuential nodes within these sectors. We explore three diﬀerent

methodologies to ﬁnd sectors in a network: graph partitioning, graph hyperbolic embedding, and

community structure. The framework is validated with a systematic analysis of real and synthetic

networks. We show that the gain in performance generated by dividing a network into sectors before

selecting the inﬂuential spreaders increases as the modularity and heterogeneity of the network in-

crease. Also, we show that the division of the network into sectors can be eﬃciently performed in a

time that scales linearly with the network size, thus making the framework applicable to large-scale

inﬂuence maximization problems.

I. INTRODUCTION

The spread of news, ideas, rumours, opinions, and

awareness in social networks is generally analyzed in

terms of processes of information diﬀusion [1–4]. A

well-established feature of this type of processes on real,

heterogeneous networks is that a small fraction of nodes

may have a disproportionately large inﬂuence over the

rest of the system [3, 5, 6]. Therefore, inﬂuence maxi-

mization (IM)– the problem of ﬁnding the optimal set of

nodes that have the most inﬂuence or the largest collec-

tive reach on the network– is central for potentially many

applications [3, 7].

Kempe et al. were the ﬁrst to formalize the IM prob-

lem [8]. They showed that the problem is NP-hard, and

that solutions to the IM problem can only be approx-

imated. Also, they proposed a greedy optimization al-

gorithm guaranteeing a solution that is within a factor

(1 −1/e)'0.63 from the optimal solution for two main

classes of spreading models. Greedy optimization con-

sists in building the set of inﬂuential spreaders in a net-

work sequentially by adding one spreader at a time to the

set. At each stage of the algorithm, the best spreader is

chosen as the node, among those outside the current set

of optimal spreaders, that generates the largest increment

in the inﬂuence of the set of spreaders. Importantly, the

gain in inﬂuence that a candidate spreader could bring is

estimated by adding it to the current set of already se-

lected spreaders, and simulating numerically the spread-

ing process. This procedure, although computationally

expensive, allows for properly assessing the combined in-

ﬂuence that multiple spreaders usually have in a network.

The original recipe by Kempe et al. can be applied to

relatively small networks only. Followup studies further

improved upon the complexity of the greedy algorithm

proposed by Kempe et al. allowing for the study of IM

problems in larger settings [3, 9–13]. Speedup is also pos-

sible by ﬁrst dividing the network into sectors, and then

performing greedy optimization within each sector sepa-

rately [3, 10–13]. In these approaches, sectors are gener-

ally identiﬁed in terms of network communities. Finding

communities in networks is a task that can be performed

in a time that grows linearly with the network size [14].

However, since these algorithms still rely on the estima-

tion of the inﬂuence function via numerical simulations,

they can only be used to deal with IM problems on net-

works of moderate size.

As more eﬃcient alternatives, several purely topologi-

cal metrics of node centrality were proposed to quantify

the inﬂuence of the nodes [5, 9, 15–17]. The assumption

behind this approach is that a topological centrality met-

ric is a good proxy for dynamical inﬂuence. As the com-

putation of a network centrality metric does not involve

simulating the actual spreading process, centrality-based

algorithms can be applied to study the IM problem in

large-scale networks. However, their performance in ap-

proximating solutions to IM problem is systematically

worse than that of the greedy algorithm [6].

A common drawback of centrality-based algorithms is

assuming that each seed acts as an independent spreader

in the network so that the inﬂuence of a set of spreaders

is given by the sum of the inﬂuence of each individual

seed. This is clearly a weak assumption. For example,

it is well known that even in the case of simple conta-

gion models like the independent cascade model, the best

strategy is not choosing highly inﬂuential nodes in the

same closely connected neighborhood [18], but choosing

suﬃciently distant nodes [19]. Two main ways of allevi-

ating this issue are considered in the literature. A ﬁrst

way consists in deﬁning an adaptive version of the cen-

trality metric at hand, so that the eﬀect of the already

selected spreaders is discounted from the estimation of

the inﬂuence of the nodes under observation. This trick

is able to greatly improve the performance of even basic

degree centrality, whose adaptive version excels in perfor-

mance [6]. A second way proposed by Chen et al. is ﬁrst

arXiv:2210.01203v2 [physics.soc-ph] 6 Oct 2022

partitioning the network into sectors, and then estimat-

ing nodes’ inﬂuence within their own sectors [20]. The

rationale behind this procedure is that sectors represent

relatively independent parts of a network, thus selecting

seeds from diﬀerent sectors represents a straightforward

way of reducing the overlap between portions of the net-

work that multiple spreaders are able to inﬂuence. The

rationale is similar to the one used in greedy optimization

performed on network communities [3, 10–13], however,

sectors in Chen et al. are obtained by clustering nodes

on the basis of their node2vec embedding [21]. One of

the advantages of using geometric embedding instead of

community structure is the possibility of having full con-

trol on the number of sectors used in the division of the

network. On the other hand, identifying sectors in an

high-dimensional space as the one generated by node2vec

is computational expensive. Further in the procedure by

Chen et al., the number of sectors is set equal to the

number of spreaders that should be identiﬁed, requiring

therefore to ﬁnd sectors afresh whenever the size of the

seed set is varied. The result is an algorithm that does

not scale well with the system size.

In this paper, we generalize and combine the above

ideas into a scalable approach. We propose a pipeline

consisting in dividing the network into sectors and then

choosing inﬂuential spreaders based on the division of

the network into sectors. Scalability is obtained by im-

posing the number of sectors to be independent from the

number of spreaders. We explore three diﬀerent method-

ologies to divide the network into sectors, namely graph

partitioning, graph hyperbolic embedding, and commu-

nity structure. The ﬁrst two methods allow us to identify

sectors in the graph in a time that grows linearly with

the network size. The use of centrality metrics like adap-

tive degree centrality that also can be computed in linear

time allows us to produce solutions to the IM problem in

large networks. Hyperbolic embedding requires instead

a time that grows quadratically with the network size,

but allows for a ﬂexible and straightforward way of iden-

tifying network sectors. The method can be used only in

suﬃciently small networks.

We systematically validate our approach on a large

corpus of real-world networks, demonstrating its eﬀec-

tiveness in approximating solutions to the IM problem.

Furthermore, we leverage the Lancichinetti-Fortunato-

Radicchi (LFR) network model [22] to show that the

method is particularly useful in solving IM problems on

modular and heterogeneous networks.

II. METHODS

A. Networks

1. Real networks

We take advantage of a corpus of 52 undirected and

unweighted real-world networks. Sizes of these networks

FIG. 1. The divide-and-conquer approach to inﬂuence

maximization. The network is ﬁrst divided into sectors of

inﬂuence, here represented by diﬀerent colors. Each inﬂuen-

tial spreader is chosen by ﬁrst randomly picking a sector, and

then selecting a node within the sector, that is not yet part of

the set of spreaders, according to some criterion, typically the

value of a centrality score. The operation is iterated until a

desired number of spreaders is selected. The size of nodes in

the ﬁgure is proportional to their degrees, here used to proxy

nodes’ inﬂuence. Seven inﬂuential spreaders, depicted as bold

circles, are selected from the four available sectors.

range from N= 500 to N= 26,498 nodes. The upper

bound on the maximum size of the networks analyzed is

due to the high complexity of the greedy optimization al-

gorithm, which we use as the baseline for estimating the

performance of the other algorithms. We consider net-

works from diﬀerent domains. Speciﬁcally, our corpus of

networks include social, technological, information, bio-

logical, and transportation networks. Details about the

analyzed networks can be found in the Appendix.

2. LFR model

To systematically analyze the dependence of the pro-

posed algorithm’s performance on the modularity and the

heterogeneity of the network structure, we use the LFR

network model [22], commonly adopted as benchmark for

community detection algorithms [23]. The LFR model

allows us to generate synthetic networks with power-law

distributions of degree and community size. Parameters

of the model are the power-law exponent of the degree

distribution τ1, the average degree hki, the maximum de-

gree kmax, the power-law exponent of the community size

distribution τ2, and the mixing parameter µ, which is the

average fraction of neighbors outside the community of a

node. Low values of µindicate well separated and pro-

nounced communities; the larger µthe less strong the

community structure is.

B. Independent cascade model

In this work, we focus our attention on the Independent

Cascade Model (ICM) which is one of the most stud-

ied spreading models in the context of inﬂuence maxi-

mization (IM) [8]. The ICM is a discrete-time conta-

gion model, similar in spirit to the Susceptible-Infected-

Recovered model [24]. In the initial conﬁguration, all

nodes are in the susceptible state, except for the nodes

in the set of spreaders that are in the infected state. At a

given time step, each infected node ﬁrst attempts to in-

fect its susceptible neighbors with probability p, and then

recovers. Recovered nodes do no longer participate in the

dynamics. The dynamics proceeds by repeating the pre-

viously described iteration over the newly infected nodes.

The spreading process stops once there are no infected

nodes remaining in the network. The inﬂuence of the

set of spreaders is quantiﬁed as the size of the outbreak,

i.e., the number of nodes that are found in the recovered

state at the end of the dynamics. Clearly, this number

may diﬀer from realization to realization of the model due

to the stochastic nature of the spreading events. The IM

problem consists in ﬁnding the set of spreaders leading

to the largest average value of the outbreak size [8]. The

optimization is constrained by the number of nodes that

can compose the set of spreaders. The typical setting in

practical applications consists in ﬁnding a small set of

spreaders in a very large network.

As a function of the spreading probability p, the ICM

displays a transition from a non-endemic regime, where

the size of the outbreak is small compared to the network

size, to an endemic regime, where the outbreak involves a

large portion of the nodes in the network. The IM prob-

lem is particularly challenging and interesting around the

point where such a change of regime occurs. We deﬁne

it as the pseudo-critical value p∗of the ordinary bond-

percolation model on the network. Speciﬁcally, p∗rep-

resents the threshold between the non-endemic and en-

demic regimes for the ICM started from one randomly

chosen seed; this fact follows from the exact mapping of

critical SIR-like spreading to bond percolation on net-

works [25]. We stress that each network is characterized

by a diﬀerent p∗value; the numerical estimation of a net-

work’s p∗is performed using the Newman-Ziﬀ algorithm

[26, 27].

C. The divide-and-conquer algorithm

The input of our algorithm is an unweighted and undi-

rected network G= (V, E), with set of nodes Vand set of

edges E. We denote the size of the network as N=|V|.

The algorithm requires also to choose the number kof

desired inﬂuential spreaders, and the number Sof sec-

tors used to divide the network. The divide-and-conquer

(DC) algorithm consists of two main components (see

Figure I). First, we divide the network into Ssectors, or

vertex subsets, V1, V2, ..., VS. We have V=SS

i=1 Viand

Vi∩Vj=∅for all i6=j. Second, we form the set of

kinﬂuential spreaders by adding one node at a time to

the set. Starting from an empty set, at each of the k

iterations, we ﬁrst select a random sector, and then pick

the most inﬂuential node in the sector that is not already

included in the set of spreaders. One can use any suit-

able methodology to divide the network and any suitable

centrality metric to select inﬂuential spreaders from the

sectors. Clearly, for S= 1 no actual division of the net-

work into sectors is performed. In this case, the selection

of inﬂuential spreaders is made relying on the centrality

metric scores only, thus according to the standard pro-

cedure used in the literature [6]. For S=N, seed nodes

are randomly selected.

We note that the above procedure is conceptually iden-

tical to the one introduced by Chen et al. [20]. However,

there are a few important practical diﬀerences. First,

Chen et al. consider high-dimensional node2vec embed-

dings only [21]. node2vec requires a non-trivial calibra-

tion of several hyperparameters that is known to be es-

sential for task performance, but adds signiﬁcant com-

putational burden to the procedure [28]. Also, the high-

dimensionality of the node2vec embedding space makes

the identiﬁcation procedure of the sectors non trivial. Fi-

nally, Chen et al. impose S=k, with one seed selected

per sector. This fact implies that increasing the seed set

from kto k+ 1 requires redeﬁning the sectors afresh, an

operation that requires a time that grows at least lin-

early with the network size N. Since in IM problems

one typically uses a number of spreaders proportional to

the size of the system [6], the resulting complexity of the

algorithm is at least quadratic.

1. Dividing the network

We consider three possible methods of dividing a net-

work into sectors: (i) graph partitioning, (ii) graph hy-

perbolic embedding, and (iii) community structure. Be-

low, we brieﬂy summarize each of these methods.

Graph partitioning consists in splitting a graph

into an arbitrarily chosen number of sectors of roughly

equal size, such that the total number of edges lying be-

tween the corresponding subgraphs is minimized [29, 30].

To perform graph partitioning, we take advantage of

METIS [30], i.e., the algorithm that implements the

multilevel partitioning technique introduced in Refs. [31]

and [32]. The computational time of METIS grows as

S N [30].

Graph hyperbolic embedding is another represen-

tation that allows to divide a network into sectors. Here,

sectors are given by groups of close-by nodes in vector

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InuenceMaximization:DivideandConquerSiddharthPatwardhan,1FilippoRadicchi,1andSantoFortunato2,11CenterforComplexNetworksandSystemsResearch,LuddySchoolofInformatics,Computing,andEngineering,IndianaUniversity,Bloomington,Indiana47408,USA2IndianaUniversityNetworkScienceInstitute(IUNI)Theproblemofinuence...

展开>> 收起<<

Inuence Maximization Divide and Conquer Siddharth Patwardhan1Filippo Radicchi1and Santo Fortunato2 1 1Center for Complex Networks and Systems Research.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Inuence Maximization Divide and Conquer Siddharth Patwardhan1Filippo Radicchi1and Santo Fortunato2 1 1Center for Complex Networks and Systems Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: