Impact of network topology changes on information source localization Piotr Machura Robert Paluch

2025-05-08 1 0 1.53MB 22 页 10玖币

侵权投诉

Impact of network topology changes on information source

localization

Piotr Machura, Robert Paluch

Abstract

Well-established methods of locating the source of information in a complex network are usually derived

with the assumption of complete and exact knowledge of network topology. We study the performance of

three such algorithms (LPTVA, GMLA and Pearson correlation algorithm) in scenarios that do not fulﬁll

this assumption by modifying the network prior to localization. This is done by adding superﬂuous new

links, hiding existing ones, or reattaching links in accordance with the network’s structural Hamiltonian. We

ﬁnd that GMLA is highly resilient to the addition of superﬂuous edges, as its precision falls by more than

statistical uncertainty only when the number of links is approximately doubled. On the other hand, if the

edge set is underestimated or reattachment has taken place, the performance of GMLA drops signiﬁcantly. In

such a scenario the Pearson algorithm is preferable, retaining most of its performance when other simulation

parameters favor localization (high density of observers, highly deterministic propagation). It is also generally

more accurate than LPTVA, as well as orders of magnitude faster. The aforementioned diﬀerences between

localization algorithms can be intuitively explained, although a need for further theoretical research is noted.

Keywords: complex network, information propagation, source localization, exponential random graphs

1 Introduction

The exchange of information is the main purpose of many networks found in our everyday life. The creation of

the Internet enabled its users to communicate on an unprecedented scale. The rapid rise in the usage of social

media [1] fundamentally changed the way we view and absorb the information presented to us by acquaintances

and news publishers alike. The emergence of clickbait [2] promotes sensation over honesty, encouraging the

usage of exaggeration and deceit in the name of proﬁtable engagement [3]. Due to the instantaneous nature

of social media misinformation propagates in a ﬂash, reaching a wide audience of users. As a result we, the

recipients, are often unaware of the true origin of viral gossip we encounter, making it diﬃcult to judge and

interpret properly.

The scientiﬁc community has taken it upon itself to solve this dilemma, by establishing methods to identify

and combat the spread of misinformation [4, 5]. In this manner, the topic of source location is of crucial

importance. Knowing where a speciﬁc piece of information came from can aid in identifying its purpose and

reliability.

Among various types of localization methods, categorized by Jiang et al. [6], observer-based localization

methods are especially popular due to their relatively low requirements. They can accurately pinpoint the source

of information using data provided by only a few members of the network, together with the knowledge of its

structure. When deriving localization methods, it is usually assumed that this structure (network topology) is

known completely and exactly. We propose a set of scenarios, outlined in Section 3, in which this assumption

does not hold. That is, the network in which the information has spread is not the same as the network used

during localization. This may be due to errors made during network reconstruction, lack of knowledge regarding

the full structure, or natural evolution occurring between the time of propagation and the moment in which the

network is recorded.

Examining the simulation performance of three popular source localization algorithms, outlined in Section

2.2, we seek to ﬁnd and describe the qualitative diﬀerences between them. Results of such simulations can

potentially aid in choosing the right algorithm for a real-world situation, in which the “correctness” of the

reconstructed network is uncertain.

2 Basics

We assume a simple agent-based model, in which Nagents are located in the vertices of a complex network,

i.e. an unweighted, undirected graph G= (V, E). Edges of such graphs represent a bi-directional connection

through which information may propagate.

arXiv:2210.14737v1 [physics.soc-ph] 26 Oct 2022

a) o1

b) o1

Figure 1: Illustration of the simulation methodology. a) At an unknown time t∗the source s∗is spontaneously

infected. b) Information propagates through the network. Infection times tiof chosen nodes (observers oi) are

recorded. c) Network undergoes topological changes. Some existing edges are removed (dotted line) and some

are added (double line), producing the modiﬁed network. The graph may become disconnected in the process.

d) The modiﬁed network is cropped to the largest component before the source localization algorithm is run.

2.1 Information spread

To simulate the spread of information, we use the synchronous Susceptible-Infected (SI) model [7]. In the SI

model, each node can be in one of two states: susceptible/healthy (uninformed) or infected (informed). In the

beginning, all nodes are healthy. At some unknown point in time, t∗a single randomly chosen node becomes

spontaneously infected. We will refer to this node as the source of information, denoted s∗. In every subsequent

time step, infected nodes will try to infect their susceptible neighbors, with a probability of success β, called

infection rate. The propagation ends when there are no more healthy nodes in the network. It is assumed

that the network is connected (i.e. there is a path between each pair of nodes) during the spread. After the

propagation completes and the network topology is changed (see Section 3) this assumption no longer holds.

2.2 Locating the source

We consider observer -based methods of source localization. These algorithms are popular in the literature [8–12]

due to their realistic assumptions, making possible real-world implementation and usage. The observer nodes

provide some limited knowledge about the spread after it has concluded. They may represent e.g. social media

moderators capable of providing insight, or individual users reporting their encounters with concerning content.

In our simulations observers form a randomly chosen subset of graph vertices O ⊆ V, with the ratio

r=|O|/|V|called observer density. It is assumed that each observer oimeasures the time of its infection ti.

This is notably diﬀerent from assumptions made by Pinto et. al. [8] and Xu, Teng, Zheou et al. [13], where

both the observer infection time and the node by which the observer was infected are known. We ﬁnd that this

limited approach can still result in satisfactory algorithm performance, while signiﬁcantly reducing the amount

of information required to perform localization.

Source localization algorithm assigns a score to every suspect node s∈S⊆V(including observers), given

the vector of observer infection times [t]i=ti, infection rate βand graph topology. It is therefore equivalent

to a score function φ(s;t, β, G). Intuitively, φ(s) should be associated with the posterior probability of sbeing

the true source s∗, that is

P(s=s∗|t,G)∼f(φ(s;t, β, G)),(1)

where fis any strictly increasing function. This allows for ranking suspects by their score, with the highest

scoring suspect

ˆs= argmax

φ(s;t, β, G) (2)

being the most probable origin of the information. If it is true that ˆs=s∗then the localization is successful.

See Section 2.2.4 for further details on evaluating localization algorithms.

2.2.1 Maximum likelihood algorithm

The maximum likelihood algorithm has been proposed by Pinto et al in [8]. We will refer to it as LPTVA

(Limited Pinto-Thiran-Vitelli Algorithm), with limited signifying the fact that only infection times tiare used.

It is analytically derived for tree graphs and extended to general graphs in the way outlined below.

The unknown inception time t∗is accounted for by choosing a reference observer o1and constructing a

vector of infection delays t0with respect to t1

[t0]i=ti+1 −t1.(3)

If we assume that the graph in question is a tree Tand let P(vi, vj) denote the path between vertices vi,vj,

the expected value for the t0vector (given sis the source) can be written as

[µs]i=µ(|P(s, oi+1)|−|P(s, o1)|),(4)

where µis the mean time it takes for information to transmit through one edge. In the case of the SI model it

is equal to the mean of a geometric distribution µ=1

β. The covariance matrix Λ of t0is given by

[Λ]i,j =σ2|P(oi+1, o1)∩P(oj+1, o1)|,(5)

where σ2is the variance of the time it takes for information to transmit through one edge. In the case of SI

model it is equal to variance of a geometric distribution σ2=1−β

β2.

Using the central limit theorem, we can postulate that t0, as a sum of i.i.d. single-edge delays, follows a

multivariate normal distribution with mean µsand covariance matrix Λ. Taking a logarithm of this distribution

and stripping away s-independent normalization factors we arrive at the score function

φLPTVA(s) = −(t0−µs)TΛ−1(t0−µs)−ln |Λ|,(6)

which can be calculated for all nodes in the network.

If the graph in question is not a tree, the algorithm can be adjusted by calculating the score given by (6)

with respect to the breadth-ﬁrst search (BFS) tree T(s)

BF S rooted at s. Such modiﬁcation is equivalent to the

assumption that the information propagates along the shortest paths connecting the source with observers. This

holds exactly when β= 1 and contributes to signiﬁcantly reduced algorithm performance for non-tree graphs

with βsigniﬁcantly lower than 1.

2.2.2 Gradient maximum likelihood algorithm

The LPTV algorithm, while well established and highly analytical, suﬀers from signiﬁcant computation speed

issues. Matrix inversion in equation (6) must be performed for each suspect, rendering the method eﬀectively

unusable for networks with a large number of nodes.

The gradient maximum likelihood algorithm (GMLA) [9] attempts to rectify this problem by prioritizing

the earliest observers and utilizing a gradient-like selection of suspects. As a consequence, unlike in the case of

LPTVA and Pearson correlation algorithm, the set of suspects is generally only a small subset of all nodes in

the network. All other nodes ev∈(V\S) are assigned the same score, eﬀectively equal −∞. While the score

function used to rank suspects in GMLA is the same as (6), the choice of suspects is much more sophisticated.

Firstly, only Kfearliest-infected observers are used in the calculations. The performance of the algorithm

varies depending on the exact value of Kf, with the optimum usually estimated for each network model on

a case-by-case basis. In our calculations we have used Kf=√Nfor all considered networks, which is an

acceptable rule of thumb [9].

Secondly, the score function is not calculated for all Nvertices. Instead, it is ﬁrst calculated for the neighbors

S(1) of the earliest observer o1. Next, the score is calculated for the previously unchecked neighbors of the highest

scoring ˆs(1) from S(1), creating another set of suspects S(2), from which the highest scoring ˆs(2) is chosen. This

process is repeated until there is no node in S(n)(neighbors of ˆs(n−1) not checked in previous steps) with score

higher than ˆs(n−1). In such case, the algorithm stops and returns scores calculated for all members of

[

i=1

S(i).(7)

ŝ(1)

ŝ(2)

ŝ(3)

Figure 2: Node selection in GMLA. Initially, the neighbors of the earliest informed neighbor o1are checked

(S(1), colored orange). Highest scoring vertex ˆs(1) from this set is chosen, and its neighbors S(2) are checked

(colored blue). Note that o1is included in S(2). Next, neighbors S(3) of ˆs(2) (colored purple) are checked,

among which highest scoring vertex ˆs(3) can be found. Since the score of ˆs(3) is lower than that of ˆs(2), the

algorithm terminates with ˆs= ˆs(2). The ﬁnal suspect set consist of all colored nodes S=S(3) ∪S(2) ∪S(1).

White vertexes remain unchecked and are placed at the bottom of the ranking, at position N+ 1, where Nis

the size of the network.

Empirical studies suggest that the total number of suspects roughly follows |S| ∼ ˆ

klog N, where ˆ

kis the

average degree in the network. Fig. 2 illustrates this gradient-like selection process.

GMLA ensures that nodes near high-scoring suspects are checked, while those far away are ignored, signif-

icantly reducing computation time. Using only Kfﬁrst observers also proves to be highly beneﬁcial, bringing

the overall localization capability of GMLA in line with, or even exceeding, that of LPTVA [10].

2.2.3 Pearson correlation algorithm

Proposed by Xu, Teng, Zheou et al. [13], the correlation algorithm stands out as an extremely straightforward,

yet highly performant method. If we assume that the information propagates along the shortest path connecting

the source to observers (which is also an assumption required for extending LPTVA to general graphs), then it

should be intuitively obvious that the higher the distance from an observer to the source, the longer it takes for

this observer to become infected.

More precisely, in the case of true source s∗, the vector of infection times tshould be correlated with the

vector of distances between the source and the observers

[ds∗]i=d(s∗, oi),(8)

where distance d(vi, vj) is deﬁned as the length of shortest path connecting vertices viand vj. The score,

calculated for all suspects, is therefore given by the Pearson correlation coeﬃcient between tand distances to

the suspect ds

φPearson(s) = ρPearson(t,ds),(9)

which should be the highest for the true source s=s∗.

2.2.4 Evaluation measures

Algorithms outlined in Section 2.2 assign a score to every node from the suspect set S. Sorting suspects by the

score in descending order produces a ranking, with the highest scoring suspect in the ﬁrst place, satisfying (2).

The main measure of algorithm performance chosen for our purposes is its precision, deﬁned as

Precision = TP

TP + FP,(10)

where TP is the number of true positives, or correctly located sources (1 if φ(s∗) = φ(ˆs), 0 otherwise) and FP

is the number of false positives (all nodes s6=s∗such that φ(s) = φ(ˆs)). If there are no ties, then precision is a

binary measure, equal to 1 if the source is at the top of the ranking and 0 otherwise. If s∗is tied with nother

nodes at the top of the ranking, then precision is equal to 1/n.

The second evaluation measure chosen for the purpose of this article is the credible set size at conﬁdence

level α, denoted CSSα. Intuitively it can be understood as the size of the smallest set of top-scoring nodes

containing the source with probability α. It is also equivalent to the α-quantile of the rank assigned to the

source by the localization algorithm.

3 Changing network topology

Localization derivations outlined in Section 2.2 assumes that the initial network G= (V, E), in which the

information propagates, is exactly the same as the graph used by localization algorithms G0= (V0, E0). Such

assumption, while essential for analytical derivation, is not easily satisﬁed in practice. We examine a number

of scenarios, for which V=V0and E6=E0. That is, the set of edges of the initial network Gis diﬀerent than

the set of edges of the modiﬁed network G0, visible to the localization algorithms.

The diﬀerence between edge sets may be due to insuﬃcient or inaccurate information about the network’s

structure. A simple interpretation of such scenario is described in Section 3.2. Another possible reason could be

the natural changes (topological ﬂuctuations) taking place between the time of propagation and the moment of

localization, resulting from the non-static nature of a real-world network. In such case graphs Gand G0should

possess the same (or highly similar) macroscopic properties, but diﬀer on an edge-by-edge basis. We model this

with exponential random graphs [14], using a structural Hamiltonian approach described in Section 3.3.

The general simulation procedure, illustrated in Fig. 1, can be summarized as follows:

1. Create a connected network of the desired model.

2. Designate r·Nrandom nodes as observers.

3. Propagate information according to the synchronous SI model, recording infection times ti.

4. Modify the network by adding random edges/removing random edges/performing Metropolis steps.

5. Locate the information source using all three location algorithms. If the network became disconnected,

only the largest component is searched.

3.1 Graph dissimilarity measure

To compare diﬀerent types of topology changes an appropriate dissimilarity measure is required. We use Jaccard

distance between Eand E0, deﬁned as

dJ(E, E0)=1−J(E, E0)=1−|E∩E0|

|E∪E0|(11)

as the free variable in most of our simulations Other measures, namely Hamming distance, Sørensen-Dice

coeﬃcient and set overlap were considered. Jaccard distance has been selected because it is normalized, satisﬁes

the triangle inequality, and has an intuitive interpretation in all studied scenarios.

3.2 Adding/hiding links

In this scenario the number of edges in G0is over- or underestimated. In the ﬁrst case a fraction of superﬂuous

edges seis randomly added to the network, such that

|E0|= (1 + se)|E|, E ⊆E0.(12)

The Jaccard distance is thus

dJ(se) = 1 −|E|

(1 + se)|E|=se

1 + se

,(13)

with the inverse relationship

se(dJ) = dJ

1−dJ

.(14)

Similarly, if instead of adding superﬂuous links we instead randomly hide heof them, we arrive at

|E0|= (1 −he)|E|, E0⊆E, (15)

with Jaccard distance

dJ(he)=1−(1 −he)|E|

|E|=he.(16)

It should be noted that removing edges may result in component isolation. While the initial network is

guaranteed to be connected, this is no longer the case for the modiﬁed graph. Networks studied by us (see

Section 4) undergo a reversed process of percolation [15], while initially remaining far from the percolation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImpactofnetworktopologychangesoninformationsourcelocalizationPiotrMachura,RobertPaluchAbstractWell-establishedmethodsoflocatingthesourceofinformationinacomplexnetworkareusuallyderivedwiththeassumptionofcompleteandexactknowledgeofnetworktopology.Westudytheperformanceofthreesuchalgorithms(LPTVA,GMLAan...

展开>> 收起<<

Impact of network topology changes on information source localization Piotr Machura Robert Paluch.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Impact of network topology changes on information source localization Piotr Machura Robert Paluch

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: