Impact of network topology changes on information source localization Piotr Machura Robert Paluch

2025-05-08 0 0 1.53MB 22 页 10玖币
侵权投诉
Impact of network topology changes on information source
localization
Piotr Machura, Robert Paluch
Abstract
Well-established methods of locating the source of information in a complex network are usually derived
with the assumption of complete and exact knowledge of network topology. We study the performance of
three such algorithms (LPTVA, GMLA and Pearson correlation algorithm) in scenarios that do not fulfill
this assumption by modifying the network prior to localization. This is done by adding superfluous new
links, hiding existing ones, or reattaching links in accordance with the network’s structural Hamiltonian. We
find that GMLA is highly resilient to the addition of superfluous edges, as its precision falls by more than
statistical uncertainty only when the number of links is approximately doubled. On the other hand, if the
edge set is underestimated or reattachment has taken place, the performance of GMLA drops significantly. In
such a scenario the Pearson algorithm is preferable, retaining most of its performance when other simulation
parameters favor localization (high density of observers, highly deterministic propagation). It is also generally
more accurate than LPTVA, as well as orders of magnitude faster. The aforementioned differences between
localization algorithms can be intuitively explained, although a need for further theoretical research is noted.
Keywords: complex network, information propagation, source localization, exponential random graphs
1 Introduction
The exchange of information is the main purpose of many networks found in our everyday life. The creation of
the Internet enabled its users to communicate on an unprecedented scale. The rapid rise in the usage of social
media [1] fundamentally changed the way we view and absorb the information presented to us by acquaintances
and news publishers alike. The emergence of clickbait [2] promotes sensation over honesty, encouraging the
usage of exaggeration and deceit in the name of profitable engagement [3]. Due to the instantaneous nature
of social media misinformation propagates in a flash, reaching a wide audience of users. As a result we, the
recipients, are often unaware of the true origin of viral gossip we encounter, making it difficult to judge and
interpret properly.
The scientific community has taken it upon itself to solve this dilemma, by establishing methods to identify
and combat the spread of misinformation [4, 5]. In this manner, the topic of source location is of crucial
importance. Knowing where a specific piece of information came from can aid in identifying its purpose and
reliability.
Among various types of localization methods, categorized by Jiang et al. [6], observer-based localization
methods are especially popular due to their relatively low requirements. They can accurately pinpoint the source
of information using data provided by only a few members of the network, together with the knowledge of its
structure. When deriving localization methods, it is usually assumed that this structure (network topology) is
known completely and exactly. We propose a set of scenarios, outlined in Section 3, in which this assumption
does not hold. That is, the network in which the information has spread is not the same as the network used
during localization. This may be due to errors made during network reconstruction, lack of knowledge regarding
the full structure, or natural evolution occurring between the time of propagation and the moment in which the
network is recorded.
Examining the simulation performance of three popular source localization algorithms, outlined in Section
2.2, we seek to find and describe the qualitative differences between them. Results of such simulations can
potentially aid in choosing the right algorithm for a real-world situation, in which the “correctness” of the
reconstructed network is uncertain.
2 Basics
We assume a simple agent-based model, in which Nagents are located in the vertices of a complex network,
i.e. an unweighted, undirected graph G= (V, E). Edges of such graphs represent a bi-directional connection
through which information may propagate.
1
arXiv:2210.14737v1 [physics.soc-ph] 26 Oct 2022
o1
s*
a) o1
o3
o2
s*
b) o1
o3
o2
s*
c)
o3
o2
t*
t1
t2
t3
o1
s*
d)
o2
Figure 1: Illustration of the simulation methodology. a) At an unknown time tthe source sis spontaneously
infected. b) Information propagates through the network. Infection times tiof chosen nodes (observers oi) are
recorded. c) Network undergoes topological changes. Some existing edges are removed (dotted line) and some
are added (double line), producing the modified network. The graph may become disconnected in the process.
d) The modified network is cropped to the largest component before the source localization algorithm is run.
2.1 Information spread
To simulate the spread of information, we use the synchronous Susceptible-Infected (SI) model [7]. In the SI
model, each node can be in one of two states: susceptible/healthy (uninformed) or infected (informed). In the
beginning, all nodes are healthy. At some unknown point in time, ta single randomly chosen node becomes
spontaneously infected. We will refer to this node as the source of information, denoted s. In every subsequent
time step, infected nodes will try to infect their susceptible neighbors, with a probability of success β, called
infection rate. The propagation ends when there are no more healthy nodes in the network. It is assumed
that the network is connected (i.e. there is a path between each pair of nodes) during the spread. After the
propagation completes and the network topology is changed (see Section 3) this assumption no longer holds.
2.2 Locating the source
We consider observer -based methods of source localization. These algorithms are popular in the literature [8–12]
due to their realistic assumptions, making possible real-world implementation and usage. The observer nodes
provide some limited knowledge about the spread after it has concluded. They may represent e.g. social media
moderators capable of providing insight, or individual users reporting their encounters with concerning content.
In our simulations observers form a randomly chosen subset of graph vertices O ⊆ V, with the ratio
r=|O|/|V|called observer density. It is assumed that each observer oimeasures the time of its infection ti.
This is notably different from assumptions made by Pinto et. al. [8] and Xu, Teng, Zheou et al. [13], where
both the observer infection time and the node by which the observer was infected are known. We find that this
limited approach can still result in satisfactory algorithm performance, while significantly reducing the amount
of information required to perform localization.
Source localization algorithm assigns a score to every suspect node sSV(including observers), given
the vector of observer infection times [t]i=ti, infection rate βand graph topology. It is therefore equivalent
to a score function φ(s;t, β, G). Intuitively, φ(s) should be associated with the posterior probability of sbeing
the true source s, that is
P(s=s|t,G)f(φ(s;t, β, G)),(1)
where fis any strictly increasing function. This allows for ranking suspects by their score, with the highest
2
scoring suspect
ˆs= argmax
s
φ(s;t, β, G) (2)
being the most probable origin of the information. If it is true that ˆs=sthen the localization is successful.
See Section 2.2.4 for further details on evaluating localization algorithms.
2.2.1 Maximum likelihood algorithm
The maximum likelihood algorithm has been proposed by Pinto et al in [8]. We will refer to it as LPTVA
(Limited Pinto-Thiran-Vitelli Algorithm), with limited signifying the fact that only infection times tiare used.
It is analytically derived for tree graphs and extended to general graphs in the way outlined below.
The unknown inception time tis accounted for by choosing a reference observer o1and constructing a
vector of infection delays t0with respect to t1
[t0]i=ti+1 t1.(3)
If we assume that the graph in question is a tree Tand let P(vi, vj) denote the path between vertices vi,vj,
the expected value for the t0vector (given sis the source) can be written as
[µs]i=µ(|P(s, oi+1)|−|P(s, o1)|),(4)
where µis the mean time it takes for information to transmit through one edge. In the case of the SI model it
is equal to the mean of a geometric distribution µ=1
β. The covariance matrix Λ of t0is given by
[Λ]i,j =σ2|P(oi+1, o1)P(oj+1, o1)|,(5)
where σ2is the variance of the time it takes for information to transmit through one edge. In the case of SI
model it is equal to variance of a geometric distribution σ2=1β
β2.
Using the central limit theorem, we can postulate that t0, as a sum of i.i.d. single-edge delays, follows a
multivariate normal distribution with mean µsand covariance matrix Λ. Taking a logarithm of this distribution
and stripping away s-independent normalization factors we arrive at the score function
φLPTVA(s) = (t0µs)TΛ1(t0µs)ln |Λ|,(6)
which can be calculated for all nodes in the network.
If the graph in question is not a tree, the algorithm can be adjusted by calculating the score given by (6)
with respect to the breadth-first search (BFS) tree T(s)
BF S rooted at s. Such modification is equivalent to the
assumption that the information propagates along the shortest paths connecting the source with observers. This
holds exactly when β= 1 and contributes to significantly reduced algorithm performance for non-tree graphs
with βsignificantly lower than 1.
2.2.2 Gradient maximum likelihood algorithm
The LPTV algorithm, while well established and highly analytical, suffers from significant computation speed
issues. Matrix inversion in equation (6) must be performed for each suspect, rendering the method effectively
unusable for networks with a large number of nodes.
The gradient maximum likelihood algorithm (GMLA) [9] attempts to rectify this problem by prioritizing
the earliest observers and utilizing a gradient-like selection of suspects. As a consequence, unlike in the case of
LPTVA and Pearson correlation algorithm, the set of suspects is generally only a small subset of all nodes in
the network. All other nodes ev(V\S) are assigned the same score, effectively equal −∞. While the score
function used to rank suspects in GMLA is the same as (6), the choice of suspects is much more sophisticated.
Firstly, only Kfearliest-infected observers are used in the calculations. The performance of the algorithm
varies depending on the exact value of Kf, with the optimum usually estimated for each network model on
a case-by-case basis. In our calculations we have used Kf=Nfor all considered networks, which is an
acceptable rule of thumb [9].
Secondly, the score function is not calculated for all Nvertices. Instead, it is first calculated for the neighbors
S(1) of the earliest observer o1. Next, the score is calculated for the previously unchecked neighbors of the highest
scoring ˆs(1) from S(1), creating another set of suspects S(2), from which the highest scoring ˆs(2) is chosen. This
process is repeated until there is no node in S(n)(neighbors of ˆs(n1) not checked in previous steps) with score
higher than ˆs(n1). In such case, the algorithm stops and returns scores calculated for all members of
S=
n
[
i=1
S(i).(7)
3
ŝ(1)
o1
ŝ(2)
ŝ(3)
Figure 2: Node selection in GMLA. Initially, the neighbors of the earliest informed neighbor o1are checked
(S(1), colored orange). Highest scoring vertex ˆs(1) from this set is chosen, and its neighbors S(2) are checked
(colored blue). Note that o1is included in S(2). Next, neighbors S(3) of ˆs(2) (colored purple) are checked,
among which highest scoring vertex ˆs(3) can be found. Since the score of ˆs(3) is lower than that of ˆs(2), the
algorithm terminates with ˆs= ˆs(2). The final suspect set consist of all colored nodes S=S(3) S(2) S(1).
White vertexes remain unchecked and are placed at the bottom of the ranking, at position N+ 1, where Nis
the size of the network.
Empirical studies suggest that the total number of suspects roughly follows |S| ∼ ˆ
klog N, where ˆ
kis the
average degree in the network. Fig. 2 illustrates this gradient-like selection process.
GMLA ensures that nodes near high-scoring suspects are checked, while those far away are ignored, signif-
icantly reducing computation time. Using only Kffirst observers also proves to be highly beneficial, bringing
the overall localization capability of GMLA in line with, or even exceeding, that of LPTVA [10].
2.2.3 Pearson correlation algorithm
Proposed by Xu, Teng, Zheou et al. [13], the correlation algorithm stands out as an extremely straightforward,
yet highly performant method. If we assume that the information propagates along the shortest path connecting
the source to observers (which is also an assumption required for extending LPTVA to general graphs), then it
should be intuitively obvious that the higher the distance from an observer to the source, the longer it takes for
this observer to become infected.
More precisely, in the case of true source s, the vector of infection times tshould be correlated with the
vector of distances between the source and the observers
[ds]i=d(s, oi),(8)
where distance d(vi, vj) is defined as the length of shortest path connecting vertices viand vj. The score,
calculated for all suspects, is therefore given by the Pearson correlation coefficient between tand distances to
the suspect ds
φPearson(s) = ρPearson(t,ds),(9)
which should be the highest for the true source s=s.
2.2.4 Evaluation measures
Algorithms outlined in Section 2.2 assign a score to every node from the suspect set S. Sorting suspects by the
score in descending order produces a ranking, with the highest scoring suspect in the first place, satisfying (2).
The main measure of algorithm performance chosen for our purposes is its precision, defined as
Precision = TP
TP + FP,(10)
where TP is the number of true positives, or correctly located sources (1 if φ(s) = φ(ˆs), 0 otherwise) and FP
is the number of false positives (all nodes s6=ssuch that φ(s) = φ(ˆs)). If there are no ties, then precision is a
binary measure, equal to 1 if the source is at the top of the ranking and 0 otherwise. If sis tied with nother
nodes at the top of the ranking, then precision is equal to 1/n.
4
The second evaluation measure chosen for the purpose of this article is the credible set size at confidence
level α, denoted CSSα. Intuitively it can be understood as the size of the smallest set of top-scoring nodes
containing the source with probability α. It is also equivalent to the α-quantile of the rank assigned to the
source by the localization algorithm.
3 Changing network topology
Localization derivations outlined in Section 2.2 assumes that the initial network G= (V, E), in which the
information propagates, is exactly the same as the graph used by localization algorithms G0= (V0, E0). Such
assumption, while essential for analytical derivation, is not easily satisfied in practice. We examine a number
of scenarios, for which V=V0and E6=E0. That is, the set of edges of the initial network Gis different than
the set of edges of the modified network G0, visible to the localization algorithms.
The difference between edge sets may be due to insufficient or inaccurate information about the network’s
structure. A simple interpretation of such scenario is described in Section 3.2. Another possible reason could be
the natural changes (topological fluctuations) taking place between the time of propagation and the moment of
localization, resulting from the non-static nature of a real-world network. In such case graphs Gand G0should
possess the same (or highly similar) macroscopic properties, but differ on an edge-by-edge basis. We model this
with exponential random graphs [14], using a structural Hamiltonian approach described in Section 3.3.
The general simulation procedure, illustrated in Fig. 1, can be summarized as follows:
1. Create a connected network of the desired model.
2. Designate r·Nrandom nodes as observers.
3. Propagate information according to the synchronous SI model, recording infection times ti.
4. Modify the network by adding random edges/removing random edges/performing Metropolis steps.
5. Locate the information source using all three location algorithms. If the network became disconnected,
only the largest component is searched.
3.1 Graph dissimilarity measure
To compare different types of topology changes an appropriate dissimilarity measure is required. We use Jaccard
distance between Eand E0, defined as
dJ(E, E0)=1J(E, E0)=1|EE0|
|EE0|(11)
as the free variable in most of our simulations Other measures, namely Hamming distance, Sørensen-Dice
coefficient and set overlap were considered. Jaccard distance has been selected because it is normalized, satisfies
the triangle inequality, and has an intuitive interpretation in all studied scenarios.
3.2 Adding/hiding links
In this scenario the number of edges in G0is over- or underestimated. In the first case a fraction of superfluous
edges seis randomly added to the network, such that
|E0|= (1 + se)|E|, E E0.(12)
The Jaccard distance is thus
dJ(se) = 1 |E|
(1 + se)|E|=se
1 + se
,(13)
with the inverse relationship
se(dJ) = dJ
1dJ
.(14)
Similarly, if instead of adding superfluous links we instead randomly hide heof them, we arrive at
|E0|= (1 he)|E|, E0E, (15)
with Jaccard distance
dJ(he)=1(1 he)|E|
|E|=he.(16)
It should be noted that removing edges may result in component isolation. While the initial network is
guaranteed to be connected, this is no longer the case for the modified graph. Networks studied by us (see
Section 4) undergo a reversed process of percolation [15], while initially remaining far from the percolation
5
摘要:

ImpactofnetworktopologychangesoninformationsourcelocalizationPiotrMachura,RobertPaluchAbstractWell-establishedmethodsoflocatingthesourceofinformationinacomplexnetworkareusuallyderivedwiththeassumptionofcompleteandexactknowledgeofnetworktopology.Westudytheperformanceofthreesuchalgorithms(LPTVA,GMLAan...

展开>> 收起<<
Impact of network topology changes on information source localization Piotr Machura Robert Paluch.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.53MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注