Congruity of genomic and epidemiological data in modeling of local cholera outbreaks Mateusz Wilinski1 Lauren Castro2 Jerey Keithley23 Carrie Manore1 Josena Campos4 Ethan Romero-Severson1 Daryl Domman5 Andrey Y. Lokhov1

2025-04-27 0 0 4.15MB 23 页 10玖币
侵权投诉
Congruity of genomic and epidemiological data in modeling of local cholera outbreaks
Mateusz Wilinski1, Lauren Castro2, Jeffrey Keithley2,3, Carrie Manore1, Josefina
Campos4, Ethan Romero-Severson1, Daryl Domman5, Andrey Y. Lokhov1
1Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM USA
2Analytics, Intelligence and Technology Division,
Los Alamos National Laboratory, Los Alamos, NM USA
3Department of Computer Science, University of Iowa, Iowa City, IA, USA
4UO Centro Nacional de Genomica y Bioinform´atica,
ANLIS “Dr. Carlos G. Malbr´an”, Buenos Aires, Argentina and
5Center for Global Health, Department of Internal Medicine,
University of New Mexico Heath Sciences Center, Albuquerque, NM USA
Cholera continues to be a global health threat. Understanding how cholera spreads between lo-
cations is fundamental to the rational, evidence-based design of intervention and control efforts.
Traditionally, cholera transmission models have utilized cholera case count data. More recently,
whole genome sequence data has qualitatively described cholera transmission. Integrating these
data streams may provide much more accurate models of cholera spread, however no systematic
analyses have been performed so far to compare traditional case-count models to the phylody-
namic models from genomic data for cholera transmission. Here, we use high-fidelity case count
and whole genome sequencing data from the 1991-1998 cholera epidemic in Argentina to directly
compare the epidemiological model parameters estimated from these two data sources. We find that
phylodynamic methods applied to cholera genomics data provide comparable estimates that are in
line with established methods. Our methodology represents a critical step in building a framework
for integrating case-count and genomic data sources for cholera epidemiology and other bacterial
pathogens.
INTRODUCTION
Cholera is a major public health threat with an esti-
mated 4 million cases a year and over 150,000 deaths an-
nually [1]. Cholera is an acute diarrheal disease caused by
toxigenic Vibrio cholerae, and is transmitted though the
fecal-oral route from contaminated food or water. Cur-
rently, the burden of disease is primarily in sub-Saharan
Africa and South Asia, in vulnerable populations with a
lack of access to clean drinking water and sanitation [2].
Of note, some of the largest cholera epidemics have oc-
curred since the 1990s. Case counts over 1 million were
documented for the 1991-1998 epidemic in Latin America
[3], over 800,000 cases in Haiti from 2010-2020 [4], and
over 2.5 million cases thus far in the ongoing epidemic in
Yemen that began in 2016 [5].
Understanding how cholera is transmitted within and
across populations is paramount to the rational design
and implementation of control efforts. One of the dif-
ficulties in traditional modeling of cholera outbreaks
with case-count data alone is that the inference results
strongly depend on the quality of the reporting proce-
dures, while detailed properties of statistical counting
noise are often unknown and can not be easily estimated.
Furthermore, accurate case counting is hampered by the
large heterogeneity in disease presentation which ranges
from asymptomatic to severe cholera [2]. Genetic se-
quence data offers another data source on transmission
dynamics that could help develop a new generation of
ddomman@salud.unm.edu, lokhov@lanl.gov
complex, but well constrained cholera models. This is
possible due to the fact that epidemiological processes
such as transmission and migration of infection leave a
trace in pathogen genetic sequence data by changing the
underlying infection genealogy of a sampled set of se-
quences [6–8]. For example, compared to a stable pop-
ulation, in a population experiencing rapid growth, two
randomly selected people will, on average, share a com-
mon ancestor in the distant past [9], and therefore be sep-
arated by a larger number of mutations. Likewise, in oth-
erwise isolated populations, migration links pathogens
though a network of common descent [10] leading to a
distinctive pattern of interdigitated sequences in a phy-
logeny. Therefore, pathogen genetic sequence data has
generally the potential to inform epidemiological param-
eters such as transmission, migration, and mixing rates.
Previous work have used Vibrio cholerae sequence data
to describe the broad, qualitative flow of cholera at the
global scale that can capture broad trends but not ex-
plicit details of the transmission process [11–15]. How-
ever, the main challenge in the effort to integrate ge-
nomic data into cholera transmission models is the lack
of evidence that genomic data are informative of cholera
transmission processes at the local scale, i.e. that there is
sufficient genetic diversity in a single-source, local cholera
outbreak to estimate transmission parameters.
Case counts and the genetic sequence data can be
thought of as being two independent observers of the
transmission dynamics of cholera. In this paper, we uti-
lize a rich data-set of both case count data as well as
a large collection of genomic sequencing data from Ar-
gentina [16] to model the spread of cholera and specif-
arXiv:2210.01956v2 [q-bio.QM] 30 Mar 2023
2
ically address the role of local migration as a driving
factor in cholera transmission. Our goal is to build a
meta-population model with a minimal number of as-
sumptions which accounts for migration, and understand
if the two observers agree in their predictions. The bridg-
ing and common constraints on both sources of data will
be achieved using yet another source of independent data
such as high-fidelity estimation of migration flows.
Our modeling choices are aimed at simplicity and
driven by the overall goal of checking the consistency be-
tween these data sources. For instance, we deliberately
take an agnostic stance on the open questions related to
the role of the environment or details of bacterial dynam-
ics: while several previous studies explicitly included the
environmental compartment [17–21] leading to a larger
number of model parameters, we choose to model cholera
dynamics as an effective transmission process which in-
cludes a periodic functional dependence on the season-
ality, similarly to the approach of [22]. To account for
discreteness in observed cases, we propose a novel sam-
pling model which relates the continuous model with the
discrete observed case counts.
Most of key model parameters will be directly inferred
from case counts and migration data. Further, a sub-
set of most important parameters such as transmission
amplitudes and fraction of asymptomatic infections are
independently inferred from the genomic sequence data,
and compared to models inferred from other data sources
within their uncertainties. In particular, we don’t make
any a priori quantitative assumptions on the fraction
of asymptomatic infections, in previous studies ranging
from 1% to more than 90% of the population [23–25].
Instead, we keep this important model parameter free,
infer its values from data under different settings, and
discuss the sensitivity of this parameter to various mod-
eling assumptions. We also provide a series of careful
sensitivity studies that study the stability of the inferred
parameters related to all of our modeling assumptions.
In this paper, we present initial evidence that phylody-
namic methods can be used to study cholera outbreaks
at a regional level and that they produce parameter esti-
mates that are consistent with established methods. Our
approach provides a common methodology for an early
analysis of the model viability in the context of joint in-
ference from different data sources. Given the comple-
mentary view offered by independent data sources, we
anticipate that the analysis presented in this paper will
find a widespread use in building joint hybrid epidemi-
ological and genetic models which could help verify the
main modeling assumptions.
RESULTS
Integrated data from case counts, genomics,
and transportation data. Cholera was first reported
in Argentina in 1992, and subsequent cholera cases were
reported until 1998 [26–30]. Out of the total 4,281 cases
reported, over 3,500 Vibrio cholerae isolates were stored
at INEI-ANLIS “Dr. Carlos G. Malbr´an”, the national
reference laboratory for Argentina, and a representa-
tive sub-sample of 532 of these isolates were previously
whole-genome sequenced [16], see Supplementary Mate-
rials, section A for more details. We sought to deter-
mine if there was agreement between epidemiological and
genomic data. First, we pre-processed the data set by
removing cities with insufficient genomic samples (less
than 40 sequences). This left us with three target cities:
Tartagal, San Ram´on de la Nueva Or´an (both in Salta
province) and San Salvador de Jujuy (in Jujuy province)
located within in the Northwest of Argentina (see Fig.
1). Initial reports in 1992 indicated that cholera was
first introduced into Argentina via this region from Bo-
livia, leading to a large outbreak from 1992-1993 [31].
In addition to the epidemiological and sequence data,
we used publicly available data on domestic travel to es-
timate the movement of population between these three
cities during the study period. Focusing on the two pri-
mary means of transportation, flights and buses, we were
able to estimate the typical number of people travelling
daily between the selected cities (for details see Materials
and Methods and Supplementary Materials, section B).
Modeling assumptions. We modeled the cholera
transmission dynamics using a system of ordinary dif-
ferential equations (ODEs) where the population is split
into compartments representing individuals in different
states of infection. Typical cholera models are a sys-
tem of ODEs representing a modification of the clas-
sical Susceptible-Infected-Recovered (SIR) type model
[32] with varying degrees of complexity (see [17–22, 33]).
Here, we present a new, simple ODE cholera model that
significantly advances estimation of key epidemiological
parameters in two ways: (i) we focus on a minimalist
representation which allows us to reliably infer model pa-
rameters from a limited amount of data while introducing
the least amount of assumptions; and (ii) we use a meta-
population structure to leverage the spatial knowledge
on reported cases and travel patterns that can represent
the major spreading mechanism. We also do not con-
sider re-infection in our models of localized outbreaks, as
protective immunity against cholera has been estimated
to last at least 3 years [2]. Prior to formally introduc-
ing our dynamic model, we discuss the main modeling
assumptions behind our approach.
Many cholera models in the literature include an envi-
ronmental compartment [17–21]. Such an environmental
compartment is typically introduced to explicitly model
the transmission of infection through a water source, and
additionally describes the evolution of bacteria in a water
source with a temperature-dependent dynamics. From
the fitting perspective, an environmental component may
have a benefit to help the multi-year epidemic outbreak
(see the span of observed cases in the Supplementary Ma-
terials, Fig. S1) survive the period of cool temperatures
when number of cases drop significantly, and re-occur
when the temperature rises. During our initial model
3
Argentina



Bolivia
S2
I2
A2
R2
β(1p)
βp
γ
γ
S1
I1
A1
R1
β(1p)
βp
γ
γ
S3
I3
A3
R3
β(1p)
βp
γ
γ
f13
f23
f12
FIG. 1: Meta-population model of the cholera transmission dynamics interlayed with a map of the northern
Argentina. Left: Three cities considered in our focused study are marked as follows: Tartagal – green circle, San
Ram´on de la Nueva Or´an (in what follows, referred to as Oran) – blue circle – and San Salvador de Jujuy (in what
follows, referred to as Jujuy) – orange circle. Right: The dynamics inside each city population is modelled using the
Susceptible-Infected-Asymptomatic-Recovered (SIAR) model with seasonality modulated infection rate β(t),
recovery rate γ, and the parameter prepresenting the fraction of asymptomatic cases under the infection process.
The amplitude of the infection rate is βsfor cities with smaller population (Tartagal and Oran), and βlfor a larger
city (Jujuy). Black arrows represent the migration flows of susceptible and asymptomatic individuals between cities,
proportional to the flow rates fij for migration between locations iand j. A more detailed description of the model
is provided in the Materials and Methods, as well as in the Supplementary Materials, section C.
exploration, we tested an extension of our model that in-
cluded an environmental compartment, finding that its
inclusion did not improve the quality of the fit, while at
the same time it introduced additional parameters that
needed to be inferred from data. For this reason and in
order to keep the number of model parameters small, we
do not explicitly include the environmental compartment
in our model. Instead, the seasonal component of cholera
transmission, well documented in [34–36], is included in a
direct transmission parameter of our model. This direct
contact parameter is an effective parameter describing
the spread of cholera which includes all potential trans-
mission channels, similarly to an approach used in [22].
Previously, seasonally-modulated transmission parame-
ters was suggested in a fully theoretical framework in
[37], but it was not applied to empirical data.
Our second key assumption is related to the presence
of an asymptomatic population, which does not display
any strong symptoms, but nevertheless contributes to the
infection spread via migration. This population is not di-
rectly observed, but contributes to the cholera dynamics
via a dedicated compartment A. The associated param-
eter pdescribes the fraction of infected population which
falls into the Acompartment upon infection, while the
rest of the population falls into the Icompartment which
describes the symptomatic population. In spite of the
general agreement that a significant number of individ-
uals infected by cholera display no apparent symptoms
and play a crucial role in spreading the disease between
different locations, the literature is not conclusive about
the proportion of asymptomatic carriers. For instance,
[23] suggests that between 1% and 25% of infected cases
are asymptomatic; [24] estimates pcloser to 50%; and
according to [25], the asymptomatic population repre-
sents the majority of cases. Therefore, we treat pas one
of the key free parameters in our model which will need
to be inferred from data. We further assume that the
transmission dynamics between cities through migration
is mediated by the asymptomatic carriers only. This as-
sumption is incorporated into the meta-population model
through a migration term which is proportional to the
number of asymptomatic individuals in the population,
and is appropriately normalized so that the city popula-
tions do not change. Additionally, these migration terms
are informed by independently estimated travel rates, as
we discuss below. These migration terms link epidemic
trajectories in different cities and thus facilitate the iden-
tification of the parameter prelated to the fraction of
asymptomatic cases upon infection.
Meta-population model. Here, we formally sum-
marize our dynamic meta-population model. The cholera
dynamics in each of the three cities is described using a
4
homogeneous SIR-like SIAR model, linked by a flow of
asymptomatic infected individuals between cities. More
precisely, we divide the population of each city into four
compartments: susceptible (S), infected symptomatic (I),
infected asymptomatic (A) and recovered (R). The mi-
gration mechanism allows for a mixing of different city
populations with two conditions: (i) symptomatic in-
fected individuals do not move between locations; (ii)
traveling on average does not change the population of
each city. A schematic representation of the structure of
the meta-population model and the details of the single-
city model are shown in Fig. 1. The exact system of
ordinary differential equations used to build the model is
described in detail in the Supplementary Materials, sec-
tion C.
The model contains a total of 10 epidemiological,
travel, and demographic parameters. The epidemiologi-
cal parameters are the recovery rate γ, which represents
the inverse of expected days to recovery, the parameter
p, which represents the fraction of asymptomatic cases
emerging upon infection, and the infection rate β(t) mod-
ulated by a time-dependent function reflecting the sea-
sonal changes. Importantly, while the seasonality itself
is assumed to be the same for all three geographically
close cities, the amplitudes βsand βl, respectively rep-
resenting the smaller population cities Tartagal and San
Ram´on de la Nueva Or´an, and the larger population San
Salvador de Jujuy, may be different. The transmission
amplitudes βsand βlcan a priori take different values,
for instance due to an expectation of a better infrastruc-
ture in larger cities, potentially leading to access of higher
quality health care and resulting in lower infection rates.
Additionally, the model parameters include the initial
demographic structure of all four compartments in each
location. Finally, the model contains a set of migration
parameters fij describing the flow of people between the
cities. The procedure for estimating the migration pa-
rameters is given in the Supplementary Materials, sec-
tion B. A more detailed description of the fixed and free
parameters is included in the Materials and Methods, as
well as in the Supplementary Materials, section C.
Estimation of model parameters from case
count data. Using a least-squares-based estimator min-
imizing the error between the case counts data and model
predictions (see Materials and Methods for more details),
we estimated the transmission rates, seasonality param-
eters, initial conditions, and asymptomatic fraction p.
One of the main challenges faced by the fitting of our con-
tinuous meta-population SIAR model was the fact that
the case counts are not reported continuously in the data
set, but instead appear as discrete peaks at certain sam-
pling days. To address this reporting delay challenge,
we proposed a sampling model which establishes a cor-
respondence between the continuous model and the case
counts sampled at specific dates by looking at a cumula-
tive number of cases between subsequent sampling dates
(see Materials and Methods and Supplementary Materi-
als, section D).
TABLE I: Key model parameters inferred from the case
count data, together with their single
standard-deviation uncertainty averaged over several
families of statistical counting noise.
parameter lower bound inferred value upper bound
p0.041 0.319 0.597
βs0.144 0.157 0.170
βl0.148 0.155 0.162
The results of the inference procedure are presented
in Table I for the key model parameters p,βs, and βl
(see Supplementary Materials, section E). In the absence
of ground truth, we use the following approach to es-
timate the uncertainty of our inference procedure. We
construct a synthetic model with a planted ground-truth
parameters equal to the parameters inferred from data.
Then, by generating counting noise on the same sam-
pling dates as the ones that appear in the real data, we
can construct synthetic data sets which have the same
properties as the original case count data, but with the
advantage that these data sets now come with a planted
ground truth. We run our estimator on many instances of
synthetic data with different noise realizations and com-
pare the inference results with the planted parameters
of the synthetic model. This procedure allows us to re-
liably estimate the uncertainty bounds of our inference
procedure. Previously, a similar procedure for estimating
the uncertainty in the absence of ground truth has been
used in other applications involving statistical inference
[38, 39]. To check robustness with respect to the (un-
known) counting noise, we consider the estimation error
for noise generated from several families of probability
distributions, and report the standard deviation results
averaged over these families. The details of this proce-
dure are given in the Supplementary Materials, section
E.
A comparison of the true case count data and the sam-
ples obtained from the model with the inferred parame-
ters is shown in Fig. 2. Despite some discrepancies ob-
served for the highest case count peaks, our simple model
is able to adequately describe the variations present in
the data, including the seasonal character of the cholera
outbreak in Argentina. The obtained seasonality of the
transmission rate highly correlates with seasonal temper-
ature variations in the analysed cities (see Supplementary
Materials, section F), even though such an information
was not explicitly implemented in the model and was not
provided to the inference algorithm. The model suggests
that cholera from this initial outbreak dies out after 1997,
which is consistent with a hypothesis that the nature of
the further peaks may have been significantly influenced
by external factors such as Mitch hurricane in 1998 or El
Ni˜no in 1997-1998 [40] (see Materials and Methods).
Our results are highly robust. We successfully tested
our fitting procedure with synthetic data generated with
5
1992 1993 1994 1995 1996 1997 1998
Date
0
10
20
30
40
number of cases
observed cases
model cases
I(t)
(a) Tartagal
1992 1993 1994 1995 1996 1997 1998
Date
0
20
40
60
number of cases
observed cases
model cases
I(t)
(b) San Ram´on de la Nueva Or´an (Oran)
1992 1993 1994 1995 1996 1997 1998
Date
0
20
40
60
number of cases
observed cases
model cases
I(t)
(c) San Salvador de Jujuy (Jujuy)
t1
t114 t2
cases included in sampling
(d) Sampling procedure example
FIG. 2: Comparison of the case count data (blue bars) and the samples obtained from the model with the inferred
parameters (orange bars) for three cities: (a) Tartagal, (b) Oran, and (c) Jujuy. The red dashed line represents the
number of active infected symptomatic cases according to the predictions of our continuous meta-population SIAR
model. In the real data set, the case counts are not reported every day, but instead appear as discrete peaks at
certain sampling days, which complicates the fitting of the continuous model. For this reason, we propose a
correspondence between the continuous model and the case counts sampled at specific dates. The panel (d) explains
the sampling model we use in our fitting procedure. We assume that the case counts reported on a given date
correspond to a cumulative number of cases predicted by the continuous model between the minimum of the current
and the previous sampling dates and 14 days before the current sampling date. This cut-off cumulative horizon
represents the double of expected recovery period (fixed to 7 days, as explained in the Materials and Methods). A
more detailed explanation of the sampling procedure and the study of the impact of the choice of the sampling
horizon is presented in the Supplementary Materials, sections D and I, respectively.
different types of noise (see Supplementary Materials,
section E). We have also studied the scenario of a non-
centered counting noise distribution corresponding to a
significant under-reporting of the case counts. In the
Supplementary Materials, section G, we show that this
scenario is unrealistic as it leads to an unreasonably large
number of predicted infected individuals, of the order of
the whole city population. Moreover, we tested sensitiv-
ity of our procedure with respect to the misspecification
of the inferred migration rates (see Supplementary Ma-
terials, section H) and different sampling horizons (see
Supplementary Materials, section I). Finally, we investi-
gated how the much the length of the case count time
series affected the estimated parameters (see Supplemen-
tary Materials, section J. We conclude that despite many
sources of potential discrepancies, estimates are robust to
reasonable levels of model misspecification.
Estimation of model parameters from genetic
sequence data. We used phylodynamic analysis meth-
ods to estimate the transmission rates and the asymp-
tomatic fraction parameter pfrom the genetic sequence
data (see Materials and Methods and Supplementary Ma-
terials, section K for a description of the data and the
details on the inference procedure). The inferred time-
scaled phylogenetic tree fixed to its maximum likelihood
topology is shown in Fig. 3. The tree topology suggests
an intermixed outbreak with evidence of multiple trans-
missions occurring between the three cities, which pro-
摘要:

CongruityofgenomicandepidemiologicaldatainmodelingoflocalcholeraoutbreaksMateuszWilinski1,LaurenCastro2,Je reyKeithley2;3,CarrieManore1,Jose naCampos4,EthanRomero-Severson1,DarylDomman5,AndreyY.Lokhov11TheoreticalDivision,LosAlamosNationalLaboratory,LosAlamos,NMUSA2Analytics,IntelligenceandTechnol...

展开>> 收起<<
Congruity of genomic and epidemiological data in modeling of local cholera outbreaks Mateusz Wilinski1 Lauren Castro2 Jerey Keithley23 Carrie Manore1 Josena Campos4 Ethan Romero-Severson1 Daryl Domman5 Andrey Y. Lokhov1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:4.15MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注