1 OPENDATASplitStrains a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGSdata

2025-04-30 0 0 4.06MB 16 页 10玖币
侵权投诉
1
OPEN
DATA
SplitStrains, a tool to identify and separate mixed
Mycobacterium tuberculosis infections from WGSdata
EinarGabbassov1,2,*, MiguelMoreno- Molina3, IñakiComas3, MaxwellLibbrecht1 and LeonidChindelevitch4,*
RESEARCH ARTICLE
Gabbassov etal., Microbial Genomics 2021;7:000607
DOI 10.1099/mgen.0.000607
Received 16 February 2021; Accepted 10 May 2021; Published 24 June 2021
Author aliations: 1School of Computing Science, Simon Fraser University, Burnaby, BC, Canada; 2Department of Mathematics, Simon Fraser University,
Burnaby, BC, Canada; 3Instituto de Biomedicina de Valencia, Valencia, Spain; 4MRC Centre for Global Infectious Disease Analysis, School of Public Health,
Imperial College, London, UK.
*Correspondence: Leonid Chindelevitch, l. chindelevitch@ imperial. ac. uk; Einar Gabbassov, egabbass@ sfu. ca
Keywords: hetero- resistance; public health microbiology; maximum likelihood; mixed infection; Mycobacterium tuberculosis; multiple- strain infection.
Abbreviations: AUC, area under curve; BAM, binary sequence alignment map; FPR, false positive rate; GFF, generic feature format; GMM, gaussian
mixture model; NCBI, National Center for Biotechnology Information; RMSE, root mean square error; ROC, receiver operating characteristic; SNP,
single- nucleotide polymorphism; TPR, true positive rate; VCF, variant call format; VNTR, variable- number tandem repeats; WGS, whole genome
sequencing.
Repository for the source code for SplitStrains: https:// github. com/ WGS- TB/ SplitStrains.
Data statement: All supporting data, code and protocols have been provided within the article or through supplementary data files. One supplementary
table is available with the online version of this article.
000607 © 2021 The Authors
This is an open- access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between
the Microbiology Society and the corresponding author’s institution.
Abstract
The occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. dicile within a single human host,
referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detect-
ing it, and especially determining the proportion and identities of the underlying strains, from WGS (whole- genome sequencing)
data, have been limited. In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded
in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other
existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the
underlying strains. We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for
data coming from bacterial pathogens and holds the promise of enabling previously inaccessible conclusions to be drawn in
the realm of public health microbiology.
DATA SUMMARY
e authors conrm all supporting data, code and protocols
have been provided within the article or through supplemen-
tary data les.
Supplementary data les can be found at 10.6084/
m9.gshare.14562321.
INTRODUCTION
Bacterial infections by pathogens such as Mycobacterium
tuberculosis and Clostridium dicile oen occur as mixed
infections [1, 2], whereby a single patient is infected by
several dierent strains of the same organism. Eukaryotic
pathogens such as the main etiological agent of malaria,
Plasmodium falciparum, can also cause mixed infections [3].
e identication of such mixed infections can be impor-
tant for reasons including both patient- level decisions [4]
as well as public health measures [5]. In the latter setting, if
the tracing of the origins of the mixed infection is needed,
it may be additionally required to separate the mixed infec-
tion into its constituent strains. e separation may also be
informative when the mixed infection is hetero- resistant [6],
namely, when some, but not all, the strains are resistant to a
particular antimicrobial drug. Moreover, a failure to identify
the within- host pathogen diversity can lead to misdiagnosing
a relapse and reinfection [7]. However, so far, the problem of
identifying mixed infections and separating them into their
constituent strains has not received a sucient amount of
attention in the literature.
Although older techniques based on the detection of specic
regions, such as VNTR (variable- number tandem repeats)
[8], are oen able to detect such a mixed infection [9], this
is not always the case with next- generation sequencing. e
main challenge is that the presence of two alternative alleles
in a given genomic position may signal a sequencing error as
well as the presence of multiple strains. e key distinguishing
feature of a mixed infection is the consistency of the fraction
of the sample attributable to the sub- dominant strain across
OPEN
ACCESS
2
Gabbassov etal., Microbial Genomics 2021;7:000607
most of the variable positions. us, depending on the depth
of coverage, the similarity between the constituent strains
and the proportions in which they are mixed, the problem
of detecting and separating mixed strains may vary from
straightforward to nearly infeasible.
Several methods for this problem have appeared over the
past decade. Eyre et al. [2] propose a Mixed Infection
Estimator, a two- step approach for mixture proportion
estimation using a maximum likelihood analysis and mixed
strain identication using a custom database. Even though the
paper presents results for C. dicile, the mixture estimation
algorithm can be generalized to other pathogens such as M.
tuberculosis. is method computes a deviance statistic and
uses a threshold value for this statistic to detect mixed infec-
tions. As this algorithm was initially designed for C. dicile
and relies on a custom database of sequences to identify the
constituent strains, it could only be used for mixture propor-
tion estimation in our context. More recently, Sobkowiak et al.
[10] developed MixInfect, a method for mixture propor-
tion estimation using a Bayesian model- based clustering
technique. is method calculates the ratio of heterozygous
calls to total SNPs (single nucleotide polymorphisms) and
uses a threshold on this ratio to identify mixed samples. While
this algorithm can estimate mixture proportions it does not
provide any functionality for resolving the constituent strains.
e most recent method, QuantTB by Anyansi et al. [11],
relies on a specially constructed publicly available database
of 2166 M. tuberculosis assemblies from NCBI [12]. is
method provides mixture estimates of WGS samples as well
as the identication of strains whose sequence is similar to the
ones included in the database. To determine the constituent
strains, this method compares the sample to the sequences
in the reference database, scoring each of the assemblies. e
algorithm then determines how many constituent strains are
present in a sample. is approach does not generalize to
situations where the underlying strains lack close representa-
tives in the database, which makes its performance highly
dependent on the databases representation of the common
strains in the relevant local context.
In this paper, we address this problem with a tool called
SplitStrains, grounded in a rigorous statistical
framework. It is based on formulating, for a given set of
WGS reads, two alternative hypotheses, namely: the reads
belong to a single strain (null hypothesis) or to a mixture
of two strains (alternative hypothesis). We then use the EM
(Expectation- Maximization) algorithm [13] to estimate the
parameters of both hypotheses, and compare their likeli-
hoods to draw a conclusion. As a result, we simultaneously
obtain:
•
A call to decide whether the sample represents a single
(pure) or a mixed infection,
•
A likelihood ratio between the alternative and the null
hypothesis for the call, and,
• If mixed, the proportion of each constituent strain and a
Binary Sequence Alignment Map (BAM) le grouping the
reads belonging to each constituent strain.
Our results on both simulated and real M. tuberculosis data
show that SplitStrains is eective at identifying mixed
infections and continues to perform well even at a relatively
low depth of coverage (60×) and low genetic distance (20
SNPs) between strains. Moreover, SplitStrains outper-
forms previously published tools Mixed infection
estimator, MixInfect and QuantTB on simulated
data. Furthermore, our results show that SplitStrains
accurately separates the constituent strains provided that their
proportions are not too close to each other and they are not
too similar. SplitStrains is available on GitHub: https://
github. com/ WGS- TB/ SplitStrains.
METHODS
is part of the paper is organized as follows. First, we
briey describe the datasets used in our analysis. Second,
we explain the construction of the feature vector used in
our probabilistic model and show how to use it to classify
an isolate. ird, we dene the Naïve Bayes Classier for
the assignment of reads to strains. Lastly, we show how this
approach can be generalized to three or more strains.
We begin by describing the datasets used in our analysis. We
report the average number of SNPs relative to the reference
genome in the Results section. Here we additionally report
the average number of heterogeneous SNPs, dened by a
0/1 in the GT eld of the VCF le produced by aligning the
sample to the reference genome. We note that the number of
heterogeneous SNPs depends on the alignment and variant-
calling steps of the pipeline. erefore, for the in silico
datasets, this number may be lower than the total number
of SNPs added to the reference genome when generating
the sample. We report the per- sample statistics in Table S1
(available in the online version of this article).
Dataset A, in vitro. e 48 mixed M. tuberculosis samples
presented in [10] are articially generated in vitro by
combining the DNA from two clinical cultures of M.
Impact Statement
When multiple strains of a pathogenic organism are
present in a patient, it may be necessary to not only detect
this, but also to identify the individual strains. However,
this problem has not yet been solved for bacterial patho-
gens processed via whole- genome sequencing. In this
paper, we propose the SplitStrains algorithm for
detecting multiple strains in a sample, identifying their
proportions, and inferring their sequences, in the case of
Mycobacterium tuberculosis. We test it on both simulated
and real data, with encouraging results. We believe that
our work opens new horizons in public health microbi-
ology by allowing a more precise detection, identification
and quantification of multiple infecting strains within a
sample.
3
Gabbassov etal., Microbial Genomics 2021;7:000607
tuberculosis. e DNA is quantied through spectrophotom-
etry in liquid culture and combined to produce four sets of 12
samples with major strain proportions of 70, 90, 95 % (mixed)
and 100 % (pure). e average number of heterogeneous SNPs
is 327.
Dataset B, in silico. e 60 articial samples presented in
[14] are generated from the standard reference genome for
M. tuberculosis by substituting randomly chosen alleles
at each of 553 genes in an essential core genome MLST
scheme (ecgMLST), created by intersecting the set of core
genes in an existing scheme with the set of 615 essential
M. tuberculosis genes [15]. e full dataset contains three
pure genomes, with ten samples generated from each one by
varying the depth of coverage from 10 to 100 in increments
of ten, and three mixed genomes obtained by mixing an
additional three pure genomes in pairs, with ten samples
generated from each by varying the major strain proportion
from 50% to 95 % in 5 % increments. e average number
of heterogeneous SNPs is 2843.
Dataset C, in silico. For this dataset, generated specically
for this paper, the constituent strains are produced from the
H37Rv reference genome. e WGS data is produced by the
ART simulator [16] with the following settings:
(1) Prole: HiSeqX PCR free
(2) Read length: 150
(3)
Per base sequence quality scores: 20 to 30 on the Sanger/
Illumina 1.9 scale.
(4)
Quality shi: in order to match the quality of the real
data, we shied the quality scores down by nine to pro-
duce relatively uncertain sequences with high sequenc-
ing errors.
(5) Depth of coverage: 100 for single- strain and two- strain
samples and 150 for three- strain samples.
e dataset consists of eight two- strain mixtures, six three-
strain mixtures, and eight single strains. e two- strain
mixtures have major strain proportions varying from 50%
to 95 % in 5 % increments, with 55% and 60 % omitted. e
six three- strain samples have proportions 10:25:65, 15:30:55,
20:35:45, 25:40:35, 30:45:25, and 35:50:15. e average number
of heterogeneous SNPs is 136 for the two- strain samples and
583 for the three- strain samples. e simulated reads were
aligned back to the reference genome with BWA- MEM [17].
Dataset E, in silico. For this dataset, generated specically
for this paper, the constituent strains are produced from
the same H37Rv reference genome with N {10, 15, 20,
25} random base substitutions. is yields four subsets
that contain single and two- strain samples. e rst subset
contains eight single strain and eight two- strain samples
with proportions varying from 50% to 95 % in 5 % incre-
ments, with 55% and 60 % omitted. All the samples in the
rst subset have ten SNPs relative to H37Rv, and since these
SNPs are chosen independently at random, the two- strain
samples are 20 SNPs apart. e remaining sample subsets
have the same proportions, but more SNPs. e genetic
distances between mixed strains in each of the subsets are
thus 20, 30, 40 and 50 SNPs, respectively. e WGS data
is produced by the ART simulator with the same settings
used to generate Dataset C except for the depth of coverage,
which is set to 60 for all the samples.
Required input data
e SplitStrains pipeline uses the BWA- MEM tool,
which makes use of paired- end information to produce
the alignment. e current pipeline only keeps those pairs
for which both elements have been mapped, in order to
reduce the possibility of errors. Hence, SplitStrains
only requires a BAM le of the paired- end data, a reference
genome, and optionally, a generic feature format (GFF) le.
We say that a sample represented in the BAM le contains
a single strain if it is pure, a pair of strains called major
strain and minor strain if it is a mixture of two strains,
or multiple strains in the case of a mixture of more than
two strains.
Data pre-proccessing
All datasets are pre- processed with Trimmomatic [18]. e
Trimmomatic settings are:
ILLUMINACLIP=TruSeq3- PE-2,
SLIDINGWINDOW=4 with trimming threshold=16,
Leading=10,
Trailing=10,
Minlen=40.
Feature vector construction
e BAM le contains the alignment information for each
individual read of a sequenced organism. We rst convert the
BAM le into a pileup format using the pysam library [19]. is
pileup format summarises the alignment information for each
individual base of the reference genome. We then construct a
per- base feature vector dened as follows:
xi:= (p(i)
A,p(i)
C,p(i)
G,p(i)
T;d(i)), (1)
where
p(i)
b[0, 100]
for
b{A,C,G,T}
is the percentage of
base
b
at position i, so that ∑b {A,C,G,T}
p(i)
b= 100,
and d(i) is the
total depth (number of aligned reads) at this position. Note
that if
for more than one b, there could be a SNP at
position i. In practice, at most two of the
p(i)
b
s are non- zero
most of the time. In the absence of sequencing errors, we
expect exactly one of the
p(i)
b
s to equal 100 for every in the case
of a single strain. On the other hand, in the case of a mixed
sample that has a major strain at a proportion
p1
2
and a
minor strain at a proportion
1p
, we expect to see
p(i)
bi=p
and
p(i)
b
i
=1p
, where
bi̸=b
i
, for suciently many positions i.
Fig.1 shows an example with two adjacent positions, i and
j
. ere are
n=8
reads supporting position i, six of them
containing an
A
and the remaining two containing a
T
. e
depth is
d(i)=8
and the percentages of bases A and T at
are
p(i)
A= 75
and
p(i)
T= 25
, respectively. e adjacent position
4
Gabbassov etal., Microbial Genomics 2021;7:000607
Fig. 1. An example read alignment and the corresponding feature vector.
j
has the same depth and a
G
in one of the reads, with the
remaining reads containing a
C
. ey are summarized by the
feature vectors
xi= (75, 0, 0, 25; 8)
and
xj= (0, 87.5, 12.5, 0; 8)
,
respectively.
We empirically observed that sites with a relatively low depth of
coverage contained reads that could not be reliably aligned by
BWA- MEM (i.e. they had poor alignment quality scores). For
this reason, we chose to lter out any site
xi
with depth coverage
d(i)
below
k= 70%
of the mean depth of coverage. Additionally,
we use a GFF le based on the M. tuberculosis reference genome
(NC 000962.3), but with the mobile and PE/PPE genes removed.
is user- customizable GFF le ensures that SplitStrains
analyses only annotated gene regions excluding mobile and PE/
PPE genes, as the latter are known to be highly repetitive and
produce unreliable alignment and variant calling results [20].
Detecting mixed samples
We test two hypotheses: the null hypothesis
(H0)
, which states
that there is a single strain in the data, and an alternative
hypothesis
(H1)
, which states that there are two strains with
proportions
p
and
1p
, respectively. Our data
D
consists of
all the feature vectors
xi
described above.
e likelihood of the data D under
H0
is based on the fact
that, for every position i, the feature vector
xi
can only have a
non- zero percentage at one base
b
, that is
p(i)
b= 100
. at is,
under the null hypothesis, the maximum likelihood estimator
of the underlying base given a single strain would always be the
most frequently observed nucleotide. However, we also allow
sequencing errors to occur with probability
ϵ0
. erefore, using
the notation
di
for the number of reads that map to position i,
and
ki
for the number of those reads that end up with the most
frequent base, we have
P(D|H0)=
i
di
ki
ϵdiki
0(1 3ϵ0)ki, with ki:= di·max
bp(i)
b. (2)
where the last term in the product arises from the fact that
a sequencing error can occur in three dierent ways, so the
probability of getting the correct base is
13ϵ0
.
Under
H1
we assume that there are two strains with proportions
p
and
1p
, and sequencing errors which occur with prob-
ability
ϵ1
. We now use
n(i)
M
:= d
i
·max
bp(i)
b
and
n(i)
m:= di·max
b
̸=b
p
(i)
b
to denote the counts of the most and second most frequent
bases at position i, and
n(i)
e:= di(n(i)
M+n(i)
m)
to denote the
count of the remaining bases.
Repeating the above analysis gives us the following expression:
P(D|H1)=
i
d(i)
n(i)
M,n(i)
m,n(i)
e
(p(1 3ϵ1) + (1 p)ϵ1)n(i)
M((1 p
)
(1 3
ϵ
1)+p
ϵ
1)
n(i)
mϵn(i)
e
1, (3)
where
d(i)
n(i)
M
,n(i)
m,n(i)
e
:=
d(i)
n(i)
M
d(i)n(i)
M
n(i)
m
(4)
is the trinomial coecient, which generalizes the binomial
coecient.
Using the likelihood ratio test, we can now formulate the
following condition:
if 2(log(P(D|H0)) log(P(D|H1)))<c, then select H0, otherwise H1. (5)
e threshold value
c
is dened based on the signicance
level
α
using the
χ2
distribution with one degree of freedom.
Finally, in order to evaluate Equation (5) we estimate the
parameters
ϵ0
for
H0
and
p,ϵ1
for
H1
using the well- established
Truncated Newton constrained optimization algorithm [21].
In both cases we estimate a set of parameters that maximize
the likelihood of the data conditional on the hypothesis, i.e.
摘要:

1SplitStrains,atooltoidentifyandseparatemixedMycobacteriumtuberculosisinfectionsfromWGS dataEinar Gabbassov1,2,*,Miguel Moreno-Molina3,Iñaki Comas3,Maxwell Libbrecht1andLeonid Chindelevitch4,*RESEARCHARTICLEGabbassovet al.,MicrobialGenomics2021;7:000607DOI10.1099/mgen.0.000607Received16February2021;...

展开>> 收起<<
1 OPENDATASplitStrains a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGSdata.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:4.06MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注