1 OPENDATASplitStrains a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGSdata

2025-04-30 0 0 4.06MB 16 页 10玖币

侵权投诉

OPEN

DATA

SplitStrains, a tool to identify and separate mixed

Mycobacterium tuberculosis infections from WGSdata

EinarGabbassov1,2,*, MiguelMoreno- Molina3, IñakiComas3, MaxwellLibbrecht1 and LeonidChindelevitch4,*

RESEARCH ARTICLE

Gabbassov etal., Microbial Genomics 2021;7:000607

DOI 10.1099/mgen.0.000607

Received 16 February 2021; Accepted 10 May 2021; Published 24 June 2021

Author aliations: 1School of Computing Science, Simon Fraser University, Burnaby, BC, Canada; 2Department of Mathematics, Simon Fraser University,

Burnaby, BC, Canada; 3Instituto de Biomedicina de Valencia, Valencia, Spain; 4MRC Centre for Global Infectious Disease Analysis, School of Public Health,

Imperial College, London, UK.

*Correspondence: Leonid Chindelevitch, l. chindelevitch@ imperial. ac. uk; Einar Gabbassov, egabbass@ sfu. ca

Keywords: hetero- resistance; public health microbiology; maximum likelihood; mixed infection; Mycobacterium tuberculosis; multiple- strain infection.

Abbreviations: AUC, area under curve; BAM, binary sequence alignment map; FPR, false positive rate; GFF, generic feature format; GMM, gaussian

mixture model; NCBI, National Center for Biotechnology Information; RMSE, root mean square error; ROC, receiver operating characteristic; SNP,

single- nucleotide polymorphism; TPR, true positive rate; VCF, variant call format; VNTR, variable- number tandem repeats; WGS, whole genome

sequencing.

Repository for the source code for SplitStrains: https:// github. com/ WGS- TB/ SplitStrains.

Data statement: All supporting data, code and protocols have been provided within the article or through supplementary data ﬁles. One supplementary

table is available with the online version of this article.

This is an open- access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between

the Microbiology Society and the corresponding author’s institution.

Abstract

The occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. dicile within a single human host,

referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detect-

ing it, and especially determining the proportion and identities of the underlying strains, from WGS (whole- genome sequencing)

data, have been limited. In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded

in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other

existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the

underlying strains. We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for

data coming from bacterial pathogens and holds the promise of enabling previously inaccessible conclusions to be drawn in

the realm of public health microbiology.

DATA SUMMARY

e authors conrm all supporting data, code and protocols

have been provided within the article or through supplemen-

tary data les.

Supplementary data les can be found at 10.6084/

m9.gshare.14562321.

INTRODUCTION

Bacterial infections by pathogens such as Mycobacterium

tuberculosis and Clostridium dicile oen occur as mixed

infections [1, 2], whereby a single patient is infected by

several dierent strains of the same organism. Eukaryotic

pathogens such as the main etiological agent of malaria,

Plasmodium falciparum, can also cause mixed infections [3].

e identication of such mixed infections can be impor-

tant for reasons including both patient- level decisions [4]

as well as public health measures [5]. In the latter setting, if

the tracing of the origins of the mixed infection is needed,

it may be additionally required to separate the mixed infec-

tion into its constituent strains. e separation may also be

informative when the mixed infection is hetero- resistant [6],

namely, when some, but not all, the strains are resistant to a

particular antimicrobial drug. Moreover, a failure to identify

the within- host pathogen diversity can lead to misdiagnosing

a relapse and reinfection [7]. However, so far, the problem of

identifying mixed infections and separating them into their

constituent strains has not received a sucient amount of

attention in the literature.

Although older techniques based on the detection of specic

regions, such as VNTR (variable- number tandem repeats)

[8], are oen able to detect such a mixed infection [9], this

is not always the case with next- generation sequencing. e

main challenge is that the presence of two alternative alleles

in a given genomic position may signal a sequencing error as

well as the presence of multiple strains. e key distinguishing

feature of a mixed infection is the consistency of the fraction

of the sample attributable to the sub- dominant strain across

OPEN

ACCESS

Gabbassov etal., Microbial Genomics 2021;7:000607

most of the variable positions. us, depending on the depth

of coverage, the similarity between the constituent strains

and the proportions in which they are mixed, the problem

of detecting and separating mixed strains may vary from

straightforward to nearly infeasible.

Several methods for this problem have appeared over the

past decade. Eyre et al. [2] propose a Mixed Infection

Estimator, a two- step approach for mixture proportion

estimation using a maximum likelihood analysis and mixed

strain identication using a custom database. Even though the

paper presents results for C. dicile, the mixture estimation

algorithm can be generalized to other pathogens such as M.

tuberculosis. is method computes a deviance statistic and

uses a threshold value for this statistic to detect mixed infec-

tions. As this algorithm was initially designed for C. dicile

and relies on a custom database of sequences to identify the

constituent strains, it could only be used for mixture propor-

tion estimation in our context. More recently, Sobkowiak et al.

[10] developed MixInfect, a method for mixture propor-

tion estimation using a Bayesian model- based clustering

technique. is method calculates the ratio of heterozygous

calls to total SNPs (single nucleotide polymorphisms) and

uses a threshold on this ratio to identify mixed samples. While

this algorithm can estimate mixture proportions it does not

provide any functionality for resolving the constituent strains.

e most recent method, QuantTB by Anyansi et al. [11],

relies on a specially constructed publicly available database

of 2166 M. tuberculosis assemblies from NCBI [12]. is

method provides mixture estimates of WGS samples as well

as the identication of strains whose sequence is similar to the

ones included in the database. To determine the constituent

strains, this method compares the sample to the sequences

in the reference database, scoring each of the assemblies. e

algorithm then determines how many constituent strains are

present in a sample. is approach does not generalize to

situations where the underlying strains lack close representa-

tives in the database, which makes its performance highly

dependent on the database’s representation of the common

strains in the relevant local context.

In this paper, we address this problem with a tool called

SplitStrains, grounded in a rigorous statistical

framework. It is based on formulating, for a given set of

WGS reads, two alternative hypotheses, namely: the reads

belong to a single strain (null hypothesis) or to a mixture

of two strains (alternative hypothesis). We then use the EM

(Expectation- Maximization) algorithm [13] to estimate the

parameters of both hypotheses, and compare their likeli-

hoods to draw a conclusion. As a result, we simultaneously

obtain:

•

A call to decide whether the sample represents a single

(pure) or a mixed infection,

•

A likelihood ratio between the alternative and the null

hypothesis for the call, and,

• If mixed, the proportion of each constituent strain and a

Binary Sequence Alignment Map (BAM) le grouping the

reads belonging to each constituent strain.

Our results on both simulated and real M. tuberculosis data

show that SplitStrains is eective at identifying mixed

infections and continues to perform well even at a relatively

low depth of coverage (60×) and low genetic distance (20

SNPs) between strains. Moreover, SplitStrains outper-

forms previously published tools Mixed infection

estimator, MixInfect and QuantTB on simulated

data. Furthermore, our results show that SplitStrains

accurately separates the constituent strains provided that their

proportions are not too close to each other and they are not

too similar. SplitStrains is available on GitHub: https://

github. com/ WGS- TB/ SplitStrains.

METHODS

is part of the paper is organized as follows. First, we

briey describe the datasets used in our analysis. Second,

we explain the construction of the feature vector used in

our probabilistic model and show how to use it to classify

an isolate. ird, we dene the Naïve Bayes Classier for

the assignment of reads to strains. Lastly, we show how this

approach can be generalized to three or more strains.

We begin by describing the datasets used in our analysis. We

report the average number of SNPs relative to the reference

genome in the Results section. Here we additionally report

the average number of heterogeneous SNPs, dened by a

0/1 in the GT eld of the VCF le produced by aligning the

sample to the reference genome. We note that the number of

heterogeneous SNPs depends on the alignment and variant-

calling steps of the pipeline. erefore, for the in silico

datasets, this number may be lower than the total number

of SNPs added to the reference genome when generating

the sample. We report the per- sample statistics in Table S1

(available in the online version of this article).

Dataset A, in vitro. e 48 mixed M. tuberculosis samples

presented in [10] are articially generated in vitro by

combining the DNA from two clinical cultures of M.

Impact Statement

When multiple strains of a pathogenic organism are

present in a patient, it may be necessary to not only detect

this, but also to identify the individual strains. However,

this problem has not yet been solved for bacterial patho-

gens processed via whole- genome sequencing. In this

paper, we propose the SplitStrains algorithm for

detecting multiple strains in a sample, identifying their

proportions, and inferring their sequences, in the case of

Mycobacterium tuberculosis. We test it on both simulated

and real data, with encouraging results. We believe that

our work opens new horizons in public health microbi-

ology by allowing a more precise detection, identiﬁcation

and quantiﬁcation of multiple infecting strains within a

sample.

Gabbassov etal., Microbial Genomics 2021;7:000607

tuberculosis. e DNA is quantied through spectrophotom-

etry in liquid culture and combined to produce four sets of 12

samples with major strain proportions of 70, 90, 95 % (mixed)

and 100 % (pure). e average number of heterogeneous SNPs

is 327.

Dataset B, in silico. e 60 articial samples presented in

[14] are generated from the standard reference genome for

M. tuberculosis by substituting randomly chosen alleles

at each of 553 genes in an essential core genome MLST

scheme (ecgMLST), created by intersecting the set of core

genes in an existing scheme with the set of 615 essential

M. tuberculosis genes [15]. e full dataset contains three

pure genomes, with ten samples generated from each one by

varying the depth of coverage from 10 to 100 in increments

of ten, and three mixed genomes obtained by mixing an

additional three pure genomes in pairs, with ten samples

generated from each by varying the major strain proportion

from 50% to 95 % in 5 % increments. e average number

of heterogeneous SNPs is 2843.

Dataset C, in silico. For this dataset, generated specically

for this paper, the constituent strains are produced from the

H37Rv reference genome. e WGS data is produced by the

ART simulator [16] with the following settings:

(1) Prole: HiSeqX PCR free

(2) Read length: 150

(3)

Per base sequence quality scores: 20 to 30 on the Sanger/

Illumina 1.9 scale.

(4)

Quality shi: in order to match the quality of the real

data, we shied the quality scores down by nine to pro-

duce relatively uncertain sequences with high sequenc-

ing errors.

(5) Depth of coverage: 100 for single- strain and two- strain

samples and 150 for three- strain samples.

e dataset consists of eight two- strain mixtures, six three-

strain mixtures, and eight single strains. e two- strain

mixtures have major strain proportions varying from 50%

to 95 % in 5 % increments, with 55% and 60 % omitted. e

six three- strain samples have proportions 10:25:65, 15:30:55,

20:35:45, 25:40:35, 30:45:25, and 35:50:15. e average number

of heterogeneous SNPs is 136 for the two- strain samples and

583 for the three- strain samples. e simulated reads were

aligned back to the reference genome with BWA- MEM [17].

Dataset E, in silico. For this dataset, generated specically

for this paper, the constituent strains are produced from

the same H37Rv reference genome with N ∈{10, 15, 20,

25} random base substitutions. is yields four subsets

that contain single and two- strain samples. e rst subset

contains eight single strain and eight two- strain samples

with proportions varying from 50% to 95 % in 5 % incre-

ments, with 55% and 60 % omitted. All the samples in the

rst subset have ten SNPs relative to H37Rv, and since these

SNPs are chosen independently at random, the two- strain

samples are 20 SNPs apart. e remaining sample subsets

have the same proportions, but more SNPs. e genetic

distances between mixed strains in each of the subsets are

thus 20, 30, 40 and 50 SNPs, respectively. e WGS data

is produced by the ART simulator with the same settings

used to generate Dataset C except for the depth of coverage,

which is set to 60 for all the samples.

Required input data

e SplitStrains pipeline uses the BWA- MEM tool,

which makes use of paired- end information to produce

the alignment. e current pipeline only keeps those pairs

for which both elements have been mapped, in order to

reduce the possibility of errors. Hence, SplitStrains

only requires a BAM le of the paired- end data, a reference

genome, and optionally, a generic feature format (GFF) le.

We say that a sample represented in the BAM le contains

a single strain if it is pure, a pair of strains called major

strain and minor strain if it is a mixture of two strains,

or multiple strains in the case of a mixture of more than

two strains.

Data pre-proccessing

All datasets are pre- processed with Trimmomatic [18]. e

Trimmomatic settings are:

ILLUMINACLIP=TruSeq3- PE-2,

SLIDINGWINDOW=4 with trimming threshold=16,

Leading=10,

Trailing=10,

Minlen=40.

Feature vector construction

e BAM le contains the alignment information for each

individual read of a sequenced organism. We rst convert the

BAM le into a pileup format using the pysam library [19]. is

pileup format summarises the alignment information for each

individual base of the reference genome. We then construct a

per- base feature vector dened as follows:

xi:= (p(i)

A,p(i)

C,p(i)

G,p(i)

T;d(i)), (1)

where

p(i)

b∈[0, 100]

for

b∈{A,C,G,T}

is the percentage of

base

at position i, so that ∑b ∈ {A,C,G,T}

p(i)

b= 100,

and d(i) is the

total depth (number of aligned reads) at this position. Note

that if

p(i)

b>0

for more than one b, there could be a SNP at

position i. In practice, at most two of the

p(i)

’s are non- zero

most of the time. In the absence of sequencing errors, we

expect exactly one of the

p(i)

’s to equal 100 for every in the case

of a single strain. On the other hand, in the case of a mixed

sample that has a major strain at a proportion

p≥1

and a

minor strain at a proportion

1−p

, we expect to see

p(i)

bi=p

and

p(i)

′

=1−p

, where

bi̸=b′

, for suciently many positions i.

Fig.1 shows an example with two adjacent positions, i and

. ere are

n=8

reads supporting position i, six of them

containing an

and the remaining two containing a

. e

depth is

d(i)=8

and the percentages of bases A and T at

are

p(i)

A= 75

and

p(i)

T= 25

, respectively. e adjacent position

Gabbassov etal., Microbial Genomics 2021;7:000607

Fig. 1. An example read alignment and the corresponding feature vector.

has the same depth and a

in one of the reads, with the

remaining reads containing a

. ey are summarized by the

feature vectors

xi= (75, 0, 0, 25; 8)

and

xj= (0, 87.5, 12.5, 0; 8)

respectively.

We empirically observed that sites with a relatively low depth of

coverage contained reads that could not be reliably aligned by

BWA- MEM (i.e. they had poor alignment quality scores). For

this reason, we chose to lter out any site

with depth coverage

d(i)

below

k= 70%

of the mean depth of coverage. Additionally,

we use a GFF le based on the M. tuberculosis reference genome

(NC 000962.3), but with the mobile and PE/PPE genes removed.

is user- customizable GFF le ensures that SplitStrains

analyses only annotated gene regions excluding mobile and PE/

PPE genes, as the latter are known to be highly repetitive and

produce unreliable alignment and variant calling results [20].

Detecting mixed samples

We test two hypotheses: the null hypothesis

(H0)

, which states

that there is a single strain in the data, and an alternative

hypothesis

(H1)

, which states that there are two strains with

proportions

and

1−p

, respectively. Our data

consists of

all the feature vectors

described above.

e likelihood of the data D under

is based on the fact

that, for every position i, the feature vector

can only have a

non- zero percentage at one base

, that is

p(i)

b= 100

. at is,

under the null hypothesis, the maximum likelihood estimator

of the underlying base given a single strain would always be the

most frequently observed nucleotide. However, we also allow

sequencing errors to occur with probability

ϵ0

. erefore, using

the notation

for the number of reads that map to position i,

and

for the number of those reads that end up with the most

frequent base, we have

P(D|H0)=

∏









ϵdi−ki

0(1 −3ϵ0)ki, with ki:= di·max

bp(i)

b. (2)

where the last term in the product arises from the fact that

a sequencing error can occur in three dierent ways, so the

probability of getting the correct base is

1−3ϵ0

Under

we assume that there are two strains with proportions

and

1−p

, and sequencing errors which occur with prob-

ability

ϵ1

. We now use

n(i)

:= d

·max

bp(i)

and

n(i)

m:= di·max

′

̸=b

(i)

b′

to denote the counts of the most and second most frequent

bases at position i, and

n(i)

e:= di−(n(i)

M+n(i)

to denote the

count of the remaining bases.

Repeating the above analysis gives us the following expression:

P(D|H1)=∏





d(i)

n(i)

M,n(i)

m,n(i)



(p(1 −3ϵ1) + (1 −p)ϵ1)n(i)

M((1 −p

)

(1 −3

1)+p

n(i)

mϵn(i)

1, (3)

where





d(i)

n(i)

,n(i)

m,n(i)









d(i)

n(i)







d(i)−n(i)

n(i)



(4)

is the trinomial coecient, which generalizes the binomial

coecient.

Using the likelihood ratio test, we can now formulate the

following condition:

if −2(log(P(D|H0)) −log(P(D|H1)))<c, then select H0, otherwise H1. (5)

e threshold value

is dened based on the signicance

level

using the

χ2

distribution with one degree of freedom.

Finally, in order to evaluate Equation (5) we estimate the

parameters

ϵ0

for

and

p,ϵ1

for

using the well- established

Truncated Newton constrained optimization algorithm [21].

In both cases we estimate a set of parameters that maximize

the likelihood of the data conditional on the hypothesis, i.e.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1SplitStrains,atooltoidentifyandseparatemixedMycobacteriumtuberculosisinfectionsfromWGS dataEinar Gabbassov1,2,*,Miguel Moreno-Molina3,Iñaki Comas3,Maxwell Libbrecht1andLeonid Chindelevitch4,*RESEARCHARTICLEGabbassovet al.,MicrobialGenomics2021;7:000607DOI10.1099/mgen.0.000607Received16February2021;...

展开>> 收起<<

1 OPENDATASplitStrains a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGSdata.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 OPENDATASplitStrains a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGSdata

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: