
2
Gabbassov etal., Microbial Genomics 2021;7:000607
most of the variable positions. us, depending on the depth
of coverage, the similarity between the constituent strains
and the proportions in which they are mixed, the problem
of detecting and separating mixed strains may vary from
straightforward to nearly infeasible.
Several methods for this problem have appeared over the
past decade. Eyre et al. [2] propose a Mixed Infection
Estimator, a two- step approach for mixture proportion
estimation using a maximum likelihood analysis and mixed
strain identication using a custom database. Even though the
paper presents results for C. dicile, the mixture estimation
algorithm can be generalized to other pathogens such as M.
tuberculosis. is method computes a deviance statistic and
uses a threshold value for this statistic to detect mixed infec-
tions. As this algorithm was initially designed for C. dicile
and relies on a custom database of sequences to identify the
constituent strains, it could only be used for mixture propor-
tion estimation in our context. More recently, Sobkowiak et al.
[10] developed MixInfect, a method for mixture propor-
tion estimation using a Bayesian model- based clustering
technique. is method calculates the ratio of heterozygous
calls to total SNPs (single nucleotide polymorphisms) and
uses a threshold on this ratio to identify mixed samples. While
this algorithm can estimate mixture proportions it does not
provide any functionality for resolving the constituent strains.
e most recent method, QuantTB by Anyansi et al. [11],
relies on a specially constructed publicly available database
of 2166 M. tuberculosis assemblies from NCBI [12]. is
method provides mixture estimates of WGS samples as well
as the identication of strains whose sequence is similar to the
ones included in the database. To determine the constituent
strains, this method compares the sample to the sequences
in the reference database, scoring each of the assemblies. e
algorithm then determines how many constituent strains are
present in a sample. is approach does not generalize to
situations where the underlying strains lack close representa-
tives in the database, which makes its performance highly
dependent on the database’s representation of the common
strains in the relevant local context.
In this paper, we address this problem with a tool called
SplitStrains, grounded in a rigorous statistical
framework. It is based on formulating, for a given set of
WGS reads, two alternative hypotheses, namely: the reads
belong to a single strain (null hypothesis) or to a mixture
of two strains (alternative hypothesis). We then use the EM
(Expectation- Maximization) algorithm [13] to estimate the
parameters of both hypotheses, and compare their likeli-
hoods to draw a conclusion. As a result, we simultaneously
obtain:
•
A call to decide whether the sample represents a single
(pure) or a mixed infection,
•
A likelihood ratio between the alternative and the null
hypothesis for the call, and,
• If mixed, the proportion of each constituent strain and a
Binary Sequence Alignment Map (BAM) le grouping the
reads belonging to each constituent strain.
Our results on both simulated and real M. tuberculosis data
show that SplitStrains is eective at identifying mixed
infections and continues to perform well even at a relatively
low depth of coverage (60×) and low genetic distance (20
SNPs) between strains. Moreover, SplitStrains outper-
forms previously published tools Mixed infection
estimator, MixInfect and QuantTB on simulated
data. Furthermore, our results show that SplitStrains
accurately separates the constituent strains provided that their
proportions are not too close to each other and they are not
too similar. SplitStrains is available on GitHub: https://
github. com/ WGS- TB/ SplitStrains.
METHODS
is part of the paper is organized as follows. First, we
briey describe the datasets used in our analysis. Second,
we explain the construction of the feature vector used in
our probabilistic model and show how to use it to classify
an isolate. ird, we dene the Naïve Bayes Classier for
the assignment of reads to strains. Lastly, we show how this
approach can be generalized to three or more strains.
We begin by describing the datasets used in our analysis. We
report the average number of SNPs relative to the reference
genome in the Results section. Here we additionally report
the average number of heterogeneous SNPs, dened by a
0/1 in the GT eld of the VCF le produced by aligning the
sample to the reference genome. We note that the number of
heterogeneous SNPs depends on the alignment and variant-
calling steps of the pipeline. erefore, for the in silico
datasets, this number may be lower than the total number
of SNPs added to the reference genome when generating
the sample. We report the per- sample statistics in Table S1
(available in the online version of this article).
Dataset A, in vitro. e 48 mixed M. tuberculosis samples
presented in [10] are articially generated in vitro by
combining the DNA from two clinical cultures of M.
Impact Statement
When multiple strains of a pathogenic organism are
present in a patient, it may be necessary to not only detect
this, but also to identify the individual strains. However,
this problem has not yet been solved for bacterial patho-
gens processed via whole- genome sequencing. In this
paper, we propose the SplitStrains algorithm for
detecting multiple strains in a sample, identifying their
proportions, and inferring their sequences, in the case of
Mycobacterium tuberculosis. We test it on both simulated
and real data, with encouraging results. We believe that
our work opens new horizons in public health microbi-
ology by allowing a more precise detection, identification
and quantification of multiple infecting strains within a
sample.