On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

2025-05-02 0 0 4.87MB 45 页 10玖币
侵权投诉
On the Salient Limitations of the Methods of
Assembly Theory and their Classification of
Molecular Biosignatures
Abicumaran Uthamacumaran1,2, Felipe S. Abrah˜ao3,4, Narsis
A. Kiani5,6, and Hector Zenil6,7
1Department of Physics and Psychology (Alumni), Concordia University,
Canada.
2McGill University, McGill Genome Center, Majewski Lab, Canada.
3Centre for Logic, Epistemology and the History of Science, University of
Campinas (UNICAMP), Brazil.
4DEXL, National Laboratory for Scientific Computing, Brazil.
5Department of Oncology-Pathology, Center for Molecular Medicine,
Karolinska Institutet, Sweden.
6Algorithmic Dynamics Lab, Karolinska Institutet, Sweden.
7School of Biomedical Engineering and Imaging Sciences, King’s College
London, U.K..
Abstract
We demonstrate that the assembly pathway method underlying
assembly theory (AT) is an encoding scheme widely used by pop-
ular statistical compression algorithms. We show that in all cases
(synthetic or natural) AT performs similarly to other simple coding
schemes and underperforms compared to system-related indexes based
upon algorithmic probability that take into account statistical repeti-
tions but also the likelihood of other computable patterns. Our results
imply that the assembly index does not offer substantial improvements
over existing methods, including traditional statistical ones, and imply
that the separation between living and non-living compounds follow-
ing these methods has been reported before.
Keywords: assembly theory, assembly index, complexity, biosigna-
tures, statistical coding, algorithmic information, LZ compression
1 Introduction
The distinction between living and nonliving systems has long fascinated
both scientists and philosophers. The question has been at the core of the
areas of systems biology and complexity science since their inception, while
the seminal concept of complexity—an irreducible emergent property among
Corresponding author. Email: hector.zenil@cs.ox.ac.uk, hector.zenil@kcl.ac.uk
1
arXiv:2210.00901v10 [cs.IT] 14 Aug 2024
simpler components in a system—has long been believed to be central to the
distinction between living systems and inanimate matter [8, 9, 35, 37, 44].
The first to discuss this nexus of issues was Erwin Schr¨odinger, in his
book “What is Life?”, exploring the physical aspect of life and cells, followed
by Claude Shannon, whose concept of entropy, significantly shaped not only
by communication theory but by his characterisation of life and intelligence,
placed the concept of information at the core of the question about life.
Shannon proposed that his digital theory of communication and information
be applied to understanding information processing in biological systems [39].
By solving not only the problem of a mathematical definition for random-
ness but also the apparent bias toward simplicity underlying formal theories,
the concepts of algorithmic information, algorithmic randomness, and algo-
rithmic probability from Algorithmic Information Theory (AIT) abstract the
issue away from statistics and human personal biases and choices to recast it
in terms of fundamental mathematical first principles. These foundations are
the underpinnings of coding methods, and they are ultimately what explain
and justify their application as a generalisation of Shannon’s information
theory. AIT has also been motivated by questions about randomness, com-
plexity, and structure in the real world, formulating concepts ranging from
algorithmic probability [42], that formalises the discussion related to how
likely a computable process or object is to be produced by chance under in-
formation constraints, to the concept of logical depth [12], that frames the
discussion related to process memory, causal structure and how life can be
characterised otherwise that in terms of randomness and simplicity.
A recently introduced approach termed “Assembly Theory” (AT), featur-
ing a computable index, has been claimed to be a novel and superior approach
to distinguishing living from non-living systems and gauging the complexity
of molecular biosignatures with an assembly index or molecular assembly in-
dex (MA). In proposing MA as a new complexity measure that quantifies the
minimal number of bond-forming steps needed to construct a molecule, the
central claim advanced in [34] is that molecules with high molecular assembly
index (MA) values “are very unlikely to form abiotically, and the probability
of abiotic formation goes down as MA increases”. In other words, according
to the authors, “high MA molecules cannot form in detectable abundance
through random and unconstrained processes, implying that the existence of
high MA molecules depends on additional constraints imposed on the pro-
cess” [34]. We will use the notation ‘AT’, ‘assembly index’, or ‘MA’ to refer
to the aforementioned theory and the index derived therefrom.
The underlying intuition is that such an assembly index (by virtue of
minimising the length of the path necessary for an extrinsic agent to assemble
the object) would afford “a way to rank the relative complexity of objects
2
made up of the same building units on the basis of the pathway, exploiting
the combinatorial nature of these combinations” [32].
In order to support their central claim, the authors of Assembly Theory
state that “MA tracks the specificity of a path through the combinatorially
vast chemical space” [32] and that, as presented in Marshall et al. [33], it
“leads to a measure of structural complexity that accounts for the structure
of the object and how it could have been constructed, which is, in all cases,
computable and unambiguous”.
1.1 What a ZIP file can tell about life
The authors propose that molecules with high MA detected in contexts or
samples generated by random processes, in which there are minimal (or no)
biases in the formation of the objects, display a smaller frequency of occur-
rence in comparison to the frequency of occurrence of molecules in alternative
configurations, where extrinsic agents or a set of biases (such as those brought
into play by evolutionary processes) play a significant role.
However, we found that what the authors have called AT [34] is a for-
mulation that mirrors the working of previous coding algorithms—though
no proper references or attributions are offered—in particular, statistical
lossless compression algorithms, whose purpose is to find redundancies [6].
These algorithms were dictionary-based, like run-length encoding (RLE),
Huffman [28], and Lempev-Ziv (LZ)-based [61]. They were all launched early
in the development of the field of compression for the purpose of detecting
identical copies that could be reused.
Lossless compression, incorporating the basic ideas of LZ compression,
has been widely applied in the context of living systems, including in a land-
mark paper published in 2005, where it was shown that it was not only ca-
pable of characterising DNA as a biosignature, but also of reconstructing the
main branches of an evolutionary phylogenetic tree from the compressibility
ratio of mammalian mtDNA sequences [31]. The same LZ algorithms have
been used for plagiarism detection, as measures of language distance, and for
clustering and classification [31]. In genetics, it is widely known that similar
species have similar nucleotide GC content, and that therefore a simple Shan-
non Entropy approach on a uniform distribution of G and C nucleotides—
effectively simply counting the exact repetitions of polymers [47]—can yield
a phylogenetic tree. LZ compression has been used in this same context [48],
and is central to complexity applications to living organisms, which are based
upon exactly the same grounds and on the idea of repetitive modules.
LZ77/LZ78 is at the core of AT, but its assembly index method is weaker
than resource-bounded measures introduced before [19, 41, 56]. LZ-based
3
schemes have been used in compression since 1977, and they are behind al-
gorithms like zip, gzip, giff, and others, exploited for the purposes of compres-
sion and as approximations to algorithmic (Solomonoff-Kolmogorov-Chaitin)
complexity, which is one of the indexes from AIT. This is because compress-
ibility is sufficient proof of non-randomness. Being one of the LZ compression
schemes [6], the assembly index calculation method looks for the largest sub-
string matches, counting them only once as they can be reused to reproduce
the original object. But it is weaker than other approximating measures
because by definition it only takes into consideration identical copies rather
than the full spectrum of causal operations to which an object may be subject
(beyond simple identical copies).
Our results demonstrate that the claim that AT may help not only to
distinguish life from non-life but also to identify non-terrestrial life, explain
evolution and natural selection, and unify physics and biology is a major
overstatement. (See also the Appendix for a detailed presentation of the
results). What AT amounts to is a re-purposing of some elementary al-
gorithms in computer science in a sub-optimal application to life detection
that has been suggested and undertaken before [12, 51], even generating the
same results when applied to separating organic from non-organic chemical
compounds [46]. By empirically demonstrating the higher predictive perfor-
mance of AIT-based complexity measures, such as approximations to algo-
rithmic complexity, to experimental applications in molecular classification,
we extend the results reported before in [46] that had already—years before
the introduction of Assembly Theory—demonstrated the capabilities of these
measures as regards separating chemical compounds by their particular prop-
erties, including organic from inorganic compounds. Further research based
on the same underlying ideas of perturbation/mutation analysis together
with algorithmic information theory has also been recently used to detect
and decode bio- and technosignatures [60].
2 MA and compression algorithms
By employing different types of data (on the same subset of molecules [32,
34]), as shown in Figures 4 and 5, we demonstrate that other measures applied
to other (chemical and molecular) data reproduce what AT’s authors claimed
was unique, though in fact it was not. We have shown that the same indexes
used and shown in these figures, and reported to separate organic from non-
organic compounds before in [46], also separate what the authors thought
was a unique type of spectral data. Using exactly the same data input
utilised by the authors of AT in their original paper [34], we have shown that
4
their MA index, also known as the assembly index, displays exactly the same
behaviour as other complexity indexes. These results show that the assembly
index calculation method not only is a compression scheme (as proven in [6]),
but also performs like one for all intents and purposes, and does not seem
to afford any classificatory advantage either by virtue of its method or in
combination with any property of the input data (e.g. mass spectra).
Assembly Theory claims that MA can predict living vs.nonliving molecules,
testing it against a small cherry-picked subset of biological extracts, between
abiotic factors and inorganic (dead) matter. We repeated the experiment
using the binarised MS2 spectra peaks matrices provided in the source data
in [34]. Our reproduced findings are shown in Figures 1 and 2. (See also the
Appendix E for more detailed information).
Thus, the coding indexes systematically outperform the MA index as
a discriminant of living vs.non-living systems. MA works on the basis
upon which all popular statistical lossless compression algorithms operate,
the principle of ‘counting exact repetitions’ in data, which AT fully relies
upon. These are basic coding schemes introduced at the inception of in-
formation theory and computer science that do not incorporate the many
advances made in recent decades in the area of coding, compression, and
resource-bounded algorithmic complexity theory [59] and cannot explain se-
lection and evolution or unify physics and biology [40] beyond the connections
already made [26].
As demonstrated here, the characterisation of molecules using mass spec-
trometry signatures is not a challenge for other equally computable and
statistically-driven indexes. Other indexes are equally capable of discrim-
inating biosignature categories, by InChI, by bond distance matrices or by
mass spectra (MS2 peak matrices), thus disproving the claim that MA is the
only experimentally valid measure of molecular complexity.
3 Limitations of MA as a complexity measure
We have also shown that as soon as the MA index is confronted with more
complicated cases of non-linear modularity, it underperforms or misses obvi-
ous regularities. As shown in this article and more detailed in the Appendix,
our results show that MA, and its generalisation in the hypothesis called AT,
is prone to false positives and fails both in theory and in practice to capture
the notion of high-level causality beyond non-trivial statistical repetitions—
that Shannon Entropy could not have already captured in the first place—
which is necessary for distinguishing a serendipitous extrinsic agent (e.g.
a chemical reaction resulting from biological processes) that constructs or
5
摘要:

OntheSalientLimitationsoftheMethodsofAssemblyTheoryandtheirClassificationofMolecularBiosignaturesAbicumaranUthamacumaran1,2,FelipeS.Abrah˜ao3,4,NarsisA.Kiani5,6,andHectorZenil∗6,71DepartmentofPhysicsandPsychology(Alumni),ConcordiaUniversity,Canada.2McGillUniversity,McGillGenomeCenter,MajewskiLab,Can...

展开>> 收起<<
On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures.pdf

共45页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:45 页 大小:4.87MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 45
客服
关注