On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

2025-05-02 0 0 4.87MB 45 页 10玖币

侵权投诉

On the Salient Limitations of the Methods of

Assembly Theory and their Classiﬁcation of

Molecular Biosignatures

Abicumaran Uthamacumaran1,2, Felipe S. Abrah˜ao3,4, Narsis

A. Kiani5,6, and Hector Zenil∗6,7

1Department of Physics and Psychology (Alumni), Concordia University,

Canada.

2McGill University, McGill Genome Center, Majewski Lab, Canada.

3Centre for Logic, Epistemology and the History of Science, University of

Campinas (UNICAMP), Brazil.

4DEXL, National Laboratory for Scientiﬁc Computing, Brazil.

5Department of Oncology-Pathology, Center for Molecular Medicine,

Karolinska Institutet, Sweden.

6Algorithmic Dynamics Lab, Karolinska Institutet, Sweden.

7School of Biomedical Engineering and Imaging Sciences, King’s College

London, U.K..

Abstract

We demonstrate that the assembly pathway method underlying

assembly theory (AT) is an encoding scheme widely used by pop-

ular statistical compression algorithms. We show that in all cases

(synthetic or natural) AT performs similarly to other simple coding

schemes and underperforms compared to system-related indexes based

upon algorithmic probability that take into account statistical repeti-

tions but also the likelihood of other computable patterns. Our results

imply that the assembly index does not oﬀer substantial improvements

over existing methods, including traditional statistical ones, and imply

that the separation between living and non-living compounds follow-

ing these methods has been reported before.

Keywords: assembly theory, assembly index, complexity, biosigna-

tures, statistical coding, algorithmic information, LZ compression

1 Introduction

The distinction between living and nonliving systems has long fascinated

both scientists and philosophers. The question has been at the core of the

areas of systems biology and complexity science since their inception, while

the seminal concept of complexity—an irreducible emergent property among

∗Corresponding author. Email: hector.zenil@cs.ox.ac.uk, hector.zenil@kcl.ac.uk

arXiv:2210.00901v10 [cs.IT] 14 Aug 2024

simpler components in a system—has long been believed to be central to the

distinction between living systems and inanimate matter [8, 9, 35, 37, 44].

The ﬁrst to discuss this nexus of issues was Erwin Schr¨odinger, in his

book “What is Life?”, exploring the physical aspect of life and cells, followed

by Claude Shannon, whose concept of entropy, signiﬁcantly shaped not only

by communication theory but by his characterisation of life and intelligence,

placed the concept of information at the core of the question about life.

Shannon proposed that his digital theory of communication and information

be applied to understanding information processing in biological systems [39].

By solving not only the problem of a mathematical deﬁnition for random-

ness but also the apparent bias toward simplicity underlying formal theories,

the concepts of algorithmic information, algorithmic randomness, and algo-

rithmic probability from Algorithmic Information Theory (AIT) abstract the

issue away from statistics and human personal biases and choices to recast it

in terms of fundamental mathematical ﬁrst principles. These foundations are

the underpinnings of coding methods, and they are ultimately what explain

and justify their application as a generalisation of Shannon’s information

theory. AIT has also been motivated by questions about randomness, com-

plexity, and structure in the real world, formulating concepts ranging from

algorithmic probability [42], that formalises the discussion related to how

likely a computable process or object is to be produced by chance under in-

formation constraints, to the concept of logical depth [12], that frames the

discussion related to process memory, causal structure and how life can be

characterised otherwise that in terms of randomness and simplicity.

A recently introduced approach termed “Assembly Theory” (AT), featur-

ing a computable index, has been claimed to be a novel and superior approach

to distinguishing living from non-living systems and gauging the complexity

of molecular biosignatures with an assembly index or molecular assembly in-

dex (MA). In proposing MA as a new complexity measure that quantiﬁes the

minimal number of bond-forming steps needed to construct a molecule, the

central claim advanced in [34] is that molecules with high molecular assembly

index (MA) values “are very unlikely to form abiotically, and the probability

of abiotic formation goes down as MA increases”. In other words, according

to the authors, “high MA molecules cannot form in detectable abundance

through random and unconstrained processes, implying that the existence of

high MA molecules depends on additional constraints imposed on the pro-

cess” [34]. We will use the notation ‘AT’, ‘assembly index’, or ‘MA’ to refer

to the aforementioned theory and the index derived therefrom.

The underlying intuition is that such an assembly index (by virtue of

minimising the length of the path necessary for an extrinsic agent to assemble

the object) would aﬀord “a way to rank the relative complexity of objects

made up of the same building units on the basis of the pathway, exploiting

the combinatorial nature of these combinations” [32].

In order to support their central claim, the authors of Assembly Theory

state that “MA tracks the speciﬁcity of a path through the combinatorially

vast chemical space” [32] and that, as presented in Marshall et al. [33], it

“leads to a measure of structural complexity that accounts for the structure

of the object and how it could have been constructed, which is, in all cases,

computable and unambiguous”.

1.1 What a ZIP ﬁle can tell about life

The authors propose that molecules with high MA detected in contexts or

samples generated by random processes, in which there are minimal (or no)

biases in the formation of the objects, display a smaller frequency of occur-

rence in comparison to the frequency of occurrence of molecules in alternative

conﬁgurations, where extrinsic agents or a set of biases (such as those brought

into play by evolutionary processes) play a signiﬁcant role.

However, we found that what the authors have called AT [34] is a for-

mulation that mirrors the working of previous coding algorithms—though

no proper references or attributions are oﬀered—in particular, statistical

lossless compression algorithms, whose purpose is to ﬁnd redundancies [6].

These algorithms were dictionary-based, like run-length encoding (RLE),

Huﬀman [28], and Lempev-Ziv (LZ)-based [61]. They were all launched early

in the development of the ﬁeld of compression for the purpose of detecting

identical copies that could be reused.

Lossless compression, incorporating the basic ideas of LZ compression,

has been widely applied in the context of living systems, including in a land-

mark paper published in 2005, where it was shown that it was not only ca-

pable of characterising DNA as a biosignature, but also of reconstructing the

main branches of an evolutionary phylogenetic tree from the compressibility

ratio of mammalian mtDNA sequences [31]. The same LZ algorithms have

been used for plagiarism detection, as measures of language distance, and for

clustering and classiﬁcation [31]. In genetics, it is widely known that similar

species have similar nucleotide GC content, and that therefore a simple Shan-

non Entropy approach on a uniform distribution of G and C nucleotides—

eﬀectively simply counting the exact repetitions of polymers [47]—can yield

a phylogenetic tree. LZ compression has been used in this same context [48],

and is central to complexity applications to living organisms, which are based

upon exactly the same grounds and on the idea of repetitive modules.

LZ77/LZ78 is at the core of AT, but its assembly index method is weaker

than resource-bounded measures introduced before [19, 41, 56]. LZ-based

schemes have been used in compression since 1977, and they are behind al-

gorithms like zip, gzip, giﬀ, and others, exploited for the purposes of compres-

sion and as approximations to algorithmic (Solomonoﬀ-Kolmogorov-Chaitin)

complexity, which is one of the indexes from AIT. This is because compress-

ibility is suﬃcient proof of non-randomness. Being one of the LZ compression

schemes [6], the assembly index calculation method looks for the largest sub-

string matches, counting them only once as they can be reused to reproduce

the original object. But it is weaker than other approximating measures

because by deﬁnition it only takes into consideration identical copies rather

than the full spectrum of causal operations to which an object may be subject

(beyond simple identical copies).

Our results demonstrate that the claim that AT may help not only to

distinguish life from non-life but also to identify non-terrestrial life, explain

evolution and natural selection, and unify physics and biology is a major

overstatement. (See also the Appendix for a detailed presentation of the

results). What AT amounts to is a re-purposing of some elementary al-

gorithms in computer science in a sub-optimal application to life detection

that has been suggested and undertaken before [12, 51], even generating the

same results when applied to separating organic from non-organic chemical

compounds [46]. By empirically demonstrating the higher predictive perfor-

mance of AIT-based complexity measures, such as approximations to algo-

rithmic complexity, to experimental applications in molecular classiﬁcation,

we extend the results reported before in [46] that had already—years before

the introduction of Assembly Theory—demonstrated the capabilities of these

measures as regards separating chemical compounds by their particular prop-

erties, including organic from inorganic compounds. Further research based

on the same underlying ideas of perturbation/mutation analysis together

with algorithmic information theory has also been recently used to detect

and decode bio- and technosignatures [60].

2 MA and compression algorithms

By employing diﬀerent types of data (on the same subset of molecules [32,

34]), as shown in Figures 4 and 5, we demonstrate that other measures applied

to other (chemical and molecular) data reproduce what AT’s authors claimed

was unique, though in fact it was not. We have shown that the same indexes

used and shown in these ﬁgures, and reported to separate organic from non-

organic compounds before in [46], also separate what the authors thought

was a unique type of spectral data. Using exactly the same data input

utilised by the authors of AT in their original paper [34], we have shown that

their MA index, also known as the assembly index, displays exactly the same

behaviour as other complexity indexes. These results show that the assembly

index calculation method not only is a compression scheme (as proven in [6]),

but also performs like one for all intents and purposes, and does not seem

to aﬀord any classiﬁcatory advantage either by virtue of its method or in

combination with any property of the input data (e.g. mass spectra).

Assembly Theory claims that MA can predict living vs.nonliving molecules,

testing it against a small cherry-picked subset of biological extracts, between

abiotic factors and inorganic (dead) matter. We repeated the experiment

using the binarised MS2 spectra peaks matrices provided in the source data

in [34]. Our reproduced ﬁndings are shown in Figures 1 and 2. (See also the

Appendix E for more detailed information).

Thus, the coding indexes systematically outperform the MA index as

a discriminant of living vs.non-living systems. MA works on the basis

upon which all popular statistical lossless compression algorithms operate,

the principle of ‘counting exact repetitions’ in data, which AT fully relies

upon. These are basic coding schemes introduced at the inception of in-

formation theory and computer science that do not incorporate the many

advances made in recent decades in the area of coding, compression, and

resource-bounded algorithmic complexity theory [59] and cannot explain se-

lection and evolution or unify physics and biology [40] beyond the connections

already made [26].

As demonstrated here, the characterisation of molecules using mass spec-

trometry signatures is not a challenge for other equally computable and

statistically-driven indexes. Other indexes are equally capable of discrim-

inating biosignature categories, by InChI, by bond distance matrices or by

mass spectra (MS2 peak matrices), thus disproving the claim that MA is the

only experimentally valid measure of molecular complexity.

3 Limitations of MA as a complexity measure

We have also shown that as soon as the MA index is confronted with more

complicated cases of non-linear modularity, it underperforms or misses obvi-

ous regularities. As shown in this article and more detailed in the Appendix,

our results show that MA, and its generalisation in the hypothesis called AT,

is prone to false positives and fails both in theory and in practice to capture

the notion of high-level causality beyond non-trivial statistical repetitions—

that Shannon Entropy could not have already captured in the ﬁrst place—

which is necessary for distinguishing a serendipitous extrinsic agent (e.g.

a chemical reaction resulting from biological processes) that constructs or

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheSalientLimitationsoftheMethodsofAssemblyTheoryandtheirClassificationofMolecularBiosignaturesAbicumaranUthamacumaran1,2,FelipeS.Abrah˜ao3,4,NarsisA.Kiani5,6,andHectorZenil∗6,71DepartmentofPhysicsandPsychology(Alumni),ConcordiaUniversity,Canada.2McGillUniversity,McGillGenomeCenter,MajewskiLab,Can...

展开>> 收起<<

On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures.pdf

共45页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Salient Limitations of the Methods of Assembly Theory and their Classification of Molecular Biosignatures

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: