NLP- BASED CLASSIFICATION OF SOFTWARE TOOLS FOR METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM SEMANTIC ANNOTATION

2025-05-02 0 0 3.77MB 26 页 10玖币

侵权投诉

NLP-BASED CLASSIFICATION OF SOFTWARE TOOLS FOR

METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM

SEMANTIC ANNOTATION

Kaoutar Daoud Hiri

Jožef Stefan International Postgraduate School

Ljubljana, SI 1000, Slovenia

BioSistemika

Ljubljana, SI 1000, Slovenia

kdhiri@biosistemika.com

Matjaž Hren

BioSistemika

Ljubljana, SI 1000, Slovenia

matjaz@scinote.net

Tomaž Curk

Faculty of Computer and Information Science

University of Ljubljana

Veˇ

cna pot 113, 1000 Ljubljana, Slovenia

tomaz.curk@fri.uni-lj.si

ABSTRACT

Motivation:

The rapid growth of metagenomics sequencing data makes metagenomics increasingly

dependent on computational and statistical methods for fast and efﬁcient analysis. Consequently,

novel analysis tools for big-data metagenomics are constantly emerging. One of the biggest

challenges for researchers occurs in the analysis planning stage: selecting the most suitable

metagenomics software tool to gain valuable insights from sequencing data. The building process of

data analysis pipelines is often laborious and time-consuming since it requires a deep and critical

understanding of how to apply a particular tool to complete a speciﬁed metagenomics task.

Results:

We have addressed this challenge by using machine learning methods to develop a

classiﬁcation system of metagenomics software tools into 13 classes (11 semantic annotations of

EDAM and two virus-speciﬁc classes) based on the descriptions of the tools. We trained three

classiﬁers (Naive Bayes, Logistic Regression, and Random Forest) using 15 text feature extraction

techniques (TF-IDF, GloVe, BERT-based models, and others). The manually curated dataset

includes 224 software tools and contains text from the abstract and the methods section of the tools’

publications. The best classiﬁcation performance, with an Area Under the Precision-Recall Curve

score of 0.85, is achieved using Logistic regression, BioBERT for text embedding, and text from

abstracts only. The proposed system provides accurate and uniﬁed identiﬁcation of metagenomics

data analysis tools and tasks, which is a crucial step in the construction of metagenomics data

analysis pipelines.

Keywords

natural language processing, software tool classiﬁcation, information retrieval, language models,

metagenomics, EDAM ontology

1 Introduction

Metagenomics aims to provide insight into the genetic material present in various environmental samples. Viral

metagenomics, for example, studies viral communities in water, soil, animals, and plants. The most common approach

in metagenomics is to use high-throughput sequencing (HTS) of DNA or RNA, which generates millions of short-read

nucleotide sequences. HTS data are used to detect and quantify genomes and transcriptomes in a biological sample. The

arXiv:2210.00831v2 [q-bio.GN] 18 Oct 2022

APREPRINT - OCTOBER 19, 2022

widespread adoption of HTS techniques in biological studies caused a rapid increase in the volume of metagenomics

data that needs to be analyzed as efﬁciently and rapidly as possible. These metagenomics big data make the ﬁeld

increasingly dependent on computational and statistical methods that lead to discovering new knowledge from such

data. Consequently, new analysis tools for big-data metagenomics are constantly emerging [1], e.g. 2500 new tools

were produced in 2016. HTS data analysis tools are computer programs that assist users with computational analyses of

DNA and RNA sequences to understand their features and functionality using different analytical methods. Interest in

such analysis may be motivated by different research questions, ranging from pathogen monitoring and identiﬁcation

to identifying all organisms in a sequenced biological sample. The standard approach to achieve this is to apply a

combination of trimming, assembly, alignment and mapping, annotation, and other complex pipelines of software

algorithms to HTS data.

HTS data analysis tools play an essential role in the pipeline construction process. Helping scientists select and use

the appropriate tools facilitates the development of analysis-speciﬁc efﬁcient pipelines and updating of existing ones.

Individual institutions with various project constraints increasingly use metagenomics tools and gradually improve their

knowledge and tool use. Under these circumstances, selecting the most suitable metagenomics software tool to gain

valuable data insights can be complex and confusing for people involved in the pipeline-building process.

Before adding a tool to a pipeline, it is essential to know certain details about it. What are the required inputs? Which

input and output ﬁle formats are supported? Most importantly, which data analysis task does the tool perform? “Task”

refers to the function of the metagenomics tool or the analysis it performs. Having an overview of all the available

tools for a given task is also crucial. The results provided by search engines are too unstructured to allow for a swift

differentiation and comparison of similar tools. Furthermore, selecting a suitable tool for each data analysis step based

on ofﬁcial publications and websites is not straightforward. Therefore, several benchmark studies tried to address

“the best tool for the task” challenge, considering different perspectives, e.g. plant-associated metagenome analysis

tools [2

–

4], machine learning-based approaches for metagenome analysis [3,5], task-speciﬁc tools for mapping [2,6]

and assembly [4], and complete pipelines for virus classiﬁcation [7–9] and taxonomic classiﬁcation [10–12].

Other ﬁelds face a similar challenge with the abundance of software to classify. Machine learning approaches for

software classiﬁcation have been widely used in the cybersecurity domain [13,14]. Examples include data protection by

developing misuse-based systems that detect malicious code and classify malware into different known families, e.g.

Worm, Trojan, Backdoor, Ransomware, and others. Another active area is anomaly-detection-based systems, which

cluster binaries that behave similarly to identify new categories.

There is a plethora of metagenomics tool functions available. Understanding the functions of a given tool and comparing

it with similar tools are complicated tasks. Different benchmark efforts for metagenomics tools are published regularly.

Still, they are often incomplete, covering only a speciﬁc research question, including a limited set of tools, focusing

extensively on technical metrics, or lacking transparency and continuity.

The Galaxy platform [15] provides a recommendation-based solution [16] to help users create workﬂows. The

recommendations are based on data from more than 18000 workﬂows and thousands of available tools for various

scientiﬁc analyses. The deep learning-based recommendation system uses the tool sequences, the workﬂow quality, and

the pattern analysis of tool usage to suggest highly relevant tools to the users for their speciﬁc data analysis. A set of

tool sequences is extracted from each workﬂow created by the platform users. This approach is not fully personalized,

as it only considers one metric, i.e., the similarity between tool sequences in workﬂows. The system will recommend

the same next-step set of tools to all the users with the same built sequence. Furthermore, it limits the system to the

workﬂow data available on the platform’s internal database, where a certain type of analysis can predominate at a

speciﬁc point in time. These constraints directly inﬂuence the quality of the recommendations, especially for minority

user proﬁles, who will receive low-quality or unsuitable tool recommendations more frequently.

Machine learning-based classiﬁcation systems of research papers were developed to help users ﬁnd the appropriate

paper. The search can be directed towards differentiating the topics [17, 18] or be focused on speciﬁc domains, e.g.

computer science [19,20] or bioinformatics [21].

Classiﬁcation systems use different algorithms and combinations of paper sections. In some works [19,22] they rely on

established ontologies such as CSO - the computer science ontology [23], EDAM - the ontology of bio-scientiﬁc data

analysis and data management [24], and SWO - the software ontology [25].

We propose a machine learning-based system that uses curated and peer-reviewed abstract text descriptions to classify

metagenomics tools into classes representing their main task. The classiﬁcation system facilitates users to investigate

tools quicker, decide where a tool ﬁts in the metagenomics pipeline construction process, and quickly and efﬁciently

select tools from 13 different classes.

APREPRINT - OCTOBER 19, 2022

2 Methods

Our main goal was to be able to infer the main task of metagenomics tools from their description in natural text. We

explored different combinations of the classiﬁcation algorithm, its set of hyperparameters, the textual description, and

the text embedding method to identify the best model for the task.

2.1 Data sources

The information contained in most scientiﬁc papers is typically divided into the title, abstract, introduction, methods,

results, and discussion sections. We manually gathered descriptions from the paper publications of 224 metagenomics

tools. We collected the abstract sections in the “abstracts only” dataset and the methods section in the “methods

only” dataset. We also prepared tool descriptions that include both the abstracts and methods sections in the “ab-

stracts+methods” dataset (Supplementary Datasets S1, S2, and S3 and see Supplementary Section S1). All datasets

include the title of the paper as the ﬁrst sentence in the description of each tool. Each record in the collected datasets

represents a single tool and contains the tool’s name, description, and task (class) as represented in Table 1.

Table 1: Excerpt of raw “abstracts only” dataset for ﬁve tools belonging to different categories.

Tool name Tool description Tool task (Class)

KrakenUniq KrakenUniq:conﬁdent and fast

metagenomics cl..

Classiﬁcation

ViruDetect ViruDetect: An automated pipeline

or efﬁcie..

Virus identiﬁcation

ALLPATHS ALLPATHS: de novo assembly

of whole-genome sho..

Assembly

Bambino Bambino: a variant detector

and alignment view..

Visualisation

imGLAD imGLAD: accurate detection

and quantiﬁcation

Abundance estimation

2.2 Task ontology

The diverse and complex operations in bio-scientiﬁc data analysis lead us to rely on the well-established and comprehen-

sive EDAM ontology [24] to categorize the tools from a functional perspective. The 11 classes comprise bioinformatics

operations and processes from the EDAM ontology: “(Sequence) alignment”, “(Taxonomic) classiﬁcation”, “Mapping”,

“(Sequence) assembly”, “(Sequence) trimming”, “(Sequencing) quality control”, “(Sequence) annotation”, “(Sequence)

assembly validation”, “(RNA-seq quantiﬁcation for) abundance estimation”, “SNP-Discovery”, “Visualization”.

We deﬁned two additional classes: “Virus detection” and “Virus identiﬁcation”. We assign to these two classes viral

analysis tools classiﬁed as machine learning tools in EDAM ontology, e.g. DeepVirFinder [26] and VirNet [27]. We

assign other viral analysis pipelines to the two classes even if the pipelines include several tools belonging to other

EDAM classes, such as K-mer counting, assembly, mapping, and others. Examples of such tools are Kodoja [28],

VirFind [29] and VirusFinder [30], which are all developed for virus detection and identiﬁcation.

We assigned 224 tools into 13 tasks (classes). Some tools can be used for several tasks and thus belong to several

classes. However, we only assigned them to one of the 13 classes, i.e., to the main task for which they were designed,

see Supplementary Section S2. The obtained class distribution is shown in Figure 1.

2.3 Data pre-processing

Before a classiﬁer can use the available data, the appropriate pre-processing steps are required. The steps involved

in extracting data from a tool description are summarized in Figure 2. To create features from the raw text, train the

classiﬁers and infer machine learning models, we performed the following steps: text cleaning and preparation, label

coding, and vector representation of text (Supplementary Datasets S4, S5, and S6). For text cleaning and preparation,

we use downcasing, lemmatization, removal of stop words, possessive pronouns, words composed of one or two letters,

words starting with digits, special characters, punctuation signs, numbers, and links. We represent the class variable as

a nominal discrete variable with 13 different values. We then generated text vector representations, which are discussed

in the following subsection.

APREPRINT - OCTOBER 19, 2022

Figure 1: The class distribution shows the number of tools assigned to each task.

Figure 2: Process of extraction of information from text.

2.4 Vector representation of text

To train the different classiﬁers, we represented the text description of the tools as a vector of numbers using language

models, prediction-based and frequency-based techniques (Supplementary Datasets S7-S42).

2.4.1 Word embedding methods

We used and evaluated the 12 most commonly used approaches to extract features from the text. We describe them in

the following paragraphs.

TF-IDF

for a word in a document is calculated by multiplying the frequency of the term (term frequency) [31] of a

word in a document with the inverse document frequency of a word [32] in a set of documents. If the word is very

common and appears in many documents, this number will approach 0. Otherwise, the TF-IDF will approach 1.

GloVe

Embeddings [33], which stands for global vectors, capture the semantic context of words using both local

statistics (local word context) and global statistics (word co-occurrences) to generate a word vector. This regression

neural network, trained on ﬁve combinations of general domain corpora (English Wikipedia and Gigaword), combines

the advantages of global matrix factorization and local context window methods. It uses a gradient descent optimization

algorithm and a decreasing weighting function where distant word pairs are expected to have less information about

their relationship.

ELMO

[34], deep contextualized word representation, represents each token based on the complete input sentence.

The word representations combine the internal states of a pre-trained bidirectional language model (biLM) in a linear

function learned by the end task model.

BERT

[35], which stands for Bidirectional Encoder Representations from Transformers, improves the ﬁne-tuning-based

strategies for applying pre-trained language representations to downstream tasks. It uses two unsupervised tasks during

pre-training: binarized Next Sentence Prediction (NSP) and Masked Language Model (MLM). Given a set of input

tokens, the Masked Language Model randomly masks 15% of the tokens. The goal is to predict the masked words based

on their bidirectional context. To understand the relationship between sentences, which is crucial for many downstream

tasks, BERT pre-trains on NSP tasks which can be generated from the monolingual vocabulary. The ﬁnal hidden state

corresponding to the [CLS] token (the ﬁrst token of every sequence) is used as the aggregate sequence representation

APREPRINT - OCTOBER 19, 2022

for classiﬁcation tasks. In this work, we refer to L as the number of layers (transformer blocks), H as the number of

hidden states, and A as the number of self-attention heads, and we report results on BERTBASE: L=12, H=768, A=12.

In addition to using the [CLS] token to represent a text sequence, we investigated three additional pooling strategies for

BERTBASE, representing different choices of vectors from different layers:

•BERTS2L: Summing the vector embeddings generated from the Second to the Last Layer.

•BERTSL4: Summing the vector embeddings generated from the Last Four Layers.

•BERTCL4: Concatenation of the vector embeddings generated from the Last Four Layers.

BioBERT

[36] is a domain-speciﬁc language representation model based on the adaptation of BERT to the biomedical

domain. With the same architecture, weights, and Wordpiece vocabulary as BERT, BioBERT is pre-trained on corpora

from the biomedical domain (PubMed abstracts and PMC full-text articles). BioBERT achieved a new state-of-the-art

performance on three biomedical tasks: Biomedical named entity recognition (in terms of F1 score), biomedical relation

extraction (in terms of F1 score), and biomedical question answering (in terms of mean reciprocal rank).

XLNET

[37] is a generalized AutoRegressive pre-training method that combines the best of AutoEncoding and

Autoregressive language modeling while overcoming their limitations. XLNet is not based on a data corruption

mechanism such as BERT. Consequently, special symbols used in pre-training are not missed in ﬁne-tuning step.

XLNET also improves the pre-training design architecture by (1) increasing the performance of long text-related tasks

by including the segment recurrence mechanism and the relative encoding scheme of Transformer-XL in the training

step, and (2) reparameterizing the Transformer-XL network to apply its architecture to permutation-based language

modeling.

RoBERTA

[38] is an optimized method of pre-training BERT-based models that demonstrate the beneﬁts of bigger

datasets, batches, and sequences to enhance model performance. The improved strategy also recommends training the

models for a longer period, dynamically modifying the masking pattern used on the training data, and removing the

next-sentence prediction objective.

ELECTRA

[39] BERT is pre-trained using the masked language modeling approach to learn bidirectional word

representations. ELECTRA (Efﬁciently Learning an Encoder that Classiﬁes Token Replacements Accurately) proposes

an alternative pre-training task (replaced token detection). The tokens are replaced with proposed alternatives produced

by a generator network. Then the discriminator network predicts which token is original and which is a replacement.

ELECTRAMed

[40], based on ELECTRA, is a pre-trained domain-speciﬁc language model for the biomedical domain,

inheriting the general-domain ELECTRA architecture learning framework and computational beneﬁts.

2.4.2 Short vs. long text

The complexity of the attention layer is quadratic to the length of the sequence [35], therefore longer sequences are

more expensive for BERT and BERT-based language models. The length of the text sequences cannot exceed 510

tokens, excluding special tokens ([CLS] and [SEP]). When analyzing the “abstracts only” dataset, we were not faced

with this limitation. To extend the analysis to longer texts, we explored libraries NLU [41], sentence transformers by

UKP lab [42] and transformers by Hugging Face [43], depending on the availability of the models. We applied the

long-text approach to all studied datasets, where we mapped input text into a ﬁxed-length embedding based on the

pre-trained model used. We also compared the performances of the direct, short-text and long-text approaches on the

“abstracts only” dataset.

As shown in the Supplementary Table S2, the resulting word or token embeddings have different sizes, ranging from

100 to 3072 elements, depending on the algorithm we used to generate the vectors. Except for TF-IDF, all embedding

methods were subjected to the following two steps to obtain the ﬁnal sentence vector. First, for each row in the dataset,

we constructed an embedding matrix with

rows and

columns consisting of a list of words or tokens in the text and

their corresponding numeric vector representations, as shown in Figure 3, where

is the number of words/tokens in the

text description, and

is the number of elements in the generated word embedding vectors. Second, we calculated the

average of the elements in each column of the resulting embedding matrix. Thus, we obtained sentence embeddings of

the same size regardless of the length of the original text description.

2.5 Learning algorithms

To ﬁnd which learning algorithm performed best on our data, we investigated three machine learning classiﬁcation

models with different parameter settings (Supplementary Data Table S44): Logistic Regression (LR), Random Forest

(RF), and Naive Bayes (NB). We assembled a pipeline for the TF-IDF vectorizer and the classiﬁers so that they can be

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NLP-BASEDCLASSIFICATIONOFSOFTWARETOOLSFORMETAGENOMICSSEQUENCINGDATAANALYSISINTOEDAMSEMANTICANNOTATIONKaoutarDaoudHiriJoefStefanInternationalPostgraduateSchoolLjubljana,SI1000,SloveniaBioSistemikaLjubljana,SI1000,Sloveniakdhiri@biosistemika.comMatjaHrenBioSistemikaLjubljana,SI1000,Sloveniamatjaz@sc...

展开>> 收起<<

NLP- BASED CLASSIFICATION OF SOFTWARE TOOLS FOR METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM SEMANTIC ANNOTATION.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

NLP- BASED CLASSIFICATION OF SOFTWARE TOOLS FOR METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM SEMANTIC ANNOTATION

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: