NLP- BASED CLASSIFICATION OF SOFTWARE TOOLS FOR METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM SEMANTIC ANNOTATION

2025-05-02 0 0 3.77MB 26 页 10玖币
侵权投诉
NLP-BASED CLASSIFICATION OF SOFTWARE TOOLS FOR
METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM
SEMANTIC ANNOTATION
Kaoutar Daoud Hiri
Jožef Stefan International Postgraduate School
Ljubljana, SI 1000, Slovenia
BioSistemika
Ljubljana, SI 1000, Slovenia
kdhiri@biosistemika.com
Matjaž Hren
BioSistemika
Ljubljana, SI 1000, Slovenia
matjaz@scinote.net
Tomaž Curk
Faculty of Computer and Information Science
University of Ljubljana
Veˇ
cna pot 113, 1000 Ljubljana, Slovenia
tomaz.curk@fri.uni-lj.si
ABSTRACT
Motivation:
The rapid growth of metagenomics sequencing data makes metagenomics increasingly
dependent on computational and statistical methods for fast and efficient analysis. Consequently,
novel analysis tools for big-data metagenomics are constantly emerging. One of the biggest
challenges for researchers occurs in the analysis planning stage: selecting the most suitable
metagenomics software tool to gain valuable insights from sequencing data. The building process of
data analysis pipelines is often laborious and time-consuming since it requires a deep and critical
understanding of how to apply a particular tool to complete a specified metagenomics task.
Results:
We have addressed this challenge by using machine learning methods to develop a
classification system of metagenomics software tools into 13 classes (11 semantic annotations of
EDAM and two virus-specific classes) based on the descriptions of the tools. We trained three
classifiers (Naive Bayes, Logistic Regression, and Random Forest) using 15 text feature extraction
techniques (TF-IDF, GloVe, BERT-based models, and others). The manually curated dataset
includes 224 software tools and contains text from the abstract and the methods section of the tools’
publications. The best classification performance, with an Area Under the Precision-Recall Curve
score of 0.85, is achieved using Logistic regression, BioBERT for text embedding, and text from
abstracts only. The proposed system provides accurate and unified identification of metagenomics
data analysis tools and tasks, which is a crucial step in the construction of metagenomics data
analysis pipelines.
Keywords
natural language processing, software tool classification, information retrieval, language models,
metagenomics, EDAM ontology
1 Introduction
Metagenomics aims to provide insight into the genetic material present in various environmental samples. Viral
metagenomics, for example, studies viral communities in water, soil, animals, and plants. The most common approach
in metagenomics is to use high-throughput sequencing (HTS) of DNA or RNA, which generates millions of short-read
nucleotide sequences. HTS data are used to detect and quantify genomes and transcriptomes in a biological sample. The
arXiv:2210.00831v2 [q-bio.GN] 18 Oct 2022
APREPRINT - OCTOBER 19, 2022
widespread adoption of HTS techniques in biological studies caused a rapid increase in the volume of metagenomics
data that needs to be analyzed as efficiently and rapidly as possible. These metagenomics big data make the field
increasingly dependent on computational and statistical methods that lead to discovering new knowledge from such
data. Consequently, new analysis tools for big-data metagenomics are constantly emerging [1], e.g. 2500 new tools
were produced in 2016. HTS data analysis tools are computer programs that assist users with computational analyses of
DNA and RNA sequences to understand their features and functionality using different analytical methods. Interest in
such analysis may be motivated by different research questions, ranging from pathogen monitoring and identification
to identifying all organisms in a sequenced biological sample. The standard approach to achieve this is to apply a
combination of trimming, assembly, alignment and mapping, annotation, and other complex pipelines of software
algorithms to HTS data.
HTS data analysis tools play an essential role in the pipeline construction process. Helping scientists select and use
the appropriate tools facilitates the development of analysis-specific efficient pipelines and updating of existing ones.
Individual institutions with various project constraints increasingly use metagenomics tools and gradually improve their
knowledge and tool use. Under these circumstances, selecting the most suitable metagenomics software tool to gain
valuable data insights can be complex and confusing for people involved in the pipeline-building process.
Before adding a tool to a pipeline, it is essential to know certain details about it. What are the required inputs? Which
input and output file formats are supported? Most importantly, which data analysis task does the tool perform? “Task”
refers to the function of the metagenomics tool or the analysis it performs. Having an overview of all the available
tools for a given task is also crucial. The results provided by search engines are too unstructured to allow for a swift
differentiation and comparison of similar tools. Furthermore, selecting a suitable tool for each data analysis step based
on official publications and websites is not straightforward. Therefore, several benchmark studies tried to address
“the best tool for the task” challenge, considering different perspectives, e.g. plant-associated metagenome analysis
tools [2
4], machine learning-based approaches for metagenome analysis [3,5], task-specific tools for mapping [2,6]
and assembly [4], and complete pipelines for virus classification [7–9] and taxonomic classification [10–12].
Other fields face a similar challenge with the abundance of software to classify. Machine learning approaches for
software classification have been widely used in the cybersecurity domain [13,14]. Examples include data protection by
developing misuse-based systems that detect malicious code and classify malware into different known families, e.g.
Worm, Trojan, Backdoor, Ransomware, and others. Another active area is anomaly-detection-based systems, which
cluster binaries that behave similarly to identify new categories.
There is a plethora of metagenomics tool functions available. Understanding the functions of a given tool and comparing
it with similar tools are complicated tasks. Different benchmark efforts for metagenomics tools are published regularly.
Still, they are often incomplete, covering only a specific research question, including a limited set of tools, focusing
extensively on technical metrics, or lacking transparency and continuity.
The Galaxy platform [15] provides a recommendation-based solution [16] to help users create workflows. The
recommendations are based on data from more than 18000 workflows and thousands of available tools for various
scientific analyses. The deep learning-based recommendation system uses the tool sequences, the workflow quality, and
the pattern analysis of tool usage to suggest highly relevant tools to the users for their specific data analysis. A set of
tool sequences is extracted from each workflow created by the platform users. This approach is not fully personalized,
as it only considers one metric, i.e., the similarity between tool sequences in workflows. The system will recommend
the same next-step set of tools to all the users with the same built sequence. Furthermore, it limits the system to the
workflow data available on the platform’s internal database, where a certain type of analysis can predominate at a
specific point in time. These constraints directly influence the quality of the recommendations, especially for minority
user profiles, who will receive low-quality or unsuitable tool recommendations more frequently.
Machine learning-based classification systems of research papers were developed to help users find the appropriate
paper. The search can be directed towards differentiating the topics [17, 18] or be focused on specific domains, e.g.
computer science [19,20] or bioinformatics [21].
Classification systems use different algorithms and combinations of paper sections. In some works [19,22] they rely on
established ontologies such as CSO - the computer science ontology [23], EDAM - the ontology of bio-scientific data
analysis and data management [24], and SWO - the software ontology [25].
We propose a machine learning-based system that uses curated and peer-reviewed abstract text descriptions to classify
metagenomics tools into classes representing their main task. The classification system facilitates users to investigate
tools quicker, decide where a tool fits in the metagenomics pipeline construction process, and quickly and efficiently
select tools from 13 different classes.
2
APREPRINT - OCTOBER 19, 2022
2 Methods
Our main goal was to be able to infer the main task of metagenomics tools from their description in natural text. We
explored different combinations of the classification algorithm, its set of hyperparameters, the textual description, and
the text embedding method to identify the best model for the task.
2.1 Data sources
The information contained in most scientific papers is typically divided into the title, abstract, introduction, methods,
results, and discussion sections. We manually gathered descriptions from the paper publications of 224 metagenomics
tools. We collected the abstract sections in the “abstracts only” dataset and the methods section in the “methods
only” dataset. We also prepared tool descriptions that include both the abstracts and methods sections in the “ab-
stracts+methods” dataset (Supplementary Datasets S1, S2, and S3 and see Supplementary Section S1). All datasets
include the title of the paper as the first sentence in the description of each tool. Each record in the collected datasets
represents a single tool and contains the tool’s name, description, and task (class) as represented in Table 1.
Table 1: Excerpt of raw “abstracts only” dataset for five tools belonging to different categories.
Tool name Tool description Tool task (Class)
KrakenUniq KrakenUniq:confident and fast
metagenomics cl..
Classification
ViruDetect ViruDetect: An automated pipeline
or efficie..
Virus identification
ALLPATHS ALLPATHS: de novo assembly
of whole-genome sho..
Assembly
Bambino Bambino: a variant detector
and alignment view..
Visualisation
imGLAD imGLAD: accurate detection
and quantification
Abundance estimation
2.2 Task ontology
The diverse and complex operations in bio-scientific data analysis lead us to rely on the well-established and comprehen-
sive EDAM ontology [24] to categorize the tools from a functional perspective. The 11 classes comprise bioinformatics
operations and processes from the EDAM ontology: “(Sequence) alignment”, “(Taxonomic) classification”, “Mapping”,
“(Sequence) assembly”, “(Sequence) trimming”, “(Sequencing) quality control”, “(Sequence) annotation”, “(Sequence)
assembly validation”, “(RNA-seq quantification for) abundance estimation”, “SNP-Discovery”, “Visualization”.
We defined two additional classes: “Virus detection” and “Virus identification”. We assign to these two classes viral
analysis tools classified as machine learning tools in EDAM ontology, e.g. DeepVirFinder [26] and VirNet [27]. We
assign other viral analysis pipelines to the two classes even if the pipelines include several tools belonging to other
EDAM classes, such as K-mer counting, assembly, mapping, and others. Examples of such tools are Kodoja [28],
VirFind [29] and VirusFinder [30], which are all developed for virus detection and identification.
We assigned 224 tools into 13 tasks (classes). Some tools can be used for several tasks and thus belong to several
classes. However, we only assigned them to one of the 13 classes, i.e., to the main task for which they were designed,
see Supplementary Section S2. The obtained class distribution is shown in Figure 1.
2.3 Data pre-processing
Before a classifier can use the available data, the appropriate pre-processing steps are required. The steps involved
in extracting data from a tool description are summarized in Figure 2. To create features from the raw text, train the
classifiers and infer machine learning models, we performed the following steps: text cleaning and preparation, label
coding, and vector representation of text (Supplementary Datasets S4, S5, and S6). For text cleaning and preparation,
we use downcasing, lemmatization, removal of stop words, possessive pronouns, words composed of one or two letters,
words starting with digits, special characters, punctuation signs, numbers, and links. We represent the class variable as
a nominal discrete variable with 13 different values. We then generated text vector representations, which are discussed
in the following subsection.
3
APREPRINT - OCTOBER 19, 2022
Figure 1: The class distribution shows the number of tools assigned to each task.
Figure 2: Process of extraction of information from text.
2.4 Vector representation of text
To train the different classifiers, we represented the text description of the tools as a vector of numbers using language
models, prediction-based and frequency-based techniques (Supplementary Datasets S7-S42).
2.4.1 Word embedding methods
We used and evaluated the 12 most commonly used approaches to extract features from the text. We describe them in
the following paragraphs.
TF-IDF
for a word in a document is calculated by multiplying the frequency of the term (term frequency) [31] of a
word in a document with the inverse document frequency of a word [32] in a set of documents. If the word is very
common and appears in many documents, this number will approach 0. Otherwise, the TF-IDF will approach 1.
GloVe
Embeddings [33], which stands for global vectors, capture the semantic context of words using both local
statistics (local word context) and global statistics (word co-occurrences) to generate a word vector. This regression
neural network, trained on five combinations of general domain corpora (English Wikipedia and Gigaword), combines
the advantages of global matrix factorization and local context window methods. It uses a gradient descent optimization
algorithm and a decreasing weighting function where distant word pairs are expected to have less information about
their relationship.
ELMO
[34], deep contextualized word representation, represents each token based on the complete input sentence.
The word representations combine the internal states of a pre-trained bidirectional language model (biLM) in a linear
function learned by the end task model.
BERT
[35], which stands for Bidirectional Encoder Representations from Transformers, improves the fine-tuning-based
strategies for applying pre-trained language representations to downstream tasks. It uses two unsupervised tasks during
pre-training: binarized Next Sentence Prediction (NSP) and Masked Language Model (MLM). Given a set of input
tokens, the Masked Language Model randomly masks 15% of the tokens. The goal is to predict the masked words based
on their bidirectional context. To understand the relationship between sentences, which is crucial for many downstream
tasks, BERT pre-trains on NSP tasks which can be generated from the monolingual vocabulary. The final hidden state
corresponding to the [CLS] token (the first token of every sequence) is used as the aggregate sequence representation
4
APREPRINT - OCTOBER 19, 2022
for classification tasks. In this work, we refer to L as the number of layers (transformer blocks), H as the number of
hidden states, and A as the number of self-attention heads, and we report results on BERTBASE: L=12, H=768, A=12.
In addition to using the [CLS] token to represent a text sequence, we investigated three additional pooling strategies for
BERTBASE, representing different choices of vectors from different layers:
BERTS2L: Summing the vector embeddings generated from the Second to the Last Layer.
BERTSL4: Summing the vector embeddings generated from the Last Four Layers.
BERTCL4: Concatenation of the vector embeddings generated from the Last Four Layers.
BioBERT
[36] is a domain-specific language representation model based on the adaptation of BERT to the biomedical
domain. With the same architecture, weights, and Wordpiece vocabulary as BERT, BioBERT is pre-trained on corpora
from the biomedical domain (PubMed abstracts and PMC full-text articles). BioBERT achieved a new state-of-the-art
performance on three biomedical tasks: Biomedical named entity recognition (in terms of F1 score), biomedical relation
extraction (in terms of F1 score), and biomedical question answering (in terms of mean reciprocal rank).
XLNET
[37] is a generalized AutoRegressive pre-training method that combines the best of AutoEncoding and
Autoregressive language modeling while overcoming their limitations. XLNet is not based on a data corruption
mechanism such as BERT. Consequently, special symbols used in pre-training are not missed in fine-tuning step.
XLNET also improves the pre-training design architecture by (1) increasing the performance of long text-related tasks
by including the segment recurrence mechanism and the relative encoding scheme of Transformer-XL in the training
step, and (2) reparameterizing the Transformer-XL network to apply its architecture to permutation-based language
modeling.
RoBERTA
[38] is an optimized method of pre-training BERT-based models that demonstrate the benefits of bigger
datasets, batches, and sequences to enhance model performance. The improved strategy also recommends training the
models for a longer period, dynamically modifying the masking pattern used on the training data, and removing the
next-sentence prediction objective.
ELECTRA
[39] BERT is pre-trained using the masked language modeling approach to learn bidirectional word
representations. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) proposes
an alternative pre-training task (replaced token detection). The tokens are replaced with proposed alternatives produced
by a generator network. Then the discriminator network predicts which token is original and which is a replacement.
ELECTRAMed
[40], based on ELECTRA, is a pre-trained domain-specific language model for the biomedical domain,
inheriting the general-domain ELECTRA architecture learning framework and computational benefits.
2.4.2 Short vs. long text
The complexity of the attention layer is quadratic to the length of the sequence [35], therefore longer sequences are
more expensive for BERT and BERT-based language models. The length of the text sequences cannot exceed 510
tokens, excluding special tokens ([CLS] and [SEP]). When analyzing the “abstracts only” dataset, we were not faced
with this limitation. To extend the analysis to longer texts, we explored libraries NLU [41], sentence transformers by
UKP lab [42] and transformers by Hugging Face [43], depending on the availability of the models. We applied the
long-text approach to all studied datasets, where we mapped input text into a fixed-length embedding based on the
pre-trained model used. We also compared the performances of the direct, short-text and long-text approaches on the
“abstracts only” dataset.
As shown in the Supplementary Table S2, the resulting word or token embeddings have different sizes, ranging from
100 to 3072 elements, depending on the algorithm we used to generate the vectors. Except for TF-IDF, all embedding
methods were subjected to the following two steps to obtain the final sentence vector. First, for each row in the dataset,
we constructed an embedding matrix with
n
rows and
m
columns consisting of a list of words or tokens in the text and
their corresponding numeric vector representations, as shown in Figure 3, where
n
is the number of words/tokens in the
text description, and
m
is the number of elements in the generated word embedding vectors. Second, we calculated the
average of the elements in each column of the resulting embedding matrix. Thus, we obtained sentence embeddings of
the same size regardless of the length of the original text description.
2.5 Learning algorithms
To find which learning algorithm performed best on our data, we investigated three machine learning classification
models with different parameter settings (Supplementary Data Table S44): Logistic Regression (LR), Random Forest
(RF), and Naive Bayes (NB). We assembled a pipeline for the TF-IDF vectorizer and the classifiers so that they can be
5
摘要:

NLP-BASEDCLASSIFICATIONOFSOFTWARETOOLSFORMETAGENOMICSSEQUENCINGDATAANALYSISINTOEDAMSEMANTICANNOTATIONKaoutarDaoudHiriJoefStefanInternationalPostgraduateSchoolLjubljana,SI1000,SloveniaBioSistemikaLjubljana,SI1000,Sloveniakdhiri@biosistemika.comMatjaHrenBioSistemikaLjubljana,SI1000,Sloveniamatjaz@sc...

展开>> 收起<<
NLP- BASED CLASSIFICATION OF SOFTWARE TOOLS FOR METAGENOMICS SEQUENCING DATA ANALYSIS INTO EDAM SEMANTIC ANNOTATION.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:3.77MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注