APREPRINT - OCTOBER 19, 2022
widespread adoption of HTS techniques in biological studies caused a rapid increase in the volume of metagenomics
data that needs to be analyzed as efficiently and rapidly as possible. These metagenomics big data make the field
increasingly dependent on computational and statistical methods that lead to discovering new knowledge from such
data. Consequently, new analysis tools for big-data metagenomics are constantly emerging [1], e.g. 2500 new tools
were produced in 2016. HTS data analysis tools are computer programs that assist users with computational analyses of
DNA and RNA sequences to understand their features and functionality using different analytical methods. Interest in
such analysis may be motivated by different research questions, ranging from pathogen monitoring and identification
to identifying all organisms in a sequenced biological sample. The standard approach to achieve this is to apply a
combination of trimming, assembly, alignment and mapping, annotation, and other complex pipelines of software
algorithms to HTS data.
HTS data analysis tools play an essential role in the pipeline construction process. Helping scientists select and use
the appropriate tools facilitates the development of analysis-specific efficient pipelines and updating of existing ones.
Individual institutions with various project constraints increasingly use metagenomics tools and gradually improve their
knowledge and tool use. Under these circumstances, selecting the most suitable metagenomics software tool to gain
valuable data insights can be complex and confusing for people involved in the pipeline-building process.
Before adding a tool to a pipeline, it is essential to know certain details about it. What are the required inputs? Which
input and output file formats are supported? Most importantly, which data analysis task does the tool perform? “Task”
refers to the function of the metagenomics tool or the analysis it performs. Having an overview of all the available
tools for a given task is also crucial. The results provided by search engines are too unstructured to allow for a swift
differentiation and comparison of similar tools. Furthermore, selecting a suitable tool for each data analysis step based
on official publications and websites is not straightforward. Therefore, several benchmark studies tried to address
“the best tool for the task” challenge, considering different perspectives, e.g. plant-associated metagenome analysis
tools [2
–
4], machine learning-based approaches for metagenome analysis [3,5], task-specific tools for mapping [2,6]
and assembly [4], and complete pipelines for virus classification [7–9] and taxonomic classification [10–12].
Other fields face a similar challenge with the abundance of software to classify. Machine learning approaches for
software classification have been widely used in the cybersecurity domain [13,14]. Examples include data protection by
developing misuse-based systems that detect malicious code and classify malware into different known families, e.g.
Worm, Trojan, Backdoor, Ransomware, and others. Another active area is anomaly-detection-based systems, which
cluster binaries that behave similarly to identify new categories.
There is a plethora of metagenomics tool functions available. Understanding the functions of a given tool and comparing
it with similar tools are complicated tasks. Different benchmark efforts for metagenomics tools are published regularly.
Still, they are often incomplete, covering only a specific research question, including a limited set of tools, focusing
extensively on technical metrics, or lacking transparency and continuity.
The Galaxy platform [15] provides a recommendation-based solution [16] to help users create workflows. The
recommendations are based on data from more than 18000 workflows and thousands of available tools for various
scientific analyses. The deep learning-based recommendation system uses the tool sequences, the workflow quality, and
the pattern analysis of tool usage to suggest highly relevant tools to the users for their specific data analysis. A set of
tool sequences is extracted from each workflow created by the platform users. This approach is not fully personalized,
as it only considers one metric, i.e., the similarity between tool sequences in workflows. The system will recommend
the same next-step set of tools to all the users with the same built sequence. Furthermore, it limits the system to the
workflow data available on the platform’s internal database, where a certain type of analysis can predominate at a
specific point in time. These constraints directly influence the quality of the recommendations, especially for minority
user profiles, who will receive low-quality or unsuitable tool recommendations more frequently.
Machine learning-based classification systems of research papers were developed to help users find the appropriate
paper. The search can be directed towards differentiating the topics [17, 18] or be focused on specific domains, e.g.
computer science [19,20] or bioinformatics [21].
Classification systems use different algorithms and combinations of paper sections. In some works [19,22] they rely on
established ontologies such as CSO - the computer science ontology [23], EDAM - the ontology of bio-scientific data
analysis and data management [24], and SWO - the software ontology [25].
We propose a machine learning-based system that uses curated and peer-reviewed abstract text descriptions to classify
metagenomics tools into classes representing their main task. The classification system facilitates users to investigate
tools quicker, decide where a tool fits in the metagenomics pipeline construction process, and quickly and efficiently
select tools from 13 different classes.
2