Self-omics A Self-supervised Learning Framework for Multi-omics Cancer Data

2025-04-15 1 0 1.33MB 12 页 10玖币

侵权投诉

Self-omics: A Self-supervised Learning Framework for Multi-omics Cancer Data

Sayed Hashim†, Karthik Nandakumar and Mohammad Yaqub

Mohamed Bin Zayed University of Artiﬁcial Intelligence

Abu Dhabi, UAE

†E-mail: sayed.hashim@mbzuai.ac.ae

We have gained access to vast amounts of multi-omics data thanks to Next Generation

Sequencing. However, it is challenging to analyse this data due to its high dimensionality

and much of it not being annotated. Lack of annotated data is a signiﬁcant problem in

machine learning, and Self-Supervised Learning (SSL) methods are typically used to deal

with limited labelled data. However, there is a lack of studies that use SSL methods to

exploit inter-omics relationships on unlabelled multi-omics data. In this work, we develop a

novel and eﬃcient pre-training paradigm that consists of various SSL components, including

but not limited to contrastive alignment, data recovery from corrupted samples, and using

one type of omics data to recover other omic types. Our pre-training paradigm improves

performance on downstream tasks with limited labelled data. We show that our approach

outperforms the state-of-the-art method in cancer type classiﬁcation on the TCGA pan-

cancer dataset in semi-supervised setting. Moreover, we show that the encoders that are

pre-trained using our approach can be used as powerful feature extractors even without

ﬁne-tuning. Our ablation study shows that the method is not overly dependent on any

pretext task component. The network architectures in our approach are designed to handle

missing omic types and multiple datasets for pre-training and downstream training. Our

pre-training paradigm can be extended to perform zero-shot classiﬁcation of rare cancers.

Keywords: Self-supervised Learning; Contrastive Learning; Multi-omics; Cancer Type Clas-

siﬁcation

1. Introduction

According to WHO, cancer accounted for around 10 million deaths in 2020 or about one in

six deaths.1Many cancers can be cured with early diagnosis, and eﬀective treatment.2Various

factors are responsible for late diagnoses, such as symptoms being detected late, lack of access

to oncologists, as well as the time & cost involved. It could also be because of vague and unclear

symptoms and indistinguishable signs on scans and mammograms.3Nevertheless, performing

cancer diagnosis in its early stages or even before it starts developing could remarkably improve

survival and provide opportunities for more eﬀective treatment. Studies in the areas of biology

that end with omics, such as genomics, proteomics, transcriptomics or metabolomics, are called

omics sciences. With the advent of Next Generation Sequencing, we have gained access to

multiple types of omics data. Each type of omics data reveals diﬀerent characteristics within

the tumour. However, due to the high dimensionality and the numerous diﬀerent types of

distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC)

4.0 License.

arXiv:2210.00825v1 [cs.LG] 3 Oct 2022

omics data, it is nearly impossible for clinicians to analyse multi-omics data. Due to this

reason, they tend to focus on analysing the values of speciﬁc biomarkers. However, to get a

complete picture of a tumour, which is heterogeneous and complex, multi-omics data analysis

is vital.

Modern machine learning algorithms, especially deep neural networks, have shown to be

able to work well with high-dimensional data. Deep learning has made massive progress in

tasks like object recognition, object detection and semantic segmentation in the visual domain.

It has also made strides in speech and natural language processing on tasks such as machine

translation, speech recognition and question answering. The algorithms developed for the tasks

mentioned above require processing high-dimensional inputs. In this work, we developed Self-

Supervised Learning (SSL) methods for multi-omics data to provide supervision to the model

from unlabelled data. We explored various SSL pretext tasks on top of the usual reconstruction

task with autoencoders. Some of the SSL techniques we implemented include contrastive

learning, recovering data from its corrupted versions and aligning representations from multi-

omics data.

The low-dimensional representations that our model produces from high-dimensional

multi-omics data can be considered ”computational biomarkers”. The model that learns from

large datasets gets good at producing such biomarkers and can be used to produce good rep-

resentations for smaller datasets. Furthermore, as the model learns from tumours diagnosed

early, it produces better representations for such tumours. Therefore, even if the dataset at

hand does not have samples of tumours that are sequenced early, the fact that it was pre-

trained on a large dataset that contains many samples of such tumours makes the model

better at early diagnosis.

2. Literature Review

Self-supervised learning (SSL) has been extensively applied in representation learning of data

in various domains such as natural language processing4–6 audio and image.7–9 These methods

mainly use spatial, semantic and temporal structural relationships in the data. This is done

through developing novel pretext tasks, data augmentation methods and model architectures.

Due to the absence of the relationships mentioned above in tabular data, such methods could

be less eﬀective. For instance, augmentation methods used on images, such as scaling and

rotation, cannot be directly used on tabular data. SSL techniques have not been explored

enough on tabular data due to these reasons.10

An autoencoder is a deep network that consists of an encoder and decoder.11 While the

encoder is trained to map the input to a latent representation, the decoder is trained to re-

construct the input from this latent representation. A popular work in images is denoising

autoencoders (DAE).12 It is built on the hypothesis that partially destroyed inputs should

result in a similar latent representation as the original inputs. In this work, the authors inves-

tigated an autoencoder’s robustness to partial demolition of inputs. The input is corrupted

and fed to the autoencoder, whose job is to recover the original ”clean” input. A group of re-

searchers developed VIME,10 a novel SSL framework for tabular data. They developed a couple

of pretext tasks called feature vector estimation and mask vector estimation. The former aims

to reconstruct an input sample from its masked version, while the latter involves predicting

the mask vector applied to the sample. In other words, the pretext task is to estimate which

features are masked and predict the values of the corrupted features. A work called SubTab

focuses on converting the representation learning problem from single-view to multi-view.13

Here, the features are divided into subsets to produce the various views. The authors claim

that this is analogous to cropping images and bagging features in ensemble learning. They

demonstrate that the encoder learns more useful representations from a subset of the data

than a corrupted version of it. They pre-trained the network on this pretext task and tested

its performance on some downstream tasks.

Self-supervised representation learning of multi-omics data is an under-studied area of

research. Many methods used for representation learning mainly focus on the integration of

multi-omics data. Many integration strategies have been proposed. We will review the inte-

gration methods here due to the lack of self-supervised methods. A group developed a group

lasso regularised deep learning method for cancer prognosis by integrating multi-omics data

using early fusion.14 They perform various data preprocessing techniques, and the model con-

sists of a few fully connected layers. Another work integrates multi-omics data using standard

and disjointed deep autoencoders.15 Various omics data such as DNA methylation, microRNA

expression, mRNA expression and reverse phase protein array data are concatenated before

being fed into the autoencoder. A work called OmiEmbed16 does intermediate multi-omics

data integration. It is a multi-task framework that is built on a variational autoencoder. The

pretext task here is the reconstruction of three types of omics data: gene expression, mi-

croRNA expression and DNA methylation. They show the eﬀectiveness of their method by

testing on various downstream tasks. They also developed a multi-task strategy that concur-

rently trains multiple downstream modules such as survival analysis, cancer type classiﬁcation

and phenotype prediction. Training it this way has shown to perform better than training the

downstream modules separately. Late integration of multi-omics data was done in a work that

predicts breast cancer prognosis.17 They perform feature selection and use a deep neural net-

work for the task. Gene expression, copy number alterations and clinical information are fed

into three separate networks. Their predictions are combined at the end with a score-fusing

technique called weighted linear aggregation.

There exists a lack of studies on self-supervised representation learning of multi-omics data.

Studies focusing on adding more pretext tasks on top of the reconstruction task are rare. The

usual focus is on integrating the data and less on exploiting inter-omics relationships through

constraints and other SSL losses. Moreover, lack of annotated data can be tackled with SSL

approaches.

3. Method

3.1. Dataset

For our experiments, we used The Cancer Genome Atlas (TCGA) pan-cancer multi-omics

dataset.18 Table 1 gives an overview of the dataset. It is one of the most popular multi-omics

datasets. It consists of omics data as well as phenotypic information of patients. We used three

types of omics data from the TCGA dataset: DNA methylation, miRNA stem-loop expression,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-omics:ASelf-supervisedLearningFrameworkforMulti-omicsCancerDataSayedHashimy,KarthikNandakumarandMohammadYaqubMohamedBinZayedUniversityofArticialIntelligenceAbuDhabi,UAEyE-mail:sayed.hashim@mbzuai.ac.aeWehavegainedaccesstovastamountsofmulti-omicsdatathankstoNextGenerationSequencing.However,itis...

展开>> 收起<<

Self-omics A Self-supervised Learning Framework for Multi-omics Cancer Data.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-omics A Self-supervised Learning Framework for Multi-omics Cancer Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: