Self-omics A Self-supervised Learning Framework for Multi-omics Cancer Data

2025-04-15 0 0 1.33MB 12 页 10玖币
侵权投诉
Self-omics: A Self-supervised Learning Framework for Multi-omics Cancer Data
Sayed Hashim, Karthik Nandakumar and Mohammad Yaqub
Mohamed Bin Zayed University of Artificial Intelligence
Abu Dhabi, UAE
E-mail: sayed.hashim@mbzuai.ac.ae
We have gained access to vast amounts of multi-omics data thanks to Next Generation
Sequencing. However, it is challenging to analyse this data due to its high dimensionality
and much of it not being annotated. Lack of annotated data is a significant problem in
machine learning, and Self-Supervised Learning (SSL) methods are typically used to deal
with limited labelled data. However, there is a lack of studies that use SSL methods to
exploit inter-omics relationships on unlabelled multi-omics data. In this work, we develop a
novel and efficient pre-training paradigm that consists of various SSL components, including
but not limited to contrastive alignment, data recovery from corrupted samples, and using
one type of omics data to recover other omic types. Our pre-training paradigm improves
performance on downstream tasks with limited labelled data. We show that our approach
outperforms the state-of-the-art method in cancer type classification on the TCGA pan-
cancer dataset in semi-supervised setting. Moreover, we show that the encoders that are
pre-trained using our approach can be used as powerful feature extractors even without
fine-tuning. Our ablation study shows that the method is not overly dependent on any
pretext task component. The network architectures in our approach are designed to handle
missing omic types and multiple datasets for pre-training and downstream training. Our
pre-training paradigm can be extended to perform zero-shot classification of rare cancers.
Keywords: Self-supervised Learning; Contrastive Learning; Multi-omics; Cancer Type Clas-
sification
1. Introduction
According to WHO, cancer accounted for around 10 million deaths in 2020 or about one in
six deaths.1Many cancers can be cured with early diagnosis, and effective treatment.2Various
factors are responsible for late diagnoses, such as symptoms being detected late, lack of access
to oncologists, as well as the time & cost involved. It could also be because of vague and unclear
symptoms and indistinguishable signs on scans and mammograms.3Nevertheless, performing
cancer diagnosis in its early stages or even before it starts developing could remarkably improve
survival and provide opportunities for more effective treatment. Studies in the areas of biology
that end with omics, such as genomics, proteomics, transcriptomics or metabolomics, are called
omics sciences. With the advent of Next Generation Sequencing, we have gained access to
multiple types of omics data. Each type of omics data reveals different characteristics within
the tumour. However, due to the high dimensionality and the numerous different types of
©2022 The Authors. Open Access chapter published by World Scientific Publishing Company and
distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC)
4.0 License.
arXiv:2210.00825v1 [cs.LG] 3 Oct 2022
omics data, it is nearly impossible for clinicians to analyse multi-omics data. Due to this
reason, they tend to focus on analysing the values of specific biomarkers. However, to get a
complete picture of a tumour, which is heterogeneous and complex, multi-omics data analysis
is vital.
Modern machine learning algorithms, especially deep neural networks, have shown to be
able to work well with high-dimensional data. Deep learning has made massive progress in
tasks like object recognition, object detection and semantic segmentation in the visual domain.
It has also made strides in speech and natural language processing on tasks such as machine
translation, speech recognition and question answering. The algorithms developed for the tasks
mentioned above require processing high-dimensional inputs. In this work, we developed Self-
Supervised Learning (SSL) methods for multi-omics data to provide supervision to the model
from unlabelled data. We explored various SSL pretext tasks on top of the usual reconstruction
task with autoencoders. Some of the SSL techniques we implemented include contrastive
learning, recovering data from its corrupted versions and aligning representations from multi-
omics data.
The low-dimensional representations that our model produces from high-dimensional
multi-omics data can be considered ”computational biomarkers”. The model that learns from
large datasets gets good at producing such biomarkers and can be used to produce good rep-
resentations for smaller datasets. Furthermore, as the model learns from tumours diagnosed
early, it produces better representations for such tumours. Therefore, even if the dataset at
hand does not have samples of tumours that are sequenced early, the fact that it was pre-
trained on a large dataset that contains many samples of such tumours makes the model
better at early diagnosis.
2. Literature Review
Self-supervised learning (SSL) has been extensively applied in representation learning of data
in various domains such as natural language processing4–6 audio and image.7–9 These methods
mainly use spatial, semantic and temporal structural relationships in the data. This is done
through developing novel pretext tasks, data augmentation methods and model architectures.
Due to the absence of the relationships mentioned above in tabular data, such methods could
be less effective. For instance, augmentation methods used on images, such as scaling and
rotation, cannot be directly used on tabular data. SSL techniques have not been explored
enough on tabular data due to these reasons.10
An autoencoder is a deep network that consists of an encoder and decoder.11 While the
encoder is trained to map the input to a latent representation, the decoder is trained to re-
construct the input from this latent representation. A popular work in images is denoising
autoencoders (DAE).12 It is built on the hypothesis that partially destroyed inputs should
result in a similar latent representation as the original inputs. In this work, the authors inves-
tigated an autoencoder’s robustness to partial demolition of inputs. The input is corrupted
and fed to the autoencoder, whose job is to recover the original ”clean” input. A group of re-
searchers developed VIME,10 a novel SSL framework for tabular data. They developed a couple
of pretext tasks called feature vector estimation and mask vector estimation. The former aims
to reconstruct an input sample from its masked version, while the latter involves predicting
the mask vector applied to the sample. In other words, the pretext task is to estimate which
features are masked and predict the values of the corrupted features. A work called SubTab
focuses on converting the representation learning problem from single-view to multi-view.13
Here, the features are divided into subsets to produce the various views. The authors claim
that this is analogous to cropping images and bagging features in ensemble learning. They
demonstrate that the encoder learns more useful representations from a subset of the data
than a corrupted version of it. They pre-trained the network on this pretext task and tested
its performance on some downstream tasks.
Self-supervised representation learning of multi-omics data is an under-studied area of
research. Many methods used for representation learning mainly focus on the integration of
multi-omics data. Many integration strategies have been proposed. We will review the inte-
gration methods here due to the lack of self-supervised methods. A group developed a group
lasso regularised deep learning method for cancer prognosis by integrating multi-omics data
using early fusion.14 They perform various data preprocessing techniques, and the model con-
sists of a few fully connected layers. Another work integrates multi-omics data using standard
and disjointed deep autoencoders.15 Various omics data such as DNA methylation, microRNA
expression, mRNA expression and reverse phase protein array data are concatenated before
being fed into the autoencoder. A work called OmiEmbed16 does intermediate multi-omics
data integration. It is a multi-task framework that is built on a variational autoencoder. The
pretext task here is the reconstruction of three types of omics data: gene expression, mi-
croRNA expression and DNA methylation. They show the effectiveness of their method by
testing on various downstream tasks. They also developed a multi-task strategy that concur-
rently trains multiple downstream modules such as survival analysis, cancer type classification
and phenotype prediction. Training it this way has shown to perform better than training the
downstream modules separately. Late integration of multi-omics data was done in a work that
predicts breast cancer prognosis.17 They perform feature selection and use a deep neural net-
work for the task. Gene expression, copy number alterations and clinical information are fed
into three separate networks. Their predictions are combined at the end with a score-fusing
technique called weighted linear aggregation.
There exists a lack of studies on self-supervised representation learning of multi-omics data.
Studies focusing on adding more pretext tasks on top of the reconstruction task are rare. The
usual focus is on integrating the data and less on exploiting inter-omics relationships through
constraints and other SSL losses. Moreover, lack of annotated data can be tackled with SSL
approaches.
3. Method
3.1. Dataset
For our experiments, we used The Cancer Genome Atlas (TCGA) pan-cancer multi-omics
dataset.18 Table 1 gives an overview of the dataset. It is one of the most popular multi-omics
datasets. It consists of omics data as well as phenotypic information of patients. We used three
types of omics data from the TCGA dataset: DNA methylation, miRNA stem-loop expression,
摘要:

Self-omics:ASelf-supervisedLearningFrameworkforMulti-omicsCancerDataSayedHashimy,KarthikNandakumarandMohammadYaqubMohamedBinZayedUniversityofArti cialIntelligenceAbuDhabi,UAEyE-mail:sayed.hashim@mbzuai.ac.aeWehavegainedaccesstovastamountsofmulti-omicsdatathankstoNextGenerationSequencing.However,itis...

展开>> 收起<<
Self-omics A Self-supervised Learning Framework for Multi-omics Cancer Data.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:学术论文 价格:10玖币 属性:12 页 大小:1.33MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注