What cleaves Is proteasomal cleavage prediction reaching a ceiling Ingo Ziegler1Bolei Ma1Ercong Nie1

2025-05-06 0 0 329.68KB 15 页 10玖币
侵权投诉
What cleaves? Is proteasomal cleavage prediction
reaching a ceiling?
Ingo Ziegler, 1Bolei Ma,1Ercong Nie,1
Bernd Bischl,2,3David Rügamer,2Benjamin Schubert,4,5Emilio Dorigatti2,4
1Center for Information and Language Processing, LMU Munich,
2Department of Statistics, LMU Munich,
3Munich Center For Machine Learning,
4Institute of Computational Biology, Helmholtz Zentrum München,
5Department of Mathematics, TUM Munich
{ziegler.ingo, bolei.ma}@campus.lmu.de, nie@cis.lmu.de,
{bernd.bischl, david.ruegamer, emilio.dorigatti}@stat.uni-muenchen.de
benjamin.schubert@helmholtz-muenchen.de
Abstract
Epitope vaccines are a promising direction to enable precision treatment for cancer,
autoimmune diseases, and allergies. Effectively designing such vaccines requires
accurate prediction of proteasomal cleavage in order to ensure that the epitopes
in the vaccine are presented to T cells by the major histocompatibility complex
(MHC). While direct identification of proteasomal cleavage in vitro is cumbersome
and low throughput, it is possible to implicitly infer cleavage events from the
termini of MHC-presented epitopes, which can be detected in large amounts thanks
to recent advances in high-throughput MHC ligandomics. Inferring cleavage
events in such a way provides an inherently noisy signal which can be tackled
with new developments in the field of deep learning that supposedly make it
possible to learn predictors from noisy labels. Inspired by such innovations, we
sought to modernize proteasomal cleavage predictors by benchmarking a wide
range of recent methods, including LSTMs, transformers, CNNs, and denoising
methods, on a recently introduced cleavage dataset. We found that increasing
model scale and complexity appeared to deliver limited performance gains, as
several methods reached about 88.5% AUC on C-terminal and 79.5% AUC on
N-terminal cleavage prediction. This suggests that the noise and/or complexity
of proteasomal cleavage and the subsequent biological processes of the antigen
processing pathway are the major limiting factors for predictive performance rather
than the specific modeling approach used. While biological complexity can be
tackled by more data and better models, noise and randomness inherently limit the
maximum achievable predictive performance. All our datasets and experiments are
available at https://github.com/ziegler-ingo/cleavage_prediction.
1 Introduction
Proteasomal cleavage digestion of antigens is a major step of the antigen processing pathway, as by
cleaving proteins in smaller peptides it determines what may be subsequently presented by the major
histocompatibility complex (MHC) to T cells, potentially triggering an immune response [Blum
et al., 2013]. Therefore, an important task for computational design of epitope vaccines (EV) is the
prediction of this cleavage process, so that this information can be used by existing computational
approaches [Dorigatti and Schubert, 2020a,b] to improve the efficacy of the vaccine.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12991v2 [q-bio.QM] 25 Oct 2022
Due to the difficulty of collecting large quantities of data in vitro, proteasomal cleavage events are
usually inferred implicitly from MHC ligandomics data [Purcell et al., 2019] by matching eluted
ligands to their progenitor protein to recover sequence information surrounding the terminals [Ke¸smir
et al., 2002]. This procedure, however, does not give an indication of which amino acid sequences
cannot result in a cleavage event, since missed cleavage sites are not observed in MHC ligands.
Therefore, decoy negative samples are usually generated synthetically either by randomly shuffling
the amino acids in a short window around the cleavage site or by considering artificial negative sites
located around observed cleavage events [Calis et al., 2014]. Even though such negative samples are
not entirely reliable, the growing availability of this kind of data Vita et al. [2018] spurred continuous
development and improvement of proteasomal cleavage predictors Ke¸smir et al. [2002], Kuttler et al.
[2000], Dönnes and Kohlbacher [2005], Nielsen et al. [2005] which have been recently revised in
light of new innovations in the deep learning field [Amengual-Rigo and Guallar, 2021a, Dorigatti
et al., 2022, Weeder et al., 2021, Amengual-Rigo and Guallar, 2021b].
As a consequence of these developments, we implemented and tested several binary classifica-
tion methods on a proteasomal cleavage prediction task, carefully benchmarking a wide choice of
architectures, embeddings, and training regimes.
2 Methods
In this benchmark study we consider three main axis of variation: the initial embedding of amino
acids, the neural architecture of the predictor, and their training regime via noise handling and data
augmentations.
2.1 Embedding
The choice of embedding is crucial as it influences what intrinsic information a model can exploit
for classification [Ibtehaz and Kihara, 2021]; we thus consider various embeddings in our analysis,
while keeping the base architecture equal. Specifically, we analyze the performance of a randomly
initialized embedding layer that is optimized in conjunction with the loss function of the whole
model, and the dedicated Prot2Vec [Asgari and Mofrad, 2015] embeddings trained with the well-
established Word2Vec [Mikolov et al., 2013a,b] algorithm. Analogous to natural language, we design
sequence embeddings by concatenating independently trained forward and backward amino acid
representations of each input [Heigold et al., 2016].
Trainable tokenizers learn to form a given number of complex intra-token splits. This leads to a
setting where the vocabulary size is now a tunable hyperparameter and thus has a direct impact on the
size and quality of subsequently trained embedding representations. We extend our experiment with a
vocabulary size
1000
and a vocabulary size
50 000
version of the byte-level byte pair encoding [Sen-
nrich et al., 2016, BBPE], as well as a vocabulary size
50 000
version of the WordPiece [Schuster
and Nakajima, 2012, WP] algorithm.
2.2 Neural architectures
Recurrent:
Bidirectional long short-term memory networks (BiLSTM) [Graves and Schmidhuber,
2005] are well suited for a wide range of text classification tasks, thus we based nine of 12 model
architectures around BiLSTMs. The fundamental structure for our BiLSTMs is built around the
architecture proposed by Ozols et al. [Ozols et al., 2021], in which multiple sequential BiLSTMs are
followed by a hidden and an output layer. For eight of our nine BiLSTM-related experiments, we
choose two sequential BiLSTMs, where sequence dimensionality is reduced by taking the maximum
value of the depth-wise per-residue output of the last layer. For the hidden layer, we used the Gaussian
Error Linear Units (GELU) [Hendrycks and Gimpel, 2016] activation function. We additionally
include an adjusted five BiLSTM version of a residual architecture between LSTM blocks, which
aims to combat the shallow layer problem of deep LSTM architectures while also trying to improve
the decoder quality with attention [Liu and Gong, 2019].
Transformers:
Besides RNNs, the attention mechanism introduced by Vaswani et al. enabled a
whole new architecture capable of processing sequences: the transformer [Vaswani et al., 2017]. We,
therefore, integrated ProtTrans’ T5-XL encoder-only model [Elnaggar et al., 2022] featuring 1.2
2
billion parameters, as well as ESM2 transformer [Lin et al., 2022] in its 150 million parameter version.
Additionally, we include a fine-tuning performance of ESM2 by adding a linear layer projection from
its vocabulary-sized per-residue Roberta Language Model Head [Liu et al., 2019a, Rives et al., 2021]
to our binary classification target.
Convolutional and Perceptron:
We take the DeepCleave [Li et al., 2019] attention-enhanced
convolutional neural network [LeCun et al., 1998, CNN] architecture into our benchmark analysis.
Furthermore, stacking fully connected layers without any convolutional or recurrent features, e.g., in
DeepCalpain [Liu et al., 2019b] or Terminitor [Yang et al., 2020], has also been successfully applied
to protein data. As baseline, we include a single hidden layer perceptron [Rumelhart et al., 1986]
with Rectified Linear Units [Agarap, 2018] as activation function into the analysis.
2.3 Training
Dataset:
We used the dataset introduced in [Dorigatti et al., 2022], which contains
229 163
and
222 181
N- and C-terminals cleavage sites respectively. Each cleavage site is captured into a window
comprising six amino acids to its left and four to its right, and is associated with six decoy negative
samples obtained by considering the three residues preceding and following it, resulting in a total of
1 434 989
and
1 419 501
samples after deduplication for N- and C-terminals. As the decoy negatives
are situated in close proximity to real cleavage sites and due to the probabilistic nature of proteasomal
cleavage, some of the negative samples are likely to be actual, unmeasured cleavage sites, and may
influence the performance of predictors trained using such data.
Noisy labels:
To reduce the impact of asymmetric label noise on the performance of our classifiers,
we take five recent deep learning-specific denoising approaches into consideration: a noise adaptation
layer, which attempts to learn the noise distribution in the data [Goldberger and Ben-Reuven, 2017],
co-teaching, where two models are trained simultaneously by deciding for the respective other network
which samples from a mini-batch to use for training [Han et al., 2018], and co-teaching-plus [Yu et al.,
2019], which updates co-teaching with the disagreement learning approach of decoupling [Malach
and Shalev-Shwartz, 2017]. We additionally consider a joint training method with co-regularization
(JoCoR) [Wei et al., 2020] and DivideMix [Li et al., 2020a] for benchmarking. DivideMix is a
holistic approach originally developed for computer vision and integrates multiple frameworks,
such as co-teaching and MixMatch [Berthelot et al., 2019], into one. As MixMatch builds upon
MixUp [Zhang et al., 2018], which was developed for image data, we adjust it for sequential data by
mixing up the embedded sequence representation [Guo et al., 2019] instead of the pixel input in the
data loading process.
Data augmentation:
For all models, we apply data augmentation directly on the input sequences to
combat overfitting and improve generalizability by masking a random amino acid per sequence as un-
known [Shen et al., 2021]. All predictors except ESM2 fine-tuning use adaptive momentum [Kingma
and Ba, 2015] as their optimization technique, whereas ESM2 fine-tuning uses adaptive momentum
with decoupled weight decay [Loshchilov and Hutter, 2017]. All models without denoising techniques
use (binary) cross-entropy loss [Cox, 1958], while all denoising models calculate dedicated losses.
3 Experimental protocol
Evaluation:
As previously mentioned, some negative samples may actually result in a proteasomal
cleavage event in vivo due to the way these negative samples are generated. For this reason, traditional
binary classification metrics such as accuracy, precision, recall, etc. are misleading and model
evaluation should instead be based on the AUC [Menon et al., 2015]. We reserved a random 10% of
each terminal dataset as test dataset used for the final evaluation of the best hyperparameters.
Hyperparameter optimization:
Due to computational limitations, we split up the hyperparameter
search into three priority groups: group one used Ray Tune’s [Moritz et al., 2018] implementation
of the asynchronous hyperband algorithm [Li et al., 2020b] and evaluated each configuration in a
ten-folds cross-validation (CV), while for groups two and three we chose hyperparameters manually
and evaluated each configuration with five-folds CV (group two) or a single run on a held-out
validation set (group three). We then used the best hyperparameter combination to train each
3
摘要:

Whatcleaves?Isproteasomalcleavagepredictionreachingaceiling?IngoZiegler,1BoleiMa,1ErcongNie,1BerndBischl,2;3DavidRügamer,2BenjaminSchubert,4;5EmilioDorigatti2;41CenterforInformationandLanguageProcessing,LMUMunich,2DepartmentofStatistics,LMUMunich,3MunichCenterForMachineLearning,4InstituteofComputati...

展开>> 收起<<
What cleaves Is proteasomal cleavage prediction reaching a ceiling Ingo Ziegler1Bolei Ma1Ercong Nie1.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:329.68KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注