billion parameters, as well as ESM2 transformer [Lin et al., 2022] in its 150 million parameter version.
Additionally, we include a fine-tuning performance of ESM2 by adding a linear layer projection from
its vocabulary-sized per-residue Roberta Language Model Head [Liu et al., 2019a, Rives et al., 2021]
to our binary classification target.
Convolutional and Perceptron:
We take the DeepCleave [Li et al., 2019] attention-enhanced
convolutional neural network [LeCun et al., 1998, CNN] architecture into our benchmark analysis.
Furthermore, stacking fully connected layers without any convolutional or recurrent features, e.g., in
DeepCalpain [Liu et al., 2019b] or Terminitor [Yang et al., 2020], has also been successfully applied
to protein data. As baseline, we include a single hidden layer perceptron [Rumelhart et al., 1986]
with Rectified Linear Units [Agarap, 2018] as activation function into the analysis.
2.3 Training
Dataset:
We used the dataset introduced in [Dorigatti et al., 2022], which contains
229 163
and
222 181
N- and C-terminals cleavage sites respectively. Each cleavage site is captured into a window
comprising six amino acids to its left and four to its right, and is associated with six decoy negative
samples obtained by considering the three residues preceding and following it, resulting in a total of
1 434 989
and
1 419 501
samples after deduplication for N- and C-terminals. As the decoy negatives
are situated in close proximity to real cleavage sites and due to the probabilistic nature of proteasomal
cleavage, some of the negative samples are likely to be actual, unmeasured cleavage sites, and may
influence the performance of predictors trained using such data.
Noisy labels:
To reduce the impact of asymmetric label noise on the performance of our classifiers,
we take five recent deep learning-specific denoising approaches into consideration: a noise adaptation
layer, which attempts to learn the noise distribution in the data [Goldberger and Ben-Reuven, 2017],
co-teaching, where two models are trained simultaneously by deciding for the respective other network
which samples from a mini-batch to use for training [Han et al., 2018], and co-teaching-plus [Yu et al.,
2019], which updates co-teaching with the disagreement learning approach of decoupling [Malach
and Shalev-Shwartz, 2017]. We additionally consider a joint training method with co-regularization
(JoCoR) [Wei et al., 2020] and DivideMix [Li et al., 2020a] for benchmarking. DivideMix is a
holistic approach originally developed for computer vision and integrates multiple frameworks,
such as co-teaching and MixMatch [Berthelot et al., 2019], into one. As MixMatch builds upon
MixUp [Zhang et al., 2018], which was developed for image data, we adjust it for sequential data by
mixing up the embedded sequence representation [Guo et al., 2019] instead of the pixel input in the
data loading process.
Data augmentation:
For all models, we apply data augmentation directly on the input sequences to
combat overfitting and improve generalizability by masking a random amino acid per sequence as un-
known [Shen et al., 2021]. All predictors except ESM2 fine-tuning use adaptive momentum [Kingma
and Ba, 2015] as their optimization technique, whereas ESM2 fine-tuning uses adaptive momentum
with decoupled weight decay [Loshchilov and Hutter, 2017]. All models without denoising techniques
use (binary) cross-entropy loss [Cox, 1958], while all denoising models calculate dedicated losses.
3 Experimental protocol
Evaluation:
As previously mentioned, some negative samples may actually result in a proteasomal
cleavage event in vivo due to the way these negative samples are generated. For this reason, traditional
binary classification metrics such as accuracy, precision, recall, etc. are misleading and model
evaluation should instead be based on the AUC [Menon et al., 2015]. We reserved a random 10% of
each terminal dataset as test dataset used for the final evaluation of the best hyperparameters.
Hyperparameter optimization:
Due to computational limitations, we split up the hyperparameter
search into three priority groups: group one used Ray Tune’s [Moritz et al., 2018] implementation
of the asynchronous hyperband algorithm [Li et al., 2020b] and evaluated each configuration in a
ten-folds cross-validation (CV), while for groups two and three we chose hyperparameters manually
and evaluated each configuration with five-folds CV (group two) or a single run on a held-out
validation set (group three). We then used the best hyperparameter combination to train each
3