We conduct experiments using three deep-learning models: an RNN, a Transformer and a WFA (i.e.
Weighted Finite Automata, a form of RNN with linear activation). Our results suggest that in the
binary classification setting the deep-learning models are indeed able to learn the underlying class
distribution in a non-trivial manner. These results seem to be consistent with previous studies showing
that deep-learning models provide calibrated predictions for binary classification problems [Niculescu-
Mizil and Caruana, 2005]. Our main contributions are: 1) We present a novel evaluation framework for
sequence prediction models. By exploiting unlabeled data, we evaluate the model with respect to the
implicitly induced joint distribution. 2) Our evaluation distinguishes performance over components
of the distribution seen in training data as well as unseen components explicitly differentiating
compression from generalization. 3) Our experiments on a sparse sequence classification task show
that deep learning architectures are able to induce good distributions in a non-trivial manner.
2 Related Work
In the recent literature several works have addressed the problem of model calibration. Some of
these studies have shown that for some deep learning architectures, the predictions produced by the
models are not well calibrated [Guo et al., 2017] [Kumar and Sarawagi, 2019] especially for non
pre-trained transformers [Desai and Durrett, 2020]. In contrast, for the case of binary classification
some previous work has suggested that they are indeed well calibrated [Niculescu-Mizil and Caruana,
2005]. To our knowledge we are the first ones to study calibration by looking at the joint distribution
Pr(x, y)induced by the learned classifier instead of the quality of the conditional class predictions.
Calibration aside, several works have compared the classification performance of different deep-
learning architectures: CNNs and RNNs [Józefowicz et al., 2016] [Yin et al., 2017], Transformers
and RNNs [Karita et al., 2019] [Lakew et al., 2018], Transformers and CNNs [Kolesnikov et al.,
2021] [Pinto et al., 2021] [Bai et al., 2021], WFA to CNNs [Quattoni and Carreras, 2020] and WFA
to RNNs [Quattoni and Carreras, 2019].
Finally, an orthogonal related problem is that of developing deep learning models for density
estimation. These include Transformers [Fakoor et al., 2020], Autoregressive Networks [Uria
et al., 2016] [Oliva et al., 2018] and Flow Models [Durkan et al., 2019] [De Cao et al., 2020].
3 Evaluating Deep-Learning Model as Moment Predictors
Our goal is to evaluate models defined over sequences of discrete symbols. More precisely we
consider an alphabet
Σ
and the set of all possible sequences
Σ?
. In general, we can think of
probabilistic binary sequence classifiers as functions from
Σ?→[0,1]
. In our setting we assume that
we have access to a large set of sequences
U={x(1), . . . , x(u)}
sampled according to the underlying
distribution over
Σ?
. We can think of
U
as a large set of unlabeled sequences that represent the
domain of the sequence classification task. The target class
Y
that we wish to learn is a subset of
sequences in
U
, that is
Y⊂U
. We are particularly interested in cases where the target class is rare,
that is cases in which |Y|is significantly smaller than |U|.
We will create a labeled training set
T={(x(1), y(1)),...,(x(m), y(m))}
of size
m
by sampling
sequences from
U
and labeling
y= 1
as positive instances if they appear in
Y
, and otherwise as
negative instances
y= 0
. Section 4 contains further details about how we create the task data from an
existing dataset, for various training sizes
m
. Given a training set we will train a sequence classifier
Mthat defines a distribution PrM(y|x).
In this work, we consider three sequence classification models: an
RNN
, a
Transformer
and a
WFA
.
As recurrent neural network (RNN) we employ a multi-layer LSTM [Hochreiter and Schmidhuber,
1997] with a binary classification feed-forward layer on top. For the Transformer we select the BERT
[Devlin et al., 2019] architecture, we don’t use the pre-trained weights and we expand the embeddings
with new randomly initialized ones to deal with the protein dataset vocabulary. Both models output
the conditional probability
Pr(l|x)
where
x
is a sentence and
l
a label. We also evaluate a WFA
(Weighted Finite Automata; which in essence is an RNN with linear activation functions [Rabusseau
et al., 2019]) employing the ensemble proposed in Quattoni and Carreras [2020].
2