
modeling in order to obtain accurate decoding in-
to acoustic units. Back to square one.
Here, we wish to estimate, as a preliminary
question, the cost of switching from a word-based
representation (with boundaries) to a phoneme
one, without boundaries. We use the Librispeech
corpus (Panayotov et al.,2015), for which we have
both the text transcription and a phoneme-based
transcription. We use use phoneme transcriptions
as a proxy for ’accurate’ acoustic units, leaving
for later the problem of erroneous transcripts when
the units are derived from speech. We also test the
possibility of replacing gold word boundaries by
automatically obtained one using an unsupervised
word segmentation algorithm.
When comparing LMs with widely different
kinds of input units, standard metrics like perple-
xity cannot be used because these metrics scale in
complicated ways with the granularity of the in-
put units. Instead here, we rely on three psycho-
linguistically inspired black-box NLP benchmarks
which are independent of unit granularity, and
which we adapt to be speech-compatible by pho-
nemizing them and filtering the vocabulary with
the Librispeech train set. The first two are based
on assigning pseudo-probabilities to input strings,
which are used them as a proxy for an accepta-
bility score. For the lexical benchmark (pWUG-
GY), we compare the acceptability of words (li-
ke “brick") to that of a non-word (like “blick").
The words and non-words are otherwise matched
on unigram and bigram probabilities. For the syn-
tactic benchmark (pBLIMP), we adapted and pho-
netically transcribed the BLIMP dataset (Warstadt
et al.,2019) in which the acceptability of pairs of
grammatical and ungrammatical sentences is as-
sessed. The semantic test (pSIMI) is based on the
distance between embeddings of words, which is
correlated with human obtained distances.
The structure of the paper is as follows:
after presenting the related work (Section 2) and
methods (datasets, models and metrics, Section 3),
we present the results of baseline character-based
LSTM models with access to word boundaries
(Section 4). We then present experiments where
we change the units to be phones, and remove
the gold boundaries, or replace them with
automatically extracted ones (Section 5).
2 Related work
Units for LSTMs The importance of word
boundaries has been investigated by Hahn and
Baroni (2019). They compared word based
and character based LSTMs where the word
boundaries (space character) were removed, on
a variety of probe tasks. They found that the
character-based LSTMs passed a number of
linguistic tests, sometimes better than word based
models that are impaired by the presence of OOVs.
Here, we follow the same inspiration, but evaluate
more systematically models that are boundary
based, but do not suffer from OOVs (ie, BPE and
fallback models), in order to give word models a
fairer comparison point and provide a quantitative
measure of the cost of not having boundaries.
We also expand the investigation to phoneme
representations that are step closer to speech.
Black box linguistics Among the variety of
Black-Box linguistic tasks, psycholinguistically
inspired ones enable the direct comparison of
models and humans. Grammaticality judgments
for recurrent networks have been investigated
since Allen and Seidenberg (1999), who use
closely matched pairs of sentences to investigate
grammatical correctness. This approach has been
adopted recently to assess the abilities of RNNs,
and LSTMs in particular, to capture syntactic
structures. For instance, Linzen et al. (2016)
and Gulordava et al. (2018) use word probes in
minimally different pairs of English sentences
to study number agreement. To discriminate
grammatical sentences from ungrammatical ones,
they retrieve the probabilities of the possible
morphological forms of a target word, given
the probability of the previous words in the
sentence. Practically, in the sentence “the boy
is sleeping”, the network has detected number
agreement if P(w=is)>P(w=are). This
methodology has also been adapted by Goldberg
(2019) to models trained with a masked language-
modeling objective. Those works find that in the
absence of many detractors or complex sentence
features, recent language models perform well at
the number-agreement problem in English.
More closely related to our work, Ravfogel
et al. (2018) use word probes to examine
whether LSTMs understand Basque agreement.
Like German, Basque is a morpho-syntactically
rich language with relatively free word order,
thus providing a challenging setting for the LM.
In contrast to our work, the LM’s ability to
understand verb argument structure is tested on
number-agreement and on suffix recovery tasks,