ized frequencies of the inflected forms produced
by participants – and ratings – i.e., the average
rating assigned to a given past tense form on a
well-formedness scale. They then implemented
two computational models: a rule-based and an
analogy-based model and computed the correlation
between the probabilities of past tense forms for
nonce verbs under each model and according to hu-
mans. They found that the rule-based model more
accurately accounts for nonce word inflection.
After several years of progress for neural net-
works, including state-of-the-art results on morpho-
logical inflection (Kann and Schütze,2016;Cot-
terell et al.,2016), this debate was revisited by
Kirov and Cotterell (2018, K&C), who examined
modern neural networks. They trained a bidirec-
tional LSTM (Hochreiter and Schmidhuber,1997)
with attention (Bahdanau et al.,2015) on English
past tense inflection and in experiments quantifying
model accuracy on a held out set of real English
verbs, they showed that it addresses many of the
shortcomings pointed out by Pinker and Prince
(1988). They concluded that the LSTM is, in fact,
capable of modeling English past tense inflection.
They also applied the model to the wug experiment
from A&H and found a positive correlation with
human production probabilities that was slightly
higher than the rule-based model from A&H.
Corkery et al. (2019, C&al.) reproduced this ex-
periment and additionally compared to the average
human rating that each past tense form received
in A&H’s dataset. They found that the neural net-
work from K&C produced probabilities that were
sensitive to random initialization – showing high
variance in the resulting correlations with humans –
and typically did not correlate better than the rule-
based model from A&H. They then designed an
experiment where inflected forms were sampled
from several different randomly initialized mod-
els, so that the frequencies of each form could be
aggregated in a similar fashion to the adult pro-
duction probabilities – but the results still favored
A&H. They hypothesized that the model’s overcon-
fidence in the most likely inflection (i.e. the regular
inflection class) leads to uncharacteristically low
variance on predictions for unknown words.
German Noun Plural
McCurdy et al. (2020a,
M&al.) applied an LSTM to the task of German
noun plural inflection to investigate a hypothesis
from Marcus et al. (1995, M95), who attributed the
outputs of neural models to their susceptibility to
the most frequent pattern observed during training,
stressing that, as a result, neural approaches fail to
learn patterns of infrequent groups.
German nouns inflect for the plural and singular
distinction. There are five suffixes, none of which
is considered a regular majority: /-(e)n/, /-e/, /-er/,
/-s/, and /-
∅
/. M95 had built a dataset of monosyl-
labic German noun wugs and investigated human
behavior when inflecting the plural form, distin-
guishing between phonologically familiar environ-
ments (rhymes), and unfamiliar ones (non-rhymes).
The German plural system, they argued, was an
important test for neural networks since it presents
multiple productive inflection rules, all of which
are minority inflection classes by frequency. This
is in contrast to the dichotomy of the regular and
irregular English past tense. M&al. collected their
own human production probabilities and ratings
for these wugs, and then compared those to LSTM
productions. Humans were prompted with each
wug with the neuter determiner to control for the
fact that neural inflection models of German noun
plurals are sensitive to grammatical gender (Goebel
and Indefrey,2000), and because humans do not
have a majority preference for monosyllabic, neuter
nouns (Clahsen et al.,1992).
The /-s/ inflection class, which is highly infre-
quent appears in a wide range of phonological con-
texts, which has lead some research to suggest it
is the default class for German noun plurals, and
thus the regular inflection, despite its infrequent
use. M&al. found that it was preferred by hu-
mans in Non-Rhyme context more than Rhymes,
but the LSTM model showed the opposite pref-
erence, undermining the hypothesis that LSTMs
model human generalization behavior. /-s/ was ad-
ditionally predicted less accurately on a held-out
test set of real noun inflections when compared to
other inflection classes.
They found that the most frequent inflection
class in the training for the monosyllabic neuter
contexts, /-e/, was over-generalized by the LSTM
when compared to human productions. The most
frequent class overall, /-(e)n/ (but infrequent in the
neuter context), was applied by humans quite fre-
quently to nonce nouns, but rarely by the LSTM.
They additionally found that /-er/, which is as in-
frequent as /-s/, could be accurately predicted in
the test set, and the null inflection /-
∅
/, which is
generally frequent, but extremely rare in the mono-
syllabic, neuter setting was never predicted for the