A Comprehensive Comparison of Neural Networks as Cognitive Models of Inflection Adam Wiemerslage and Shiran Dudy and Katharina Kann

2025-04-30 0 0 1.04MB 13 页 10玖币
侵权投诉
A Comprehensive Comparison of Neural Networks as Cognitive Models of
Inflection
Adam Wiemerslage and Shiran Dudy and Katharina Kann
University of Colorado Boulder
first.last@colorado.edu
Abstract
Neural networks have long been at the center
of a debate around the cognitive mechanism by
which humans process inflectional morphol-
ogy. This debate has gravitated into NLP by
way of the question: Are neural networks a fea-
sible account for human behavior in morpho-
logical inflection? We address that question
by measuring the correlation between human
judgments and neural network probabilities for
unknown word inflections. We test a larger
range of architectures than previously studied
on two important tasks for the cognitive pro-
cessing debate: English past tense, and Ger-
man number inflection. We find evidence that
the Transformer may be a better account of hu-
man behavior than LSTMs on these datasets,
and that LSTM features known to increase in-
flection accuracy do not always result in more
human-like behavior.
1 Introduction: The Past Tense Debate
Morphological inflection has historically been a
proving ground for studying models of language
acquisition. Rumelhart and McClelland (1985)
famously presented a neural network that they
claimed could learn English past tense inflection.
However, Pinker and Prince (1988) proposed a
dual-route theory for inflection, wherein regular
verbs are inflected based on rules and irregular
verbs are looked up in the lexicon. They high-
lighted several shortcomings of Rumelhart and Mc-
Clelland (1985) that they claimed any neural net-
work would suffer from.
This opened a line of work wherein cognitive
theories of inflection are analyzed by implement-
ing them as computational models and comparing
their behavior to that of humans. A famous study
in the area of morphology is the wug test (Berko,
1958), where human participants are prompted with
a novel-to-them nonce word and asked to produce
its plural form. Similarly, morphological inflection
models are generally evaluated on words they have
R&M
M95
P&P
K&C
A&H
Corkery
et al
Evidence for
Evidence against
Ours
Dankers
et al
McCurdy
et al
Figure 1: Summary of the past tense debate as it per-
tains to this work, color coded by evidence for (blue)
or against (red) neural networks as a cognitively plausi-
ble account for human behavior.
not seen during training. However, since they are
evaluated on actual words, it is impossible to mean-
ingfully ask a native speaker, who knows the words’
inflected forms, how likely different reasonable in-
flections for the words in a model’s test data are.
Thus, in order to compare the behavior of humans
and models on words unknown to both, prior work
has created sets of made-up nonce words (Marcus
et al.,1995;Albright and Hayes,2003).
English Past Tense
English verbs inflect to ex-
press the past and present tense distinction. Most
verbs inflect for past tense by applying the /-
d
/, /-
Id
/,
or /-
t
/ suffix: allophones of the regular inflection
class. Some verbs, however, express the past tense
with a highly infrequent or completely unique in-
flection, forming the irregular inflection class. This
distinction between regular and irregular inflection
has motivated theories like the dual-route theory
described above.
Prasada and Pinker (1993) performed a wug test
for English past tense inflection in order to com-
pare the model from Rumelhart and McClelland
(1985) to humans with special attention to how
models behave with respect to regular vs. irregular
forms, finding that it could not account for human
generalizations. Albright and Hayes (2003, A&H)
gathered production probabilities – i.e., the normal-
arXiv:2210.12321v1 [cs.CL] 22 Oct 2022
ized frequencies of the inflected forms produced
by participants – and ratings – i.e., the average
rating assigned to a given past tense form on a
well-formedness scale. They then implemented
two computational models: a rule-based and an
analogy-based model and computed the correlation
between the probabilities of past tense forms for
nonce verbs under each model and according to hu-
mans. They found that the rule-based model more
accurately accounts for nonce word inflection.
After several years of progress for neural net-
works, including state-of-the-art results on morpho-
logical inflection (Kann and Schütze,2016;Cot-
terell et al.,2016), this debate was revisited by
Kirov and Cotterell (2018, K&C), who examined
modern neural networks. They trained a bidirec-
tional LSTM (Hochreiter and Schmidhuber,1997)
with attention (Bahdanau et al.,2015) on English
past tense inflection and in experiments quantifying
model accuracy on a held out set of real English
verbs, they showed that it addresses many of the
shortcomings pointed out by Pinker and Prince
(1988). They concluded that the LSTM is, in fact,
capable of modeling English past tense inflection.
They also applied the model to the wug experiment
from A&H and found a positive correlation with
human production probabilities that was slightly
higher than the rule-based model from A&H.
Corkery et al. (2019, C&al.) reproduced this ex-
periment and additionally compared to the average
human rating that each past tense form received
in A&H’s dataset. They found that the neural net-
work from K&C produced probabilities that were
sensitive to random initialization – showing high
variance in the resulting correlations with humans –
and typically did not correlate better than the rule-
based model from A&H. They then designed an
experiment where inflected forms were sampled
from several different randomly initialized mod-
els, so that the frequencies of each form could be
aggregated in a similar fashion to the adult pro-
duction probabilities – but the results still favored
A&H. They hypothesized that the model’s overcon-
fidence in the most likely inflection (i.e. the regular
inflection class) leads to uncharacteristically low
variance on predictions for unknown words.
German Noun Plural
McCurdy et al. (2020a,
M&al.) applied an LSTM to the task of German
noun plural inflection to investigate a hypothesis
from Marcus et al. (1995, M95), who attributed the
outputs of neural models to their susceptibility to
the most frequent pattern observed during training,
stressing that, as a result, neural approaches fail to
learn patterns of infrequent groups.
German nouns inflect for the plural and singular
distinction. There are five suffixes, none of which
is considered a regular majority: /-(e)n/, /-e/, /-er/,
/-s/, and /-
/. M95 had built a dataset of monosyl-
labic German noun wugs and investigated human
behavior when inflecting the plural form, distin-
guishing between phonologically familiar environ-
ments (rhymes), and unfamiliar ones (non-rhymes).
The German plural system, they argued, was an
important test for neural networks since it presents
multiple productive inflection rules, all of which
are minority inflection classes by frequency. This
is in contrast to the dichotomy of the regular and
irregular English past tense. M&al. collected their
own human production probabilities and ratings
for these wugs, and then compared those to LSTM
productions. Humans were prompted with each
wug with the neuter determiner to control for the
fact that neural inflection models of German noun
plurals are sensitive to grammatical gender (Goebel
and Indefrey,2000), and because humans do not
have a majority preference for monosyllabic, neuter
nouns (Clahsen et al.,1992).
The /-s/ inflection class, which is highly infre-
quent appears in a wide range of phonological con-
texts, which has lead some research to suggest it
is the default class for German noun plurals, and
thus the regular inflection, despite its infrequent
use. M&al. found that it was preferred by hu-
mans in Non-Rhyme context more than Rhymes,
but the LSTM model showed the opposite pref-
erence, undermining the hypothesis that LSTMs
model human generalization behavior. /-s/ was ad-
ditionally predicted less accurately on a held-out
test set of real noun inflections when compared to
other inflection classes.
They found that the most frequent inflection
class in the training for the monosyllabic neuter
contexts, /-e/, was over-generalized by the LSTM
when compared to human productions. The most
frequent class overall, /-(e)n/ (but infrequent in the
neuter context), was applied by humans quite fre-
quently to nonce nouns, but rarely by the LSTM.
They additionally found that /-er/, which is as in-
frequent as /-s/, could be accurately predicted in
the test set, and the null inflection /-
/, which is
generally frequent, but extremely rare in the mono-
syllabic, neuter setting was never predicted for the
wugs. We refer to McCurdy et al. (2020a) for more
details on the inflection classes and their frequen-
cies, and additional discussion around their rele-
vance to inflection behavior.
Ultimately, M&al. reported no correlation with
human production probabilities for any inflection
class. They concluded that modern neural networks
still simply generalize the most frequent patterns to
unfamiliar inputs.
Dankers et al. (2021) performed in-depth behav-
ioral and structural analyses of German noun plural
inflection by a unidirectional LSTM without atten-
tion. They argued that these modeling decisions
made a more plausible model of human cognition.
In a behavioral test they found that, like humans but
unlike M&al., their model did predict /-s/ more for
non-rhymes than for rhymes, but the result was not
statistically significant. They also found that /-s/
was applied with a high frequency and attributed
this to sensitivity to word length. For a visual of all
studies discussed in this section, see Figure 1.
Our Contribution
Most work on modern neural
networks discussed here analyzes the same bidirec-
tional LSTM with attention and draws a mixture of
conclusions based on differing experimental setups.
Dankers et al. (2021) changed the LSTM-based
architecture, and found somewhat different results
for German number inflection, though they did not
investigate correlations with human ratings nor pro-
duction probabilities in the same way as previous
work. The limited variation of architectures in pre-
vious studies as well as inconsistent methods of
comparison with human behavior prevent us from
drawing definite conclusions about the adequacy
of neural networks as models of human inflection.
Here, we present results on a wider range of
LSTMs and a Transformer (Vaswani et al.,2017)
model for both English past tense and German num-
ber inflection. We ask which architecture is the
best account for human inflection behavior and,
following M&al., investigate the actual model pro-
ductions (and probabilities) for the German plural
classes in order to qualitatively compare to human
behavior. We additionally ask how architectural de-
cisions for the LSTM encoder-decoder affect this
correlation. Finally, we investigate the relation-
ship between inflection accuracy on the test set and
correlation with human wug ratings.
We find that the Transformer consistently cor-
relates best with human ratings, producing proba-
bilities that result in Spearman’s
ρ
in the range of
0.47-0.71 for several inflection classes, which is fre-
quently higher than LSTMs. However, when look-
ing closely at the Transformer productions, it dis-
plays behavior that deviates from humans similarly
to the LSTM in M&al., though to a lesser extent.
While attention greatly increases LSTM accuracy
on inflection, we also find that it does not always
lead to better correlations with human wug ratings,
and that the directionality of the encoder has more
complicated implications. Finally, we find that
there is no clear relationship between model accu-
racy and correlation with human ratings across all
experiments, demonstrating that neural networks
can solve the inflection task in its current setup
without learning human-like distributions. While
the Transformer experiment in this work demon-
strates stronger correlations with human behavior,
and some more human-like behaviors than before,
our findings continue to cast doubt on the cognitive
plausibility of neural networks for inflection.
2 Neural Morphological Inflection
2.1 Task Description
The experiments in this paper are centered around
a natural language processing (NLP) task called
morphological inflection, which consists of gener-
ating an inflected form for a given lemma and set of
morphological features indicating the target form.
It is typically cast as a character-level sequence-to-
sequence task, where the characters of the lemma
and the morphological features constitute the input,
while the characters of the target inflected form are
the output (Kann and Schütze,2016):
PST c r y cried
Formally, let
S
be the paradigm slots expressed
in a language and
l
a lemma in the language. The
set of all inflected forms – or paradigm
π
of
l
is
then defined as:
π(l) = nfk[l], tkok∈S (1)
fk[l]
denotes the inflection of
l
which expresses tag
tk
, and
l
and
fk[l]
represent strings consisting of
letters from the language’s alphabet Σ.
The task of morphological inflection can then
formally be described as predicting the form
fi[l]
from the paradigm of lcorresponding to tag ti.
2.2 Models
Rumelhart and McClelland
The original
model of Rumelhart and McClelland (1985)
摘要:

AComprehensiveComparisonofNeuralNetworksasCognitiveModelsofInectionAdamWiemerslageandShiranDudyandKatharinaKannUniversityofColoradoBoulderfirst.last@colorado.eduAbstractNeuralnetworkshavelongbeenatthecenterofadebatearoundthecognitivemechanismbywhichhumansprocessinectionalmorphol-ogy.Thisdebatehasg...

展开>> 收起<<
A Comprehensive Comparison of Neural Networks as Cognitive Models of Inflection Adam Wiemerslage and Shiran Dudy and Katharina Kann.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.04MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注