Discourse Context Predictability Effects in Hindi Word Order Sidharth Ranjan IIT Delhi

2025-05-03 0 0 605.34KB 17 页 10玖币
侵权投诉
Discourse Context Predictability Effects in Hindi Word Order
Sidharth Ranjan
IIT Delhi
sidharth.ranjan03@gmail.com
Marten van Schijndel
Cornell University
mv443@cornell.edu
Sumeet Agarwal
IIT Delhi
sumeet@iitd.ac.in
Rajakrishnan Rajkumar
IISER Bhopal
rajak@iiserb.ac.in
Abstract
We test the hypothesis that discourse pre-
dictability influences Hindi syntactic choice.
While prior work has shown that a num-
ber of factors (e.g., information status, de-
pendency length, and syntactic surprisal) in-
fluence Hindi word order preferences, the
role of discourse predictability is underex-
plored in the literature. Inspired by prior
work on syntactic priming, we investigate how
the words and syntactic structures in a sen-
tence influence the word order of the follow-
ing sentences. Specifically, we extract sen-
tences from the Hindi-Urdu Treebank corpus
(HUTB), permute the preverbal constituents
of those sentences, and build a classifier to pre-
dict which sentences actually occurred in the
corpus against artificially generated distrac-
tors. The classifier uses a number of discourse-
based features and cognitive features to make
its predictions, including dependency length,
surprisal, and information status. We find that
information status and LSTM-based discourse
predictability influence word order choices, es-
pecially for non-canonical object-fronted or-
ders. We conclude by situating our results
within the broader syntactic priming literature.
1 Introduction
Grammars of natural languages have evolved over
time to factor in cognitive pressures related to pro-
duction (Hawkins,1994,2000) and comprehen-
sion (Hawkins,2004,2014), learnability (Chris-
tiansen and Chater,2008) and communicative effi-
ciency (Jaeger and Tily,2011;Gibson et al.,2019).
In this work, we test the hypothesis that maximiza-
tion of discourse predictability (quantified using
lexical repetition surprisal and adaptive LSTM sur-
prisal) is a significant predictor of Hindi syntactic
choice, when controlling for information status, de-
pendency length, and surprisal measures estimated
from
n
-gram, LSTM and incremental constituency
parsing models.
Our hypothesis is inspired by a solid body of evi-
dence from studies based on dependency treebanks
of typologically diverse languages which show that
grammars of languages tend to order words by
minimizing dependency length (Liu,2008;Futrell
et al.,2015) and maximizing their trigram pre-
dictability (Gildea and Jaeger,2015). Parallel to
this line of work on sentence-level word order, an-
other strand of work has focused on discourse-level
estimates of entropy starting from the Constant En-
tropy Rate hypothesis (CER; Genzel and Charniak,
2002). To overcome the major difficulty of esti-
mating sentence probabilities conditioned on the
previous discourse context, Qian and Jaeger (2012)
approximated discourse-level entropy using lexi-
cal cues from the previous context. In contrast,
we leverage modern computational psycholinguis-
tic neural techniques to obtain word and sentence-
level estimates of inter-sentential discourse pre-
dictability and study the impact of these measures
on Hindi word order choices. We conclude that
discourse-level priming influences Hindi word or-
der decisions and interpret our findings in the light
of the factors outlined by Reitter et al. (2011).
Hindi (Indo-Aryan language; Indo-European
language family) has a rich case-marking system
and flexible word order, though it mainly follows
SOV word order (Kachru,2006) as exemplified
below.
(1) a. amar ujala-ko
Amar Ujala-ACC
yah
it
sukravar-ko
friday-on daak-se
post-INST
prapt
receive hua
be.PST.SG
Amar Ujala received it by post on Friday.
b. yah amar ujala
-ko sukravar-ko daak-se prapt
hua
c.
sukravar-ko
yah amar ujala
-ko daak-se prapt
hua
arXiv:2210.13940v1 [cs.CL] 25 Oct 2022
To test ordering preferences, we generated
meaning-equivalent grammatical variants (Exam-
ples 1b and 1c above) of reference sentences (Ex-
ample 1a) from the Hindi-Urdu Treebank corpus
of written text (HUTB; Bhatt et al.,2009) by per-
muting their preverbal constituent ordering. Sub-
sequently, we used a logistic regression model to
distinguish the original reference sentences from
the plausible variants based on a set of cognitive
predictors. We test whether fine-tuning a neural
language model on preceding sentences improves
predictions of preverbal Hindi constituent order in
later sentences over other cognitive control mea-
sures. The motivation for our fine-tuning method is
that, during reading, encountering a syntactic struc-
ture eases the comprehension of subsequent sen-
tences with similar syntactic structures as attested
in a wide variety of languages (Arai et al.,2007;
Tooley and Traxler,2010) including Hindi (Husain
and Yadav,2020). Our cognitive control factors are
motivated by recent works which show that Hindi
optimizes processing efficiency by minimizing lex-
ical and syntactic surprisal (Ranjan et al.,2019)
and dependency length (Ranjan et al.,2022a) at
the sentence level.
Our results indicate that discourse predictabil-
ity is maximized by reference sentences compared
with alternative orderings, indicating that discourse
predictability influences Hindi word-order prefer-
ences. This finding corroborates previous find-
ings of adaptation/priming in comprehension (Fine
et al.,2013;Fine and Jaeger,2016) and produc-
tion (Gries,2005;Bock,1986). Generally, this
effect is influenced by lexical priming, but we also
find that certain object-fronted constructions prime
subsequent object-fronting, providing evidence
for self-priming of larger syntactic configurations.
With the introduction of neural model surprisal
scores, dependency length minimization effects re-
ported to influence Hindi word order choices in
previous work (Ranjan et al.,2022a) disappear ex-
cept in the case of direct object fronting, which we
interpret as evidence for the Information Locality
Hypothesis (Futrell et al.,2020). Finally, we dis-
cuss the implications of our findings for syntactic
priming in both comprehension and production.
Our main contribution is that we show the im-
pact of discourse predictability on word order
choices using modern computational methods and
naturally occurring data (as opposed to carefully
controlled stimuli in behavioural experiments).
Cross-linguistic evidence is imperative to validate
theories of language processing (Jaeger and Nor-
cliffe,2009), and in this work we extend existing
theories of how humans prioritize word order deci-
sions to Hindi.
2 Background
2.1 Surprisal Theory
Surprisal Theory (Hale,2001;Levy,2008) posits
that comprehenders construct probabilistic inter-
pretations of sentences based on previously encoun-
tered structures. Mathematically, the surprisal of
the
kth
word,
wk
, is defined as the negative log
probability of wkgiven the preceding context:
Sk=log P(wk|w1...k1) = log P(w1...wk1)
P(w1...wk)(1)
These probabilities can be computed either over
word sequences or syntactic configurations and
reflect the information load (or predictability) of
wk
. High surprisal is correlated with longer read-
ing times (Levy,2008;Demberg and Keller,2008;
Staub,2015) as well as longer spontaneous spoken
word durations (Demberg et al.,2012;Dammalap-
ati et al.,2021). Lexical predictability estimated us-
ing n-gram language models is one of the strongest
determinants of word-order preferences in both En-
glish (Rajkumar et al.,2016) and Hindi (Ranjan
et al.,2022a,2019;Jain et al.,2018).
2.2 Dependency Locality Theory
Dependency locality theory (Gibson,2000) has
been shown to be effective at predicting the com-
prehension difficulty of a sequence, with shorter de-
pendencies generally being easier to process than
longer ones (Temperley,2007;Futrell et al.,2015;
Liu et al.,2017, cf. Demberg and Keller,2008).
3 Data and Models
Our dataset comprises 1996 reference sentences
containing well-defined subject and object con-
stituents from the HUTB
1
corpus of dependency
trees (Bhatt et al.,2009). The HUTB corpus,
1https://verbs.colorado.edu/hindiurdu/
which belongs to the newswire domain and con-
tains written text in a natural discourse context,
is a human-annotated, multi-representational, and
multi-layered treebank. The dependency trees here
assumes Panini’s grammatical model where each
sentence is represented as a series of modifier-
modified elements (Bharati et al.,2002;Sangal
et al.,1995). Each tree in the HUTB corpus de-
notes words in the sentence with nodes such that
head words (modified) are linked to the dependent
words (modifier) via labelled links denoting the
grammatical relationship between word pairs.
For each reference sentence in the HUTB cor-
pus, we created artificial variants by permuting the
preverbal constituents whose heads were linked to
the root node in the dependency tree. Inspired by
grammar rules proposed in the NLG literature (Ra-
jkumar and White,2014), ungrammatical variants
were automatically filtered out by detecting depen-
dency relation sequences not attested in the origi-
nal HUTB corpus. After filtering, we had 72833
variant sentences for our classification task. Fig-
ure 1in Appendix Adisplays the dependency tree
for Example sentence 1a and explains our variant
generation procedure in more detail.
To determine whether the original word order
(i.e. the reference sentence) is preferred to the per-
muted word orders (i.e. the variant sentences), we
conducted a targeted human evaluation via forced-
choice rating task and collected sentence judg-
ments from 12 Hindi native speakers for 167 ran-
domly selected reference-variant pairs in our data
set. Participants were first shown the preceding
sentence, and then they were asked to select the
best continuation between either the reference or
the variant. We found that 89.92% of the reference
sentences which originally appeared in the HUTB
corpus were also preferred by native speakers com-
pared to the artificially generated grammatical vari-
ants expressing similar meaning (Further details
are provided in Appendix G). Therefore, in our
analyses we treat the HUTB reference sentences
as human-preferred gold orderings compared with
other possible automatically-generated constituent
orderings.
3.1 Models
We set up a binary classification task to separate
the original HUTB reference sentences from the
variants using the cognitive metrics described in
Section 2. To alleviate the data imbalance between
the two classes (1996 references vs 72833 variants),
we transformed our data set using the approach
described in Joachims (2002). This technique con-
verts a binary classification problem into a pair-
wise ranking task by training the classifier on the
difference of the feature vectors of each reference
and its corresponding variants (see Equations 2and
3). Equation 2displays the objective of a standard
binary classifier, where the classifier must learn a
feature weight (
w
) such that the dot product of
w
with the reference feature vector (
φ(reference)
)
is greater than the dot product of
w
with the variant
feature vector (
φ(variant)
). This objective can be
rewritten as equation 3such that the dot product
of
w
with the difference of the feature vectors is
greater than zero.
w·φ(reference)> w ·φ(variant)(2)
w·(φ(reference)φ(variant)) >0(3)
Every variant sentence in our dataset was paired
with its corresponding reference sentence with or-
der balanced across these pairings (e.g., Example 1
would yield (1a,1b) and (1c,1a)). Thereafter, their
feature vectors were subtracted (e.g., 1a-1b and
1c-1a), and binary labels were assigned to each
transformed data point. Reference-Variant pairs
were coded as “1" and Variant-Reference pairs
were coded as “0". The alternate pair ordering
thus re-balanced our previously severely imbal-
anced classification task. Table 5in Appendix D
illustrates the original and transformed values of
the independent variables.
For each reference sentence, our objective was
to model the possible syntactic choices entertained
by the speaker. In each instance, the author chose
to generate the reference order over the variant,
implicitly demonstrating an order preference. If
the cognitive factors in our study influenced that
decision, a logistic regression model should be
able to use those factors to predict which syntactic
choice was ultimately chosen by the author. Us-
ing the transformed features dataset labelled with
1 (denoting a preference for the reference order)
and 0 (denoting a preference for the variant order),
we trained a logistic regression model to predict
each reference sentence (see Equation 4). We re-
port our classification results using 10-fold cross-
validation. The regression results are reported on
the entire transformed test data for the respective
experiments. All experiments were done with the
Generalized Linear Model (GLM) package in R.
choice
δdependency length +
δtrigram surp + δpcfg surp +
δIS score + δlexical repetition surp +
δlstm surp + δadaptive lstm surp
(4)
Here choice is encoded by the binary dependent
variable as discussed above (
1
: reference prefer-
ence and
0
: variant preference). To obtain sentence-
level surprisal measures, we summed word-level
surprisal of all the words in each sentence. The
values for independent variables were calculated
as follows.
1. Dependency length
: We computed a
sentence-level dependency length measure by
summing the head-dependent distances (mea-
sured as the number of intervening words) in
the HUTB reference and variant dependency
trees.
2. Trigram surprisal
: For each word in a sen-
tence, we estimated its local predictability us-
ing a 3-gram language model (LM) trained
on the written section of the EMILLE Hindi
Corpus (Baker et al.,2002), which consists
of 1 million mixed genre sentences, using the
SRILM toolkit (Stolcke,2002) with Good-
Turing discounting.
3. PCFG surprisal
: The syntactic predictabil-
ity of each word in a sentence was es-
timated using the Berkeley latent-variable
PCFG parser
2
(Petrov et al.,2006). 12000
phrase structure trees were created to train the
parser by converting Bhatt et al.s HUTB de-
pendency trees into constituency trees using
the approach described in Yadav et al. (2017).
Sentence level log-likelihood of each test sen-
tence was estimated by training a PCFG LM
on four folds of the phrase structure trees and
then testing on a fifth held-out fold.
2
5-fold cross-validated parser training and testing F1-score
metrics were 90.82% and 84.95%, respectively.
4. Information status (IS) score
: We automat-
ically annotated whether each sentence exhib-
ited given-new ordering. The subject and ob-
ject constituents in a sentence were assigned
aGiven tag if its head was a pronoun or any
content word within it was mentioned in the
preceding sentence. All other phrases were
tagged as New. For each sentence, IS score
was computed as follows: a) Given-New or-
der = +1 b) New-Given order = -1 c) Given-
Given and New-New = 0. For an illustration
of givenness coding, see Example 3in Ap-
pendix Aand the description in Appendix B.
5. Lexical repetition surprisal
: For each word
in a sentence, we accounted for lexical prim-
ing by interpolating a 3-gram language model
with a unigram cache LM based on the his-
tory of words (
H= 100
) containing only the
preceding sentence. We used the original im-
plementation provided in the SRILM toolkit
with a default interpolation weight parameter
(
µ= 0.05
; see Equations 5and 6) based on
the approach described by Kuhn and De Mori
(1990). The idea is to keep a count of recently
occurring words in the sentence history and
then boost their probability within the trigram
language model. Words that have occurred re-
cently in the text are likely to re-occur
3
in sub-
sequent sentences (Kuhn and De Mori,1990;
Clarkson and Robinson,1997).
P(wk|w1, w2, ....wk1) = µ Pcache(wk|w1, w2, ....wk1)
+(1 µ)Ptrigram (wk|wk2, wk1)
(5)
Pcache(wk|wkH, wkH+1, ..wk1) = wkcounts(cache)
H(6)
6. LSTM surprisal
: We estimated the pre-
dictability for each word according to the en-
tire sentence prefix using a long short-term
memory language model (LSTM; Hochre-
iter and Schmidhuber,1997) trained on the 1
million written sentences from the EMILLE
Hindi corpus (Baker et al.,2002). We used
the LSTM implementation provided in the
3
Out of 13274 sentences present in HUTB, 71.20% sen-
tences contained at least one content word previously men-
tioned in the preceding sentence (Jain et al.,2018).
摘要:

DiscourseContextPredictabilityEffectsinHindiWordOrderSidharthRanjanIITDelhisidharth.ranjan03@gmail.comMartenvanSchijndelCornellUniversitymv443@cornell.eduSumeetAgarwalIITDelhisumeet@iitd.ac.inRajakrishnanRajkumarIISERBhopalrajak@iiserb.ac.inAbstractWetestthehypothesisthatdiscoursepre-dictabilityinu...

展开>> 收起<<
Discourse Context Predictability Effects in Hindi Word Order Sidharth Ranjan IIT Delhi.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:605.34KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注