Discourse Context Predictability Effects in Hindi Word Order Sidharth Ranjan IIT Delhi

2025-05-03 1 0 605.34KB 17 页 10玖币

侵权投诉

Discourse Context Predictability Effects in Hindi Word Order

Sidharth Ranjan

IIT Delhi

sidharth.ranjan03@gmail.com

Marten van Schijndel

Cornell University

mv443@cornell.edu

Sumeet Agarwal

IIT Delhi

sumeet@iitd.ac.in

Rajakrishnan Rajkumar

IISER Bhopal

rajak@iiserb.ac.in

Abstract

We test the hypothesis that discourse pre-

dictability inﬂuences Hindi syntactic choice.

While prior work has shown that a num-

ber of factors (e.g., information status, de-

pendency length, and syntactic surprisal) in-

ﬂuence Hindi word order preferences, the

role of discourse predictability is underex-

plored in the literature. Inspired by prior

work on syntactic priming, we investigate how

the words and syntactic structures in a sen-

tence inﬂuence the word order of the follow-

ing sentences. Speciﬁcally, we extract sen-

tences from the Hindi-Urdu Treebank corpus

(HUTB), permute the preverbal constituents

of those sentences, and build a classiﬁer to pre-

dict which sentences actually occurred in the

corpus against artiﬁcially generated distrac-

tors. The classiﬁer uses a number of discourse-

based features and cognitive features to make

its predictions, including dependency length,

surprisal, and information status. We ﬁnd that

information status and LSTM-based discourse

predictability inﬂuence word order choices, es-

pecially for non-canonical object-fronted or-

ders. We conclude by situating our results

within the broader syntactic priming literature.

1 Introduction

Grammars of natural languages have evolved over

time to factor in cognitive pressures related to pro-

duction (Hawkins,1994,2000) and comprehen-

sion (Hawkins,2004,2014), learnability (Chris-

tiansen and Chater,2008) and communicative efﬁ-

ciency (Jaeger and Tily,2011;Gibson et al.,2019).

In this work, we test the hypothesis that maximiza-

tion of discourse predictability (quantiﬁed using

lexical repetition surprisal and adaptive LSTM sur-

prisal) is a signiﬁcant predictor of Hindi syntactic

choice, when controlling for information status, de-

pendency length, and surprisal measures estimated

from

-gram, LSTM and incremental constituency

parsing models.

Our hypothesis is inspired by a solid body of evi-

dence from studies based on dependency treebanks

of typologically diverse languages which show that

grammars of languages tend to order words by

minimizing dependency length (Liu,2008;Futrell

et al.,2015) and maximizing their trigram pre-

dictability (Gildea and Jaeger,2015). Parallel to

this line of work on sentence-level word order, an-

other strand of work has focused on discourse-level

estimates of entropy starting from the Constant En-

tropy Rate hypothesis (CER; Genzel and Charniak,

2002). To overcome the major difﬁculty of esti-

mating sentence probabilities conditioned on the

previous discourse context, Qian and Jaeger (2012)

approximated discourse-level entropy using lexi-

cal cues from the previous context. In contrast,

we leverage modern computational psycholinguis-

tic neural techniques to obtain word and sentence-

level estimates of inter-sentential discourse pre-

dictability and study the impact of these measures

on Hindi word order choices. We conclude that

discourse-level priming inﬂuences Hindi word or-

der decisions and interpret our ﬁndings in the light

of the factors outlined by Reitter et al. (2011).

Hindi (Indo-Aryan language; Indo-European

language family) has a rich case-marking system

and ﬂexible word order, though it mainly follows

SOV word order (Kachru,2006) as exempliﬁed

below.

(1) a. amar ujala-ko

Amar Ujala-ACC

yah

sukravar-ko

friday-on daak-se

post-INST

prapt

receive hua

be.PST.SG

Amar Ujala received it by post on Friday.

b. yah amar ujala

-ko sukravar-ko daak-se prapt

hua

sukravar-ko

yah amar ujala

-ko daak-se prapt

hua

arXiv:2210.13940v1 [cs.CL] 25 Oct 2022

To test ordering preferences, we generated

meaning-equivalent grammatical variants (Exam-

ples 1b and 1c above) of reference sentences (Ex-

ample 1a) from the Hindi-Urdu Treebank corpus

of written text (HUTB; Bhatt et al.,2009) by per-

muting their preverbal constituent ordering. Sub-

sequently, we used a logistic regression model to

distinguish the original reference sentences from

the plausible variants based on a set of cognitive

predictors. We test whether ﬁne-tuning a neural

language model on preceding sentences improves

predictions of preverbal Hindi constituent order in

later sentences over other cognitive control mea-

sures. The motivation for our ﬁne-tuning method is

that, during reading, encountering a syntactic struc-

ture eases the comprehension of subsequent sen-

tences with similar syntactic structures as attested

in a wide variety of languages (Arai et al.,2007;

Tooley and Traxler,2010) including Hindi (Husain

and Yadav,2020). Our cognitive control factors are

motivated by recent works which show that Hindi

optimizes processing efﬁciency by minimizing lex-

ical and syntactic surprisal (Ranjan et al.,2019)

and dependency length (Ranjan et al.,2022a) at

the sentence level.

Our results indicate that discourse predictabil-

ity is maximized by reference sentences compared

with alternative orderings, indicating that discourse

predictability inﬂuences Hindi word-order prefer-

ences. This ﬁnding corroborates previous ﬁnd-

ings of adaptation/priming in comprehension (Fine

et al.,2013;Fine and Jaeger,2016) and produc-

tion (Gries,2005;Bock,1986). Generally, this

effect is inﬂuenced by lexical priming, but we also

ﬁnd that certain object-fronted constructions prime

subsequent object-fronting, providing evidence

for self-priming of larger syntactic conﬁgurations.

With the introduction of neural model surprisal

scores, dependency length minimization effects re-

ported to inﬂuence Hindi word order choices in

previous work (Ranjan et al.,2022a) disappear ex-

cept in the case of direct object fronting, which we

interpret as evidence for the Information Locality

Hypothesis (Futrell et al.,2020). Finally, we dis-

cuss the implications of our ﬁndings for syntactic

priming in both comprehension and production.

Our main contribution is that we show the im-

pact of discourse predictability on word order

choices using modern computational methods and

naturally occurring data (as opposed to carefully

controlled stimuli in behavioural experiments).

Cross-linguistic evidence is imperative to validate

theories of language processing (Jaeger and Nor-

cliffe,2009), and in this work we extend existing

theories of how humans prioritize word order deci-

sions to Hindi.

2 Background

2.1 Surprisal Theory

Surprisal Theory (Hale,2001;Levy,2008) posits

that comprehenders construct probabilistic inter-

pretations of sentences based on previously encoun-

tered structures. Mathematically, the surprisal of

the

kth

word,

, is deﬁned as the negative log

probability of wkgiven the preceding context:

Sk=−log P(wk|w1...k−1) = log P(w1...wk−1)

P(w1...wk)(1)

These probabilities can be computed either over

word sequences or syntactic conﬁgurations and

reﬂect the information load (or predictability) of

. High surprisal is correlated with longer read-

ing times (Levy,2008;Demberg and Keller,2008;

Staub,2015) as well as longer spontaneous spoken

word durations (Demberg et al.,2012;Dammalap-

ati et al.,2021). Lexical predictability estimated us-

ing n-gram language models is one of the strongest

determinants of word-order preferences in both En-

glish (Rajkumar et al.,2016) and Hindi (Ranjan

et al.,2022a,2019;Jain et al.,2018).

2.2 Dependency Locality Theory

Dependency locality theory (Gibson,2000) has

been shown to be effective at predicting the com-

prehension difﬁculty of a sequence, with shorter de-

pendencies generally being easier to process than

longer ones (Temperley,2007;Futrell et al.,2015;

Liu et al.,2017, cf. Demberg and Keller,2008).

3 Data and Models

Our dataset comprises 1996 reference sentences

containing well-deﬁned subject and object con-

stituents from the HUTB

corpus of dependency

trees (Bhatt et al.,2009). The HUTB corpus,

1https://verbs.colorado.edu/hindiurdu/

which belongs to the newswire domain and con-

tains written text in a natural discourse context,

is a human-annotated, multi-representational, and

multi-layered treebank. The dependency trees here

assumes Panini’s grammatical model where each

sentence is represented as a series of modiﬁer-

modiﬁed elements (Bharati et al.,2002;Sangal

et al.,1995). Each tree in the HUTB corpus de-

notes words in the sentence with nodes such that

head words (modiﬁed) are linked to the dependent

words (modiﬁer) via labelled links denoting the

grammatical relationship between word pairs.

For each reference sentence in the HUTB cor-

pus, we created artiﬁcial variants by permuting the

preverbal constituents whose heads were linked to

the root node in the dependency tree. Inspired by

grammar rules proposed in the NLG literature (Ra-

jkumar and White,2014), ungrammatical variants

were automatically ﬁltered out by detecting depen-

dency relation sequences not attested in the origi-

nal HUTB corpus. After ﬁltering, we had 72833

variant sentences for our classiﬁcation task. Fig-

ure 1in Appendix Adisplays the dependency tree

for Example sentence 1a and explains our variant

generation procedure in more detail.

To determine whether the original word order

(i.e. the reference sentence) is preferred to the per-

muted word orders (i.e. the variant sentences), we

conducted a targeted human evaluation via forced-

choice rating task and collected sentence judg-

ments from 12 Hindi native speakers for 167 ran-

domly selected reference-variant pairs in our data

set. Participants were ﬁrst shown the preceding

sentence, and then they were asked to select the

best continuation between either the reference or

the variant. We found that 89.92% of the reference

sentences which originally appeared in the HUTB

corpus were also preferred by native speakers com-

pared to the artiﬁcially generated grammatical vari-

ants expressing similar meaning (Further details

are provided in Appendix G). Therefore, in our

analyses we treat the HUTB reference sentences

as human-preferred gold orderings compared with

other possible automatically-generated constituent

orderings.

3.1 Models

We set up a binary classiﬁcation task to separate

the original HUTB reference sentences from the

variants using the cognitive metrics described in

Section 2. To alleviate the data imbalance between

the two classes (1996 references vs 72833 variants),

we transformed our data set using the approach

described in Joachims (2002). This technique con-

verts a binary classiﬁcation problem into a pair-

wise ranking task by training the classiﬁer on the

difference of the feature vectors of each reference

and its corresponding variants (see Equations 2and

3). Equation 2displays the objective of a standard

binary classiﬁer, where the classiﬁer must learn a

feature weight (

) such that the dot product of

with the reference feature vector (

φ(reference)

)

is greater than the dot product of

with the variant

feature vector (

φ(variant)

). This objective can be

rewritten as equation 3such that the dot product

with the difference of the feature vectors is

greater than zero.

w·φ(reference)> w ·φ(variant)(2)

w·(φ(reference)−φ(variant)) >0(3)

Every variant sentence in our dataset was paired

with its corresponding reference sentence with or-

der balanced across these pairings (e.g., Example 1

would yield (1a,1b) and (1c,1a)). Thereafter, their

feature vectors were subtracted (e.g., 1a-1b and

1c-1a), and binary labels were assigned to each

transformed data point. Reference-Variant pairs

were coded as “1" and Variant-Reference pairs

were coded as “0". The alternate pair ordering

thus re-balanced our previously severely imbal-

anced classiﬁcation task. Table 5in Appendix D

illustrates the original and transformed values of

the independent variables.

For each reference sentence, our objective was

to model the possible syntactic choices entertained

by the speaker. In each instance, the author chose

to generate the reference order over the variant,

implicitly demonstrating an order preference. If

the cognitive factors in our study inﬂuenced that

decision, a logistic regression model should be

able to use those factors to predict which syntactic

choice was ultimately chosen by the author. Us-

ing the transformed features dataset labelled with

1 (denoting a preference for the reference order)

and 0 (denoting a preference for the variant order),

we trained a logistic regression model to predict

each reference sentence (see Equation 4). We re-

port our classiﬁcation results using 10-fold cross-

validation. The regression results are reported on

the entire transformed test data for the respective

experiments. All experiments were done with the

Generalized Linear Model (GLM) package in R.

choice ∼











δdependency length +

δtrigram surp + δpcfg surp +

δIS score + δlexical repetition surp +

δlstm surp + δadaptive lstm surp

(4)

Here choice is encoded by the binary dependent

variable as discussed above (

: reference prefer-

ence and

: variant preference). To obtain sentence-

level surprisal measures, we summed word-level

surprisal of all the words in each sentence. The

values for independent variables were calculated

as follows.

1. Dependency length

: We computed a

sentence-level dependency length measure by

summing the head-dependent distances (mea-

sured as the number of intervening words) in

the HUTB reference and variant dependency

trees.

2. Trigram surprisal

: For each word in a sen-

tence, we estimated its local predictability us-

ing a 3-gram language model (LM) trained

on the written section of the EMILLE Hindi

Corpus (Baker et al.,2002), which consists

of 1 million mixed genre sentences, using the

SRILM toolkit (Stolcke,2002) with Good-

Turing discounting.

3. PCFG surprisal

: The syntactic predictabil-

ity of each word in a sentence was es-

timated using the Berkeley latent-variable

PCFG parser

(Petrov et al.,2006). 12000

phrase structure trees were created to train the

parser by converting Bhatt et al.’s HUTB de-

pendency trees into constituency trees using

the approach described in Yadav et al. (2017).

Sentence level log-likelihood of each test sen-

tence was estimated by training a PCFG LM

on four folds of the phrase structure trees and

then testing on a ﬁfth held-out fold.

5-fold cross-validated parser training and testing F1-score

metrics were 90.82% and 84.95%, respectively.

4. Information status (IS) score

: We automat-

ically annotated whether each sentence exhib-

ited given-new ordering. The subject and ob-

ject constituents in a sentence were assigned

aGiven tag if its head was a pronoun or any

content word within it was mentioned in the

preceding sentence. All other phrases were

tagged as New. For each sentence, IS score

was computed as follows: a) Given-New or-

der = +1 b) New-Given order = -1 c) Given-

Given and New-New = 0. For an illustration

of givenness coding, see Example 3in Ap-

pendix Aand the description in Appendix B.

5. Lexical repetition surprisal

: For each word

in a sentence, we accounted for lexical prim-

ing by interpolating a 3-gram language model

with a unigram cache LM based on the his-

tory of words (

H= 100

) containing only the

preceding sentence. We used the original im-

plementation provided in the SRILM toolkit

with a default interpolation weight parameter

(

µ= 0.05

; see Equations 5and 6) based on

the approach described by Kuhn and De Mori

(1990). The idea is to keep a count of recently

occurring words in the sentence history and

then boost their probability within the trigram

language model. Words that have occurred re-

cently in the text are likely to re-occur

in sub-

sequent sentences (Kuhn and De Mori,1990;

Clarkson and Robinson,1997).

P(wk|w1, w2, ....wk−1) = µ Pcache(wk|w1, w2, ....wk−1)

+(1 −µ)Ptrigram (wk|wk−2, wk−1)

(5)

Pcache(wk|wk−H, wk−H+1, ..wk−1) = wkcounts(cache)

H(6)

6. LSTM surprisal

: We estimated the pre-

dictability for each word according to the en-

tire sentence preﬁx using a long short-term

memory language model (LSTM; Hochre-

iter and Schmidhuber,1997) trained on the 1

million written sentences from the EMILLE

Hindi corpus (Baker et al.,2002). We used

the LSTM implementation provided in the

Out of 13274 sentences present in HUTB, 71.20% sen-

tences contained at least one content word previously men-

tioned in the preceding sentence (Jain et al.,2018).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DiscourseContextPredictabilityEffectsinHindiWordOrderSidharthRanjanIITDelhisidharth.ranjan03@gmail.comMartenvanSchijndelCornellUniversitymv443@cornell.eduSumeetAgarwalIITDelhisumeet@iitd.ac.inRajakrishnanRajkumarIISERBhopalrajak@iiserb.ac.inAbstractWetestthehypothesisthatdiscoursepre-dictabilityinu...

展开>> 收起<<

Discourse Context Predictability Effects in Hindi Word Order Sidharth Ranjan IIT Delhi.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Discourse Context Predictability Effects in Hindi Word Order Sidharth Ranjan IIT Delhi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: