LIKE A BILINGUAL BABY THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL LANGUAGE MODEL

2025-04-24 0 0 4.8MB 10 页 10玖币

侵权投诉

LIKE A BILINGUAL BABY:

THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL

LANGUAGE MODEL

Khai-Nguyen Nguyen

Bucknell Univeristy

Lewisburg, PA

nkn002@bucknell.edu

Zixin Tang

The Pennsylvania State University

State College, PA

zqt5035@psu.edu

Ankur Mali

University of South Florida

Tampa, FL

ankurarjunmali@usf.edu

M. Alex Kelly

Carleton University

Ottawa, Canada

alex.kelly@carleton.ca

ABSTRACT

Unlike most neural language models, humans learn languages in a rich, multi-sensory, and, often,

multi-lingual environment. Conversely, language models are typically trained on only language

data, and often on only one language. We hypothesize that perceptual grounding scaffolds language

learning, including learning relationships between languages. To better understand multilingualism

and the role of visual input in language understanding, we train a recurrent language model on images

and corresponding text in English and Spanish from MS-COCO-ES. We ﬁnd that visual grounding

improves the model’s understanding of semantic similarity within and across languages and improves

language generation. Our results provide evidence of the advantages of visually grounded language

models. We posit that language learning is better understood as integral to the totality of a learner’s

experiences, and thus there is a need for more naturalistic language data from multilingual speakers

and multilingual datasets with perceptual grounding.

Keywords:

natural language processing; multilingual models; grounded cognition; neural language

models; recurrent neural network; natural language understanding

1 Introduction

With the effects of globalization on business, education, and culture, multilingualism—speaking two or more languages—

is becoming more and more common. The prevalence of multilingualism and speech patterns peculiar to multilingualism

creates a need for computational language models that can handle, process, and generate the language of multilingual

communities.

It is well known that a person’s knowledge is inseparable from the physical or social context in which it is learned

and used; as humans, we create a world model based on context [

]. Perceptual symbols theory states that language,

reasoning, context, and cognition are grounded in perceptual features that provide visual clues to create world models

[

]. Unlike state-of-the-art language models, humans learn languages in a rich perceptual environment. Perceptual data

contributes to linguistic tasks and plays an important role in the acquisition of language in humans [

]. Perceptual

grounding facilitates ﬁrst language acquisition (e.g., the illustrations in children’s picture books) and second language

acquisition (e.g., studying abroad). By using perceptual information, we can quickly understand the meaning of a new

word when learning a language by mapping it to its real-world subject. Language models that incorporate visual data

have a stronger correspondence to human judgments of word similarity and human reaction times on semantic priming

tasks, especially for concrete nouns and visually descriptive words [5, 6, 7, 8].

Recently it has been shown that recurrent neural network-based models are equivalent to universal computational

models such as Turing Machines, even with ﬁnite precision [

] and even with bounded time [

]. As the brain

arXiv:2210.05487v2 [cs.CL] 13 Feb 2023

Preprint

is widely understood to be a kind of Turing machine [

], RNNs are an obvious choice to conduct this study. On

the other hand, the self-attention layers of the popular transformer models [

] have restricted capability and fail to

recognize context-free languages, even when allowed inﬁnite precision in the weights [

], and have issues generalizing

to unseen distributions. Hence to better study the role of visual grounding in multilingual language learning, we extend

a Long Short Term Memory (LSTM) recurrent model with multimodal and multilingual inputs, trained on images and

corresponding English and Spanish text. Our interest in images is largely due to the availability of visual datasets rather

than a commitment to the vision to the exclusion of other senses being important for human-like language learning.

Congenitally blind people’s language experience is perceptually grounded, just not visually grounded.

We aim to understand the process of using perceptual information in human language learning in language models,

speciﬁcally facilitating multilingual learning using visual information, to examine if visual representation allows the

model to understand better the relationship between words from different languages with the same semantic meaning.

Perceptually grounded multilingual language models have the potential to be

(1)

more human-like in how they process

multiple languages (i.e., better models of multilingual speakers) and

(2)

scaffold the acquisition of multiple languages

given conditions of plentiful perceptual data and limited linguistic data (as is generally the case for human second

language learners).

In what follows, we discuss related work on perceptually grounded and multilingual language modeling, describe

our model, and present results (namely, overall perplexity in Spanish and English and both within and between

language judgements of semantic similarity). We ﬁnd that the use of visual information lowers perplexity and improves

correlation to human judgements of semantic similarity both between and within languages. However, the performance

improvement is least-signiﬁcant for abstract words. Our results align with prior studies on images and monolingual data

[

]; visual grounding improves multilingual model performance on next-word prediction and semantic alignment

across languages.

2 Related Work

The importance of visual or perceptual features to language comprehension is widely studied in neuro-imaging, and

behavioral studies [

]. These studies provide substantial evidence that language and perception can beneﬁt each

other. This can also be seen in large vision-language models, where pre-training on language beneﬁts visual processing

[

]. It is also evident from studies showing an infant’s world model is efﬁciently created by jointly learning

different modalities [

]. Children or infants rapidly learn new words by inferring and analyzing information from their

physical world [

]. Bayesian cognitive models have captured the rapid dynamics of children’s language acquisition

by pairing syntactic information from language experience with semantic knowledge from world experience, such

the learning in the two modalities bootstrap off of each other [

]. Thus, integrating vision and language can help us

better understand language acquisition in human brains and can beneﬁt artiﬁcial intelligence systems through efﬁcient

learning and reasoning.

Multilingual language models succeed in many tasks, including language comprehension, generation, cross-lingual

translations, and information retrieval. Studies have found that after ﬁne-tuning on target language pairs, pre-trained

models can successfully handle multiple multilingual tasks, including but not limited to next-word prediction, translation,

language generation for code-switching sentences, and speech recognition [

]. However, achieving human-like

performance on tasks involving integration and interactions among multiple languages is still challenging. [

] found

that even though pre-trained models succeed in multiple multilingual tasks, they may not perform well in forming

representations of code-switching patterns in language production, indicating a lack of sufﬁciently deep integration and

interactions between models’ representations across languages.

Furthermore, questions of how multilingual models integrate knowledge across languages and if that integration is

human-like remain a matter of study [

]. Studies have found that even when multilingual models form language-

integrated representations [

] and align similar knowledge among languages under similar representations [

], the

models still may not achieve satisfying results on higher-level multilingual tasks such as translation alignment and

language generation. In short, more studies are needed to understand better the mechanisms of language representations

of multilingualism and how representations of different languages are involved in the language generation process.

3 Model Design

Using multilingual data, we will evaluate gated recurrent models such as long-short-term memory (LSTM). In this

study, we show that having vision context or semantic input helps the model better understand the syntactic structure

across languages. We ﬁrst deﬁne the vanilla LSTM architecture, which is deﬁned as follows:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LIKEABILINGUALBABY:THEADVANTAGEOFVISUALLYGROUNDINGABILINGUALLANGUAGEMODELKhai-NguyenNguyenBucknellUniveristyLewisburg,PAnkn002@bucknell.eduZixinTangThePennsylvaniaStateUniversityStateCollege,PAzqt5035@psu.eduAnkurMaliUniversityofSouthFloridaTampa,FLankurarjunmali@usf.eduM.AlexKellyCarletonUniversity...

展开>> 收起<<

LIKE A BILINGUAL BABY THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL LANGUAGE MODEL.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LIKE A BILINGUAL BABY THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL LANGUAGE MODEL

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: