LIKE A BILINGUAL BABY THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL LANGUAGE MODEL

2025-04-24 0 0 4.8MB 10 页 10玖币
侵权投诉
LIKE A BILINGUAL BABY:
THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL
LANGUAGE MODEL
Khai-Nguyen Nguyen
Bucknell Univeristy
Lewisburg, PA
nkn002@bucknell.edu
Zixin Tang
The Pennsylvania State University
State College, PA
zqt5035@psu.edu
Ankur Mali
University of South Florida
Tampa, FL
ankurarjunmali@usf.edu
M. Alex Kelly
Carleton University
Ottawa, Canada
alex.kelly@carleton.ca
ABSTRACT
Unlike most neural language models, humans learn languages in a rich, multi-sensory, and, often,
multi-lingual environment. Conversely, language models are typically trained on only language
data, and often on only one language. We hypothesize that perceptual grounding scaffolds language
learning, including learning relationships between languages. To better understand multilingualism
and the role of visual input in language understanding, we train a recurrent language model on images
and corresponding text in English and Spanish from MS-COCO-ES. We find that visual grounding
improves the model’s understanding of semantic similarity within and across languages and improves
language generation. Our results provide evidence of the advantages of visually grounded language
models. We posit that language learning is better understood as integral to the totality of a learner’s
experiences, and thus there is a need for more naturalistic language data from multilingual speakers
and multilingual datasets with perceptual grounding.
Keywords:
natural language processing; multilingual models; grounded cognition; neural language
models; recurrent neural network; natural language understanding
1 Introduction
With the effects of globalization on business, education, and culture, multilingualism—speaking two or more languages—
is becoming more and more common. The prevalence of multilingualism and speech patterns peculiar to multilingualism
creates a need for computational language models that can handle, process, and generate the language of multilingual
communities.
It is well known that a person’s knowledge is inseparable from the physical or social context in which it is learned
and used; as humans, we create a world model based on context [
1
]. Perceptual symbols theory states that language,
reasoning, context, and cognition are grounded in perceptual features that provide visual clues to create world models
[
2
]. Unlike state-of-the-art language models, humans learn languages in a rich perceptual environment. Perceptual data
contributes to linguistic tasks and plays an important role in the acquisition of language in humans [
3
,
4
]. Perceptual
grounding facilitates first language acquisition (e.g., the illustrations in children’s picture books) and second language
acquisition (e.g., studying abroad). By using perceptual information, we can quickly understand the meaning of a new
word when learning a language by mapping it to its real-world subject. Language models that incorporate visual data
have a stronger correspondence to human judgments of word similarity and human reaction times on semantic priming
tasks, especially for concrete nouns and visually descriptive words [5, 6, 7, 8].
Recently it has been shown that recurrent neural network-based models are equivalent to universal computational
models such as Turing Machines, even with finite precision [
9
,
10
,
11
] and even with bounded time [
11
]. As the brain
arXiv:2210.05487v2 [cs.CL] 13 Feb 2023
Preprint
is widely understood to be a kind of Turing machine [
12
,
13
], RNNs are an obvious choice to conduct this study. On
the other hand, the self-attention layers of the popular transformer models [
14
] have restricted capability and fail to
recognize context-free languages, even when allowed infinite precision in the weights [
15
], and have issues generalizing
to unseen distributions. Hence to better study the role of visual grounding in multilingual language learning, we extend
a Long Short Term Memory (LSTM) recurrent model with multimodal and multilingual inputs, trained on images and
corresponding English and Spanish text. Our interest in images is largely due to the availability of visual datasets rather
than a commitment to the vision to the exclusion of other senses being important for human-like language learning.
Congenitally blind people’s language experience is perceptually grounded, just not visually grounded.
We aim to understand the process of using perceptual information in human language learning in language models,
specifically facilitating multilingual learning using visual information, to examine if visual representation allows the
model to understand better the relationship between words from different languages with the same semantic meaning.
Perceptually grounded multilingual language models have the potential to be
(1)
more human-like in how they process
multiple languages (i.e., better models of multilingual speakers) and
(2)
scaffold the acquisition of multiple languages
given conditions of plentiful perceptual data and limited linguistic data (as is generally the case for human second
language learners).
In what follows, we discuss related work on perceptually grounded and multilingual language modeling, describe
our model, and present results (namely, overall perplexity in Spanish and English and both within and between
language judgements of semantic similarity). We find that the use of visual information lowers perplexity and improves
correlation to human judgements of semantic similarity both between and within languages. However, the performance
improvement is least-significant for abstract words. Our results align with prior studies on images and monolingual data
[
16
,
8
]; visual grounding improves multilingual model performance on next-word prediction and semantic alignment
across languages.
2 Related Work
The importance of visual or perceptual features to language comprehension is widely studied in neuro-imaging, and
behavioral studies [
2
,
3
]. These studies provide substantial evidence that language and perception can benefit each
other. This can also be seen in large vision-language models, where pre-training on language benefits visual processing
[
17
,
18
]. It is also evident from studies showing an infant’s world model is efficiently created by jointly learning
different modalities [
19
]. Children or infants rapidly learn new words by inferring and analyzing information from their
physical world [
20
]. Bayesian cognitive models have captured the rapid dynamics of children’s language acquisition
by pairing syntactic information from language experience with semantic knowledge from world experience, such
the learning in the two modalities bootstrap off of each other [
21
]. Thus, integrating vision and language can help us
better understand language acquisition in human brains and can benefit artificial intelligence systems through efficient
learning and reasoning.
Multilingual language models succeed in many tasks, including language comprehension, generation, cross-lingual
translations, and information retrieval. Studies have found that after fine-tuning on target language pairs, pre-trained
models can successfully handle multiple multilingual tasks, including but not limited to next-word prediction, translation,
language generation for code-switching sentences, and speech recognition [
22
,
23
,
24
]. However, achieving human-like
performance on tasks involving integration and interactions among multiple languages is still challenging. [
25
] found
that even though pre-trained models succeed in multiple multilingual tasks, they may not perform well in forming
representations of code-switching patterns in language production, indicating a lack of sufficiently deep integration and
interactions between models’ representations across languages.
Furthermore, questions of how multilingual models integrate knowledge across languages and if that integration is
human-like remain a matter of study [
26
]. Studies have found that even when multilingual models form language-
integrated representations [
27
] and align similar knowledge among languages under similar representations [
28
], the
models still may not achieve satisfying results on higher-level multilingual tasks such as translation alignment and
language generation. In short, more studies are needed to understand better the mechanisms of language representations
of multilingualism and how representations of different languages are involved in the language generation process.
3 Model Design
Using multilingual data, we will evaluate gated recurrent models such as long-short-term memory (LSTM). In this
study, we show that having vision context or semantic input helps the model better understand the syntactic structure
across languages. We first define the vanilla LSTM architecture, which is defined as follows:
2
摘要:

LIKEABILINGUALBABY:THEADVANTAGEOFVISUALLYGROUNDINGABILINGUALLANGUAGEMODELKhai-NguyenNguyenBucknellUniveristyLewisburg,PAnkn002@bucknell.eduZixinTangThePennsylvaniaStateUniversityStateCollege,PAzqt5035@psu.eduAnkurMaliUniversityofSouthFloridaTampa,FLankurarjunmali@usf.eduM.AlexKellyCarletonUniversity...

展开>> 收起<<
LIKE A BILINGUAL BABY THE ADVANTAGE OF VISUALLY GROUNDING A BILINGUAL LANGUAGE MODEL.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:4.8MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注