
Preprint
is widely understood to be a kind of Turing machine [
12
,
13
], RNNs are an obvious choice to conduct this study. On
the other hand, the self-attention layers of the popular transformer models [
14
] have restricted capability and fail to
recognize context-free languages, even when allowed infinite precision in the weights [
15
], and have issues generalizing
to unseen distributions. Hence to better study the role of visual grounding in multilingual language learning, we extend
a Long Short Term Memory (LSTM) recurrent model with multimodal and multilingual inputs, trained on images and
corresponding English and Spanish text. Our interest in images is largely due to the availability of visual datasets rather
than a commitment to the vision to the exclusion of other senses being important for human-like language learning.
Congenitally blind people’s language experience is perceptually grounded, just not visually grounded.
We aim to understand the process of using perceptual information in human language learning in language models,
specifically facilitating multilingual learning using visual information, to examine if visual representation allows the
model to understand better the relationship between words from different languages with the same semantic meaning.
Perceptually grounded multilingual language models have the potential to be
(1)
more human-like in how they process
multiple languages (i.e., better models of multilingual speakers) and
(2)
scaffold the acquisition of multiple languages
given conditions of plentiful perceptual data and limited linguistic data (as is generally the case for human second
language learners).
In what follows, we discuss related work on perceptually grounded and multilingual language modeling, describe
our model, and present results (namely, overall perplexity in Spanish and English and both within and between
language judgements of semantic similarity). We find that the use of visual information lowers perplexity and improves
correlation to human judgements of semantic similarity both between and within languages. However, the performance
improvement is least-significant for abstract words. Our results align with prior studies on images and monolingual data
[
16
,
8
]; visual grounding improves multilingual model performance on next-word prediction and semantic alignment
across languages.
2 Related Work
The importance of visual or perceptual features to language comprehension is widely studied in neuro-imaging, and
behavioral studies [
2
,
3
]. These studies provide substantial evidence that language and perception can benefit each
other. This can also be seen in large vision-language models, where pre-training on language benefits visual processing
[
17
,
18
]. It is also evident from studies showing an infant’s world model is efficiently created by jointly learning
different modalities [
19
]. Children or infants rapidly learn new words by inferring and analyzing information from their
physical world [
20
]. Bayesian cognitive models have captured the rapid dynamics of children’s language acquisition
by pairing syntactic information from language experience with semantic knowledge from world experience, such
the learning in the two modalities bootstrap off of each other [
21
]. Thus, integrating vision and language can help us
better understand language acquisition in human brains and can benefit artificial intelligence systems through efficient
learning and reasoning.
Multilingual language models succeed in many tasks, including language comprehension, generation, cross-lingual
translations, and information retrieval. Studies have found that after fine-tuning on target language pairs, pre-trained
models can successfully handle multiple multilingual tasks, including but not limited to next-word prediction, translation,
language generation for code-switching sentences, and speech recognition [
22
,
23
,
24
]. However, achieving human-like
performance on tasks involving integration and interactions among multiple languages is still challenging. [
25
] found
that even though pre-trained models succeed in multiple multilingual tasks, they may not perform well in forming
representations of code-switching patterns in language production, indicating a lack of sufficiently deep integration and
interactions between models’ representations across languages.
Furthermore, questions of how multilingual models integrate knowledge across languages and if that integration is
human-like remain a matter of study [
26
]. Studies have found that even when multilingual models form language-
integrated representations [
27
] and align similar knowledge among languages under similar representations [
28
], the
models still may not achieve satisfying results on higher-level multilingual tasks such as translation alignment and
language generation. In short, more studies are needed to understand better the mechanisms of language representations
of multilingualism and how representations of different languages are involved in the language generation process.
3 Model Design
Using multilingual data, we will evaluate gated recurrent models such as long-short-term memory (LSTM). In this
study, we show that having vision context or semantic input helps the model better understand the syntactic structure
across languages. We first define the vanilla LSTM architecture, which is defined as follows:
2