2
prior experiences [10]. For example, when we see a familiar object,
we spontaneously retrieve the knowledge of that object and the
entity relationships that object forms in our mind. As shown in
Fig. 1, cognitive neuroscience research on dual-coding theory [11],
[12] also considers concrete concepts to be encoded in the brain
both visually and linguistically, where language, as a valid prior
experience, contributes to shaping vision-derived representations.
Moreover, in brain-inspired computational modeling, large-scale
multimodal pretrained models [13], [14] formed by combining
image and text representations provide a better proxy for human-
like intelligence. Therefore, we argue that the recorded brain
activity should be decoded using a combination of not only the
visual semantic features that were in fact presented as clues, but
also a far richer set of linguistic semantic features typically related
to the target object.
Although several studies have addressed the idea of decoding
naturalistic visual experiences from brain activity using purely
linguistic features [15], [16], they merely use standard word vectors
of class names that are automatically extracted from large corpora
such as Common Crawl. Actually, the word vectors of class names
are barely aligned with visual information [17]. As a result, the
neural decoding accuracy is still far from the practical criterion. Is
it possible to build a language representation that is more consistent
with visual cognition, with richer visual semantics? Previous studies
using Wikipedia text descriptions to represent image classes have
shown some positive signs [17], [18]. For example, as shown in
Fig. 2, the page “Elephants” contains phrases “long trunk, tusks,
large ear flaps, massive legs” and “tough but sensitive skin” that
exactly match the visual attributes. Intuitively, Wikipedia articles
capture richer visual semantic information than class names. Here,
we argue that using natural languages such as Wikipedia articles
as class descriptions will yield better neural decoding performance
than using class names.
Motivated by the aforementioned discussions, we proposed
a biologically plausible neural decoding method, called BraVL,
to infer novel image categories from human brain activity by
the joint learning of brain-visual-linguistic features. Our model
focuses on modeling the relationships between brain activity and
multimodal semantic knowledge, i.e., visual semantic knowledge
extracted from images and textual semantic knowledge obtained
from rich Wikipedia descriptions of classes. Specifically, we de-
veloped a multimodal auto-encoding variational Bayesian learning
framework, in which we used the mixture-of-product-of-experts
formulation [19] to infer a latent code that enables coherent joint
generation of all three modalities. To learn a more consistent
joint representation and improve the data efficiency in the case
of limited brain activity data, we further introduced both the
intra- and inter-modality Mutual Information (MI) regularization
terms. In particular, our BraVL model can be trained under
various semi-supervised learning scenarios to incorporate the extra
visual and textual features obtained from the large-scale image
categories in addition to the image categories of training data.
Furthermore, we collected the corresponding textual descriptions
for two popular Image-fMRI datasets [6], [20] and one Image-EEG
dataset [21], hence forming three new trimodal matching (brain-
visual-linguistic) datasets. The experimental results give us three
significant observations. First, models using the combination of
visual and textual features perform much better than those using
either of them alone. Second, using natural languages as class
descriptions yields higher neural decoding performance than using
class names. Third, either unimodal or bimodal extra data can
remarkably improve decoding accuracy.
Contributions.
In summary, our main contributions are listed
as follows: 1) We combine visual and linguistic knowledge for
neural decoding of visual categories from human brain activity for
the first time. 2) We develop a new multimodal learning model
with specially designed intra- and inter-modality MI regularizers to
achieve more consistent brain-visual-linguistic joint representations
and improved data efficiency. 3) We contribute three trimodal
matching datasets, containing high-quality brain activity, visual
features and textual features. Our code and datasets have been
released
1
to facilitate further research. 4) Our experimental results
show several interesting conclusions and cognitive insights about
the human visual system.
2 RELATED WORK
Neural decoding of visual categories.
Estimating the seman-
tic categories of viewed images from evoked brain activity has
long been a sought objective. Previous works mostly relied on
a classification-based approach, where a classifier is trained to
build the relationship between brain activity and the predefined
labels using fMRI [1], [22], [23], [24] or EEG [3], [25], [26], [27]
data. However, this kind of method is restricted to the decoding
of a specified set of categories. To allow novel category decoding,
several identification-based methods [6], [7], [28] were proposed
by characterizing the relationship between brain activity and visual
semantic knowledge, such as image features extracted from Gabor
wavelet filters [7] or a CNN [6], [28]. Although these methods
allow the identification of a large set of possible image categories,
the decoding accuracy significantly depends on the large number
of paired stimuli-responses data, which is difficult to collect.
Therefore, accurately decoding novel image categories remains
a challenge. Neurolinguistic studies have shown that distributed
word representations are also correlated with evoked brain activity
[15], [29], [30], [31]. Encouraged by these findings, we associate
brain activity with multimodal semantic knowledge, i.e., not only
visual features but also textual features. In particular, rather than
learning a mapping directly to multimodal semantic knowledge,
we focus on creating a latent space that could describe any valid
categories, and then learn a mapping between brain activity and
this latent space.
Zero-shot learning (ZSL).
ZSL is a classification problem
where the label space is divided into two distinct sets: seen
and novel classes [32], [33], [34]. To alleviate the problem
of seen-novel domain shift, training samples typically consist
of semantic knowledge such as attributes [32], [35] or word
embeddings [36] that bridge the semantic gap between seen
and novel classes. Semantic knowledge of these types reflects
human heuristics, and can therefore be extended and transferred
from seen classes to novel ones, specifying the semantic space in
ZSL. ZSL methods can be roughly divided into three categories,
depending on the method used to inject semantic knowledge: 1)
learning instance
→
semantic projections [36], [37], 2) learning
semantic
→
instance projections [38], [39], and 3) learning the
projections of instance and semantic spaces to a shared latent space
[35], [40]. Our approach falls into the third category. Recently,
ZSL researchers have achieved success through the use of deep
generative models [35], [41], which are used for synthesizing data
features as a data augmentation mechanism in ZSL. In our work, we
1. https://github.com/ChangdeDu/BraVL