1 Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic

2025-04-27 0 0 4.68MB 17 页 10玖币

Decoding Visual Neural Representations by

Multimodal Learning of Brain-Visual-Linguistic

Features

Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He, Senior Member, IEEE

Abstract—Decoding human visual neural representations is a challenging task with great scientiﬁc signiﬁcance in revealing

vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difﬁcult to generalize to novel

categories that have no corresponding neural data for training. The two main reasons are 1) the under-exploitation of the multimodal

semantic knowledge underlying the neural data and 2) the small number of paired (stimuli-responses) training data. To overcome these

limitations, this paper presents a generic neural decoding method called BraVL that uses multimodal learning of brain-visual-linguistic

features. We focus on modeling the relationships between brain, visual and linguistic features via multimodal deep generative models.

Speciﬁcally, we leverage the mixture-of-product-of-experts formulation to infer a latent code that enables a coherent joint generation of all

three modalities. To learn a more consistent joint representation and improve the data efﬁciency in the case of limited brain activity data,

we exploit both intra- and inter-modality mutual information maximization regularization terms. In particular, our BraVL model can be

trained under various semi-supervised scenarios to incorporate the visual and textual features obtained from the extra categories. Finally,

we construct three trimodal matching datasets, and the extensive experiments lead to some interesting conclusions and cognitive insights:

1) decoding novel visual categories from human brain activity is practically possible with good accuracy; 2) decoding models using the

combination of visual and linguistic features perform much better than those using either of them alone; 3) visual perception may be

accompanied by linguistic inﬂuences to represent the semantics of visual stimuli. Code and data: https://github.com/ChangdeDu/BraVL.

Index Terms—Generic neural decoding, brain-visual-linguistic embedding, multimodal Learning, mutual information maximization

1 INTRODUCTION

Uman visual capabilities are superior to current artiﬁcial

systems. Many cognitive neuroscientists and artiﬁcial intelli-

gence researchers have been committed to reverse-engineering the

human mind, to decipher and simulate the mechanism of the brain

and to promote the development of brain-inspired computational

models [1], [2], [3]. Although there is an increasing interest in the

visual neural representation decoding task, inferring visual category

from human brain activity for novel classes remains the boundary to

explore. Zero-Shot Neural Decoding (ZSND) based on functional

Magnetic Resonance Imaging (fMRI) or electroencephalography

(EEG) data aims to tackle this problem [4], [5], [6]. In the ZSND,

we have access to a set of brain activity of seen classes, and the

objective is to leverage the visual [6], [7] or linguistic [4], [5]

semantic knowledge to learn a generic neural decoder that enables

generalization to novel classes at test time. These studies not only

help to reveal the cognitive mechanism of the human brain, but also

•

C. Du, K. Fu and H. He are with the Research Center for Brain-Inspired

Intelligence, State Key Laboratory of Multimodal Artiﬁcial Intelligence

Systems, Institute of Automation, Chinese Academy of Sciences, Beijing

100190, China. K. Fu and H. He are also with the School of Artiﬁcial

Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing

100049, China (e-mail: changde.du@ia.ac.cn, fukaicheng2019@ia.ac.cn,

huiguang.he@ia.ac.cn).

•

J. Li is with the Ningbo HwaMei Hospital, UCAS, Zhejiang 315010, China

(e-mail: lijinpeng@ucas.ac.cn).

This work was supported in part by the National Key R&D Program of China

2022ZD0116500; in part by the National Natural Science Foundation of China

under Grant 62206284, Grant 62020106015 and Grant 61976209; in part by

the Strategic Priority Research Program of Chinese Academy of Sciences under

Grant XDB32040000; and in part by the CAAI-Huawei MindSpore Open Fund.

(Corresponding author: Huiguang He)

long trunk, tusks,

large ear flaps,

massive legs, ...

Elephant

Visual experience

Vision-derived

representation Language-derived

representation

Language experience

Dual coding of knowledge

Fig. 1. Dual coding of knowledge in the human brain. When we see

a picture of elephant, we will spontaneously retrieve the knowledge of

elephant in our mind. Then, the concept of elephant is encoded in the

brain both visually and linguistically, where language, as a valid prior

experience, contributes to shaping vision-derived representations.

provide a technical basis for the development of Brain-Computer

Interfaces (BCIs).

Existing visual neural representation decoding methods mostly

resort to visual semantic knowledge, such as features extracted

from viewed images on the basis of Gabor wavelet ﬁlters [7] or

a Convolutional Neural Network (CNN) [6], [8], [9]. However,

the human ability to detect, discriminate, and recognize perceptual

visual stimuli is inﬂuenced by both visual features and people’s

arXiv:2210.06756v2 [cs.CV] 30 Mar 2023

prior experiences [10]. For example, when we see a familiar object,

we spontaneously retrieve the knowledge of that object and the

entity relationships that object forms in our mind. As shown in

Fig. 1, cognitive neuroscience research on dual-coding theory [11],

[12] also considers concrete concepts to be encoded in the brain

both visually and linguistically, where language, as a valid prior

experience, contributes to shaping vision-derived representations.

Moreover, in brain-inspired computational modeling, large-scale

multimodal pretrained models [13], [14] formed by combining

image and text representations provide a better proxy for human-

like intelligence. Therefore, we argue that the recorded brain

activity should be decoded using a combination of not only the

visual semantic features that were in fact presented as clues, but

also a far richer set of linguistic semantic features typically related

to the target object.

Although several studies have addressed the idea of decoding

naturalistic visual experiences from brain activity using purely

linguistic features [15], [16], they merely use standard word vectors

of class names that are automatically extracted from large corpora

such as Common Crawl. Actually, the word vectors of class names

are barely aligned with visual information [17]. As a result, the

neural decoding accuracy is still far from the practical criterion. Is

it possible to build a language representation that is more consistent

with visual cognition, with richer visual semantics? Previous studies

using Wikipedia text descriptions to represent image classes have

shown some positive signs [17], [18]. For example, as shown in

Fig. 2, the page “Elephants” contains phrases “long trunk, tusks,

large ear ﬂaps, massive legs” and “tough but sensitive skin” that

exactly match the visual attributes. Intuitively, Wikipedia articles

capture richer visual semantic information than class names. Here,

we argue that using natural languages such as Wikipedia articles

as class descriptions will yield better neural decoding performance

than using class names.

Motivated by the aforementioned discussions, we proposed

a biologically plausible neural decoding method, called BraVL,

to infer novel image categories from human brain activity by

the joint learning of brain-visual-linguistic features. Our model

focuses on modeling the relationships between brain activity and

multimodal semantic knowledge, i.e., visual semantic knowledge

extracted from images and textual semantic knowledge obtained

from rich Wikipedia descriptions of classes. Speciﬁcally, we de-

veloped a multimodal auto-encoding variational Bayesian learning

framework, in which we used the mixture-of-product-of-experts

formulation [19] to infer a latent code that enables coherent joint

generation of all three modalities. To learn a more consistent

joint representation and improve the data efﬁciency in the case

of limited brain activity data, we further introduced both the

intra- and inter-modality Mutual Information (MI) regularization

terms. In particular, our BraVL model can be trained under

various semi-supervised learning scenarios to incorporate the extra

visual and textual features obtained from the large-scale image

categories in addition to the image categories of training data.

Furthermore, we collected the corresponding textual descriptions

for two popular Image-fMRI datasets [6], [20] and one Image-EEG

dataset [21], hence forming three new trimodal matching (brain-

visual-linguistic) datasets. The experimental results give us three

signiﬁcant observations. First, models using the combination of

visual and textual features perform much better than those using

either of them alone. Second, using natural languages as class

descriptions yields higher neural decoding performance than using

class names. Third, either unimodal or bimodal extra data can

remarkably improve decoding accuracy.

Contributions.

In summary, our main contributions are listed

as follows: 1) We combine visual and linguistic knowledge for

neural decoding of visual categories from human brain activity for

the ﬁrst time. 2) We develop a new multimodal learning model

with specially designed intra- and inter-modality MI regularizers to

achieve more consistent brain-visual-linguistic joint representations

and improved data efﬁciency. 3) We contribute three trimodal

matching datasets, containing high-quality brain activity, visual

features and textual features. Our code and datasets have been

released

to facilitate further research. 4) Our experimental results

show several interesting conclusions and cognitive insights about

the human visual system.

2 RELATED WORK

Neural decoding of visual categories.

Estimating the seman-

tic categories of viewed images from evoked brain activity has

long been a sought objective. Previous works mostly relied on

a classiﬁcation-based approach, where a classiﬁer is trained to

build the relationship between brain activity and the predeﬁned

labels using fMRI [1], [22], [23], [24] or EEG [3], [25], [26], [27]

data. However, this kind of method is restricted to the decoding

of a speciﬁed set of categories. To allow novel category decoding,

several identiﬁcation-based methods [6], [7], [28] were proposed

by characterizing the relationship between brain activity and visual

semantic knowledge, such as image features extracted from Gabor

wavelet ﬁlters [7] or a CNN [6], [28]. Although these methods

allow the identiﬁcation of a large set of possible image categories,

the decoding accuracy signiﬁcantly depends on the large number

of paired stimuli-responses data, which is difﬁcult to collect.

Therefore, accurately decoding novel image categories remains

a challenge. Neurolinguistic studies have shown that distributed

word representations are also correlated with evoked brain activity

[15], [29], [30], [31]. Encouraged by these ﬁndings, we associate

brain activity with multimodal semantic knowledge, i.e., not only

visual features but also textual features. In particular, rather than

learning a mapping directly to multimodal semantic knowledge,

we focus on creating a latent space that could describe any valid

categories, and then learn a mapping between brain activity and

this latent space.

Zero-shot learning (ZSL).

ZSL is a classiﬁcation problem

where the label space is divided into two distinct sets: seen

and novel classes [32], [33], [34]. To alleviate the problem

of seen-novel domain shift, training samples typically consist

of semantic knowledge such as attributes [32], [35] or word

embeddings [36] that bridge the semantic gap between seen

and novel classes. Semantic knowledge of these types reﬂects

human heuristics, and can therefore be extended and transferred

from seen classes to novel ones, specifying the semantic space in

ZSL. ZSL methods can be roughly divided into three categories,

depending on the method used to inject semantic knowledge: 1)

learning instance

→

semantic projections [36], [37], 2) learning

semantic

→

instance projections [38], [39], and 3) learning the

projections of instance and semantic spaces to a shared latent space

[35], [40]. Our approach falls into the third category. Recently,

ZSL researchers have achieved success through the use of deep

generative models [35], [41], which are used for synthesizing data

features as a data augmentation mechanism in ZSL. In our work, we

1. https://github.com/ChangdeDu/BraVL

科研规划-1 （目标CVPR-2022）

Training data from seen classes

......

Class names:

Zebra

Wikipedia article:

Zebras are several species of

African equids (horse family)

united by their distinctive black-

and-white striped coats. Their

stripes come in different patterns,

unique to each individual. (...)

Class names:

Elephant

Wikipedia article:

Elephants (...) Distinctive

features of all elephants

include a long trunk, tusks,

large ear flaps, massive legs,

and tough but sensitive

skin.(...)

......

Stimulus

?......

Class names:

Goldfish, Carassius auratus

Wikipedia article:

Goldfish breeds vary greatly in

size, body shape, fin

configuration and coloration

(various combinations of white,

yellow, orange, red, brown, and

black are known). (...)

......

Training data from novel classes

Brain activity

Stimulus Brain activity

Stimulus

?......

Test data from novel classes

Brain activity

Novel class

neural decoder

Goldfish

Definition of novel classes: a set

of known candidate categories

with no overlap with seen classes.

Fig. 2. Image stimuli, evoked brain activity and their corresponding textual data. We can only collect brain activity for a few categories, but we can

easily collect images and/or text data for almost all categories. Therefore, for seen classes, we assume that the brain activity, visual images and

corresponding textual descriptions are available for training, whereas for novel classes, only visual images and textual descriptions are available for

training. The test data are brain activity from the novel classes.

use the ZSL paradigm to solve the novel class neural decoding task.

Although visual and linguistic semantic knowledge are observable

for the novel class, there was no brain activity data. Therefore,

the novel class neural decoding can be regarded as a zero-shot

classiﬁcation problem.

Multimodal learning.

Multimodal learning is inspired by cog-

nitive science research, suggesting that human semantic knowledge

relies on perceptual and sensori-motor experience. Multimodal

semantic models using both linguistic representations and visual

perceptual information have been proven successful in a range of

Natural Language Processing (NLP) tasks, such as learning word

embeddings [13], [42]. Several studies have addressed the idea of

decoding linguistic nouns from brain activity using both linguistic

and visual perceptual information [43], [44], [45]. Anderson et al.

applied linguistic and visually-grounded computational models to

decode the neural representations of a set of concrete and abstract

nouns [43]. Davis et al. constructed multimodal models combining

linguistic and three kinds of visual features, and evaluated the

models on the task of decoding brain activity associated with the

meanings of nouns [45]. In contrast to the above studies that have

leveraged visual features to boost the neural decoding of linguistic

nouns, we introduce textual features to enhance the neural decoding

of visual categories.

Mutual information maximization.

For two random variables

and

whose joint probability distribution is

p(x, y)

, the

mutual information (MI) between them is deﬁned as

I(X;Y) =

Ep(x,y)hlog p(x,y)

p(x)p(y)i

. Furthermore, as a Shannon entropy-based

quantity, MI can also be written as

I(X;Y) = H(X)−H(X|Y)

where

H(X)

is the Shannon entropy and

H(X|Y)

is the condi-

tional entropy. As a pioneer, [46] ﬁrst incorporated MI-related

optimization into deep learning. Since then, many works have

demonstrated the beneﬁt of the MI-maximization in deep repre-

sentation learning [47], [48], [49]. Since directly optimizing MI in

high-dimensional spaces is nearly impossible, many approximation

methods with variational bounds have been proposed [50], [51],

[52]. In our work, we apply MI-maximization at both the intra-

and inter-modality levels in multimodal representation learning.

We prove that inter-modality MI-maximization is equivalent to

multimodal contrast learning.

3 MULTIMODAL LEARNING OF BRAIN-VISUAL-

LINGUISTIC FEATURES

3.1 Problem deﬁnition

In real world applications, we can only collect brain activity

for a few visual categories, but we can easily collect images and/or

text data for almost all categories. If we can make full use of the

plentiful image and text data without corresponding brain activity,

we have opportunities to improve the generalization performance

of neural decoding models. Therefore, as shown in Fig. 2, we

assume that brain activity, visual images (from ImageNet) and

class-speciﬁc textual descriptions (from Wikipedia) are provided

for the seen classes, but only visual and textual information are

provided for novel/unseen classes . Our goal is to learn a classiﬁer

(i.e., neural decoder) that can classify novel class brain activity at

test time. Note that the novel classes are a set of known candidate

categories with no overlap with the seen classes (rather than the

inﬁnite arbitrary categories).

Let

Dseen ={(xb,xv,xt,y)|xb∈Xs

b,xv∈Xs

v,xt∈

t,y∈Ys}

be the set of seen class data, where

corresponds

to the set of brain activity (fMRI) features,

denotes the visual

features,

denotes the textual features and

denotes the set of

seen class labels. Similarly, the data for novel/unseen classes are

deﬁned as

Dnovel ={(xn

v,xn

t,yn)|xn

v∈Xn

v,xn

t∈Xn

t,yn∈

Yn}

, where

and

denote the visual features, textual

features and class labels of the novel classes, respectively. The

seen class labels

and the novel class labels

are disjointed

in categories, i.e.,

Ys∩Yn=∅

. Note that the novel class brain

activity data

is unavailable during model training, and it will

only be used at test time.

Let b, v and trepresent the subscripts of the brain, visual, and

textual modality, respectively. For any given modality subscript

(

m∈ {b, v, t}

), the unimodal feature matrix

Xm∈RNm×dm

where

Xm=Xs

mSXn

Nm=Ns

m+Nn

is the sample size

and dmis the feature dimension of modality m.

3.2 Brain, image and text preprocessing

As shown in Fig. 3, we ﬁrst preprocess the raw inputs into

feature representations with modality-speciﬁc feature extractors.

Stability selection of brain voxels.

Brain activity differs from

trial to trial, even for an identical visual stimulus. To improve

𝑥𝑏

𝑥𝑡

𝑥𝑣

PCA

Concat.

Class names:

Elephant

Wikipedia article:

(...) Distinctive features of all

elephants include a long trunk,

tusks, large ear flaps, massive legs,

and tough but sensitive skin. (...)

Text embedding

Pre-trained

NLP model

ALBERT,

GPT-Neo

Trial 2

Trial 1

Voxel 1

Voxel 5

Visual embedding

Trial 2

Trial 1

Voxel 1

Voxel 3

Voxel 5

Stability Selection

Trial 1

Trial 2

Voxel 1

Voxel 3

Voxel 5

Fig. 3. Data preprocessing. We preprocess the raw inputs into feature

representations with modality-speciﬁc feature extractors.

the stability of neural decoding, we used stability selection for

fMRI data, in which the voxels showing the highest consistency in

activation patterns across distinct trials for an identical visual

stimulus were selected for the analysis, following [31]. This

stability is quantiﬁed for each voxel as the mean Pearson correlation

coefﬁcient across all pairwise combinations of the trials. In

particular, the stable voxel was selected separately on each brain

region to avoid the selected voxels concentrated in the local brain

region to ensure that a portion of high-quality voxels were retained

in each brain region. This operation can effectively reduce the

dimension of fMRI data and suppress the interference caused by

noisy voxels without seriously affecting the discriminative ability

of brain features. For each selected brain voxel, its response vector

to the visual stimuli belonging to the seen classes is normalized

(across stimuli, zero-mean and unit-variance). Note that we used

only the training fMRI data belonging to the seen classes to

calculate the normalization parameters (i.e., the mean and variance

of each voxel) of each selected voxel, and the calculated mean and

variance were used to normalize both the training and testing fMRI

data separately. After stability selection and normalization, we

perform Principal Component Analysis (PCA) on the training fMRI

data belonging to the seen classes for dimensionality reduction.

The brain feature dimensions after keeping 99% of the variance

using PCA are shown in Section 4.1. Note that the test samples

are not included in the PCA ﬁtting, and we use only the training

samples to estimate the PCA mapping weights. After PCA ﬁtting,

the estimated mapping weights are directly applied to the test

samples to obtain the dimension-reduced test samples.

Feature extraction of visual images.

We use a powerful

VGG-style ConvNet, referred to as RepVGG [53], to extract

hierarchical visual features from the images. Speciﬁcally, we use

the Timm library

to extract the intermediate feature maps with

different strides in the RepVGG-b3g4 model, which had been

pretrained to achieve 80.21% top-1 accuracy on ImageNet [53].

Similar to the brain feature processing pipeline, the extracted visual

features of seen classes are ﬂattened and normalized ﬁrst, and then

dimensionality reduction is performed using PCA to keep 99% of

the variance.

2. https://github.com/rwightman/pytorch-image-models

Embedding of textual descriptions.

In early studies of lan-

guage processing and understanding, generating vectors to represent

sentences is typically done by averaging vectors for the content

words [54]. This method of obtaining sentence vector by average

pooling of word vectors has been successfully applied in many

linguistic neural encoding and decoding studies [30], [55], and has

achieved impressive decoding results. With the development of

NLP method, researchers started to input individual sentence into

Transformer-based models [56], such as BERT [57], and to derive

ﬁxed-size sentence embeddings, which have been found to be very

effective for neural encoding [55]. To obtain sentence embedding

from BERT-like NLP models, the most commonly used approach

is to average the output layer (known as token embeddings) or by

using the output of the ﬁrst token (the [CLS] token). As shown

in a previous linguistic neural decoding study [58], these two

common practices yield similar qualitative results. Here, we use

ALBERT [59] and GPT-Neo [60] as text encoders, and we use the

mean of token embeddings as the sentence embedding.3

Due to the constraint on the input sequence length for ALBERT

and GPT-Neo, we cannot directly input the entire Wikipedia

article into the model. To encode articles that can be longer than

the maximal length, we alternatively split the article text into

partially overlapping sequences of 256 tokens with an overlap of

50 tokens. Concatenating multiple sentence embeddings will lead

to an undesirable ‘curse of dimensionality’ issue. Therefore, we

use the average-pooled representation of multiple sequences to

encode the entire article. This average-pooling strategy has also

been successfully used in a recent linguistic neural encoding study

[61]. Similarly, if a class has multiple corresponding articles in

Wikipedia, we average the representations obtained from each of

them. See Appendix for the degree of heterogeneity of text features

under average-pooling.

3.3 High-level overview of the proposed BraVL model

Fig. 4A shows the overall architecture of the proposed BraVL

model. The model works in two collaborative parts—multi-

modality joint modeling and MI regularization:

•Multi-modality joint modeling.

Based on the Mixture-of-

Products-of-Experts (MoPoE) formulation [19], we develop

a multimodal auto-encoding variational Bayesian model

that enables us to utilize the visual and textual features

jointly to enhance the brain activity representation learning

and downstream novel class neural decoding performance.

Speciﬁcally, we use three modality-speciﬁc encoding

networks

Eb, Ev

and

to transform the unimodal features

xb,xv

and

into the joint latent representation

, which

is then passed through three modality-speciﬁc decoding

networks

Db, Dv

and

for feature reconstruction, re-

spectively.

•Mutual information (MI) regularization.

The MI at two

levels—intra-modality level and inter-modality level are

maximized simultaneously. The former is approximated by

its variational lower bound [62] and the latter is achieved

through introspective cross-modal contrastive learning. The

MI at intra-modality level is used as a consistent regularizer

to force the joint latent representation

to have a strong

relationship with observations

xb,xv

and

, and hence

learn useful joint representations. The MI at inter-modality

3. https://github.com/huggingface/transformers

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DecodingVisualNeuralRepresentationsbyMultimodalLearningofBrain-Visual-LinguisticFeaturesChangdeDu,KaichengFu,JinpengLi,andHuiguangHe,SeniorMember,IEEEAbstractDecodinghumanvisualneuralrepresentationsisachallengingtaskwithgreatscienticsignicanceinrevealingvision-processingmechanismsanddevelopingbr...

展开>> 收起<<

1 Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: