1 Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic

2025-04-27 0 0 4.68MB 17 页 10玖币
侵权投诉
1
Decoding Visual Neural Representations by
Multimodal Learning of Brain-Visual-Linguistic
Features
Changde Du, Kaicheng Fu, Jinpeng Li, and Huiguang He, Senior Member, IEEE
Abstract—Decoding human visual neural representations is a challenging task with great scientific significance in revealing
vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to generalize to novel
categories that have no corresponding neural data for training. The two main reasons are 1) the under-exploitation of the multimodal
semantic knowledge underlying the neural data and 2) the small number of paired (stimuli-responses) training data. To overcome these
limitations, this paper presents a generic neural decoding method called BraVL that uses multimodal learning of brain-visual-linguistic
features. We focus on modeling the relationships between brain, visual and linguistic features via multimodal deep generative models.
Specifically, we leverage the mixture-of-product-of-experts formulation to infer a latent code that enables a coherent joint generation of all
three modalities. To learn a more consistent joint representation and improve the data efficiency in the case of limited brain activity data,
we exploit both intra- and inter-modality mutual information maximization regularization terms. In particular, our BraVL model can be
trained under various semi-supervised scenarios to incorporate the visual and textual features obtained from the extra categories. Finally,
we construct three trimodal matching datasets, and the extensive experiments lead to some interesting conclusions and cognitive insights:
1) decoding novel visual categories from human brain activity is practically possible with good accuracy; 2) decoding models using the
combination of visual and linguistic features perform much better than those using either of them alone; 3) visual perception may be
accompanied by linguistic influences to represent the semantics of visual stimuli. Code and data: https://github.com/ChangdeDu/BraVL.
Index Terms—Generic neural decoding, brain-visual-linguistic embedding, multimodal Learning, mutual information maximization
F
1 INTRODUCTION
H
Uman visual capabilities are superior to current artificial
systems. Many cognitive neuroscientists and artificial intelli-
gence researchers have been committed to reverse-engineering the
human mind, to decipher and simulate the mechanism of the brain
and to promote the development of brain-inspired computational
models [1], [2], [3]. Although there is an increasing interest in the
visual neural representation decoding task, inferring visual category
from human brain activity for novel classes remains the boundary to
explore. Zero-Shot Neural Decoding (ZSND) based on functional
Magnetic Resonance Imaging (fMRI) or electroencephalography
(EEG) data aims to tackle this problem [4], [5], [6]. In the ZSND,
we have access to a set of brain activity of seen classes, and the
objective is to leverage the visual [6], [7] or linguistic [4], [5]
semantic knowledge to learn a generic neural decoder that enables
generalization to novel classes at test time. These studies not only
help to reveal the cognitive mechanism of the human brain, but also
C. Du, K. Fu and H. He are with the Research Center for Brain-Inspired
Intelligence, State Key Laboratory of Multimodal Artificial Intelligence
Systems, Institute of Automation, Chinese Academy of Sciences, Beijing
100190, China. K. Fu and H. He are also with the School of Artificial
Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing
100049, China (e-mail: changde.du@ia.ac.cn, fukaicheng2019@ia.ac.cn,
huiguang.he@ia.ac.cn).
J. Li is with the Ningbo HwaMei Hospital, UCAS, Zhejiang 315010, China
(e-mail: lijinpeng@ucas.ac.cn).
This work was supported in part by the National Key R&D Program of China
2022ZD0116500; in part by the National Natural Science Foundation of China
under Grant 62206284, Grant 62020106015 and Grant 61976209; in part by
the Strategic Priority Research Program of Chinese Academy of Sciences under
Grant XDB32040000; and in part by the CAAI-Huawei MindSpore Open Fund.
(Corresponding author: Huiguang He)
long trunk, tusks,
large ear flaps,
massive legs, ...
Elephant
Visual experience
Vision-derived
representation Language-derived
representation
Language experience
Dual coding of knowledge
Fig. 1. Dual coding of knowledge in the human brain. When we see
a picture of elephant, we will spontaneously retrieve the knowledge of
elephant in our mind. Then, the concept of elephant is encoded in the
brain both visually and linguistically, where language, as a valid prior
experience, contributes to shaping vision-derived representations.
provide a technical basis for the development of Brain-Computer
Interfaces (BCIs).
Existing visual neural representation decoding methods mostly
resort to visual semantic knowledge, such as features extracted
from viewed images on the basis of Gabor wavelet filters [7] or
a Convolutional Neural Network (CNN) [6], [8], [9]. However,
the human ability to detect, discriminate, and recognize perceptual
visual stimuli is influenced by both visual features and people’s
arXiv:2210.06756v2 [cs.CV] 30 Mar 2023
2
prior experiences [10]. For example, when we see a familiar object,
we spontaneously retrieve the knowledge of that object and the
entity relationships that object forms in our mind. As shown in
Fig. 1, cognitive neuroscience research on dual-coding theory [11],
[12] also considers concrete concepts to be encoded in the brain
both visually and linguistically, where language, as a valid prior
experience, contributes to shaping vision-derived representations.
Moreover, in brain-inspired computational modeling, large-scale
multimodal pretrained models [13], [14] formed by combining
image and text representations provide a better proxy for human-
like intelligence. Therefore, we argue that the recorded brain
activity should be decoded using a combination of not only the
visual semantic features that were in fact presented as clues, but
also a far richer set of linguistic semantic features typically related
to the target object.
Although several studies have addressed the idea of decoding
naturalistic visual experiences from brain activity using purely
linguistic features [15], [16], they merely use standard word vectors
of class names that are automatically extracted from large corpora
such as Common Crawl. Actually, the word vectors of class names
are barely aligned with visual information [17]. As a result, the
neural decoding accuracy is still far from the practical criterion. Is
it possible to build a language representation that is more consistent
with visual cognition, with richer visual semantics? Previous studies
using Wikipedia text descriptions to represent image classes have
shown some positive signs [17], [18]. For example, as shown in
Fig. 2, the page “Elephants” contains phrases “long trunk, tusks,
large ear flaps, massive legs” and “tough but sensitive skin” that
exactly match the visual attributes. Intuitively, Wikipedia articles
capture richer visual semantic information than class names. Here,
we argue that using natural languages such as Wikipedia articles
as class descriptions will yield better neural decoding performance
than using class names.
Motivated by the aforementioned discussions, we proposed
a biologically plausible neural decoding method, called BraVL,
to infer novel image categories from human brain activity by
the joint learning of brain-visual-linguistic features. Our model
focuses on modeling the relationships between brain activity and
multimodal semantic knowledge, i.e., visual semantic knowledge
extracted from images and textual semantic knowledge obtained
from rich Wikipedia descriptions of classes. Specifically, we de-
veloped a multimodal auto-encoding variational Bayesian learning
framework, in which we used the mixture-of-product-of-experts
formulation [19] to infer a latent code that enables coherent joint
generation of all three modalities. To learn a more consistent
joint representation and improve the data efficiency in the case
of limited brain activity data, we further introduced both the
intra- and inter-modality Mutual Information (MI) regularization
terms. In particular, our BraVL model can be trained under
various semi-supervised learning scenarios to incorporate the extra
visual and textual features obtained from the large-scale image
categories in addition to the image categories of training data.
Furthermore, we collected the corresponding textual descriptions
for two popular Image-fMRI datasets [6], [20] and one Image-EEG
dataset [21], hence forming three new trimodal matching (brain-
visual-linguistic) datasets. The experimental results give us three
significant observations. First, models using the combination of
visual and textual features perform much better than those using
either of them alone. Second, using natural languages as class
descriptions yields higher neural decoding performance than using
class names. Third, either unimodal or bimodal extra data can
remarkably improve decoding accuracy.
Contributions.
In summary, our main contributions are listed
as follows: 1) We combine visual and linguistic knowledge for
neural decoding of visual categories from human brain activity for
the first time. 2) We develop a new multimodal learning model
with specially designed intra- and inter-modality MI regularizers to
achieve more consistent brain-visual-linguistic joint representations
and improved data efficiency. 3) We contribute three trimodal
matching datasets, containing high-quality brain activity, visual
features and textual features. Our code and datasets have been
released
1
to facilitate further research. 4) Our experimental results
show several interesting conclusions and cognitive insights about
the human visual system.
2 RELATED WORK
Neural decoding of visual categories.
Estimating the seman-
tic categories of viewed images from evoked brain activity has
long been a sought objective. Previous works mostly relied on
a classification-based approach, where a classifier is trained to
build the relationship between brain activity and the predefined
labels using fMRI [1], [22], [23], [24] or EEG [3], [25], [26], [27]
data. However, this kind of method is restricted to the decoding
of a specified set of categories. To allow novel category decoding,
several identification-based methods [6], [7], [28] were proposed
by characterizing the relationship between brain activity and visual
semantic knowledge, such as image features extracted from Gabor
wavelet filters [7] or a CNN [6], [28]. Although these methods
allow the identification of a large set of possible image categories,
the decoding accuracy significantly depends on the large number
of paired stimuli-responses data, which is difficult to collect.
Therefore, accurately decoding novel image categories remains
a challenge. Neurolinguistic studies have shown that distributed
word representations are also correlated with evoked brain activity
[15], [29], [30], [31]. Encouraged by these findings, we associate
brain activity with multimodal semantic knowledge, i.e., not only
visual features but also textual features. In particular, rather than
learning a mapping directly to multimodal semantic knowledge,
we focus on creating a latent space that could describe any valid
categories, and then learn a mapping between brain activity and
this latent space.
Zero-shot learning (ZSL).
ZSL is a classification problem
where the label space is divided into two distinct sets: seen
and novel classes [32], [33], [34]. To alleviate the problem
of seen-novel domain shift, training samples typically consist
of semantic knowledge such as attributes [32], [35] or word
embeddings [36] that bridge the semantic gap between seen
and novel classes. Semantic knowledge of these types reflects
human heuristics, and can therefore be extended and transferred
from seen classes to novel ones, specifying the semantic space in
ZSL. ZSL methods can be roughly divided into three categories,
depending on the method used to inject semantic knowledge: 1)
learning instance
semantic projections [36], [37], 2) learning
semantic
instance projections [38], [39], and 3) learning the
projections of instance and semantic spaces to a shared latent space
[35], [40]. Our approach falls into the third category. Recently,
ZSL researchers have achieved success through the use of deep
generative models [35], [41], which are used for synthesizing data
features as a data augmentation mechanism in ZSL. In our work, we
1. https://github.com/ChangdeDu/BraVL
3
科研规划-1 (目标CVPR-2022
Training data from seen classes
......
Class names:
Zebra
Wikipedia article:
Zebras are several species of
African equids (horse family)
united by their distinctive black-
and-white striped coats. Their
stripes come in different patterns,
unique to each individual. (...)
Class names:
Elephant
Wikipedia article:
Elephants (...) Distinctive
features of all elephants
include a long trunk, tusks,
large ear flaps, massive legs,
and tough but sensitive
skin.(...)
......
......
Stimulus
?......
Class names:
Goldfish, Carassius auratus
Wikipedia article:
Goldfish breeds vary greatly in
size, body shape, fin
configuration and coloration
(various combinations of white,
yellow, orange, red, brown, and
black are known). (...)
......
......
Training data from novel classes
Brain activity
Stimulus Brain activity
Stimulus
?......
Test data from novel classes
Brain activity
Novel class
neural decoder
Goldfish
Definition of novel classes: a set
of known candidate categories
with no overlap with seen classes.
Fig. 2. Image stimuli, evoked brain activity and their corresponding textual data. We can only collect brain activity for a few categories, but we can
easily collect images and/or text data for almost all categories. Therefore, for seen classes, we assume that the brain activity, visual images and
corresponding textual descriptions are available for training, whereas for novel classes, only visual images and textual descriptions are available for
training. The test data are brain activity from the novel classes.
use the ZSL paradigm to solve the novel class neural decoding task.
Although visual and linguistic semantic knowledge are observable
for the novel class, there was no brain activity data. Therefore,
the novel class neural decoding can be regarded as a zero-shot
classification problem.
Multimodal learning.
Multimodal learning is inspired by cog-
nitive science research, suggesting that human semantic knowledge
relies on perceptual and sensori-motor experience. Multimodal
semantic models using both linguistic representations and visual
perceptual information have been proven successful in a range of
Natural Language Processing (NLP) tasks, such as learning word
embeddings [13], [42]. Several studies have addressed the idea of
decoding linguistic nouns from brain activity using both linguistic
and visual perceptual information [43], [44], [45]. Anderson et al.
applied linguistic and visually-grounded computational models to
decode the neural representations of a set of concrete and abstract
nouns [43]. Davis et al. constructed multimodal models combining
linguistic and three kinds of visual features, and evaluated the
models on the task of decoding brain activity associated with the
meanings of nouns [45]. In contrast to the above studies that have
leveraged visual features to boost the neural decoding of linguistic
nouns, we introduce textual features to enhance the neural decoding
of visual categories.
Mutual information maximization.
For two random variables
X
and
Y
whose joint probability distribution is
p(x, y)
, the
mutual information (MI) between them is defined as
I(X;Y) =
Ep(x,y)hlog p(x,y)
p(x)p(y)i
. Furthermore, as a Shannon entropy-based
quantity, MI can also be written as
I(X;Y) = H(X)H(X|Y)
,
where
H(X)
is the Shannon entropy and
H(X|Y)
is the condi-
tional entropy. As a pioneer, [46] first incorporated MI-related
optimization into deep learning. Since then, many works have
demonstrated the benefit of the MI-maximization in deep repre-
sentation learning [47], [48], [49]. Since directly optimizing MI in
high-dimensional spaces is nearly impossible, many approximation
methods with variational bounds have been proposed [50], [51],
[52]. In our work, we apply MI-maximization at both the intra-
and inter-modality levels in multimodal representation learning.
We prove that inter-modality MI-maximization is equivalent to
multimodal contrast learning.
3 MULTIMODAL LEARNING OF BRAIN-VISUAL-
LINGUISTIC FEATURES
3.1 Problem definition
In real world applications, we can only collect brain activity
for a few visual categories, but we can easily collect images and/or
text data for almost all categories. If we can make full use of the
plentiful image and text data without corresponding brain activity,
we have opportunities to improve the generalization performance
of neural decoding models. Therefore, as shown in Fig. 2, we
assume that brain activity, visual images (from ImageNet) and
class-specific textual descriptions (from Wikipedia) are provided
for the seen classes, but only visual and textual information are
provided for novel/unseen classes . Our goal is to learn a classifier
(i.e., neural decoder) that can classify novel class brain activity at
test time. Note that the novel classes are a set of known candidate
categories with no overlap with the seen classes (rather than the
infinite arbitrary categories).
Let
Dseen ={(xb,xv,xt,y)|xbXs
b,xvXs
v,xt
Xs
t,yYs}
be the set of seen class data, where
Xs
b
corresponds
to the set of brain activity (fMRI) features,
Xs
v
denotes the visual
features,
Xs
t
denotes the textual features and
Ys
denotes the set of
seen class labels. Similarly, the data for novel/unseen classes are
defined as
Dnovel ={(xn
v,xn
t,yn)|xn
vXn
v,xn
tXn
t,yn
Yn}
, where
Xn
v
,
Xn
t
and
Yn
denote the visual features, textual
features and class labels of the novel classes, respectively. The
seen class labels
Ys
and the novel class labels
Yn
are disjointed
in categories, i.e.,
YsYn=
. Note that the novel class brain
activity data
Xn
b
is unavailable during model training, and it will
only be used at test time.
Let b, v and trepresent the subscripts of the brain, visual, and
textual modality, respectively. For any given modality subscript
m
(
m∈ {b, v, t}
), the unimodal feature matrix
XmRNm×dm
,
where
Xm=Xs
mSXn
m
,
Nm=Ns
m+Nn
m
is the sample size
and dmis the feature dimension of modality m.
3.2 Brain, image and text preprocessing
As shown in Fig. 3, we first preprocess the raw inputs into
feature representations with modality-specific feature extractors.
Stability selection of brain voxels.
Brain activity differs from
trial to trial, even for an identical visual stimulus. To improve
4
𝑥𝑏
𝑥𝑡
𝑥𝑣
PCA
PCA
PCA
Concat.
Class names:
Elephant
Wikipedia article:
(...) Distinctive features of all
elephants include a long trunk,
tusks, large ear flaps, massive legs,
and tough but sensitive skin. (...)
Text embedding
Pre-trained
NLP model
ALBERT,
GPT-Neo
Trial 2
Trial 1
Voxel 1
Voxel 5
Visual embedding
Trial 2
Trial 1
Voxel 1
Voxel 3
Voxel 5
Stability Selection
Trial 1
Trial 2
Voxel 1
Voxel 3
Voxel 5
Fig. 3. Data preprocessing. We preprocess the raw inputs into feature
representations with modality-specific feature extractors.
the stability of neural decoding, we used stability selection for
fMRI data, in which the voxels showing the highest consistency in
activation patterns across distinct trials for an identical visual
stimulus were selected for the analysis, following [31]. This
stability is quantified for each voxel as the mean Pearson correlation
coefficient across all pairwise combinations of the trials. In
particular, the stable voxel was selected separately on each brain
region to avoid the selected voxels concentrated in the local brain
region to ensure that a portion of high-quality voxels were retained
in each brain region. This operation can effectively reduce the
dimension of fMRI data and suppress the interference caused by
noisy voxels without seriously affecting the discriminative ability
of brain features. For each selected brain voxel, its response vector
to the visual stimuli belonging to the seen classes is normalized
(across stimuli, zero-mean and unit-variance). Note that we used
only the training fMRI data belonging to the seen classes to
calculate the normalization parameters (i.e., the mean and variance
of each voxel) of each selected voxel, and the calculated mean and
variance were used to normalize both the training and testing fMRI
data separately. After stability selection and normalization, we
perform Principal Component Analysis (PCA) on the training fMRI
data belonging to the seen classes for dimensionality reduction.
The brain feature dimensions after keeping 99% of the variance
using PCA are shown in Section 4.1. Note that the test samples
are not included in the PCA fitting, and we use only the training
samples to estimate the PCA mapping weights. After PCA fitting,
the estimated mapping weights are directly applied to the test
samples to obtain the dimension-reduced test samples.
Feature extraction of visual images.
We use a powerful
VGG-style ConvNet, referred to as RepVGG [53], to extract
hierarchical visual features from the images. Specifically, we use
the Timm library
2
to extract the intermediate feature maps with
different strides in the RepVGG-b3g4 model, which had been
pretrained to achieve 80.21% top-1 accuracy on ImageNet [53].
Similar to the brain feature processing pipeline, the extracted visual
features of seen classes are flattened and normalized first, and then
dimensionality reduction is performed using PCA to keep 99% of
the variance.
2. https://github.com/rwightman/pytorch-image-models
Embedding of textual descriptions.
In early studies of lan-
guage processing and understanding, generating vectors to represent
sentences is typically done by averaging vectors for the content
words [54]. This method of obtaining sentence vector by average
pooling of word vectors has been successfully applied in many
linguistic neural encoding and decoding studies [30], [55], and has
achieved impressive decoding results. With the development of
NLP method, researchers started to input individual sentence into
Transformer-based models [56], such as BERT [57], and to derive
fixed-size sentence embeddings, which have been found to be very
effective for neural encoding [55]. To obtain sentence embedding
from BERT-like NLP models, the most commonly used approach
is to average the output layer (known as token embeddings) or by
using the output of the first token (the [CLS] token). As shown
in a previous linguistic neural decoding study [58], these two
common practices yield similar qualitative results. Here, we use
ALBERT [59] and GPT-Neo [60] as text encoders, and we use the
mean of token embeddings as the sentence embedding.3
Due to the constraint on the input sequence length for ALBERT
and GPT-Neo, we cannot directly input the entire Wikipedia
article into the model. To encode articles that can be longer than
the maximal length, we alternatively split the article text into
partially overlapping sequences of 256 tokens with an overlap of
50 tokens. Concatenating multiple sentence embeddings will lead
to an undesirable ‘curse of dimensionality’ issue. Therefore, we
use the average-pooled representation of multiple sequences to
encode the entire article. This average-pooling strategy has also
been successfully used in a recent linguistic neural encoding study
[61]. Similarly, if a class has multiple corresponding articles in
Wikipedia, we average the representations obtained from each of
them. See Appendix for the degree of heterogeneity of text features
under average-pooling.
3.3 High-level overview of the proposed BraVL model
Fig. 4A shows the overall architecture of the proposed BraVL
model. The model works in two collaborative parts—multi-
modality joint modeling and MI regularization:
Multi-modality joint modeling.
Based on the Mixture-of-
Products-of-Experts (MoPoE) formulation [19], we develop
a multimodal auto-encoding variational Bayesian model
that enables us to utilize the visual and textual features
jointly to enhance the brain activity representation learning
and downstream novel class neural decoding performance.
Specifically, we use three modality-specific encoding
networks
Eb, Ev
and
Et
to transform the unimodal features
xb,xv
and
xt
into the joint latent representation
z
, which
is then passed through three modality-specific decoding
networks
Db, Dv
and
Dt
for feature reconstruction, re-
spectively.
Mutual information (MI) regularization.
The MI at two
levels—intra-modality level and inter-modality level are
maximized simultaneously. The former is approximated by
its variational lower bound [62] and the latter is achieved
through introspective cross-modal contrastive learning. The
MI at intra-modality level is used as a consistent regularizer
to force the joint latent representation
z
to have a strong
relationship with observations
xb,xv
and
xt
, and hence
learn useful joint representations. The MI at inter-modality
3. https://github.com/huggingface/transformers
摘要:

1DecodingVisualNeuralRepresentationsbyMultimodalLearningofBrain-Visual-LinguisticFeaturesChangdeDu,KaichengFu,JinpengLi,andHuiguangHe,SeniorMember,IEEEAbstract—Decodinghumanvisualneuralrepresentationsisachallengingtaskwithgreatscienticsignicanceinrevealingvision-processingmechanismsanddevelopingbr...

展开>> 收起<<
1 Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:4.68MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注