Critical Learning Periods for Multisensory Integration in Deep Networks Michael Kleinman1Alessandro Achille2Stefano Soatto2 1University of California Los Angeles2AWS AI Labs

2025-04-27 0 0 3.42MB 17 页 10玖币
侵权投诉
Critical Learning Periods for Multisensory Integration in Deep Networks
Michael Kleinman1*Alessandro Achille2Stefano Soatto2
1University of California, Los Angeles 2AWS AI Labs
michael.kleinman@ucla.edu {aachille,soattos}@amazon.com
Abstract
We show that the ability of a neural network to integrate
information from diverse sources hinges critically on be-
ing exposed to properly correlated signals during the early
phases of training. Interfering with the learning process
during this initial stage can permanently impair the devel-
opment of a skill, both in artificial and biological systems
where the phenomenon is known as a critical learning pe-
riod. We show that critical periods arise from the complex
and unstable early transient dynamics, which are decisive
of final performance of the trained system and their learned
representations. This evidence challenges the view, engen-
dered by analysis of wide and shallow networks, that early
learning dynamics of neural networks are simple, akin to
those of a linear model. Indeed, we show that even deep
linear networks exhibit critical learning periods for multi-
source integration, while shallow networks do not. To bet-
ter understand how the internal representations change ac-
cording to disturbances or sensory deficits, we introduce a
new measure of source sensitivity, which allows us to track
the inhibition and integration of sources during training.
Our analysis of inhibition suggests cross-source reconstruc-
tion as a natural auxiliary training objective, and indeed
we show that architectures trained with cross-sensor recon-
struction objectives are remarkably more resilient to crit-
ical periods. Our findings suggest that the recent success
in self-supervised multi-modal training compared to previ-
ous supervised efforts may be in part due to more robust
learning dynamics and not solely due to better architectures
and/or more data.
1. Introduction
Learning generally benefits from exposure to diverse
sources of information, including different sensory modal-
ities, views, or features. Multiple sources can be more in-
formative than the sum of their parts. For instance, both
views of a random-dot stereogram are needed to extract the
*Work conducted during an internship at AWS AI Labs.
synergistic information, which is absent in each individual
view [17]. More generally, multiple sources can help iden-
tify latent common factors of variation relevant to the task,
and separate them from source-specific nuisance variability,
as done in contrastive learning.
Much information fusion work in Deep Learning focuses
on the design of the architecture, as different sources may
require different architectural biases to be efficiently en-
coded. We instead focus on the learning dynamics, since
effective fusion of different sources relies on complex phe-
nomena beginning during the early epochs of training. In
fact, even slight interference with the learning process dur-
ing this critical period can permanently damage a network’s
ability to harvest synergistic information. Even in animals,
which excel at multi-sensor fusion, a temporary deficit in
one source during early development can permanently im-
pair the learning process: congenital strabismus in humans
can cause permanent loss of stereopsis if not corrected suf-
ficiently early; similarly, visual/auditory misalignment can
impair the ability of barn owls to localize prey [18]. In artifi-
cial networks, the challenge of integrating different sources
has been noted in visual question answering (VQA), where
the model often resorts to encoding less rich but more read-
ily accessible textual information [2,6], ignoring the visual
modality, or in audio-visual processing, where acoustic in-
formation is often washed out by visual information [32].
Such failures are commonly attributed to the mismatch
in learning speed between sources, or their “information
asymmetry” for the task. It has also been suggested, based
on limiting analysis for wide networks, that the initial dy-
namics of DNNs are very simple [16], seemingly in contrast
with evidence from biology. In this paper, we instead argue
that the early learning dynamics of information fusion in
deep networks are both highly complex and brittle, to the
point of exhibiting critical learning periods similar to bio-
logical systems.
In Sect. 2, we show that shallow networks do not exhibit
critical periods when learning to fuse diverse sources of in-
formation, but deep networks do. Even though, unlike an-
imals, artificial networks do not age, their learning success
is still decided during the early phases of training. The ex-
1
arXiv:2210.04643v2 [cs.LG] 14 Sep 2023
Figure 1. Decomposition of information between different modalities. Two modalities can have unique information, common infor-
mation (denoted by the overlap in the venn-diagram), or synergistic information (denoted by the additional ellipse in the right panel).
Task-relevant information (shown in red) can be distributed in a variety of ways across the different modalities. Task-relevant information
can be mostly present in Modality A (left), shared between modalities (center-left), or could require unique (center-right) or synergistic
information from both modalities (right).
istence of critical learning periods for information fusion is
not an artifact of annealing the learning rate or other details
of the optimizer and the architecture. In fact, we show that
critical periods for fusing information are present even in a
simple deep linear network. This contradicts the idea that
deep networks exhibit trivial early dynamics [16,23]. We
provide an interpretation for critical periods in linear net-
works in terms of mutual inhibition/reinforcement between
sources, manifest through sharp transitions in the learning
dynamics, which in turn are related to the intrinsic structure
of the underlying data distribution.
In Sect. 3, we introduce a metric called “Relative Source
Variance” to quantify the dependence of units in a repre-
sentation to individual sources, allowing us to better under-
stand inhibition and fusion between sources. Through it, in
Sect. 4, we show that temporarily reducing the information
in one source, or breaking the correlation between sources,
can permanently change the overall amount of information
in the learned representation. Moreover, even when down-
stream performance is not significantly affected, such tem-
porary changes result in units that are highly polarized and
process only information from one source or the other. Sur-
prisingly, we found that the final representations in our arti-
ficial networks that were exposed to a temporary deficit mir-
rored single-unit animal representations exposed to analo-
gous deficits (Fig. 4, Fig. 6).
We hypothesize that features inhibit each other because
they are competing to solve the task. But if the competitive
effect is reduced, such as through an auxiliary cross-source
reconstruction task, the different sources can interact syn-
ergistically. This supports cross-modal reconstruction as a
practical self-supervision criterion. In Sect. 4.4, we show
that indeed auxiliary cross-source reconstruction can stabi-
lize the learning dynamics and prevent critical periods. This
lends an alternate interpretation for the recent achievements
in multi-modal learning as due to the improved stability of
the early learning dynamics due to auxiliary cross-modal
reconstruction tasks, rather than to the design of the archi-
tecture.
Empirically, we show the existence of critical learning
periods for multi-source integration using state-of-the-art
architectures (Sect. 4.3-4.4). To isolate different factors that
may contribute to low-performance on multi-modal tasks
(mismatched training dynamics, different informativeness),
we focus on tasks where the sources of information are sym-
metric and homogeneous, in particular stereo and multi-
view imagery. Even in this highly controlled setting, we ob-
serve the effect of critical periods both in downstream per-
formance and/or in unit polarization. Our analysis suggests
that pre-training on one modality, for instance text, and then
adding additional pre-trained backbones, for instance visual
and acoustic, as advocated in recent trends with Founda-
tion Models, yields representations that fail to encode syn-
ergistic information. Instead, training should be performed
across modalities at the outset. Our work also suggests that
asymptotic analysis is irrelevant for deep network fusion, as
their fate is sealed during the initial transient learning. Also,
conclusions drawn from wide and shallow networks do not
transfer to deep networks in use in practice.
1.1. Related Work
Multi-sensor learning. There is a large literature on
sensor fusion in early development [27], including homoge-
neous sensors that are spatially dislocated (e.g., two eyes),
or time-separated (e.g., motion), and heterogeneous sources
(e.g., optical and acoustic, or visual and tactile). Indeed,
given normal learning, humans and other animals have the
remarkable ability to integrate multi-sensory data, such as
incoming visual stimuli coming into two eyes, as well as
corresponding haptic and audio stimuli. Monkeys have
been shown to be adept at combining and leveraging arbi-
trary sensory feedback information [9].
In deep learning, multi-modal (or multi-view learning)
learning typically falls into two broad categories: learning
a joint representation (fusion of information) and learning
an aligned representation (leveraging coordinated informa-
2
Figure 2. (Left) Σyx , with the highlighted green column representing the sensor that was dropped. (Center) We show total weights
attributed to each feature (shown in different colors) during training in a deep linear network. The solid lines represent the dynamics
when training with all features. The dashed lines represent the behavior when training with the green feature disabled. Note that disabling
the green feature prevents the gray feature from being learned during the initial transient (Right) Same experiment with a shallow linear
network. In this case the learning dynamics of the gray feature perfectly overlap in both cases.
tion in the multiple views) [5]. A fusion-based approach
is beneficial if there is synergistic information available in
the different views, while an alignment-based approach is
helpful is there is shared information common to the differ-
ent views (Fig. 1). Such a division of information typically
affects architectural and model choices: synergistic infor-
mation requires the information from the different modal-
ities to be fused or combined, whereas shared information
often serves as a self-supervised signal that can align in-
formation from the different modalities, as in contrastive
learning [8,29,30], correlation based [3], and information-
theoretic approaches [20,21].
Critical periods in animals and deep networks: Such
architectural considerations often neglect the impact com-
ing from multisensory learning dynamics, where informa-
tion can be learned at different speeds from each sensor
[34]. Indeed, [33] showed that humans and animals are pe-
culiarly sensitive to changes in the distribution of sensory
information early in training, in a phenomenon known as
critical periods. Critical periods have since been described
in many different species and sensory organs. For exam-
ple, barn owls originally exposed to misaligned auditory
and visual information cannot properly localize prey [22].
Somewhat surprisingly, similar critical periods for learning
have also been observed in deep networks. [1] found that
early periods of training were critical for determining the
asymptotic network behavior. Additionally, it was found
that the timing of regularization was important for deter-
mining asymptotic performance [12], with regularization
during the initial stages of training having the most influ-
ential effect.
Masked/de-noising Autoencoders: Reconstructing an
input from a noisy or partial observation has been long used
as a form of supervision. Recently, an in part due the suc-
cessful usage of transformers in language [31] and vision
tasks [11], such a pre-training strategy has been successfully
applied to text [10] and vision tasks [14]. An extension of
this has been recently applied to multi-modal data [4].
Models of learning dynamics We consider two ap-
proaches to gain analytic insight into the learning dynam-
ics of deep networks. [25,26] assume that the input-output
mapping is done by a deep linear network. We show that
under this model critical periods may exist. [16,23] assume
instead infinitely wide networks, resulting in a model linear
with respect to the parameters. In this latter case, no critical
period is predicted contradicting our empirical observations
on finite networks.
2. A model for critical periods in sensor-fusion
We want to establish what is the difference, in terms
of learning dynamics, between learning how to use two
sources of information at the same time, or learning how to
solve a task using each modality separately and then merg-
ing the results. In particular we consider the counterfactual
question: if we disable sensor A during training, would this
change how we learn to use sensor B? To start, let’s con-
sider the simple case of a linear regression model y=Wx
trained with a mean square error loss
L=1
N
N
X
i=1
1
2||y(i)Wx(i)||2
where D={(x(i),y(i))}N
i=1 is a training set of i.i.d. sam-
ples. In this simplified setting, we consider each compo-
nent xkof xas coming from a different sensor or source.
To simplify even further, we assume that the inputs have
been whitened, so that the input correlation matrix Σx=
1
NPix(i)x(i)T=I.
In this case, the learning dynamics of any source is inde-
pendent from the others. In fact, the gradient of the weight
wjk associated to xkand yjis given by
−∇wjk L(W) = −∇wjk
1
N
N
X
i=1
1
2||y(i)Wx(i)||2= Σyx
jk wjk
and does not depend on any whl with whl ̸=wjk . The an-
swer to the counterfactual question is thus negative in this
3
Figure 3. Example RSV distributions and relation to information diagrams. (Left) Representations that vary predominantly due to
one modality. (Center-Left, Center-right) All units in the representation vary nearly equally with both modalities. (Right) Units in the
representation that vary uniquely with each sensor, which is reflected by a polarized RSV distribution.
setting: adding or removing one source of information (or
output) will not change how the model learns to extract in-
formation from the other sources. However, we now show
that the addition of depth, even without taking introducing
non-linearities, makes the situation radically different.
To this effect, consider a deep linear network with one
hidden layer y=W2W1x. This network has the same
expressive power (and the same global optimum) as the
previous model. However, this introduces a mutual depen-
dency between sensors (due to the shared layer) that can
ultimately lead to critical periods in cross-sensor learning.
To see this, we use an analytical expression of the learning
dynamics for two-layer deep networks [25,26]. Let Σyx =
1
NPN
i=1 y(i)x(i)Tbe the cross-correlation matrix between
the inputs xand the target vector y1and let Σyx =USV T
be its singular-value decomposition (SVD). [26] shows that
the total weight W(t) = W2(t)W1(t)assigned to each
source at time tduring the training can be written as
W(t) = W2(t)W1(t) = UA(t)VT(1)
=X
α
aα(t)uαvαT (2)
where
aα(t) = sαe2sαt/τ
e2sαt/τ 1 + sα/a0
α
.(3)
This leads to non-linear learning dynamics where differ-
ent features are learned at sharply distinct points in time
[26]. Moreover, it leads to entanglement between the learn-
ing dynamics of different sources due to the eigenvectors
vαmixing multiple sources.
1Note that W=Σyx is also the global minimum of the MSE loss
L=1
NPi
1
2||y(i)Wx(i)||2.
Disabling (or adding) a source of information corre-
sponds to removing (or adding) a column to the matrix
Σyx, which in turns affects its singular-value decomposi-
tion and the corresponding learning dynamics. To see how
this change may affect the learning dynamics, in Fig. 2we
compare the weights associated to each sensor during train-
ing for one particular task. In solid we show the dynamics
with all sensors active at the same time. In dashed line we
show the dynamics when one of the sensor is disabled. We
see that disabling a sensor (green in the figure) can com-
pletely inhibit learning of other task-relevant features (e.g.,
the gray feature) during the initial transient. This should
be compared with the learning dynamics of a shallow one-
layer network (Fig. 2, right) where all task-relevant features
are learned at the same time, and where removal of a source
does not affect the others.
In deep linear networks, the suboptimal configuration
learned during the initial transient is eventually discarded,
and the network reverts to the globally optimal solution. In
the following we show this is not the case for standard non-
linear deep networks. While the initial non-trivial interac-
tion between sources of information remain, the non-linear
networks are unable to unlearn the suboptimal configura-
tions learned at the beginning (owing to the highly non-
convex landscape). This can result in permanent impair-
ments if a source of information is removed during the ini-
tial transient of learning, which reflects the trends observed
in critical periods in animals.
3. Single Neuron Sensitivity Analysis
Before studying the empirical behavior of real networks
on multi-sensor tasks, we should consider how to quantify
the effect of a deficit on a down-stream task. One way
4
摘要:

CriticalLearningPeriodsforMultisensoryIntegrationinDeepNetworksMichaelKleinman1*AlessandroAchille2StefanoSoatto21UniversityofCalifornia,LosAngeles2AWSAILabsmichael.kleinman@ucla.edu{aachille,soattos}@amazon.comAbstractWeshowthattheabilityofaneuralnetworktointegrateinformationfromdiversesourceshinges...

展开>> 收起<<
Critical Learning Periods for Multisensory Integration in Deep Networks Michael Kleinman1Alessandro Achille2Stefano Soatto2 1University of California Los Angeles2AWS AI Labs.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.42MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注