Critical Learning Periods for Multisensory Integration in Deep Networks Michael Kleinman1Alessandro Achille2Stefano Soatto2 1University of California Los Angeles2AWS AI Labs

2025-04-27 0 0 3.42MB 17 页 10玖币

侵权投诉

Critical Learning Periods for Multisensory Integration in Deep Networks

Michael Kleinman1*Alessandro Achille2Stefano Soatto2

1University of California, Los Angeles 2AWS AI Labs

michael.kleinman@ucla.edu {aachille,soattos}@amazon.com

Abstract

We show that the ability of a neural network to integrate

information from diverse sources hinges critically on be-

ing exposed to properly correlated signals during the early

phases of training. Interfering with the learning process

during this initial stage can permanently impair the devel-

opment of a skill, both in artiﬁcial and biological systems

where the phenomenon is known as a critical learning pe-

riod. We show that critical periods arise from the complex

and unstable early transient dynamics, which are decisive

of ﬁnal performance of the trained system and their learned

representations. This evidence challenges the view, engen-

dered by analysis of wide and shallow networks, that early

learning dynamics of neural networks are simple, akin to

those of a linear model. Indeed, we show that even deep

linear networks exhibit critical learning periods for multi-

source integration, while shallow networks do not. To bet-

ter understand how the internal representations change ac-

cording to disturbances or sensory deﬁcits, we introduce a

new measure of source sensitivity, which allows us to track

the inhibition and integration of sources during training.

Our analysis of inhibition suggests cross-source reconstruc-

tion as a natural auxiliary training objective, and indeed

we show that architectures trained with cross-sensor recon-

struction objectives are remarkably more resilient to crit-

ical periods. Our ﬁndings suggest that the recent success

in self-supervised multi-modal training compared to previ-

ous supervised efforts may be in part due to more robust

learning dynamics and not solely due to better architectures

and/or more data.

1. Introduction

Learning generally beneﬁts from exposure to diverse

sources of information, including different sensory modal-

ities, views, or features. Multiple sources can be more in-

formative than the sum of their parts. For instance, both

views of a random-dot stereogram are needed to extract the

*Work conducted during an internship at AWS AI Labs.

synergistic information, which is absent in each individual

view [17]. More generally, multiple sources can help iden-

tify latent common factors of variation relevant to the task,

and separate them from source-speciﬁc nuisance variability,

as done in contrastive learning.

Much information fusion work in Deep Learning focuses

on the design of the architecture, as different sources may

require different architectural biases to be efﬁciently en-

coded. We instead focus on the learning dynamics, since

effective fusion of different sources relies on complex phe-

nomena beginning during the early epochs of training. In

fact, even slight interference with the learning process dur-

ing this critical period can permanently damage a network’s

ability to harvest synergistic information. Even in animals,

which excel at multi-sensor fusion, a temporary deﬁcit in

one source during early development can permanently im-

pair the learning process: congenital strabismus in humans

can cause permanent loss of stereopsis if not corrected suf-

ﬁciently early; similarly, visual/auditory misalignment can

impair the ability of barn owls to localize prey [18]. In artiﬁ-

cial networks, the challenge of integrating different sources

has been noted in visual question answering (VQA), where

the model often resorts to encoding less rich but more read-

ily accessible textual information [2,6], ignoring the visual

modality, or in audio-visual processing, where acoustic in-

formation is often washed out by visual information [32].

Such failures are commonly attributed to the mismatch

in learning speed between sources, or their “information

asymmetry” for the task. It has also been suggested, based

on limiting analysis for wide networks, that the initial dy-

namics of DNNs are very simple [16], seemingly in contrast

with evidence from biology. In this paper, we instead argue

that the early learning dynamics of information fusion in

deep networks are both highly complex and brittle, to the

point of exhibiting critical learning periods similar to bio-

logical systems.

In Sect. 2, we show that shallow networks do not exhibit

critical periods when learning to fuse diverse sources of in-

formation, but deep networks do. Even though, unlike an-

imals, artiﬁcial networks do not age, their learning success

is still decided during the early phases of training. The ex-

arXiv:2210.04643v2 [cs.LG] 14 Sep 2023

Figure 1. Decomposition of information between different modalities. Two modalities can have unique information, common infor-

mation (denoted by the overlap in the venn-diagram), or synergistic information (denoted by the additional ellipse in the right panel).

Task-relevant information (shown in red) can be distributed in a variety of ways across the different modalities. Task-relevant information

can be mostly present in Modality A (left), shared between modalities (center-left), or could require unique (center-right) or synergistic

information from both modalities (right).

istence of critical learning periods for information fusion is

not an artifact of annealing the learning rate or other details

of the optimizer and the architecture. In fact, we show that

critical periods for fusing information are present even in a

simple deep linear network. This contradicts the idea that

deep networks exhibit trivial early dynamics [16,23]. We

provide an interpretation for critical periods in linear net-

works in terms of mutual inhibition/reinforcement between

sources, manifest through sharp transitions in the learning

dynamics, which in turn are related to the intrinsic structure

of the underlying data distribution.

In Sect. 3, we introduce a metric called “Relative Source

Variance” to quantify the dependence of units in a repre-

sentation to individual sources, allowing us to better under-

stand inhibition and fusion between sources. Through it, in

Sect. 4, we show that temporarily reducing the information

in one source, or breaking the correlation between sources,

can permanently change the overall amount of information

in the learned representation. Moreover, even when down-

stream performance is not signiﬁcantly affected, such tem-

porary changes result in units that are highly polarized and

process only information from one source or the other. Sur-

prisingly, we found that the ﬁnal representations in our arti-

ﬁcial networks that were exposed to a temporary deﬁcit mir-

rored single-unit animal representations exposed to analo-

gous deﬁcits (Fig. 4, Fig. 6).

We hypothesize that features inhibit each other because

they are competing to solve the task. But if the competitive

effect is reduced, such as through an auxiliary cross-source

reconstruction task, the different sources can interact syn-

ergistically. This supports cross-modal reconstruction as a

practical self-supervision criterion. In Sect. 4.4, we show

that indeed auxiliary cross-source reconstruction can stabi-

lize the learning dynamics and prevent critical periods. This

lends an alternate interpretation for the recent achievements

in multi-modal learning as due to the improved stability of

the early learning dynamics due to auxiliary cross-modal

reconstruction tasks, rather than to the design of the archi-

tecture.

Empirically, we show the existence of critical learning

periods for multi-source integration using state-of-the-art

architectures (Sect. 4.3-4.4). To isolate different factors that

may contribute to low-performance on multi-modal tasks

(mismatched training dynamics, different informativeness),

we focus on tasks where the sources of information are sym-

metric and homogeneous, in particular stereo and multi-

view imagery. Even in this highly controlled setting, we ob-

serve the effect of critical periods both in downstream per-

formance and/or in unit polarization. Our analysis suggests

that pre-training on one modality, for instance text, and then

adding additional pre-trained backbones, for instance visual

and acoustic, as advocated in recent trends with Founda-

tion Models, yields representations that fail to encode syn-

ergistic information. Instead, training should be performed

across modalities at the outset. Our work also suggests that

asymptotic analysis is irrelevant for deep network fusion, as

their fate is sealed during the initial transient learning. Also,

conclusions drawn from wide and shallow networks do not

transfer to deep networks in use in practice.

1.1. Related Work

Multi-sensor learning. There is a large literature on

sensor fusion in early development [27], including homoge-

neous sensors that are spatially dislocated (e.g., two eyes),

or time-separated (e.g., motion), and heterogeneous sources

(e.g., optical and acoustic, or visual and tactile). Indeed,

given normal learning, humans and other animals have the

remarkable ability to integrate multi-sensory data, such as

incoming visual stimuli coming into two eyes, as well as

corresponding haptic and audio stimuli. Monkeys have

been shown to be adept at combining and leveraging arbi-

trary sensory feedback information [9].

In deep learning, multi-modal (or multi-view learning)

learning typically falls into two broad categories: learning

a joint representation (fusion of information) and learning

an aligned representation (leveraging coordinated informa-

Figure 2. (Left) Σyx , with the highlighted green column representing the sensor that was dropped. (Center) We show total weights

attributed to each feature (shown in different colors) during training in a deep linear network. The solid lines represent the dynamics

when training with all features. The dashed lines represent the behavior when training with the green feature disabled. Note that disabling

the green feature prevents the gray feature from being learned during the initial transient (Right) Same experiment with a shallow linear

network. In this case the learning dynamics of the gray feature perfectly overlap in both cases.

tion in the multiple views) [5]. A fusion-based approach

is beneﬁcial if there is synergistic information available in

the different views, while an alignment-based approach is

helpful is there is shared information common to the differ-

ent views (Fig. 1). Such a division of information typically

affects architectural and model choices: synergistic infor-

mation requires the information from the different modal-

ities to be fused or combined, whereas shared information

often serves as a self-supervised signal that can align in-

formation from the different modalities, as in contrastive

learning [8,29,30], correlation based [3], and information-

theoretic approaches [20,21].

Critical periods in animals and deep networks: Such

architectural considerations often neglect the impact com-

ing from multisensory learning dynamics, where informa-

tion can be learned at different speeds from each sensor

[34]. Indeed, [33] showed that humans and animals are pe-

culiarly sensitive to changes in the distribution of sensory

information early in training, in a phenomenon known as

critical periods. Critical periods have since been described

in many different species and sensory organs. For exam-

ple, barn owls originally exposed to misaligned auditory

and visual information cannot properly localize prey [22].

Somewhat surprisingly, similar critical periods for learning

have also been observed in deep networks. [1] found that

early periods of training were critical for determining the

asymptotic network behavior. Additionally, it was found

that the timing of regularization was important for deter-

mining asymptotic performance [12], with regularization

during the initial stages of training having the most inﬂu-

ential effect.

Masked/de-noising Autoencoders: Reconstructing an

input from a noisy or partial observation has been long used

as a form of supervision. Recently, an in part due the suc-

cessful usage of transformers in language [31] and vision

tasks [11], such a pre-training strategy has been successfully

applied to text [10] and vision tasks [14]. An extension of

this has been recently applied to multi-modal data [4].

Models of learning dynamics We consider two ap-

proaches to gain analytic insight into the learning dynam-

ics of deep networks. [25,26] assume that the input-output

mapping is done by a deep linear network. We show that

under this model critical periods may exist. [16,23] assume

instead inﬁnitely wide networks, resulting in a model linear

with respect to the parameters. In this latter case, no critical

period is predicted contradicting our empirical observations

on ﬁnite networks.

2. A model for critical periods in sensor-fusion

We want to establish what is the difference, in terms

of learning dynamics, between learning how to use two

sources of information at the same time, or learning how to

solve a task using each modality separately and then merg-

ing the results. In particular we consider the counterfactual

question: if we disable sensor A during training, would this

change how we learn to use sensor B? To start, let’s con-

sider the simple case of a linear regression model y=Wx

trained with a mean square error loss

L=1

i=1

2||y(i)−Wx(i)||2

where D={(x(i),y(i))}N

i=1 is a training set of i.i.d. sam-

ples. In this simpliﬁed setting, we consider each compo-

nent xkof xas coming from a different sensor or source.

To simplify even further, we assume that the inputs have

been whitened, so that the input correlation matrix Σx=

NPix(i)x(i)T=I.

In this case, the learning dynamics of any source is inde-

pendent from the others. In fact, the gradient of the weight

wjk associated to xkand yjis given by

−∇wjk L(W) = −∇wjk

i=1

2||y(i)−Wx(i)||2= Σyx

jk −wjk

and does not depend on any whl with whl ̸=wjk . The an-

swer to the counterfactual question is thus negative in this

Figure 3. Example RSV distributions and relation to information diagrams. (Left) Representations that vary predominantly due to

one modality. (Center-Left, Center-right) All units in the representation vary nearly equally with both modalities. (Right) Units in the

representation that vary uniquely with each sensor, which is reﬂected by a polarized RSV distribution.

setting: adding or removing one source of information (or

output) will not change how the model learns to extract in-

formation from the other sources. However, we now show

that the addition of depth, even without taking introducing

non-linearities, makes the situation radically different.

To this effect, consider a deep linear network with one

hidden layer y=W2W1x. This network has the same

expressive power (and the same global optimum) as the

previous model. However, this introduces a mutual depen-

dency between sensors (due to the shared layer) that can

ultimately lead to critical periods in cross-sensor learning.

To see this, we use an analytical expression of the learning

dynamics for two-layer deep networks [25,26]. Let Σyx =

NPN

i=1 y(i)x(i)Tbe the cross-correlation matrix between

the inputs xand the target vector y1and let Σyx =USV T

be its singular-value decomposition (SVD). [26] shows that

the total weight W(t) = W2(t)W1(t)assigned to each

source at time tduring the training can be written as

W(t) = W2(t)W1(t) = UA(t)VT(1)

aα(t)uαvαT (2)

where

aα(t) = sαe2sαt/τ

e2sαt/τ −1 + sα/a0

.(3)

This leads to non-linear learning dynamics where differ-

ent features are learned at sharply distinct points in time

[26]. Moreover, it leads to entanglement between the learn-

ing dynamics of different sources due to the eigenvectors

vαmixing multiple sources.

1Note that W=Σyx is also the global minimum of the MSE loss

L=1

NPi

2||y(i)−Wx(i)||2.

Disabling (or adding) a source of information corre-

sponds to removing (or adding) a column to the matrix

Σyx, which in turns affects its singular-value decomposi-

tion and the corresponding learning dynamics. To see how

this change may affect the learning dynamics, in Fig. 2we

compare the weights associated to each sensor during train-

ing for one particular task. In solid we show the dynamics

with all sensors active at the same time. In dashed line we

show the dynamics when one of the sensor is disabled. We

see that disabling a sensor (green in the ﬁgure) can com-

pletely inhibit learning of other task-relevant features (e.g.,

the gray feature) during the initial transient. This should

be compared with the learning dynamics of a shallow one-

layer network (Fig. 2, right) where all task-relevant features

are learned at the same time, and where removal of a source

does not affect the others.

In deep linear networks, the suboptimal conﬁguration

learned during the initial transient is eventually discarded,

and the network reverts to the globally optimal solution. In

the following we show this is not the case for standard non-

linear deep networks. While the initial non-trivial interac-

tion between sources of information remain, the non-linear

networks are unable to unlearn the suboptimal conﬁgura-

tions learned at the beginning (owing to the highly non-

convex landscape). This can result in permanent impair-

ments if a source of information is removed during the ini-

tial transient of learning, which reﬂects the trends observed

in critical periods in animals.

3. Single Neuron Sensitivity Analysis

Before studying the empirical behavior of real networks

on multi-sensor tasks, we should consider how to quantify

the effect of a deﬁcit on a down-stream task. One way

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CriticalLearningPeriodsforMultisensoryIntegrationinDeepNetworksMichaelKleinman1*AlessandroAchille2StefanoSoatto21UniversityofCalifornia,LosAngeles2AWSAILabsmichael.kleinman@ucla.edu{aachille,soattos}@amazon.comAbstractWeshowthattheabilityofaneuralnetworktointegrateinformationfromdiversesourceshinges...

展开>> 收起<<

Critical Learning Periods for Multisensory Integration in Deep Networks Michael Kleinman1Alessandro Achille2Stefano Soatto2 1University of California Los Angeles2AWS AI Labs.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Critical Learning Periods for Multisensory Integration in Deep Networks Michael Kleinman1Alessandro Achille2Stefano Soatto2 1University of California Los Angeles2AWS AI Labs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: