
Figure 1. Decomposition of information between different modalities. Two modalities can have unique information, common infor-
mation (denoted by the overlap in the venn-diagram), or synergistic information (denoted by the additional ellipse in the right panel).
Task-relevant information (shown in red) can be distributed in a variety of ways across the different modalities. Task-relevant information
can be mostly present in Modality A (left), shared between modalities (center-left), or could require unique (center-right) or synergistic
information from both modalities (right).
istence of critical learning periods for information fusion is
not an artifact of annealing the learning rate or other details
of the optimizer and the architecture. In fact, we show that
critical periods for fusing information are present even in a
simple deep linear network. This contradicts the idea that
deep networks exhibit trivial early dynamics [16,23]. We
provide an interpretation for critical periods in linear net-
works in terms of mutual inhibition/reinforcement between
sources, manifest through sharp transitions in the learning
dynamics, which in turn are related to the intrinsic structure
of the underlying data distribution.
In Sect. 3, we introduce a metric called “Relative Source
Variance” to quantify the dependence of units in a repre-
sentation to individual sources, allowing us to better under-
stand inhibition and fusion between sources. Through it, in
Sect. 4, we show that temporarily reducing the information
in one source, or breaking the correlation between sources,
can permanently change the overall amount of information
in the learned representation. Moreover, even when down-
stream performance is not significantly affected, such tem-
porary changes result in units that are highly polarized and
process only information from one source or the other. Sur-
prisingly, we found that the final representations in our arti-
ficial networks that were exposed to a temporary deficit mir-
rored single-unit animal representations exposed to analo-
gous deficits (Fig. 4, Fig. 6).
We hypothesize that features inhibit each other because
they are competing to solve the task. But if the competitive
effect is reduced, such as through an auxiliary cross-source
reconstruction task, the different sources can interact syn-
ergistically. This supports cross-modal reconstruction as a
practical self-supervision criterion. In Sect. 4.4, we show
that indeed auxiliary cross-source reconstruction can stabi-
lize the learning dynamics and prevent critical periods. This
lends an alternate interpretation for the recent achievements
in multi-modal learning as due to the improved stability of
the early learning dynamics due to auxiliary cross-modal
reconstruction tasks, rather than to the design of the archi-
tecture.
Empirically, we show the existence of critical learning
periods for multi-source integration using state-of-the-art
architectures (Sect. 4.3-4.4). To isolate different factors that
may contribute to low-performance on multi-modal tasks
(mismatched training dynamics, different informativeness),
we focus on tasks where the sources of information are sym-
metric and homogeneous, in particular stereo and multi-
view imagery. Even in this highly controlled setting, we ob-
serve the effect of critical periods both in downstream per-
formance and/or in unit polarization. Our analysis suggests
that pre-training on one modality, for instance text, and then
adding additional pre-trained backbones, for instance visual
and acoustic, as advocated in recent trends with Founda-
tion Models, yields representations that fail to encode syn-
ergistic information. Instead, training should be performed
across modalities at the outset. Our work also suggests that
asymptotic analysis is irrelevant for deep network fusion, as
their fate is sealed during the initial transient learning. Also,
conclusions drawn from wide and shallow networks do not
transfer to deep networks in use in practice.
1.1. Related Work
Multi-sensor learning. There is a large literature on
sensor fusion in early development [27], including homoge-
neous sensors that are spatially dislocated (e.g., two eyes),
or time-separated (e.g., motion), and heterogeneous sources
(e.g., optical and acoustic, or visual and tactile). Indeed,
given normal learning, humans and other animals have the
remarkable ability to integrate multi-sensory data, such as
incoming visual stimuli coming into two eyes, as well as
corresponding haptic and audio stimuli. Monkeys have
been shown to be adept at combining and leveraging arbi-
trary sensory feedback information [9].
In deep learning, multi-modal (or multi-view learning)
learning typically falls into two broad categories: learning
a joint representation (fusion of information) and learning
an aligned representation (leveraging coordinated informa-
2