
Corollary 2.6 provides two important insights: i) the normalization with the number of tasks that
appears in
(7)
is justified since the Frobenius norm of
Γ
grows with the number of task, ii) if we
can prove that the slope (defined in Equation
(8)
) is dropping, then we are effectively proving that
multitasking gives
l2
regularization as we showed in the toy introductory example. This also holds
for the case of Gaussian, i.i.d. task vectors, as shown in the following theorem.
Theorem 2.7.
Let
CPRdˆT
be a random matrix with Gaussian, i.i.d. entries of variance
1{d
and
d“ΩpT3q
. Let
Ct, Ct`1
be the matrices formed by selecting the first
t, pt`1q
columns of
C
. Then,
there is a noise level
σthres
such that with probability
ě1´exp ´´Ω´?d¯¯
, the SVD solutions
(see (4)) of (2) (for Ct, Ct`1respectively), under the noise corruption model, satisfy:
Ę
MSEt`1,σ2
ăĘ
MSEt,σ2
,@σěσthres.(11)
Remark 2.8.In words, this result shows that adding new tasks gives
provably
increased robustness
to high noise corruption in the weights, when the task vectors are Gaussian.
Remark 2.9.Observe that the MSE under noise drops for every single new task added. The assumption
d“ΩpT3q
, can be relaxed to
d“Ωpt3q
, and we get increased robustness for the first
t
added tasks.
Nevertheless, for most applications
d“ΩpT3q
is a realistic assumption: Even for our smallest
dataset MNIST d“728, and we experiment with up to 10 tasks.
3 Experimental Evaluation
We divide the experimental section in two parts. In the first part, we add noise to the final linear repre-
sentation layer of various networks and verify that our theoretical analysis agrees with experimentally
observed multitasking robustness on real datasets (MNIST, CIFAR10, NewsGroup20). In the second
part, we show that multitasking leads to robustness to general weight corruptions in any layer of a
complex transformer. Specifically, we show that multilingual Language Models are more robust to
weight shifts (across all the layers) compared to monolingual trained under the same setting. This is
the first evidence of increased Cognitive Reserve in bilingual artificial neural networks.
Experiments with Linear Representation Layers.
We perform experiments on three datasets
(MNIST, CIFAR10, Newsgroup20) and two modalities (Vision and Language). The datasets normally
involve one classification task each. We create multiple binary tasks by distinguishing between pairs
of labels. For example, in CIFAR10, one task might be to distinguish between dogs and cats and
another between airplanes and cars. We assign a value in
r0,1s
to each sample for each task to
transform them to regression tasks (to match our theory). For example, if task
i
is to distinguish
between dogs and cats, value 0corresponds to dog and value 1to cat.
The second issue is learning the task vectors from training data. For MNIST, we can simply learn a
linear layer
C
with columns
tc1, ..., cTu
such that:
cT
ix«y
for each task. For more complex datasets
like CIFAR or Newsgroup20, linear networks have lower performance and hence it is less interesting
to examine their robustness. Instead, we first use another network to extract representations
gθpxq
and then learn a linear layer acting on the encodings such that
cT
igθpxq « y
. For CIFAR we used
a pre-trained Resnet50 as the encoder while for NewsGroup, a pre-trained BERT [
12
]. We would
like to point out that our theory is still valid for this case – this is equivalent to the linear layer
C
receiving inputs from a learned representation as opposed to the features directly. As the number of
tasks increase, we reduce the number of training examples per task. We do this to make sure that the
total training dataset size stays the same as the number of tasks increase.
Figure 3 shows how the average MSE behaves as noise increases for different number of tasks.
Note that even though all models begin from roughly the same performance in the noiseless setting,
the multitask models are much more robust to the corruption of their weights consistently among
all the datasets and modalities. This is aligned with our theoretical analysis which predicts that
the robustness slope (defined in Equation
(8)
) decreases with the number of tasks. We calculate
robustness slopes for learned task vectors for real datasets and plot their decay in the Appendix,
where we further include all the details of how these models were trained.
Experiments with Language Models.
Our objective is to compare robustness to neural weight
perturbations in monolingual and bilingual language models. We use the following perturbation
5