Multitasking Models are Robust to Structural Failure A Neural Model for Bilingual Cognitive Reserve Giannis Daras

2025-05-02 0 0 959.71KB 21 页 10玖币
侵权投诉
Multitasking Models are Robust to Structural Failure:
A Neural Model for Bilingual Cognitive Reserve
Giannis Daras˚
The University of Texas at Austin
giannisdaras@utexas.edu
Negin Raoof˚
The University of Texas at Austin
neginraoof@gmail.com
Zoi Gkalitsiou
The University of Texas at Austin
zoi.gkalitsiou@austin.utexas.edu
Alexandros G. Dimakis
The University of Texas at Austin
dimakis@austin.utexas.edu
Abstract
We find a surprising connection between multitask learning and robustness to
neuron failures. Our experiments show that bilingual language models retain
higher performance under various neuron perturbations, such as random deletions,
magnitude pruning and weight noise compared to equivalent monolingual ones. We
provide a theoretical justification of this robustness by mathematically analyzing
linear representation learning and showing that multitasking creates more robust
representations. Our analysis connects robustness to spectral properties of the
learned representation and proves that multitasking leads to higher robustness for
diverse task vectors. We open-source our code and models in the following URL:
https://github.com/giannisdaras/multilingual robustness.
1 Introduction
Converging evidence from cognitive science research indicates that bilingualism increases brain
robustness by reducing the rate of cognitive decline due to aging [
1
,
2
] and delaying the onset of
symptoms of dementia [
3
,
4
]. It appears that individuals who speak more than one language on a
regular basis are able to maintain typical cognitive functioning despite neural degeneration. This
mismatch between cognitive functioning and brain pathology is called Cognitive Reserve [
5
], and its
underlying mechanisms are poorly understood and are an active topic of investigation.
Inspired by this research, we study whether artificial neural networks are more robust when trained
on multiple languages or multiple tasks. Our experiments demonstrate that training on multiple
tasks indeed increases structural robustness. We train monolingual and bilingual GPT-2 models
with the same architecture and dataset sizes. Initially, monolingual GPT-2 [
6
] models are slightly
outperforming the bilingual ones, but when we introduce structural noise (by randomly deleting
neurons or adding noise to the weights) bilingual models degrade more gracefully and eventually
outperform the monolingual models in the high-noise regime. For some amount of noise, bilingual
models start outperforming the monolingual ones demonstrating a cross-over in performance due
to their increased robustness. We observe this phenomenon for numerous models across three
different types of corruption: additive Gaussian noise to the weights, random weight pruning and
magnitude-based weight pruning [7].
Our Contributions:
We provide a theoretical justification of this phenomenon by mathematically
analyzing linear multitask representation learning [
8
,
9
]. Our analysis shows that introducing more
˚equal contribution.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11618v1 [cs.LG] 20 Oct 2022
Cross-over point
Figure 1: Performance of monolingual and bilingual GPT-2 models with the same architecture and training
dataset size. We show the performance as we randomly erase weights. The x-axis indicates the probability of
erasing an attention weight parameter (setting to it zero). The y-axis indicates the average perplexity over
20
runs with
95
% confidence intervals. The bilingual model initially shows slightly worse performance, but as more
weights are deleted, the monolingual model declines faster and performs worse in the highly damaged regime.
This indicates that the bilingual GPT-2 model is more robust to neuron weight erasures. We show similar results
for several models and types of errors in our experimental section.
1
2
3
Figure 2: Let
c1, c2, c3
be the standard basis of
R3
. For two tasks, the best one dimensional approximation
to
c1, c2
is
ˆc1“ r1{2,1{2,0sT
but the best one dimensional approximation to three tasks
c1, c2, c3
is
ˆc1
1
r1{3,1{3,1{3sT
. Multi-tasking is creating
`2
regularization since
||ˆc1
1||2ă ||ˆc1||2
. It is important that the
original task vectors c1, c2, c3are orthogonal i.e. diverse, since this creates regularization.
diverse tasks creates
`2
regularization in the linear task heads. Further, we formally connect the
Euclidean norm of the learned representations to structural robustness under errors in the network
weights. Our main theorem establishes that multitasking leads to higher robustness to additive noise
for linear representations when the task vectors are selected as random and independent Gaussian
vectors. Our results also establish that when the tasks are significantly overlapping, multitasking does
not lead to higher robustness and hence task diversity is necessary.
We experimentally observe that multitasking increases structural robustness for numerous networks
and multiple problems including MNIST, CIFAR10, Newsgroup20, GPT models and finetuned
GPT models on GLUE tasks. We train networks under exactly comparable dataset and architecture
conditions and show that models become more robust to structural failures as they are trained with
more tasks. We experiment with three different types of structural failures and show robustness
increases for all of them. We also experimentally observe that the addition of diverse tasks seems to
regularize the model weights, as we predict in our theoretical analysis.
2 Theoretical Analysis
Building intuition.
We start with a small numerical example to build intuition. Given a feature
vector
xPRd
we compute a
k
dimensional linear representation
W x
using a matrix
WPRkˆd
. We
2
choose
W
such that we best approximate a set of ground truth task vectors,
tc1, c2, ..., cTu
, that lie
in
Rd
. The learned approximation is
ˆciWTγi
. Essentially, we use linear combinations of the
columns of
WT
to approximate the task vectors. For simplicity, we assume that the columns of
WT
are unit norm. We study the case where kăT, otherwise there are infinite solutions.
Assume we work in
d3
dimensions with
T3
total tasks,
c1“ r1,0,0sT, c2“ r0,1,0sT, c3
r0,0,1sT
. Set our learned representation dimension to be
k1
dimensional. When
T2
, using
only the first two tasks
c1, c2
, an optimal solution is
W1
?2r1,1,0s
. The corresponding linear head
is now the scalar
γ11
?2γ2
and the approximate vectors are
ˆc1WTγ1“ r0.5,0.5,0sTˆc2
.
Therefore the best one dimensional subspace to jointly approximate
c1, c2
is the span of
W
1
?2r1,1,0s
. Now we introduce one more task and find the one dimensional subspace that best
approximates
c1, c2, c3
. That becomes
W11
?3r1,1,1s
with linear heads
γ1
11
?3γ1
2γ1
3
.
The approximate vectors now are
ˆc1
1“ pW1qTγ1
1“ r1{3,1{3,1{3sTˆc1
2ˆc1
3
. Notice that
||ˆc1
i||21{3
for
3
tasks but
||ˆci||21{2
for two tasks. The point is that for more tasks, the vector
that jointly approximates all task vectors becomes shorter. Equivalently, the
`2
norm of the linear
task heads decreases from
γi1
?2
to
γ1
i1
?3
as the tasks increased from two to three showing
how multitasking creates regularization. A graphical representation of this example is given in Figure
2. It is important that the task vectors
ci
are orthogonal, increasing the effective dimensionality of
the problem. The intuition is that diverse tasks increase the effective dimension, making the best
approximation vector shorter.
Our main theoretical result is that this phenomenon is quite general and makes multitasking lead
to structural robustness. We connect the norm of the approximated task vectors with robustness to
weight perturbations and show that for Gaussian, independent task vectors the average norm shrinks
as more tasks are added. This is intuitive since high dimensional Gaussian vectors are near-orthogonal.
Surprisingly, we empirically show that task vectors for numerous problems also exhibit this behavior.
Analysis.
We consider a neural network
fθ:RdÑRk
and a collection of tasks
tT1, ..., TTu
. We are
trying to learn θ, γiPRkto solve the following optimization problem:
argminθ,tγ1,...,γTu
T
ÿ
i1
Epx,yqPTiLpγT
ifθpxq, yq.(1)
The neural network
fθ
can be as simple as a single matrix
W:RdÑRk
. For linear networks,
we consider the following dataset generation process: for task
Ti
, we sample a Gaussian
x
and we
generate its label
y
by taking the inner-product with a task vector
ci
, i.e.
ycT
ix
for task
Ti
. Given
infinite samples and MSE loss, the optimization problem of
(1)
is equivalent to the following problem.
Definition 2.1
(Optimization Problem)
.
Let
kăTăd
. We define the Factorized Best Rank-
k
approximation of a matrix CPRdˆTas the optimization problem:
W˚,Γ˚argminWPRkˆd,ΓPRkˆTˇˇˇˇWTΓ´Cˇˇˇˇ2
F.(2)
We are interested in the case when the dimensionality of the representation
k
is smaller than the
number of tasks T, otherwise the best Rank-kapproximation of Cis not unique.
The following Proposition states that in the considered setting, Problem 2 can be solved with SVD.
Proposition 2.2.
For any matrix
CPRdˆT
with distinct singular values, any solution of 2.1 satisfies:
W˚TΓ˚UΣkVT,(3)
where
UΣVT
is the SVD of
C
and
Σk
is the same as
Σ
except than the last
T´k
diagonal entries
that are zeroed out.
The fact that the Singular Value decomposition computes the best rank-kapproximation to a matrix
can be found in several textbooks e.g. Golub and Van Loan [10], Blum et al. [11].
This proposition establishes that
W˚UT
and
Γ˚ΣkVT
is a valid solution of
(2)
. Onwards, we
will be calling this the SVD Solution.
Definition 2.3. We define the SVD solution of (2), to be:
WSVD UT,ΓSVD ΣkVT.(4)
3
We note that if any multitask learning algorithm is used to obtain
W˚,Γ˚
, one can run Gram-Schmidt
to make
W˚
orthonormal and hence obtain the factorization we use. It is important that
W
stays
normalized and all scaling is pushed to
Γ
since to measure robustness to weight shifts, we are going
to add noise to Wonly, and higher Wscaling is equivalent to lower effective noise.
We study how the performance is affected when the representation network, fθ, is corrupted.
Definition 2.4.
For any sample
x
, the
Mean Squared Error (MSE)
for task
i
is defined to be the
expected error between the model prediction under noise and the true value y. Namely,
MSEiEθcpγT
ifθcpxq ´ yq2,(5)
where fθcis the model that emerges after corrupting fθ.
This measures how well the model approximates the ground truth under the presence of noise and
under the constraint of a joint representation for multiple tasks.
The simplest corruption process to study is adding noise to the representation matrix, i.e.
WcW`N, Nij Np0, σ2q,i.i.d(6)
Then, we denote the mean squared error for the task
i
with
MSEi,σ2
and the average mean squared
error across the Ttasks with Ę
MSET2
. We are now ready to introduce our results.
Theorem 2.5
(Mean Squared Error for Additive Noise)
.
Let
CPRdˆT
be a matrix with distinct
singular values
σ1ąσ2ą... ąσT
. Let
W, Γ
be the SVD solution of
(2)
. Under the Additive Noise
Model defined in (6), we have that:
Ę
MSET2
Ę
MSET,0`řk
i1σipCq2
T¨σ2.(7)
Average MSE under noise
Average MSE without noise
Noise Variance
As shown, the noisy MSE decomposes into the sum of the noiseless MSE plus the noise variance
times a function that depends on the number of tasks:
RpTq “ řk
i1σipCq2
T.(8)
It is important to emphasize that as more tasks are added, the matrix
C
changes, but the interlacing
theorem allows us to connect the singular values of smaller submatrices, as discussed in the Appendix.
RpTq
is the robustness slope: if a model with
T
tasks has smaller slope, it will eventually outperform
a model with, say
T´1
tasks and larger slope, for sufficiently large noise. This is true even if the
noiseless performance for the
T´1
-task model is better, indicating a cross-over in MSE. Therefore
the key is understanding when the sum of the top
k
singular values of
C
scales sublinearly in
T
. This
is not true for tasks that are aligned, but we can show it holds for independent Gaussian task vectors.
We believe it holds for more general families of diverse task vectors and our experiments verify it
also holds for numerous real task vectors learned from text and vision datasets.
Connection with l2regularization.
For the SVD solution (see Definition 4), the sum of the top-k
singular values squared is the squared Frobenius norm of
Γ
. Indeed, we have that
||ΓSVD||2
F
||ΣkVT||2
F
. Since
Σk
is a diagonal matrix, each row of
ΣkVT
is a rescaling of the corresponding
row of
VT
. Rows of
VT
have norm
1
, hence the i-th row of
ΣkVT
will have norm
σi
. The Frobenius
norm squared is just the sum of the squared norms of the rows. Hence, we get that
||ΓSVD||2
F
k
ÿ
i1
σipCq2.(9)
Using this simple observation, we can get the following alternative expression of Theorem 2.5.
Corollary 2.6.
Let
CPRdˆT
be a matrix with distinct singular values. Let
W, Γ
be the SVD solution
of (2). Under the Additive Noise Model defined in (6), we have that:
Ę
MSET2
Ę
MSET,0`||Γ||2
F
Tσ2.(10)
4
Corollary 2.6 provides two important insights: i) the normalization with the number of tasks that
appears in
(7)
is justified since the Frobenius norm of
Γ
grows with the number of task, ii) if we
can prove that the slope (defined in Equation
(8)
) is dropping, then we are effectively proving that
multitasking gives
l2
regularization as we showed in the toy introductory example. This also holds
for the case of Gaussian, i.i.d. task vectors, as shown in the following theorem.
Theorem 2.7.
Let
CPRdˆT
be a random matrix with Gaussian, i.i.d. entries of variance
1{d
and
dpT3q
. Let
Ct, Ct`1
be the matrices formed by selecting the first
t, pt`1q
columns of
C
. Then,
there is a noise level
σthres
such that with probability
ě1´exp ´´´?d¯¯
, the SVD solutions
(see (4)) of (2) (for Ct, Ct`1respectively), under the noise corruption model, satisfy:
Ę
MSEt`12
ăĘ
MSEt,σ2
,@σěσthres.(11)
Remark 2.8.In words, this result shows that adding new tasks gives
provably
increased robustness
to high noise corruption in the weights, when the task vectors are Gaussian.
Remark 2.9.Observe that the MSE under noise drops for every single new task added. The assumption
dpT3q
, can be relaxed to
dpt3q
, and we get increased robustness for the first
t
added tasks.
Nevertheless, for most applications
dpT3q
is a realistic assumption: Even for our smallest
dataset MNIST d728, and we experiment with up to 10 tasks.
3 Experimental Evaluation
We divide the experimental section in two parts. In the first part, we add noise to the final linear repre-
sentation layer of various networks and verify that our theoretical analysis agrees with experimentally
observed multitasking robustness on real datasets (MNIST, CIFAR10, NewsGroup20). In the second
part, we show that multitasking leads to robustness to general weight corruptions in any layer of a
complex transformer. Specifically, we show that multilingual Language Models are more robust to
weight shifts (across all the layers) compared to monolingual trained under the same setting. This is
the first evidence of increased Cognitive Reserve in bilingual artificial neural networks.
Experiments with Linear Representation Layers.
We perform experiments on three datasets
(MNIST, CIFAR10, Newsgroup20) and two modalities (Vision and Language). The datasets normally
involve one classification task each. We create multiple binary tasks by distinguishing between pairs
of labels. For example, in CIFAR10, one task might be to distinguish between dogs and cats and
another between airplanes and cars. We assign a value in
r0,1s
to each sample for each task to
transform them to regression tasks (to match our theory). For example, if task
i
is to distinguish
between dogs and cats, value 0corresponds to dog and value 1to cat.
The second issue is learning the task vectors from training data. For MNIST, we can simply learn a
linear layer
C
with columns
tc1, ..., cTu
such that:
cT
ix«y
for each task. For more complex datasets
like CIFAR or Newsgroup20, linear networks have lower performance and hence it is less interesting
to examine their robustness. Instead, we first use another network to extract representations
gθpxq
and then learn a linear layer acting on the encodings such that
cT
igθpxq « y
. For CIFAR we used
a pre-trained Resnet50 as the encoder while for NewsGroup, a pre-trained BERT [
12
]. We would
like to point out that our theory is still valid for this case – this is equivalent to the linear layer
C
receiving inputs from a learned representation as opposed to the features directly. As the number of
tasks increase, we reduce the number of training examples per task. We do this to make sure that the
total training dataset size stays the same as the number of tasks increase.
Figure 3 shows how the average MSE behaves as noise increases for different number of tasks.
Note that even though all models begin from roughly the same performance in the noiseless setting,
the multitask models are much more robust to the corruption of their weights consistently among
all the datasets and modalities. This is aligned with our theoretical analysis which predicts that
the robustness slope (defined in Equation
(8)
) decreases with the number of tasks. We calculate
robustness slopes for learned task vectors for real datasets and plot their decay in the Appendix,
where we further include all the details of how these models were trained.
Experiments with Language Models.
Our objective is to compare robustness to neural weight
perturbations in monolingual and bilingual language models. We use the following perturbation
5
摘要:

MultitaskingModelsareRobusttoStructuralFailure:ANeuralModelforBilingualCognitiveReserveGiannisDarasTheUniversityofTexasatAustingiannisdaras@utexas.eduNeginRaoofTheUniversityofTexasatAustinneginraoof@gmail.comZoiGkalitsiouTheUniversityofTexasatAustinzoi.gkalitsiou@austin.utexas.eduAlexandrosG.Dimak...

展开>> 收起<<
Multitasking Models are Robust to Structural Failure A Neural Model for Bilingual Cognitive Reserve Giannis Daras.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:959.71KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注