Multitasking Models are Robust to Structural Failure A Neural Model for Bilingual Cognitive Reserve Giannis Daras

2025-05-02 0 0 959.71KB 21 页 10玖币

侵权投诉

Multitasking Models are Robust to Structural Failure:

A Neural Model for Bilingual Cognitive Reserve

Giannis Daras˚

The University of Texas at Austin

giannisdaras@utexas.edu

Negin Raoof˚

The University of Texas at Austin

neginraoof@gmail.com

Zoi Gkalitsiou

The University of Texas at Austin

zoi.gkalitsiou@austin.utexas.edu

Alexandros G. Dimakis

The University of Texas at Austin

dimakis@austin.utexas.edu

Abstract

We ﬁnd a surprising connection between multitask learning and robustness to

neuron failures. Our experiments show that bilingual language models retain

higher performance under various neuron perturbations, such as random deletions,

magnitude pruning and weight noise compared to equivalent monolingual ones. We

provide a theoretical justiﬁcation of this robustness by mathematically analyzing

linear representation learning and showing that multitasking creates more robust

representations. Our analysis connects robustness to spectral properties of the

learned representation and proves that multitasking leads to higher robustness for

diverse task vectors. We open-source our code and models in the following URL:

https://github.com/giannisdaras/multilingual robustness.

1 Introduction

Converging evidence from cognitive science research indicates that bilingualism increases brain

robustness by reducing the rate of cognitive decline due to aging [

] and delaying the onset of

symptoms of dementia [

]. It appears that individuals who speak more than one language on a

regular basis are able to maintain typical cognitive functioning despite neural degeneration. This

mismatch between cognitive functioning and brain pathology is called Cognitive Reserve [

], and its

underlying mechanisms are poorly understood and are an active topic of investigation.

Inspired by this research, we study whether artiﬁcial neural networks are more robust when trained

on multiple languages or multiple tasks. Our experiments demonstrate that training on multiple

tasks indeed increases structural robustness. We train monolingual and bilingual GPT-2 models

with the same architecture and dataset sizes. Initially, monolingual GPT-2 [

] models are slightly

outperforming the bilingual ones, but when we introduce structural noise (by randomly deleting

neurons or adding noise to the weights) bilingual models degrade more gracefully and eventually

outperform the monolingual models in the high-noise regime. For some amount of noise, bilingual

models start outperforming the monolingual ones demonstrating a cross-over in performance due

to their increased robustness. We observe this phenomenon for numerous models across three

different types of corruption: additive Gaussian noise to the weights, random weight pruning and

magnitude-based weight pruning [7].

Our Contributions:

We provide a theoretical justiﬁcation of this phenomenon by mathematically

analyzing linear multitask representation learning [

]. Our analysis shows that introducing more

˚equal contribution.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11618v1 [cs.LG] 20 Oct 2022

Cross-over point

Figure 1: Performance of monolingual and bilingual GPT-2 models with the same architecture and training

dataset size. We show the performance as we randomly erase weights. The x-axis indicates the probability of

erasing an attention weight parameter (setting to it zero). The y-axis indicates the average perplexity over

runs with

% conﬁdence intervals. The bilingual model initially shows slightly worse performance, but as more

weights are deleted, the monolingual model declines faster and performs worse in the highly damaged regime.

This indicates that the bilingual GPT-2 model is more robust to neuron weight erasures. We show similar results

for several models and types of errors in our experimental section.

Figure 2: Let

c1, c2, c3

be the standard basis of

. For two tasks, the best one dimensional approximation

c1, c2

ˆc1“ r1{2,1{2,0sT

but the best one dimensional approximation to three tasks

c1, c2, c3

ˆc1

1“

r1{3,1{3,1{3sT

. Multi-tasking is creating

regularization since

||ˆc1

1||2ă ||ˆc1||2

. It is important that the

original task vectors c1, c2, c3are orthogonal i.e. diverse, since this creates regularization.

diverse tasks creates

regularization in the linear task heads. Further, we formally connect the

Euclidean norm of the learned representations to structural robustness under errors in the network

weights. Our main theorem establishes that multitasking leads to higher robustness to additive noise

for linear representations when the task vectors are selected as random and independent Gaussian

vectors. Our results also establish that when the tasks are signiﬁcantly overlapping, multitasking does

not lead to higher robustness and hence task diversity is necessary.

We experimentally observe that multitasking increases structural robustness for numerous networks

and multiple problems including MNIST, CIFAR10, Newsgroup20, GPT models and ﬁnetuned

GPT models on GLUE tasks. We train networks under exactly comparable dataset and architecture

conditions and show that models become more robust to structural failures as they are trained with

more tasks. We experiment with three different types of structural failures and show robustness

increases for all of them. We also experimentally observe that the addition of diverse tasks seems to

regularize the model weights, as we predict in our theoretical analysis.

2 Theoretical Analysis

Building intuition.

We start with a small numerical example to build intuition. Given a feature

vector

xPRd

we compute a

dimensional linear representation

W x

using a matrix

WPRkˆd

. We

choose

such that we best approximate a set of ground truth task vectors,

tc1, c2, ..., cTu

, that lie

. The learned approximation is

ˆci“WTγi

. Essentially, we use linear combinations of the

columns of

to approximate the task vectors. For simplicity, we assume that the columns of

are unit norm. We study the case where kăT, otherwise there are inﬁnite solutions.

Assume we work in

d“3

dimensions with

T“3

total tasks,

c1“ r1,0,0sT, c2“ r0,1,0sT, c3“

r0,0,1sT

. Set our learned representation dimension to be

k“1

dimensional. When

T“2

, using

only the ﬁrst two tasks

c1, c2

, an optimal solution is

W“1

?2r1,1,0s

. The corresponding linear head

is now the scalar

γ1“1

?2“γ2

and the approximate vectors are

ˆc1“WTγ1“ r0.5,0.5,0sT“ˆc2

Therefore the best one dimensional subspace to jointly approximate

c1, c2

is the span of

W“

?2r1,1,0s

. Now we introduce one more task and ﬁnd the one dimensional subspace that best

approximates

c1, c2, c3

. That becomes

W1“1

?3r1,1,1s

with linear heads

γ1

1“1

?3“γ1

2“γ1

The approximate vectors now are

ˆc1

1“ pW1qTγ1

1“ r1{3,1{3,1{3sT“ˆc1

2“ˆc1

. Notice that

||ˆc1

i||2“1{3

for

tasks but

||ˆci||2“1{2

for two tasks. The point is that for more tasks, the vector

that jointly approximates all task vectors becomes shorter. Equivalently, the

norm of the linear

task heads decreases from

γi“1

γ1

i“1

as the tasks increased from two to three showing

how multitasking creates regularization. A graphical representation of this example is given in Figure

2. It is important that the task vectors

are orthogonal, increasing the effective dimensionality of

the problem. The intuition is that diverse tasks increase the effective dimension, making the best

approximation vector shorter.

Our main theoretical result is that this phenomenon is quite general and makes multitasking lead

to structural robustness. We connect the norm of the approximated task vectors with robustness to

weight perturbations and show that for Gaussian, independent task vectors the average norm shrinks

as more tasks are added. This is intuitive since high dimensional Gaussian vectors are near-orthogonal.

Surprisingly, we empirically show that task vectors for numerous problems also exhibit this behavior.

Analysis.

We consider a neural network

fθ:RdÑRk

and a collection of tasks

tT1, ..., TTu

. We are

trying to learn θ, γiPRkto solve the following optimization problem:

argminθ,tγ1,...,γTu

i“1

Epx,yqPTiLpγT

ifθpxq, yq.(1)

The neural network

fθ

can be as simple as a single matrix

W:RdÑRk

. For linear networks,

we consider the following dataset generation process: for task

, we sample a Gaussian

and we

generate its label

by taking the inner-product with a task vector

, i.e.

y“cT

for task

. Given

inﬁnite samples and MSE loss, the optimization problem of

(1)

is equivalent to the following problem.

Deﬁnition 2.1

(Optimization Problem)

Let

kăTăd

. We deﬁne the Factorized Best Rank-

approximation of a matrix CPRdˆTas the optimization problem:

W˚,Γ˚“argminWPRkˆd,ΓPRkˆTˇˇˇˇWTΓ´Cˇˇˇˇ2

F.(2)

We are interested in the case when the dimensionality of the representation

is smaller than the

number of tasks T, otherwise the best Rank-kapproximation of Cis not unique.

The following Proposition states that in the considered setting, Problem 2 can be solved with SVD.

Proposition 2.2.

For any matrix

CPRdˆT

with distinct singular values, any solution of 2.1 satisﬁes:

W˚TΓ˚“UΣkVT,(3)

where

UΣVT

is the SVD of

and

Σk

is the same as

except than the last

T´k

diagonal entries

that are zeroed out.

The fact that the Singular Value decomposition computes the best rank-kapproximation to a matrix

can be found in several textbooks e.g. Golub and Van Loan [10], Blum et al. [11].

This proposition establishes that

W˚“UT

and

Γ˚“ΣkVT

is a valid solution of

(2)

. Onwards, we

will be calling this the SVD Solution.

Deﬁnition 2.3. We deﬁne the SVD solution of (2), to be:

WSVD “UT,ΓSVD “ΣkVT.(4)

We note that if any multitask learning algorithm is used to obtain

W˚,Γ˚

, one can run Gram-Schmidt

to make

W˚

orthonormal and hence obtain the factorization we use. It is important that

stays

normalized and all scaling is pushed to

since to measure robustness to weight shifts, we are going

to add noise to Wonly, and higher Wscaling is equivalent to lower effective noise.

We study how the performance is affected when the representation network, fθ, is corrupted.

Deﬁnition 2.4.

For any sample

, the

Mean Squared Error (MSE)

for task

is deﬁned to be the

expected error between the model prediction under noise and the true value y. Namely,

MSEi“Eθc“pγT

ifθcpxq ´ yq2‰,(5)

where fθcis the model that emerges after corrupting fθ.

This measures how well the model approximates the ground truth under the presence of noise and

under the constraint of a joint representation for multiple tasks.

The simplest corruption process to study is adding noise to the representation matrix, i.e.

Wc“W`N, Nij „Np0, σ2q,i.i.d(6)

Then, we denote the mean squared error for the task

with

MSEi,σ2

and the average mean squared

error across the Ttasks with Ę

MSET,σ2

. We are now ready to introduce our results.

Theorem 2.5

(Mean Squared Error for Additive Noise)

Let

CPRdˆT

be a matrix with distinct

singular values

σ1ąσ2ą... ąσT

. Let

W, Γ

be the SVD solution of

(2)

. Under the Additive Noise

Model deﬁned in (6), we have that:

MSET,σ2

“Ę

MSET,0`řk

i“1σipCq2

T¨σ2.(7)

Average MSE under noise

Average MSE without noise

Noise Variance

As shown, the noisy MSE decomposes into the sum of the noiseless MSE plus the noise variance

times a function that depends on the number of tasks:

RpTq “ řk

i“1σipCq2

T.(8)

It is important to emphasize that as more tasks are added, the matrix

changes, but the interlacing

theorem allows us to connect the singular values of smaller submatrices, as discussed in the Appendix.

RpTq

is the robustness slope: if a model with

tasks has smaller slope, it will eventually outperform

a model with, say

T´1

tasks and larger slope, for sufﬁciently large noise. This is true even if the

noiseless performance for the

T´1

-task model is better, indicating a cross-over in MSE. Therefore

the key is understanding when the sum of the top

singular values of

scales sublinearly in

. This

is not true for tasks that are aligned, but we can show it holds for independent Gaussian task vectors.

We believe it holds for more general families of diverse task vectors and our experiments verify it

also holds for numerous real task vectors learned from text and vision datasets.

Connection with l2regularization.

For the SVD solution (see Deﬁnition 4), the sum of the top-k

singular values squared is the squared Frobenius norm of

. Indeed, we have that

||ΓSVD||2

F“

||ΣkVT||2

. Since

Σk

is a diagonal matrix, each row of

ΣkVT

is a rescaling of the corresponding

row of

. Rows of

have norm

, hence the i-th row of

ΣkVT

will have norm

σi

. The Frobenius

norm squared is just the sum of the squared norms of the rows. Hence, we get that

||ΓSVD||2

F“

i“1

σipCq2.(9)

Using this simple observation, we can get the following alternative expression of Theorem 2.5.

Corollary 2.6.

Let

CPRdˆT

be a matrix with distinct singular values. Let

W, Γ

be the SVD solution

of (2). Under the Additive Noise Model deﬁned in (6), we have that:

MSET,σ2

“Ę

MSET,0`||Γ||2

Tσ2.(10)

Corollary 2.6 provides two important insights: i) the normalization with the number of tasks that

appears in

(7)

is justiﬁed since the Frobenius norm of

grows with the number of task, ii) if we

can prove that the slope (deﬁned in Equation

(8)

) is dropping, then we are effectively proving that

multitasking gives

regularization as we showed in the toy introductory example. This also holds

for the case of Gaussian, i.i.d. task vectors, as shown in the following theorem.

Theorem 2.7.

Let

CPRdˆT

be a random matrix with Gaussian, i.i.d. entries of variance

1{d

and

d“ΩpT3q

. Let

Ct, Ct`1

be the matrices formed by selecting the ﬁrst

t, pt`1q

columns of

. Then,

there is a noise level

σthres

such that with probability

ě1´exp ´´Ω´?d¯¯

, the SVD solutions

(see (4)) of (2) (for Ct, Ct`1respectively), under the noise corruption model, satisfy:

MSEt`1,σ2

ăĘ

MSEt,σ2

,@σěσthres.(11)

Remark 2.8.In words, this result shows that adding new tasks gives

provably

increased robustness

to high noise corruption in the weights, when the task vectors are Gaussian.

Remark 2.9.Observe that the MSE under noise drops for every single new task added. The assumption

d“ΩpT3q

, can be relaxed to

d“Ωpt3q

, and we get increased robustness for the ﬁrst

added tasks.

Nevertheless, for most applications

d“ΩpT3q

is a realistic assumption: Even for our smallest

dataset MNIST d“728, and we experiment with up to 10 tasks.

3 Experimental Evaluation

We divide the experimental section in two parts. In the ﬁrst part, we add noise to the ﬁnal linear repre-

sentation layer of various networks and verify that our theoretical analysis agrees with experimentally

observed multitasking robustness on real datasets (MNIST, CIFAR10, NewsGroup20). In the second

part, we show that multitasking leads to robustness to general weight corruptions in any layer of a

complex transformer. Speciﬁcally, we show that multilingual Language Models are more robust to

weight shifts (across all the layers) compared to monolingual trained under the same setting. This is

the ﬁrst evidence of increased Cognitive Reserve in bilingual artiﬁcial neural networks.

Experiments with Linear Representation Layers.

We perform experiments on three datasets

(MNIST, CIFAR10, Newsgroup20) and two modalities (Vision and Language). The datasets normally

involve one classiﬁcation task each. We create multiple binary tasks by distinguishing between pairs

of labels. For example, in CIFAR10, one task might be to distinguish between dogs and cats and

another between airplanes and cars. We assign a value in

r0,1s

to each sample for each task to

transform them to regression tasks (to match our theory). For example, if task

is to distinguish

between dogs and cats, value 0corresponds to dog and value 1to cat.

The second issue is learning the task vectors from training data. For MNIST, we can simply learn a

linear layer

with columns

tc1, ..., cTu

such that:

ix«y

for each task. For more complex datasets

like CIFAR or Newsgroup20, linear networks have lower performance and hence it is less interesting

to examine their robustness. Instead, we ﬁrst use another network to extract representations

gθpxq

and then learn a linear layer acting on the encodings such that

igθpxq « y

. For CIFAR we used

a pre-trained Resnet50 as the encoder while for NewsGroup, a pre-trained BERT [

]. We would

like to point out that our theory is still valid for this case – this is equivalent to the linear layer

receiving inputs from a learned representation as opposed to the features directly. As the number of

tasks increase, we reduce the number of training examples per task. We do this to make sure that the

total training dataset size stays the same as the number of tasks increase.

Figure 3 shows how the average MSE behaves as noise increases for different number of tasks.

Note that even though all models begin from roughly the same performance in the noiseless setting,

the multitask models are much more robust to the corruption of their weights consistently among

all the datasets and modalities. This is aligned with our theoretical analysis which predicts that

the robustness slope (deﬁned in Equation

(8)

) decreases with the number of tasks. We calculate

robustness slopes for learned task vectors for real datasets and plot their decay in the Appendix,

where we further include all the details of how these models were trained.

Experiments with Language Models.

Our objective is to compare robustness to neural weight

perturbations in monolingual and bilingual language models. We use the following perturbation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultitaskingModelsareRobusttoStructuralFailure:ANeuralModelforBilingualCognitiveReserveGiannisDarasTheUniversityofTexasatAustingiannisdaras@utexas.eduNeginRaoofTheUniversityofTexasatAustinneginraoof@gmail.comZoiGkalitsiouTheUniversityofTexasatAustinzoi.gkalitsiou@austin.utexas.eduAlexandrosG.Dimak...

展开>> 收起<<

Multitasking Models are Robust to Structural Failure A Neural Model for Bilingual Cognitive Reserve Giannis Daras.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multitasking Models are Robust to Structural Failure A Neural Model for Bilingual Cognitive Reserve Giannis Daras

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: