On Robust Incremental Learning over Many Multilingual Steps Karan Praharaj andIrina Matveeva Reveal

2025-05-02 0 0 6.01MB 10 页 10玖币

侵权投诉

On Robust Incremental Learning over Many Multilingual Steps

Karan Praharaj and Irina Matveeva

Reveal

Chicago, IL

{kpraharaj,imatveeva}@revealdata.com

Abstract

Recent work in incremental learning has in-

troduced diverse approaches to tackle catas-

trophic forgetting from data augmentation to

optimized training regimes. However, most

of them focus on very few training steps. We

propose a method for robust incremental learn-

ing over dozens of ﬁne-tuning steps using data

from a variety of languages. We show that a

combination of data-augmentation and an op-

timized training regime allows us to continue

improving the model even for as many as ﬁfty

training steps. Crucially, our augmentation

strategy does not require retaining access to

previous training data and is suitable in scenar-

ios with privacy constraints.

1 Introduction

Incremental learning is a common scenario for prac-

tical applications of deep language models. In such

applications, training data is expected to arrive in

batches rather than all at once, and so incremen-

tal perturbations to the model are preferred over

retraining the model from scratch every time new

training data becomes available for efﬁciency of

time and computational resources. When multi-

lingual models are deployed in applications, they

are expected to deliver good performance over data

across multiple languages and domains. This is

why it is desirable that the model keeps acquir-

ing new knowledge from incoming training data

in different languages, while preserving its ability

on languages that were trained in the past. The

model should ideally keep improving over time, or

at the very least not deteriorate its performance on

certain languages through the incremental learning

lifecycle.

It is known that incremental ﬁne-tuning with

data in different languages leads to catastrophic

forgetting (French,1999;Mccloskey and Cohen,

1989) of languages that were ﬁne-tuned in the past

(Liu et al.,2021b;Vu et al.,2022). This means that

Figure 1: Translation augmented sequential ﬁne-tuning

approach with LLRD-enabled training. We begin with

a pre-trained multilingual model M0and ﬁne-tune it

over multiple stages to obtain (Miwhere i= 0...N).

At each ﬁne-tuning stage, we train the model Miover

examples from the training set that is available at that

stage (Di) which is in language Lj. At each step, we

sample a small random subset from Diand translate

that sample into languages (L \ Lj) to create a set of

translated examples Ti. Each step of training includes

LLRD as a hyperparameter.

the performance on previously ﬁne-tuned tasks or

languages decreases after training on a new task or

language. Multiple strategies have been proposed

to mitigate catastrophic forgetting. Data-focused

strategies such as augmentation and episodic mem-

ories (Hayes et al.,2019;Chaudhry et al.,2019b;

Lopez-Paz and Ranzato,2017), entail maintaining

a cache of a subset of examples from previous train-

ing data, which are mixed with new examples from

the current training data. The network is subse-

quently ﬁne-tuned over this mixture as a whole, in

order to help the model "refresh" its "memory" of

prior information so that it can leverage previous

experience to transfer knowledge to future tasks.

Closely related to our current work is the work

arXiv:2210.14307v1 [cs.CL] 25 Oct 2022

by M’hamdi et al. (2022); Ozler et al. (2020) of un-

derstanding the effect of incrementally ﬁne-tuning

models with multi-lingual data. They suggest that

joint ﬁne-tuning is the best way to mitigate the ten-

dency of cross-lingual language models to erase

previously acquired knowledge. In other words,

their results show that joint ﬁne-tuning should be

used instead of incremental ﬁne-tuning, if possible.

Optimization focused strategies such as

Mirzadeh et al. (2020); Kirkpatrick et al. (2017)

focus on the training regime, and show that

techniques such as dropout, large learning rates

with decay and shrinking the batch size can create

training regimes that result in more stable models.

Translation augmentation has been shown to be

an effective technique for improving performance

as well. Wang et al. (2018); Fadaee et al. (2017);

Liu et al. (2021a) and Xia et al. (2019) use var-

ious types of translation augmentation strategies

and show substantial improvements in performance.

Encouraged by these gains, we incorporate transla-

tion as our data augmentation strategy.

In our analysis, we consider an additional con-

straint that affects our choice of data augmentation

strategies. This constraint is that the data that has

already been used for training cannot be accessed

again in a future time step. We know that privacy

is an important consideration for continuously de-

ployed models in corporate applications and similar

scenarios and privacy protocols often limit access

of each tranche of additional ﬁne-tuning data only

to the current training time step. Under such con-

straints, joint ﬁne-tuning or maintaining a cache

like Chaudhry et al. (2019a); Lopez-Paz and Ran-

zato (2017) is infeasible. Thus, we use translation

augmentation as a way to improve cross-lingual

generalization over a large number of ﬁne-tuning

steps without storing previous data.

In this paper we present a novel translation-

augmented sequential ﬁne-tuning approach that

mixes in translated data at each step of sequen-

tial ﬁne-tuning and makes use of a special training

regime. Our approach shows minimization of the

effects of catastrophic forgetting, and the interfer-

ence between languages. The results show that for

incremental learning over dozens of training steps,

the baseline approaches result in catastrophic for-

getting. We see that it may take multiple steps to

reach this point, but the performance eventually

collapses.

The main contribution of our work is combin-

ing data augmentation with adjustments in train-

ing regime and evaluating this approach over a

sequence of 50 incremental ﬁne-tuning steps. The

training regime makes sure that incremental ﬁne-

tuning of models using translation augmentation

is robust without the access to previous data. We

show that our model delivers a good performance

as it surpasses the baseline across multiple evalua-

tion metrics. To the best of our knowledge, this

is the ﬁrst work to provide a multi-stage cross-

lingual analysis of incremental learning over a large

number of ﬁne-tuning steps with recurrence of lan-

guages.

2 Related Work

Current work ﬁts into the area of incremental learn-

ing in cross-lingual settings. M’hamdi et al. (2022)

is the closest work to our research. The authors

compare several cross-lingual incremental learn-

ing methods and provide evaluation measures for

model quality after each sequential ﬁne-tuning step.

They show that combining the data from all lan-

guages and ﬁne-tuning the model jointly is more

beneﬁcial than sequential ﬁne-tuning on each lan-

guage individually. We use some of their evaluation

protocols but we have different constraints: we do

not keep the data from previous sequential ﬁne-

tuning steps and we do not control the sequence

of languages. In addition, they considered only

six hops of incremental ﬁne-tuning whereas we are

interested in dozens of steps. Ozler et al. (2020)

do not perform a cross-lingual analysis, but study

a scenario closely related to our work. Their ﬁnd-

ings fall in line with those of M’hamdi et al. (2022)

as they show that combining data from different

domains into one training set for ﬁne-tuning per-

forms better than ﬁne-tuning each domain sepa-

rately. However, this type of joint ﬁne-tuning is

ruled out for our scenario where we assume that ac-

cess to previous training data is not available, and

so we focus on sequential ﬁne-tuning exclusively.

Mirzadeh et al. (2020) study the impact of vari-

ous training regimes on forgetting mitigation. Their

study focuses on learning rates, batch size, regular-

ization method. This work, like ours, shows that

applying a learning rate decay plays a signiﬁcant

role in reducing catastrophic forgetting. However,

it is important to point out that our type of decay is

different from theirs. Mirzadeh et al. (2020) start

with a high initial learning rate for the ﬁrst task to

obtain a wide and stable minima. Then, for each

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnRobustIncrementalLearningoverManyMultilingualStepsKaranPraharajandIrinaMatveevaRevealChicago,IL{kpraharaj,imatveeva}@revealdata.comAbstractRecentworkinincrementallearninghasin-troduceddiverseapproachestotacklecatas-trophicforgettingfromdataaugmentationtooptimizedtrainingregimes.However,mostofthemf...

展开>> 收起<<

On Robust Incremental Learning over Many Multilingual Steps Karan Praharaj andIrina Matveeva Reveal.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Robust Incremental Learning over Many Multilingual Steps Karan Praharaj andIrina Matveeva Reveal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: