by M’hamdi et al. (2022); Ozler et al. (2020) of un-
derstanding the effect of incrementally fine-tuning
models with multi-lingual data. They suggest that
joint fine-tuning is the best way to mitigate the ten-
dency of cross-lingual language models to erase
previously acquired knowledge. In other words,
their results show that joint fine-tuning should be
used instead of incremental fine-tuning, if possible.
Optimization focused strategies such as
Mirzadeh et al. (2020); Kirkpatrick et al. (2017)
focus on the training regime, and show that
techniques such as dropout, large learning rates
with decay and shrinking the batch size can create
training regimes that result in more stable models.
Translation augmentation has been shown to be
an effective technique for improving performance
as well. Wang et al. (2018); Fadaee et al. (2017);
Liu et al. (2021a) and Xia et al. (2019) use var-
ious types of translation augmentation strategies
and show substantial improvements in performance.
Encouraged by these gains, we incorporate transla-
tion as our data augmentation strategy.
In our analysis, we consider an additional con-
straint that affects our choice of data augmentation
strategies. This constraint is that the data that has
already been used for training cannot be accessed
again in a future time step. We know that privacy
is an important consideration for continuously de-
ployed models in corporate applications and similar
scenarios and privacy protocols often limit access
of each tranche of additional fine-tuning data only
to the current training time step. Under such con-
straints, joint fine-tuning or maintaining a cache
like Chaudhry et al. (2019a); Lopez-Paz and Ran-
zato (2017) is infeasible. Thus, we use translation
augmentation as a way to improve cross-lingual
generalization over a large number of fine-tuning
steps without storing previous data.
In this paper we present a novel translation-
augmented sequential fine-tuning approach that
mixes in translated data at each step of sequen-
tial fine-tuning and makes use of a special training
regime. Our approach shows minimization of the
effects of catastrophic forgetting, and the interfer-
ence between languages. The results show that for
incremental learning over dozens of training steps,
the baseline approaches result in catastrophic for-
getting. We see that it may take multiple steps to
reach this point, but the performance eventually
collapses.
The main contribution of our work is combin-
ing data augmentation with adjustments in train-
ing regime and evaluating this approach over a
sequence of 50 incremental fine-tuning steps. The
training regime makes sure that incremental fine-
tuning of models using translation augmentation
is robust without the access to previous data. We
show that our model delivers a good performance
as it surpasses the baseline across multiple evalua-
tion metrics. To the best of our knowledge, this
is the first work to provide a multi-stage cross-
lingual analysis of incremental learning over a large
number of fine-tuning steps with recurrence of lan-
guages.
2 Related Work
Current work fits into the area of incremental learn-
ing in cross-lingual settings. M’hamdi et al. (2022)
is the closest work to our research. The authors
compare several cross-lingual incremental learn-
ing methods and provide evaluation measures for
model quality after each sequential fine-tuning step.
They show that combining the data from all lan-
guages and fine-tuning the model jointly is more
beneficial than sequential fine-tuning on each lan-
guage individually. We use some of their evaluation
protocols but we have different constraints: we do
not keep the data from previous sequential fine-
tuning steps and we do not control the sequence
of languages. In addition, they considered only
six hops of incremental fine-tuning whereas we are
interested in dozens of steps. Ozler et al. (2020)
do not perform a cross-lingual analysis, but study
a scenario closely related to our work. Their find-
ings fall in line with those of M’hamdi et al. (2022)
as they show that combining data from different
domains into one training set for fine-tuning per-
forms better than fine-tuning each domain sepa-
rately. However, this type of joint fine-tuning is
ruled out for our scenario where we assume that ac-
cess to previous training data is not available, and
so we focus on sequential fine-tuning exclusively.
Mirzadeh et al. (2020) study the impact of vari-
ous training regimes on forgetting mitigation. Their
study focuses on learning rates, batch size, regular-
ization method. This work, like ours, shows that
applying a learning rate decay plays a significant
role in reducing catastrophic forgetting. However,
it is important to point out that our type of decay is
different from theirs. Mirzadeh et al. (2020) start
with a high initial learning rate for the first task to
obtain a wide and stable minima. Then, for each