
prompt tuning is better than fine-tuning on cross-
lingual transfer. Our contributions are summarized
as follows: we show that prompt tuning can per-
form much better as compared to fine-tuning for
cross-lingual transfer; we also show prompt tuning
works better in the case of the cross-lingual transfer
due to the relative small robust changes it brings to
the originally learned representations.
2 Prompt-Tuning for Cross-Lingual
Tasks
Multilingual Language Models.
In the past
years, lots of pre-trained multilingual language
models come out: mBERT, XLM (CONNEAU and
Lample,2019), XLM-R (Conneau et al.,2020),
etc. XLM-R (Conneau et al.,2020) significantly
outperforms multilingual BERT (mBERT; Devlin
et al.,2019) on a variety of cross-lingual bench-
marks XTREME (Hu et al.,2020). In some pre-
vious work (Luo et al.,2021;Zhang et al.,2019),
XLM-R is also used for initialization to do another
round of pretraining with parallel data to get the
stronger cross-lingual ability. Previously, in the
cross-lingual evaluation, models are fine-tuned on
the English training data but evaluated on all tar-
get languages. As far as we know, we are the first
to explore prompt tuning on several hard multilin-
gual NLP tasks including structure prediction and
question answering
Figure 1: Two different approaches for cross-lingual
evaluation when using large multilingual language
model. Left: In fine-tuning, all model parameters are
tuned on English task data. This setting is used in cross-
lingual evaluation before. Right: In prompt tuning,
only small ratio parameters are tuned. We use prefix
prompts and use layer prompts in our experiments.
Prompt Tuning.
Fine-tuning on large pre-
trained language models leads to strong perfor-
mance on downstream tasks, however, it is memory-
consuming and lots of parameters need to save for
each task. In prompt tuning, only a small part of
the parameters ( e.g., prompts or task classifier
) are tuned during learning. However, it usually
performs not as good as compared to fine-tuning.
Recently, Lester et al. (2021) find prompt tuning
can be better than fine-tuning when the model
size is not extremely large (10 billion parameters).
Prefix-tuning (Li and Liang,2021) obtains compa-
rable performance for natural language generation
tasks. Liu et al. (2022) shows prompt tuning can be
matched to fine-tuning on language understanding
tasks even at hard sequence tagging tasks.
We investigate prompt tuning on cross-lingual
understanding on a pre-trained multilingual lan-
guage model. The framework is shown in Figure 1.
Our setting is similar to Li and Liang (2021); Liu
et al. (2022). The continuous prompts are added
as prefix tokens and tuned during learning. In the
implementation, the prompts are operated as past
keys and values in each transformer layer. Each
transformer layer has separated prompts. These
continuous prompts are optimized, but multilingual
language model parameters are frozen.
3 Experiments Setup
3.1 Datasets.
We perform experiments on four datasets included
in XTREME: cross-lingual natural language infer-
ence (XNLI; Conneau et al.,2018), cross-lingual
adversarial dataset for paraphrase identification
(PAWS-X; Yang et al.,2019), part-of-speech tag-
ging on the Universal Dependencies (UD-POS;
Nivre et al.,2018), cross-lingual question answer-
ing on XQuAD (Artetxe et al.,2020) and TyDiQA-
GoldP (Clark et al.,2020). Three categories of
downstream tasks are included: (1) sentence clas-
sification); (2) structure prediction; (3) question
answering.
3.2 Training Details.
Our frozen models are built on the top of the pre-
trained XLM-R checkpoint of LARGE size with
about 560M parameters. Previous work (Hu et al.,
2020) shows it achieves stronger performance than
mBERT
2
. All our experiments were run with Hug-
gingface (Wolf et al.,2020). More details are in
the appendix.
Prompt Length.
Prompt length usually plays an
important role in prompt tuning. In our experi-
ments, we treat this as a hyper-parameter. Longer
2Some preliminary results are obtained with mBERT.