dings to combine multilingual word, character and
sub-word embeddings for the NER task. Aguilar
and Solorio (2020) augment morphological clues to
language models and uses them for transfer learn-
ing from English to code-switched data with labels.
Samanta et al. (2019) uses translation API to create
synthetic code-switched text from English datasets
and use this for transfer learning from English
to code-switched text without labels in the code-
switched case. Qin et al. (2020) use synthetically
generated code-switched data to enhance zero-shot
cross-lingual transfer learning. Recently, Khanuja
et al. (2020) released the GLUECoS benchmark to
study the performance of multiple models for code-
switched tasks across two language pairs En-Es and
En-Hi. The benchmark contains 6 tasks, 11 datasets
and has 8 models for every task. Multilingual trans-
formers fine tuned with masked-language-model
objective on code-switched data can outperform
generic multilingual transformers. Results from
Khanuja et al. (2020) show that sentiment analy-
sis, question answering and NLI are significantly
harder than tasks like NER, POS and LID. In this
work, we focus on the sentiment analysis task in
the absence of labeled code-switched data using
multilingual transformer models, while taking into
account the distinction between resource-rich and
low-resource languages. Although our work seems
related to curriculum learning, it is distinct from
the existing work. Most of the work in curricu-
lum learning is in supervised setting (Zhang et al.,
2019;Xu et al.,2020) and our work focuses on
zero-shot setting, where no code-switched sam-
ple is annotated. Note that, this is also different
from semi-supervised setting because of distribu-
tion shifts between labeled resource-rich language
data and target unlabeled code-switched data.
3 Preliminaries
Our problem is a sentiment analysis problem where
we have a labelled resource-rich language dataset
and unlabelled code-switched data. From here
onwards, we refer the labelled resource-rich lan-
guage dataset as the source dataset and the un-
labelled code-switched dataset as target dataset.
Since code-switching often occurs in language
pairs that include English, we refer to English as
the resource-rich language. The source dataset,
say
S
, is in English and has the text-label pairs
{(xs1, ys1),(xs2, ys2), ...(xsm, ysm)}
and the tar-
get dataset, say
T
, is in code-switched form and
has texts
{xcs1, xcs2, ...xcsn}
, where
m
is signif-
icantly greater than
n
. The objective is to learn
a sentiment classifier to detect sentiment of code-
switched data by leveraging labelled source dataset
and unlabelled target dataset.
4 Methodology
Our methodology can be broken down into three
main steps: (1) Source dataset pretraining, which
uses the resource-rich language labelled source
dataset
S
for training a text classifier. This classi-
fier is used to generate pseudo-labels for the target
dataset
T
. (2) Bucket creation, which divides the
unlabelled data
T
into buckets based on the fraction
of words from resource-rich language. Some buck-
ets would contain samples that are more resource-
rich language dominated while others contain sam-
ples dominated by low-resource language. (3) Pro-
gressive training, where we initially train using
S
and the samples dominated by resource-rich lan-
guage and gradually include the low-resource lan-
guage dominated instances while training. For rest
of the paper, pretraining refers to step 1 and train-
ing refers to the training in step 3. And, we also
use class ratio based instance selection to prevent
the model getting biased towards majority label.
4.1 Source Dataset Pretraining
Resource-rich languages have abundant resources
which includes labeled data. Intuitively, sentences
in
T
that are similar to positive sentiment sentences
in
S
would also be having positive sentiment (and
same for the negative sentiment). Therefore, we
can treat the predictions made on
T
by multilin-
gual model trained on
S
as their respective pseudo-
labels. This would assign noisy pseudo-labels to
unlabeled dataset
T
. The source dataset pretraining
step is a text classification task. Let the model ob-
tained after pretraining on dataset
S
be called
mpt
.
This model is used to generate the initial pseudo-
labels and to select the instances to be used for
progressive training.
4.2 Bucket Creation
Since progressive training aims to gradually
progress from training on resource-rich language
dominated samples to low-resource language dom-
inated samples, we divide the dataset
T
into buck-
ets based on fraction of words in resource-rich
language. This creates buckets that have more
resource-rich language dominated instances and