Progressive Sentiment Analysis for Code-Switched Text Data Sudhanshu Ranjan1Dheeraj Mekala1Jingbo Shang12 1University of California San Diego

2025-05-02 0 0 1.68MB 12 页 10玖币
侵权投诉
Progressive Sentiment Analysis for Code-Switched Text Data
Sudhanshu Ranjan 1Dheeraj Mekala 1Jingbo Shang 1,2
1University of California San Diego
2Halıcıo˘
glu Data Science Institute, University of California San Diego
{sranjan, dmekala, jshang}@ucsd.edu
Abstract
Multilingual transformer language models
have recently attracted much attention from re-
searchers and are used in cross-lingual trans-
fer learning for many NLP tasks such as text
classification and named entity recognition.
However, similar methods for transfer learn-
ing from monolingual text to code-switched
text have not been extensively explored mainly
due to the following challenges: (1) Code-
switched corpus, unlike monolingual corpus,
consists of more than one language and ex-
isting methods can’t be applied efficiently,
(2) Code-switched corpus is usually made of
resource-rich and low-resource languages and
upon using multilingual pre-trained language
models, the final model might bias towards
resource-rich language. In this paper, we focus
on code-switched sentiment analysis where we
have a labelled resource-rich language dataset
and unlabelled code-switched data. We pro-
pose a framework that takes the distinction
between resource-rich and low-resource lan-
guage into account. Instead of training on the
entire code-switched corpus at once, we cre-
ate buckets based on the fraction of words in
the resource-rich language and progressively
train from resource-rich language dominated
samples to low-resource language dominated
samples. Extensive experiments across mul-
tiple language pairs demonstrate that progres-
sive training helps low-resource language dom-
inated samples.
1 Introduction
Code-switching is the phenomena where the
speaker alternates between two or more languages
in a conversation. The lack of annotated data and
diverse combinations of languages with which this
phenomenon can be observed, makes it difcult to
progress in NLP tasks on code-switched data. And
also, the prevalance of different languages is differ-
ent, making annotations expensive and difficult.
Intuitively, multilingual language models like
mBERT (Devlin et al.,2019) can be used for
Figure 1: An example of code-switched text where
words in both the languages together represent the
sentiment. A code-switched text generally contains
phrases from multiple languages in a single sentence.
The text in blue are words in Hindi that have been writ-
ten in the Latin script.
code-switched text since a single model learns
multilingual representations. Although the idea
seems straightforward, there are multiple issues.
Firstly, mBERT performs differently on different
languages depending on their script, prevalence and
predominance. mBERT performs well in medium-
resource to high-resource languages, but is outper-
formed by non-contextual subword embeddings
in a low-resource setting (Heinzerling and Strube,
2019). Moreover, the performance is highly depen-
dent on the script Pires et al. (2019a). Secondly,
pre-trained language models have only seen mono-
lingual sentences during the unsupervised pretrain-
ing, however code-switched text contains phrases
from both the languages in a single sentence as
shown in Figure 1, thus making it an entirely new
scenario for the language models. Thirdly, there is
difference in the languages based on the amount of
unsupervised corpus that is used during pretraining.
For e.g., mBERT is trained on the wikipedia corpus.
English has
6.3 million articles, whereas Hindi
and Tamil have only
140K articles each. This
may lead to under-representation of low-resource
langauges in the final model. Further, English has
been extensively studied by NLP community over
the years, making the supervised data and tools
more easily accessible. Thus, the model would be
able to easily learn patterns present in the resource-
rich language segments and motivating us to at-
tempt transfer learning from English supervised
arXiv:2210.14380v1 [cs.CL] 25 Oct 2022
… brilliant bowling by ….
… plz   frnd  ….
… fixing    gift …
    ….
  
Text Sentiment
.. how sad are you .. -ve
.. i love paris .. +ve
    ….
  
  
… plz   frnd 
… mario   fav 
… fixing    gift …
… brilliant bowling by ….
… football game sucks …
… awssmmm movie ….
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
Divide into
buckets
Input
Decreasing fraction of
English words
Final
Model
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
E[CLS]
T[CLS]
E1
U1
E2
U2
E3
U3
En-2
Un-2
En-1
Un-1
En
Un
E[SEP]
Label
Sentence
B1 B2 B3
Input
mpt m1m2
Predict
Input
Predict
Predict
Input
Input
Input
Input
Input
Input
Input
T
S
Figure 2: A visualization of the progressive training strategy. The source labelled dataset Sin resource rich
language should be easily available. Using S, a classifier is trained, say mpt. Unlabelled code-switched dataset
Tis divided into buckets using the fraction of English words as the metric. The leftmost bucket B1 has samples
dominated by resource-rich language and as we move towards right, the samples in the buckets are dominated by
low-resource language. mpt is used to generate pseudo-labels for unlabelled texts in bucket B1. We use texts from
B1 along with their pseudo-labels and the dataset Sto train a second text classifier m1. Then, m1is used to get
the pseudo-labels for texts in bucket B2. We keep repeating this until we obtain the final model which is used for
predictions.
datasets to code-switched datasets.
The main idea behind our paper can be sum-
marised as follows: When doing zero shot transfer
learning from a resource-rich language (LangA)
to code switched language (say LangA-LangB,
where LangB is a low-resource language com-
pared to LangA), the model is more likely to
be wrong when the instances are dominated by
LangB. Thus, instead of self-training on the en-
tire corpus at once, we propose to progressively
move from LangA-dominated instances to LangB-
dominated instances while transfer learning. As
illustrated in Figure 2, model trained on the an-
notated resource-rich language dataset is used to
generate pseudo-labels for code-switched data. Pro-
gressive training uses the resource-rich language
dataset and (unlabelled) resource-rich language
dominated code-switched samples together to gen-
erate better quality pseudo-labels for (unlabelled)
low-resource language dominated code-switched
samples. Lastly, annotated resource-rich language
dataset and pseudo-labelled code-switched data are
then used together for training which increases the
performance of the final model.
Our key contributions are summarized as:
We propose a simple, novel training strategy that
demonstrates superior performance. Since our
hypothesis is based on the pretraining phase of
the multilingual language models, it can be com-
bined with any transfer learning method.
We conduct experiments across multiple
language-pair datasets, showing efficiency of our
proposed method.
We create probing experiments that verify our
hypothesis.
Reproducibility.
Our code is publicly available on
github 1.
2 Related work
Multiple tasks like Language Identification, Named
Entity Recognition, Part-of-Speech, Sentiment
Analysis, Question Answering and NLI have been
studied in the code-switched setting. For sentiment
analysis, Vilares et al. (2015) showed that mul-
tilingual approaches can outperform pipelines of
monolingual models on code-switched data. Lal
et al. (2019) use CNN based network for the same.
Winata et al. (2019) use hierarchical meta embed-
1https://github.com/s1998/
progressiveTrainCodeSwitch
dings to combine multilingual word, character and
sub-word embeddings for the NER task. Aguilar
and Solorio (2020) augment morphological clues to
language models and uses them for transfer learn-
ing from English to code-switched data with labels.
Samanta et al. (2019) uses translation API to create
synthetic code-switched text from English datasets
and use this for transfer learning from English
to code-switched text without labels in the code-
switched case. Qin et al. (2020) use synthetically
generated code-switched data to enhance zero-shot
cross-lingual transfer learning. Recently, Khanuja
et al. (2020) released the GLUECoS benchmark to
study the performance of multiple models for code-
switched tasks across two language pairs En-Es and
En-Hi. The benchmark contains 6 tasks, 11 datasets
and has 8 models for every task. Multilingual trans-
formers fine tuned with masked-language-model
objective on code-switched data can outperform
generic multilingual transformers. Results from
Khanuja et al. (2020) show that sentiment analy-
sis, question answering and NLI are significantly
harder than tasks like NER, POS and LID. In this
work, we focus on the sentiment analysis task in
the absence of labeled code-switched data using
multilingual transformer models, while taking into
account the distinction between resource-rich and
low-resource languages. Although our work seems
related to curriculum learning, it is distinct from
the existing work. Most of the work in curricu-
lum learning is in supervised setting (Zhang et al.,
2019;Xu et al.,2020) and our work focuses on
zero-shot setting, where no code-switched sam-
ple is annotated. Note that, this is also different
from semi-supervised setting because of distribu-
tion shifts between labeled resource-rich language
data and target unlabeled code-switched data.
3 Preliminaries
Our problem is a sentiment analysis problem where
we have a labelled resource-rich language dataset
and unlabelled code-switched data. From here
onwards, we refer the labelled resource-rich lan-
guage dataset as the source dataset and the un-
labelled code-switched dataset as target dataset.
Since code-switching often occurs in language
pairs that include English, we refer to English as
the resource-rich language. The source dataset,
say
S
, is in English and has the text-label pairs
{(xs1, ys1),(xs2, ys2), ...(xsm, ysm)}
and the tar-
get dataset, say
T
, is in code-switched form and
has texts
{xcs1, xcs2, ...xcsn}
, where
m
is signif-
icantly greater than
n
. The objective is to learn
a sentiment classifier to detect sentiment of code-
switched data by leveraging labelled source dataset
and unlabelled target dataset.
4 Methodology
Our methodology can be broken down into three
main steps: (1) Source dataset pretraining, which
uses the resource-rich language labelled source
dataset
S
for training a text classifier. This classi-
fier is used to generate pseudo-labels for the target
dataset
T
. (2) Bucket creation, which divides the
unlabelled data
T
into buckets based on the fraction
of words from resource-rich language. Some buck-
ets would contain samples that are more resource-
rich language dominated while others contain sam-
ples dominated by low-resource language. (3) Pro-
gressive training, where we initially train using
S
and the samples dominated by resource-rich lan-
guage and gradually include the low-resource lan-
guage dominated instances while training. For rest
of the paper, pretraining refers to step 1 and train-
ing refers to the training in step 3. And, we also
use class ratio based instance selection to prevent
the model getting biased towards majority label.
4.1 Source Dataset Pretraining
Resource-rich languages have abundant resources
which includes labeled data. Intuitively, sentences
in
T
that are similar to positive sentiment sentences
in
S
would also be having positive sentiment (and
same for the negative sentiment). Therefore, we
can treat the predictions made on
T
by multilin-
gual model trained on
S
as their respective pseudo-
labels. This would assign noisy pseudo-labels to
unlabeled dataset
T
. The source dataset pretraining
step is a text classification task. Let the model ob-
tained after pretraining on dataset
S
be called
mpt
.
This model is used to generate the initial pseudo-
labels and to select the instances to be used for
progressive training.
4.2 Bucket Creation
Since progressive training aims to gradually
progress from training on resource-rich language
dominated samples to low-resource language dom-
inated samples, we divide the dataset
T
into buck-
ets based on fraction of words in resource-rich
language. This creates buckets that have more
resource-rich language dominated instances and
摘要:

ProgressiveSentimentAnalysisforCode-SwitchedTextDataSudhanshuRanjan1DheerajMekala1JingboShang1;21UniversityofCaliforniaSanDiego2Halco gluDataScienceInstitute,UniversityofCaliforniaSanDiego{sranjan,dmekala,jshang}@ucsd.eduAbstractMultilingualtransformerlanguagemodelshaverecentlyattractedmuchattenti...

展开>> 收起<<
Progressive Sentiment Analysis for Code-Switched Text Data Sudhanshu Ranjan1Dheeraj Mekala1Jingbo Shang12 1University of California San Diego.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.68MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注