Progressive Sentiment Analysis for Code-Switched Text Data Sudhanshu Ranjan1Dheeraj Mekala1Jingbo Shang12 1University of California San Diego

2025-05-02 0 0 1.68MB 12 页 10玖币

侵权投诉

Progressive Sentiment Analysis for Code-Switched Text Data

Sudhanshu Ranjan 1Dheeraj Mekala 1Jingbo Shang 1,2

1University of California San Diego

2Halıcıo˘

glu Data Science Institute, University of California San Diego

{sranjan, dmekala, jshang}@ucsd.edu

Abstract

Multilingual transformer language models

have recently attracted much attention from re-

searchers and are used in cross-lingual trans-

fer learning for many NLP tasks such as text

classiﬁcation and named entity recognition.

However, similar methods for transfer learn-

ing from monolingual text to code-switched

text have not been extensively explored mainly

due to the following challenges: (1) Code-

switched corpus, unlike monolingual corpus,

consists of more than one language and ex-

isting methods can’t be applied efﬁciently,

(2) Code-switched corpus is usually made of

resource-rich and low-resource languages and

upon using multilingual pre-trained language

models, the ﬁnal model might bias towards

resource-rich language. In this paper, we focus

on code-switched sentiment analysis where we

have a labelled resource-rich language dataset

and unlabelled code-switched data. We pro-

pose a framework that takes the distinction

between resource-rich and low-resource lan-

guage into account. Instead of training on the

entire code-switched corpus at once, we cre-

ate buckets based on the fraction of words in

the resource-rich language and progressively

train from resource-rich language dominated

samples to low-resource language dominated

samples. Extensive experiments across mul-

tiple language pairs demonstrate that progres-

sive training helps low-resource language dom-

inated samples.

1 Introduction

Code-switching is the phenomena where the

speaker alternates between two or more languages

in a conversation. The lack of annotated data and

diverse combinations of languages with which this

phenomenon can be observed, makes it difﬁcult to

progress in NLP tasks on code-switched data. And

also, the prevalance of different languages is differ-

ent, making annotations expensive and difﬁcult.

Intuitively, multilingual language models like

mBERT (Devlin et al.,2019) can be used for

Figure 1: An example of code-switched text where

words in both the languages together represent the

sentiment. A code-switched text generally contains

phrases from multiple languages in a single sentence.

The text in blue are words in Hindi that have been writ-

ten in the Latin script.

code-switched text since a single model learns

multilingual representations. Although the idea

seems straightforward, there are multiple issues.

Firstly, mBERT performs differently on different

languages depending on their script, prevalence and

predominance. mBERT performs well in medium-

resource to high-resource languages, but is outper-

formed by non-contextual subword embeddings

in a low-resource setting (Heinzerling and Strube,

2019). Moreover, the performance is highly depen-

dent on the script Pires et al. (2019a). Secondly,

pre-trained language models have only seen mono-

lingual sentences during the unsupervised pretrain-

ing, however code-switched text contains phrases

from both the languages in a single sentence as

shown in Figure 1, thus making it an entirely new

scenario for the language models. Thirdly, there is

difference in the languages based on the amount of

unsupervised corpus that is used during pretraining.

For e.g., mBERT is trained on the wikipedia corpus.

English has

∼

6.3 million articles, whereas Hindi

and Tamil have only

∼

140K articles each. This

may lead to under-representation of low-resource

langauges in the ﬁnal model. Further, English has

been extensively studied by NLP community over

the years, making the supervised data and tools

more easily accessible. Thus, the model would be

able to easily learn patterns present in the resource-

rich language segments and motivating us to at-

tempt transfer learning from English supervised

arXiv:2210.14380v1 [cs.CL] 25 Oct 2022

… brilliant bowling by ….

… plz   frnd  ….

… fixing    gift …

…     ….

…     …

Text Sentiment

.. how sad are you .. -ve

.. i love paris .. +ve

…     ….

…     …

…     …

⋮

… plz   frnd  …

… mario   fav  …

… fixing    gift …

⋮

… brilliant bowling by ….

… football game sucks …

… awssmmm movie ….

⋮

E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…

E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…

Divide into

buckets

Input

Decreasing fraction of

English words

Final

Model

E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…

E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…

E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…

E[CLS]

T[CLS]

En-2

Un-2

En-1

Un-1

E[SEP]

Label

…

Sentence

…

B1 B2 B3

Input

mpt m1m2

Predict

Input

Predict

Input

Figure 2: A visualization of the progressive training strategy. The source labelled dataset Sin resource rich

language should be easily available. Using S, a classiﬁer is trained, say mpt. Unlabelled code-switched dataset

Tis divided into buckets using the fraction of English words as the metric. The leftmost bucket B1 has samples

dominated by resource-rich language and as we move towards right, the samples in the buckets are dominated by

low-resource language. mpt is used to generate pseudo-labels for unlabelled texts in bucket B1. We use texts from

B1 along with their pseudo-labels and the dataset Sto train a second text classiﬁer m1. Then, m1is used to get

the pseudo-labels for texts in bucket B2. We keep repeating this until we obtain the ﬁnal model which is used for

predictions.

datasets to code-switched datasets.

The main idea behind our paper can be sum-

marised as follows: When doing zero shot transfer

learning from a resource-rich language (LangA)

to code switched language (say LangA-LangB,

where LangB is a low-resource language com-

pared to LangA), the model is more likely to

be wrong when the instances are dominated by

LangB. Thus, instead of self-training on the en-

tire corpus at once, we propose to progressively

move from LangA-dominated instances to LangB-

dominated instances while transfer learning. As

illustrated in Figure 2, model trained on the an-

notated resource-rich language dataset is used to

generate pseudo-labels for code-switched data. Pro-

gressive training uses the resource-rich language

dataset and (unlabelled) resource-rich language

dominated code-switched samples together to gen-

erate better quality pseudo-labels for (unlabelled)

low-resource language dominated code-switched

samples. Lastly, annotated resource-rich language

dataset and pseudo-labelled code-switched data are

then used together for training which increases the

performance of the ﬁnal model.

Our key contributions are summarized as:

•

We propose a simple, novel training strategy that

demonstrates superior performance. Since our

hypothesis is based on the pretraining phase of

the multilingual language models, it can be com-

bined with any transfer learning method.

•

We conduct experiments across multiple

language-pair datasets, showing efﬁciency of our

proposed method.

•

We create probing experiments that verify our

hypothesis.

Reproducibility.

Our code is publicly available on

github 1.

2 Related work

Multiple tasks like Language Identiﬁcation, Named

Entity Recognition, Part-of-Speech, Sentiment

Analysis, Question Answering and NLI have been

studied in the code-switched setting. For sentiment

analysis, Vilares et al. (2015) showed that mul-

tilingual approaches can outperform pipelines of

monolingual models on code-switched data. Lal

et al. (2019) use CNN based network for the same.

Winata et al. (2019) use hierarchical meta embed-

1https://github.com/s1998/

progressiveTrainCodeSwitch

dings to combine multilingual word, character and

sub-word embeddings for the NER task. Aguilar

and Solorio (2020) augment morphological clues to

language models and uses them for transfer learn-

ing from English to code-switched data with labels.

Samanta et al. (2019) uses translation API to create

synthetic code-switched text from English datasets

and use this for transfer learning from English

to code-switched text without labels in the code-

switched case. Qin et al. (2020) use synthetically

generated code-switched data to enhance zero-shot

cross-lingual transfer learning. Recently, Khanuja

et al. (2020) released the GLUECoS benchmark to

study the performance of multiple models for code-

switched tasks across two language pairs En-Es and

En-Hi. The benchmark contains 6 tasks, 11 datasets

and has 8 models for every task. Multilingual trans-

formers ﬁne tuned with masked-language-model

objective on code-switched data can outperform

generic multilingual transformers. Results from

Khanuja et al. (2020) show that sentiment analy-

sis, question answering and NLI are signiﬁcantly

harder than tasks like NER, POS and LID. In this

work, we focus on the sentiment analysis task in

the absence of labeled code-switched data using

multilingual transformer models, while taking into

account the distinction between resource-rich and

low-resource languages. Although our work seems

related to curriculum learning, it is distinct from

the existing work. Most of the work in curricu-

lum learning is in supervised setting (Zhang et al.,

2019;Xu et al.,2020) and our work focuses on

zero-shot setting, where no code-switched sam-

ple is annotated. Note that, this is also different

from semi-supervised setting because of distribu-

tion shifts between labeled resource-rich language

data and target unlabeled code-switched data.

3 Preliminaries

Our problem is a sentiment analysis problem where

we have a labelled resource-rich language dataset

and unlabelled code-switched data. From here

onwards, we refer the labelled resource-rich lan-

guage dataset as the source dataset and the un-

labelled code-switched dataset as target dataset.

Since code-switching often occurs in language

pairs that include English, we refer to English as

the resource-rich language. The source dataset,

say

, is in English and has the text-label pairs

{(xs1, ys1),(xs2, ys2), ...(xsm, ysm)}

and the tar-

get dataset, say

, is in code-switched form and

has texts

{xcs1, xcs2, ...xcsn}

, where

is signif-

icantly greater than

. The objective is to learn

a sentiment classiﬁer to detect sentiment of code-

switched data by leveraging labelled source dataset

and unlabelled target dataset.

4 Methodology

Our methodology can be broken down into three

main steps: (1) Source dataset pretraining, which

uses the resource-rich language labelled source

dataset

for training a text classiﬁer. This classi-

ﬁer is used to generate pseudo-labels for the target

dataset

. (2) Bucket creation, which divides the

unlabelled data

into buckets based on the fraction

of words from resource-rich language. Some buck-

ets would contain samples that are more resource-

rich language dominated while others contain sam-

ples dominated by low-resource language. (3) Pro-

gressive training, where we initially train using

and the samples dominated by resource-rich lan-

guage and gradually include the low-resource lan-

guage dominated instances while training. For rest

of the paper, pretraining refers to step 1 and train-

ing refers to the training in step 3. And, we also

use class ratio based instance selection to prevent

the model getting biased towards majority label.

4.1 Source Dataset Pretraining

Resource-rich languages have abundant resources

which includes labeled data. Intuitively, sentences

that are similar to positive sentiment sentences

would also be having positive sentiment (and

same for the negative sentiment). Therefore, we

can treat the predictions made on

by multilin-

gual model trained on

as their respective pseudo-

labels. This would assign noisy pseudo-labels to

unlabeled dataset

. The source dataset pretraining

step is a text classiﬁcation task. Let the model ob-

tained after pretraining on dataset

be called

mpt

This model is used to generate the initial pseudo-

labels and to select the instances to be used for

progressive training.

4.2 Bucket Creation

Since progressive training aims to gradually

progress from training on resource-rich language

dominated samples to low-resource language dom-

inated samples, we divide the dataset

into buck-

ets based on fraction of words in resource-rich

language. This creates buckets that have more

resource-rich language dominated instances and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProgressiveSentimentAnalysisforCode-SwitchedTextDataSudhanshuRanjan1DheerajMekala1JingboShang1;21UniversityofCaliforniaSanDiego2HalcogluDataScienceInstitute,UniversityofCaliforniaSanDiego{sranjan,dmekala,jshang}@ucsd.eduAbstractMultilingualtransformerlanguagemodelshaverecentlyattractedmuchattenti...

展开>> 收起<<

Progressive Sentiment Analysis for Code-Switched Text Data Sudhanshu Ranjan1Dheeraj Mekala1Jingbo Shang12 1University of California San Diego.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Progressive Sentiment Analysis for Code-Switched Text Data Sudhanshu Ranjan1Dheeraj Mekala1Jingbo Shang12 1University of California San Diego

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: