BanglaParaphrase A High-Quality Bangla Paraphrase Dataset Ajwad Akil Najrin Sultana Abhik Bhattacharjee Rifat Shahriyar Bangladesh University of Engineering and Technology BUET

2025-04-27 1 0 711.35KB 12 页 10玖币

侵权投诉

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

Ajwad Akil∗

, Najrin Sultana∗

, Abhik Bhattacharjee, Rifat Shahriyar

Bangladesh University of Engineering and Technology (BUET)

ajwadakillabib@gmail.com,nazrinshukti@gmail.com,

abhik@ra.cse.buet.ac.bd,rifat@cse.buet.ac.bd

Abstract

In this work, we present BanglaParaphrase,

a high-quality synthetic Bangla Paraphrase

dataset curated by a novel ﬁltering pipeline.

We aim to take a step towards alleviating

the low resource status of the Bangla lan-

guage in the NLP domain through the in-

troduction of BanglaParaphrase, which en-

sures quality by preserving both semantics

and diversity, making it particularly useful

to enhance other Bangla datasets. We show

a detailed comparative analysis between our

dataset and models trained on it with other ex-

isting works to establish the viability of our

synthetic paraphrase data generation pipeline.

We are making the dataset and models pub-

licly available at https://github.com/

csebuetnlp/banglaparaphrase to further

the state of Bangla NLP.

1 Introduction

Bangla, despite being the seventh most spoken lan-

guage by the total number of speakers

and ﬁfth

most spoken language by native speakers

is still

considered a low resource language in terms of lan-

guage processing. Joshi et al. (2020) have classiﬁed

Bangla in the language group that has substantial

lackings of efforts for labeled data collection and

preparation. This lacking is rampant in terms of

high-quality datasets for various natural language

tasks, including paraphrase generation.

Paraphrases can be roughly deﬁned as pairs of

texts that have similar meanings but may differ

structurally. So the task of generating paraphrases

given a sentence is to generate sentences with differ-

ent wordings or/and structures to the original sen-

tences while preserving the meaning. Paraphrasing

can be a vital tool to assist language understand-

ing tasks such as question answering (Pazzani and

∗These authors contributed equally to this work.

1https://w.wiki/Pss

2https://w.wiki/Psq

Engelman,1983;Dong et al.,2017), style trans-

fer (Krishna et al.,2020), semantic parsing (Cao

et al.,2020), and data augmentation tasks (Gao

et al.,2020).

Paraphrase generation has been a challenging

problem in the natural language processing domain

as it has several contrasting elements, such as se-

mantics and structures, that must be ensured to ob-

tain a good paraphrase of a sentence. Syntactically

Bangla has a different structure than high-resource

languages like English and French. The principal

word order of the Bangla language is subject-object-

verb (SOV). Still, it also allows free word ordering

during sentence formation. The pronoun usage in

the Bangla language has various forms, such as

"very familiar", "familiar", and "polite forms"

. It

is imperative to maintain the coherence of these

forms throughout a sentence as well as across the

paraphrases in a Bangla paraphrase dataset. Fol-

lowing that thread, we create a Bangla Paraphrase

dataset ensuring good quality in terms of seman-

tics and diversity. Since generating datasets by

manual intervention is time-consuming, we curate

our BanglaParaphrase dataset through a pivoting

(Zhao et al.,2008) approach, with additional ﬁlter-

ing stages to ensure diversity and semantics. We

further study the effects of dataset augmentation

on a synthetic dataset using masked language mod-

eling. Finally, we demonstrate the quality of our

dataset by training baseline models and through

comparative analysis with other Bangla paraphrase

datasets and models. In summary:

•

We present BanglaParaphrase, a synthetic

Bangla Paraphrase dataset ensuring both di-

versity and semantics.

•

We introduce a novel ﬁltering mechanism for

dataset preparation and evaluation.

3https://en.wikipedia.org/wiki/

Bengali_grammar

arXiv:2210.05109v1 [cs.CL] 11 Oct 2022

2 Related Work

Paraphrase generation datasets and models are

heavily dominated by high-resource languages

such as English. But for low-resource languages

such as Bangla, this domain is less explored. To

our knowledge, only (Kumar et al.,2022) described

the use of IndicBART (Dabre et al.,2021) to gen-

erate paraphrases using the sequence-to-sequence

approach for the Bangla language. One of the most

challenging barriers to paraphrasing research for

low-resource languages is the shortage of good-

quality datasets. Among recent work on low-

resource paraphrase datasets, (Kanerva et al.,2021)

introduced a comprehensive dataset for the Finnish

language. The OpusParcus dataset (Creutz,2018)

consists of paraphrases for six European languages.

For Indic languages such as Tamil, Hindi, Punjabi,

and Malayalam, Anand Kumar et al. (2016) intro-

duced a paraphrase detection dataset in a shared

task. Scherrer (2020) introduced a paraphrase

dataset for 73 languages, where there are only

about 1400 sentences in total for the Bangla lan-

guage, mainly consisting of simple sentences.

3 Paraphrase Dataset Generation and

Curation

3.1 Synthetic Dataset Generation

We started by scraping high-quality representative

sentences for the Bangla web domain from the

RoarBangla website

and translated them from

Bangla to English using the state-of-the-art transla-

tion model developed in (Hasan et al.,2020) with 5

references. For the generated English sentences, 5

new Bangla translations were generated using beam

search. Among these multiple generations, only

those (original sentence, back-translated sentence)

pairs were chosen as candidate datapoints where

the LaBSE (Feng et al.,2022) similarity score for

both (original Bangla and back-translated Bangla),

as well as (original Bangla and translated English)

were greater than 0.7

. After this process, there

were more than 1.364M sentences with multiple

references for each source.

3.2 Novel Filtering Pipeline

As mentioned in (Chen and Dolan,2011), para-

phrases must ensure the ﬂuency, semantic similar-

ity, and diversity. To that end, we make use of

4https://roar.media/bangla

We chose 0.7 as the LaBSE semantic similarity threshold

following (Bhattacharjee et al.,2022a)

different metrics evaluating each of these aspects

as ﬁlters, in a pipelined fashion.

To ensure diversity, we chose

PINC

(Paraphrase

In N-gram Changes) among various diversity mea-

suring metrics such as (Chen and Dolan,2011;Sun

and Zhou,2012) as it considers the lexical dissimi-

larity between the source and the candidates. We

name this ﬁrst ﬁlter as

PINC Score Filter

. To use

this metric for ﬁltering, we determined the opti-

mum threshold value empirically by following a

plot

of the data yield against the PINC score, indi-

cating the amount of data having at least a certain

amount of PINC score. We chose the threshold

value that maximizes the PINC score with over

63.16% yield.

Since contextualized token embeddings have

been shown to be effective for paraphrase detec-

tion (Devlin et al.,2019), we use BERTScore

(Zhang et al.,2019) to ensure semantic similar-

ity between the source and candidates. After our

PINC ﬁlter, we experimented with BERTScore,

which uses the multilingual BERT model (Devlin

et al.,2019) by default. We also experimented with

BanglaBERT (Bhattacharjee et al.,2022a) embed-

dings and decided to use this as our semantic ﬁlter

since BanglaBERT is a monolingual model per-

forming exceptionally well on Bangla NLU tasks.

We select the threshold similar to the PINC ﬁlter by

following the corresponding plot, and in all of our

experiments, we used F1 measure as the ﬁltering

metric. We name this second ﬁlter as

BERTScore

Filter

. Through a human evaluation

of 300 ran-

domly chosen samples, we deduced that pairs hav-

ing BERTScore (with BanglaBERT embeddings)

≥

0.92 were semantically sound and decided to

use this as a starting point to ﬁgure out our de-

sired threshold. We further validated our choice of

parameters through model-generated paraphrases,

with the models trained on ﬁltered datasets using

different parameters (detailed in Section 4.1).

Initially training on the resultant dataset from the

previous two ﬁlters, we noticed that some of the

predicted paraphrases were growing unnecessarily

long by repeating parts during inference. As re-

peated N-grams within the corpus most likely have

been the culprit behind this, attempts to ameliorate

the issue were made by introducing our third ﬁl-

ter, namely

N-gram Repetition Filter

, where we

tested the target side of our dataset to see if there

6More details are presented in the Appendix

More details are presented in the ethical considerations

section

Filter Name Signiﬁcance Filtering Parameters

PINC Ensure diversity in generated paraphrase 0.65, 0.76, 0.80

BERTScore Preserve semantic coherence with the source lower 0.91 - 0.93, upper 0.98

N-gram repetition Reduce n-gram repetition during inference 2 - 4 grams

Punctuation Prevent generating non-terminating sentences during inference N/A

Table 1: Filtering Scheme

were any N-gram repeats with a value of

from

1 to 4. We obtained less than 200 sentences on the

target side with a 2-gram repetition and decided

to use

N= 2

for this ﬁlter. Additionally, we re-

moved sentences without terminating punctuation

from the corpus to ensure a noise-free dataset be-

fore proceeding with the training. We term this

last ﬁlter as

Punctuation Filter

. The ﬁlters, along

with their signiﬁcance and parameters, have been

summarised in Table 1.

3.3 Evaluation Metrics

Following the work of (Niu et al.,2021), we used

multiple metrics to evaluate several criteria in our

generated paraphrase. For

quality

, we used sacre-

BLEU (Post,2018) and ROUGE-L (Lin,2004).

We used the multilingual ROUGE scoring imple-

mentation introduced by (Hasan et al.,2021) which

supports Bangla stemming and tokenization. For

syntactic diversity

, we used the PINC score as

we did for ﬁltering. For measuring

semantic cor-

rectness

, we used BERTScore F1-measure with

BanglaBERT embeddings. Additionally, we used

a modiﬁed version of a hybrid score named BERT-

iBLEU score (Niu et al.,2021) where we also used

BanglaBERT embeddings for the BERTScore part.

This hybrid score measures semantic similarity

while penalizing syntactical similarity to ensure

the diversity of the paraphrases. More details about

evaluation scores can be found in the Appendix.

3.4 Diverse Dataset Generation by Masked

Language Modeling

We wondered whether the dataset could be further

augmented through replacing tokens from a partic-

ular part of speech with other synonymous tokens.

To that end, we ﬁne-tuned BanglaBERT (Bhat-

tacharjee et al.,2022a) for POS tagging with a

token classiﬁcation head on the (Sankaran et al.,

2008) dataset containing 30 POS tags.

The idea of augmenting the dataset with masking

follows the work of (Mohiuddin et al.,2021). We

ﬁrst tagged the parts of speech of the source side of

our synthetic dataset and then chose 7 Bangla parts

of speech to maximize the diversiﬁcation in syntac-

tic content. We masked the corresponding tokens

and ﬁlled them through MLM sequentially. We

used both XLM-RoBERTa (Conneau et al.,2020)

and BanglaBERT to perform MLM out of the box.

Of these two, BanglaBERT performed mask-ﬁlling

with less noise, and thus we selected the results of

this model. To ensure consistency with our initial

dataset, we also ﬁltered these with our pipeline out-

lined in Section 3.2 by choosing the PINC score

threshold of 0.7

and (0.92 - 0.98) (lower and up-

per limit) for the BERTScore threshold, obtaining

about 70K sentences. We used this dataset for

training models with our initially ﬁltered one in a

separate experiment.9

4 Experiments and Results

4.1 Experimental Setup

We ﬁrst ﬁltered the synthetic dataset with our 4-

stage ﬁltering mechanisms and then ﬁne-tuned

mT5-small model (Xue et al.,2021), keeping the

default learning rate as 0.001 for 10 epochs. In

each of the experiments, we changed the dataset by

keeping the model ﬁxed as our objective was to ﬁnd

the threshold for the ﬁrst two ﬁlters for which the

metrics on both the validation and the test set of the

individual dataset gave us promising results. We

conducted several experiments by varying PINC

scores from (0.65, 0.76, 0.80) and BERTScore from

(0.91, 0.92, 0.93) and 0.98 (lower and upper limit)

by following respective plots.

The evaluation metrics for each experiment were

tracked, and we examined how the thresholds af-

fected the metrics for the test set of the dataset we

were experimenting with. We ﬁnally chose the ef-

fective threshold to be

0.76

for the PINC score and

0.92 - 0.98

(lower and upper limit) for BERTScore

such that it provides a good balance between good

automated evaluation scores and data amount, and

We lowered the threshold since this augmentation does

not diversify in terms of the structure of the sentences

Further details of the whole experiment can be found in

the Appendix.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BanglaParaphrase:AHigh-QualityBanglaParaphraseDatasetAjwadAkil,NajrinSultana,AbhikBhattacharjee,RifatShahriyarBangladeshUniversityofEngineeringandTechnology(BUET)ajwadakillabib@gmail.com,nazrinshukti@gmail.com,abhik@ra.cse.buet.ac.bd,rifat@cse.buet.ac.bdAbstractInthiswork,wepresentBanglaParaphrase...

展开>> 收起<<

BanglaParaphrase A High-Quality Bangla Paraphrase Dataset Ajwad Akil Najrin Sultana Abhik Bhattacharjee Rifat Shahriyar Bangladesh University of Engineering and Technology BUET.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BanglaParaphrase A High-Quality Bangla Paraphrase Dataset Ajwad Akil Najrin Sultana Abhik Bhattacharjee Rifat Shahriyar Bangladesh University of Engineering and Technology BUET

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: