BanglaParaphrase A High-Quality Bangla Paraphrase Dataset Ajwad Akil Najrin Sultana Abhik Bhattacharjee Rifat Shahriyar Bangladesh University of Engineering and Technology BUET

2025-04-27 0 0 711.35KB 12 页 10玖币
侵权投诉
BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset
Ajwad Akil
, Najrin Sultana
, Abhik Bhattacharjee, Rifat Shahriyar
Bangladesh University of Engineering and Technology (BUET)
ajwadakillabib@gmail.com,nazrinshukti@gmail.com,
abhik@ra.cse.buet.ac.bd,rifat@cse.buet.ac.bd
Abstract
In this work, we present BanglaParaphrase,
a high-quality synthetic Bangla Paraphrase
dataset curated by a novel filtering pipeline.
We aim to take a step towards alleviating
the low resource status of the Bangla lan-
guage in the NLP domain through the in-
troduction of BanglaParaphrase, which en-
sures quality by preserving both semantics
and diversity, making it particularly useful
to enhance other Bangla datasets. We show
a detailed comparative analysis between our
dataset and models trained on it with other ex-
isting works to establish the viability of our
synthetic paraphrase data generation pipeline.
We are making the dataset and models pub-
licly available at https://github.com/
csebuetnlp/banglaparaphrase to further
the state of Bangla NLP.
1 Introduction
Bangla, despite being the seventh most spoken lan-
guage by the total number of speakers
1
and fifth
most spoken language by native speakers
2
is still
considered a low resource language in terms of lan-
guage processing. Joshi et al. (2020) have classified
Bangla in the language group that has substantial
lackings of efforts for labeled data collection and
preparation. This lacking is rampant in terms of
high-quality datasets for various natural language
tasks, including paraphrase generation.
Paraphrases can be roughly defined as pairs of
texts that have similar meanings but may differ
structurally. So the task of generating paraphrases
given a sentence is to generate sentences with differ-
ent wordings or/and structures to the original sen-
tences while preserving the meaning. Paraphrasing
can be a vital tool to assist language understand-
ing tasks such as question answering (Pazzani and
These authors contributed equally to this work.
1https://w.wiki/Pss
2https://w.wiki/Psq
Engelman,1983;Dong et al.,2017), style trans-
fer (Krishna et al.,2020), semantic parsing (Cao
et al.,2020), and data augmentation tasks (Gao
et al.,2020).
Paraphrase generation has been a challenging
problem in the natural language processing domain
as it has several contrasting elements, such as se-
mantics and structures, that must be ensured to ob-
tain a good paraphrase of a sentence. Syntactically
Bangla has a different structure than high-resource
languages like English and French. The principal
word order of the Bangla language is subject-object-
verb (SOV). Still, it also allows free word ordering
during sentence formation. The pronoun usage in
the Bangla language has various forms, such as
"very familiar", "familiar", and "polite forms"
3
. It
is imperative to maintain the coherence of these
forms throughout a sentence as well as across the
paraphrases in a Bangla paraphrase dataset. Fol-
lowing that thread, we create a Bangla Paraphrase
dataset ensuring good quality in terms of seman-
tics and diversity. Since generating datasets by
manual intervention is time-consuming, we curate
our BanglaParaphrase dataset through a pivoting
(Zhao et al.,2008) approach, with additional filter-
ing stages to ensure diversity and semantics. We
further study the effects of dataset augmentation
on a synthetic dataset using masked language mod-
eling. Finally, we demonstrate the quality of our
dataset by training baseline models and through
comparative analysis with other Bangla paraphrase
datasets and models. In summary:
We present BanglaParaphrase, a synthetic
Bangla Paraphrase dataset ensuring both di-
versity and semantics.
We introduce a novel filtering mechanism for
dataset preparation and evaluation.
3https://en.wikipedia.org/wiki/
Bengali_grammar
arXiv:2210.05109v1 [cs.CL] 11 Oct 2022
2 Related Work
Paraphrase generation datasets and models are
heavily dominated by high-resource languages
such as English. But for low-resource languages
such as Bangla, this domain is less explored. To
our knowledge, only (Kumar et al.,2022) described
the use of IndicBART (Dabre et al.,2021) to gen-
erate paraphrases using the sequence-to-sequence
approach for the Bangla language. One of the most
challenging barriers to paraphrasing research for
low-resource languages is the shortage of good-
quality datasets. Among recent work on low-
resource paraphrase datasets, (Kanerva et al.,2021)
introduced a comprehensive dataset for the Finnish
language. The OpusParcus dataset (Creutz,2018)
consists of paraphrases for six European languages.
For Indic languages such as Tamil, Hindi, Punjabi,
and Malayalam, Anand Kumar et al. (2016) intro-
duced a paraphrase detection dataset in a shared
task. Scherrer (2020) introduced a paraphrase
dataset for 73 languages, where there are only
about 1400 sentences in total for the Bangla lan-
guage, mainly consisting of simple sentences.
3 Paraphrase Dataset Generation and
Curation
3.1 Synthetic Dataset Generation
We started by scraping high-quality representative
sentences for the Bangla web domain from the
RoarBangla website
4
and translated them from
Bangla to English using the state-of-the-art transla-
tion model developed in (Hasan et al.,2020) with 5
references. For the generated English sentences, 5
new Bangla translations were generated using beam
search. Among these multiple generations, only
those (original sentence, back-translated sentence)
pairs were chosen as candidate datapoints where
the LaBSE (Feng et al.,2022) similarity score for
both (original Bangla and back-translated Bangla),
as well as (original Bangla and translated English)
were greater than 0.7
5
. After this process, there
were more than 1.364M sentences with multiple
references for each source.
3.2 Novel Filtering Pipeline
As mentioned in (Chen and Dolan,2011), para-
phrases must ensure the fluency, semantic similar-
ity, and diversity. To that end, we make use of
4https://roar.media/bangla
5
We chose 0.7 as the LaBSE semantic similarity threshold
following (Bhattacharjee et al.,2022a)
different metrics evaluating each of these aspects
as filters, in a pipelined fashion.
To ensure diversity, we chose
PINC
(Paraphrase
In N-gram Changes) among various diversity mea-
suring metrics such as (Chen and Dolan,2011;Sun
and Zhou,2012) as it considers the lexical dissimi-
larity between the source and the candidates. We
name this first filter as
PINC Score Filter
. To use
this metric for filtering, we determined the opti-
mum threshold value empirically by following a
plot
6
of the data yield against the PINC score, indi-
cating the amount of data having at least a certain
amount of PINC score. We chose the threshold
value that maximizes the PINC score with over
63.16% yield.
Since contextualized token embeddings have
been shown to be effective for paraphrase detec-
tion (Devlin et al.,2019), we use BERTScore
(Zhang et al.,2019) to ensure semantic similar-
ity between the source and candidates. After our
PINC filter, we experimented with BERTScore,
which uses the multilingual BERT model (Devlin
et al.,2019) by default. We also experimented with
BanglaBERT (Bhattacharjee et al.,2022a) embed-
dings and decided to use this as our semantic filter
since BanglaBERT is a monolingual model per-
forming exceptionally well on Bangla NLU tasks.
We select the threshold similar to the PINC filter by
following the corresponding plot, and in all of our
experiments, we used F1 measure as the filtering
metric. We name this second filter as
BERTScore
Filter
. Through a human evaluation
7
of 300 ran-
domly chosen samples, we deduced that pairs hav-
ing BERTScore (with BanglaBERT embeddings)
0.92 were semantically sound and decided to
use this as a starting point to figure out our de-
sired threshold. We further validated our choice of
parameters through model-generated paraphrases,
with the models trained on filtered datasets using
different parameters (detailed in Section 4.1).
Initially training on the resultant dataset from the
previous two filters, we noticed that some of the
predicted paraphrases were growing unnecessarily
long by repeating parts during inference. As re-
peated N-grams within the corpus most likely have
been the culprit behind this, attempts to ameliorate
the issue were made by introducing our third fil-
ter, namely
N-gram Repetition Filter
, where we
tested the target side of our dataset to see if there
6More details are presented in the Appendix
7
More details are presented in the ethical considerations
section
Filter Name Significance Filtering Parameters
PINC Ensure diversity in generated paraphrase 0.65, 0.76, 0.80
BERTScore Preserve semantic coherence with the source lower 0.91 - 0.93, upper 0.98
N-gram repetition Reduce n-gram repetition during inference 2 - 4 grams
Punctuation Prevent generating non-terminating sentences during inference N/A
Table 1: Filtering Scheme
were any N-gram repeats with a value of
N
from
1 to 4. We obtained less than 200 sentences on the
target side with a 2-gram repetition and decided
to use
N= 2
for this filter. Additionally, we re-
moved sentences without terminating punctuation
from the corpus to ensure a noise-free dataset be-
fore proceeding with the training. We term this
last filter as
Punctuation Filter
. The filters, along
with their significance and parameters, have been
summarised in Table 1.
3.3 Evaluation Metrics
Following the work of (Niu et al.,2021), we used
multiple metrics to evaluate several criteria in our
generated paraphrase. For
quality
, we used sacre-
BLEU (Post,2018) and ROUGE-L (Lin,2004).
We used the multilingual ROUGE scoring imple-
mentation introduced by (Hasan et al.,2021) which
supports Bangla stemming and tokenization. For
syntactic diversity
, we used the PINC score as
we did for filtering. For measuring
semantic cor-
rectness
, we used BERTScore F1-measure with
BanglaBERT embeddings. Additionally, we used
a modified version of a hybrid score named BERT-
iBLEU score (Niu et al.,2021) where we also used
BanglaBERT embeddings for the BERTScore part.
This hybrid score measures semantic similarity
while penalizing syntactical similarity to ensure
the diversity of the paraphrases. More details about
evaluation scores can be found in the Appendix.
3.4 Diverse Dataset Generation by Masked
Language Modeling
We wondered whether the dataset could be further
augmented through replacing tokens from a partic-
ular part of speech with other synonymous tokens.
To that end, we fine-tuned BanglaBERT (Bhat-
tacharjee et al.,2022a) for POS tagging with a
token classification head on the (Sankaran et al.,
2008) dataset containing 30 POS tags.
The idea of augmenting the dataset with masking
follows the work of (Mohiuddin et al.,2021). We
first tagged the parts of speech of the source side of
our synthetic dataset and then chose 7 Bangla parts
of speech to maximize the diversification in syntac-
tic content. We masked the corresponding tokens
and filled them through MLM sequentially. We
used both XLM-RoBERTa (Conneau et al.,2020)
and BanglaBERT to perform MLM out of the box.
Of these two, BanglaBERT performed mask-filling
with less noise, and thus we selected the results of
this model. To ensure consistency with our initial
dataset, we also filtered these with our pipeline out-
lined in Section 3.2 by choosing the PINC score
threshold of 0.7
8
and (0.92 - 0.98) (lower and up-
per limit) for the BERTScore threshold, obtaining
about 70K sentences. We used this dataset for
training models with our initially filtered one in a
separate experiment.9
4 Experiments and Results
4.1 Experimental Setup
We first filtered the synthetic dataset with our 4-
stage filtering mechanisms and then fine-tuned
mT5-small model (Xue et al.,2021), keeping the
default learning rate as 0.001 for 10 epochs. In
each of the experiments, we changed the dataset by
keeping the model fixed as our objective was to find
the threshold for the first two filters for which the
metrics on both the validation and the test set of the
individual dataset gave us promising results. We
conducted several experiments by varying PINC
scores from (0.65, 0.76, 0.80) and BERTScore from
(0.91, 0.92, 0.93) and 0.98 (lower and upper limit)
by following respective plots.
The evaluation metrics for each experiment were
tracked, and we examined how the thresholds af-
fected the metrics for the test set of the dataset we
were experimenting with. We finally chose the ef-
fective threshold to be
0.76
for the PINC score and
0.92 - 0.98
(lower and upper limit) for BERTScore
such that it provides a good balance between good
automated evaluation scores and data amount, and
8
We lowered the threshold since this augmentation does
not diversify in terms of the structure of the sentences
9
Further details of the whole experiment can be found in
the Appendix.
摘要:

BanglaParaphrase:AHigh-QualityBanglaParaphraseDatasetAjwadAkil,NajrinSultana,AbhikBhattacharjee,RifatShahriyarBangladeshUniversityofEngineeringandTechnology(BUET)ajwadakillabib@gmail.com,nazrinshukti@gmail.com,abhik@ra.cse.buet.ac.bd,rifat@cse.buet.ac.bdAbstractInthiswork,wepresentBanglaParaphrase...

展开>> 收起<<
BanglaParaphrase A High-Quality Bangla Paraphrase Dataset Ajwad Akil Najrin Sultana Abhik Bhattacharjee Rifat Shahriyar Bangladesh University of Engineering and Technology BUET.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:711.35KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注