
2 Related Work
Paraphrase generation datasets and models are
heavily dominated by high-resource languages
such as English. But for low-resource languages
such as Bangla, this domain is less explored. To
our knowledge, only (Kumar et al.,2022) described
the use of IndicBART (Dabre et al.,2021) to gen-
erate paraphrases using the sequence-to-sequence
approach for the Bangla language. One of the most
challenging barriers to paraphrasing research for
low-resource languages is the shortage of good-
quality datasets. Among recent work on low-
resource paraphrase datasets, (Kanerva et al.,2021)
introduced a comprehensive dataset for the Finnish
language. The OpusParcus dataset (Creutz,2018)
consists of paraphrases for six European languages.
For Indic languages such as Tamil, Hindi, Punjabi,
and Malayalam, Anand Kumar et al. (2016) intro-
duced a paraphrase detection dataset in a shared
task. Scherrer (2020) introduced a paraphrase
dataset for 73 languages, where there are only
about 1400 sentences in total for the Bangla lan-
guage, mainly consisting of simple sentences.
3 Paraphrase Dataset Generation and
Curation
3.1 Synthetic Dataset Generation
We started by scraping high-quality representative
sentences for the Bangla web domain from the
RoarBangla website
4
and translated them from
Bangla to English using the state-of-the-art transla-
tion model developed in (Hasan et al.,2020) with 5
references. For the generated English sentences, 5
new Bangla translations were generated using beam
search. Among these multiple generations, only
those (original sentence, back-translated sentence)
pairs were chosen as candidate datapoints where
the LaBSE (Feng et al.,2022) similarity score for
both (original Bangla and back-translated Bangla),
as well as (original Bangla and translated English)
were greater than 0.7
5
. After this process, there
were more than 1.364M sentences with multiple
references for each source.
3.2 Novel Filtering Pipeline
As mentioned in (Chen and Dolan,2011), para-
phrases must ensure the fluency, semantic similar-
ity, and diversity. To that end, we make use of
4https://roar.media/bangla
5
We chose 0.7 as the LaBSE semantic similarity threshold
following (Bhattacharjee et al.,2022a)
different metrics evaluating each of these aspects
as filters, in a pipelined fashion.
To ensure diversity, we chose
PINC
(Paraphrase
In N-gram Changes) among various diversity mea-
suring metrics such as (Chen and Dolan,2011;Sun
and Zhou,2012) as it considers the lexical dissimi-
larity between the source and the candidates. We
name this first filter as
PINC Score Filter
. To use
this metric for filtering, we determined the opti-
mum threshold value empirically by following a
plot
6
of the data yield against the PINC score, indi-
cating the amount of data having at least a certain
amount of PINC score. We chose the threshold
value that maximizes the PINC score with over
63.16% yield.
Since contextualized token embeddings have
been shown to be effective for paraphrase detec-
tion (Devlin et al.,2019), we use BERTScore
(Zhang et al.,2019) to ensure semantic similar-
ity between the source and candidates. After our
PINC filter, we experimented with BERTScore,
which uses the multilingual BERT model (Devlin
et al.,2019) by default. We also experimented with
BanglaBERT (Bhattacharjee et al.,2022a) embed-
dings and decided to use this as our semantic filter
since BanglaBERT is a monolingual model per-
forming exceptionally well on Bangla NLU tasks.
We select the threshold similar to the PINC filter by
following the corresponding plot, and in all of our
experiments, we used F1 measure as the filtering
metric. We name this second filter as
BERTScore
Filter
. Through a human evaluation
7
of 300 ran-
domly chosen samples, we deduced that pairs hav-
ing BERTScore (with BanglaBERT embeddings)
≥
0.92 were semantically sound and decided to
use this as a starting point to figure out our de-
sired threshold. We further validated our choice of
parameters through model-generated paraphrases,
with the models trained on filtered datasets using
different parameters (detailed in Section 4.1).
Initially training on the resultant dataset from the
previous two filters, we noticed that some of the
predicted paraphrases were growing unnecessarily
long by repeating parts during inference. As re-
peated N-grams within the corpus most likely have
been the culprit behind this, attempts to ameliorate
the issue were made by introducing our third fil-
ter, namely
N-gram Repetition Filter
, where we
tested the target side of our dataset to see if there
6More details are presented in the Appendix
7
More details are presented in the ethical considerations
section