Investigating the detection of Tortured Phrases in Scientific Literature
Puthineath Lay1, Martin Lentschat1, and Cyril Labbé1
1Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
puthineath.lay@cadt.edu.kh,martin.lentschat@univ-grenoble-alpes.fr
Abstract
With the help of online tools, unscrupulous au-
thors can today generate a pseudo-scientific ar-
ticle and attempt to publish it. Some of these
tools work by replacing or paraphrasing ex-
isting texts to produce new content, but they
have a tendency to generate nonsensical ex-
pressions. A recent study introduced the con-
cept of “tortured phrase", an unexpected odd
phrase that appears instead of the fixed expres-
sion. E.g. counterfeit consciousness instead of
artificial intelligence. The present study aims
at investigating how tortured phrases, that are
not yet listed, can be detected automatically.
We conducted several experiments, including
non-neural binary classification, neural binary
classification and cosine similarity comparison
of the phrase tokens, yielding noticeable re-
sults.
1 Introduction
Scientific texts generated by computer programs
can be meaningless, and fake generated papers are
served and sold by various publishers with the
estimation of 4.29 documents every one million
reports (Cabanac and Labbé,2021). But gener-
ated texts are also meaningful: with the inputs
of a thousand articles, new books are now pro-
duced (e.g. Beta Writer,2019). Despite the ability
of text-generators to produce counterfeit publica-
tions, meaningless generated papers can be easily
spotted by both machines and humans (Cabanac
et al.,2021). Texts produced by neural language
models are more difficult to spot (Hutson et al.,
2021). These neural language models can produce
paraphrased texts that are closer to human-written
texts (Brown et al.,2020), and therefore machine-
paraphrased texts are harder to differentiate from
the human-written texts.
Online tools such as Spinbot, and Spinner Chief
are used to paraphrase texts. However the capacity
of a paraphrasing software to assist a writer can be
harmful to the scientific literature. Cabanac et al.
(2021) screened recent publications (e.g. in the
journal Microprocessors and Microsystems) and
discovered over 500 meaning less phrases in those
scientific papers. They called it "tortured phrases",
unexpected odd phrases replacing the lexicalised
expression, such as counterfeit consciousness in-
stead of artificial intelligence (i.e., the expected
phrase). The database of tortured phrases, and arti-
cles that contain them, have since been expanded to
over
9000
publications in different domains such
as Computer Sciences, Biology or Medicine.
In this paper, we investigate strategies to auto-
matically detect new (i.e. unlisted) tortured phrases.
Focusing solely on tortured phrases detection, and
not paraphrased text in general, we will use recent
machine learning techniques and state-of-the-art
language models. Our methods were trained on a
corpus composed of
141
known tortured phrases,
taking their sentences as contexts, and aims at de-
tecting never-seen-before tortured phrases. All
code and corpus used are available online.
2 Related Works
Up to now, no dataset has been built for the auto-
matic detection of tortured phrases. In Cabanac
et al. (2021), authors and contributors collected a
set of tortured phrases and their expected phrases
that we will use as dataset. Wahle et al. (2022)
used Spinbot and Spinnerchief to paraphrase orig-
inal data from several sources such as an arXiv
test sets, graduation theses, and Wikipedia arti-
cles. Their study aims at detecting whether a para-
graph is machine-paraphrased or not. The authors
tested classic machine learning approaches and neu-
ral language models based on the Transformer ar-
chitecture (Vaswani et al.,2017), such as BERT
(Devlin et al.,2018), RoBERTa (Liu et al.,2019),
ALBERT (Lan et al.,2019), Longformer (Beltagy
et al.,2020), and others. They showed that such ap-
proaches can complement text-matching software,
arXiv:2210.13024v1 [cs.CL] 24 Oct 2022