Investigating the detection of Tortured Phrases in Scientific Literature Puthineath Lay1 Martin Lentschat1 and Cyril Labbé1 1Univ. Grenoble Alpes CNRS Grenoble INP LIG 38000 Grenoble France

2025-05-03 0 0 162.45KB 5 页 10玖币
侵权投诉
Investigating the detection of Tortured Phrases in Scientific Literature
Puthineath Lay1, Martin Lentschat1, and Cyril Labbé1
1Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
puthineath.lay@cadt.edu.kh,martin.lentschat@univ-grenoble-alpes.fr
Abstract
With the help of online tools, unscrupulous au-
thors can today generate a pseudo-scientific ar-
ticle and attempt to publish it. Some of these
tools work by replacing or paraphrasing ex-
isting texts to produce new content, but they
have a tendency to generate nonsensical ex-
pressions. A recent study introduced the con-
cept of “tortured phrase", an unexpected odd
phrase that appears instead of the fixed expres-
sion. E.g. counterfeit consciousness instead of
artificial intelligence. The present study aims
at investigating how tortured phrases, that are
not yet listed, can be detected automatically.
We conducted several experiments, including
non-neural binary classification, neural binary
classification and cosine similarity comparison
of the phrase tokens, yielding noticeable re-
sults.
1 Introduction
Scientific texts generated by computer programs
can be meaningless, and fake generated papers are
served and sold by various publishers with the
estimation of 4.29 documents every one million
reports (Cabanac and Labbé,2021). But gener-
ated texts are also meaningful: with the inputs
of a thousand articles, new books are now pro-
duced (e.g. Beta Writer,2019). Despite the ability
of text-generators to produce counterfeit publica-
tions, meaningless generated papers can be easily
spotted by both machines and humans (Cabanac
et al.,2021). Texts produced by neural language
models are more difficult to spot (Hutson et al.,
2021). These neural language models can produce
paraphrased texts that are closer to human-written
texts (Brown et al.,2020), and therefore machine-
paraphrased texts are harder to differentiate from
the human-written texts.
Online tools such as Spinbot, and Spinner Chief
are used to paraphrase texts. However the capacity
of a paraphrasing software to assist a writer can be
harmful to the scientific literature. Cabanac et al.
(2021) screened recent publications (e.g. in the
journal Microprocessors and Microsystems) and
discovered over 500 meaning less phrases in those
scientific papers. They called it "tortured phrases",
unexpected odd phrases replacing the lexicalised
expression, such as counterfeit consciousness in-
stead of artificial intelligence (i.e., the expected
phrase). The database of tortured phrases, and arti-
cles that contain them, have since been expanded to
over
9000
publications in different domains such
as Computer Sciences, Biology or Medicine.
In this paper, we investigate strategies to auto-
matically detect new (i.e. unlisted) tortured phrases.
Focusing solely on tortured phrases detection, and
not paraphrased text in general, we will use recent
machine learning techniques and state-of-the-art
language models. Our methods were trained on a
corpus composed of
141
known tortured phrases,
taking their sentences as contexts, and aims at de-
tecting never-seen-before tortured phrases. All
code and corpus used are available online.
2 Related Works
Up to now, no dataset has been built for the auto-
matic detection of tortured phrases. In Cabanac
et al. (2021), authors and contributors collected a
set of tortured phrases and their expected phrases
that we will use as dataset. Wahle et al. (2022)
used Spinbot and Spinnerchief to paraphrase orig-
inal data from several sources such as an arXiv
test sets, graduation theses, and Wikipedia arti-
cles. Their study aims at detecting whether a para-
graph is machine-paraphrased or not. The authors
tested classic machine learning approaches and neu-
ral language models based on the Transformer ar-
chitecture (Vaswani et al.,2017), such as BERT
(Devlin et al.,2018), RoBERTa (Liu et al.,2019),
ALBERT (Lan et al.,2019), Longformer (Beltagy
et al.,2020), and others. They showed that such ap-
proaches can complement text-matching software,
arXiv:2210.13024v1 [cs.CL] 24 Oct 2022
摘要:

InvestigatingthedetectionofTorturedPhrasesinScienticLiteraturePuthineathLay1,MartinLentschat1,andCyrilLabbé11Univ.GrenobleAlpes,CNRS,GrenobleINP,LIG,38000Grenoble,Franceputhineath.lay@cadt.edu.kh,martin.lentschat@univ-grenoble-alpes.frAbstractWiththehelpofonlinetools,unscrupulousau-thorscantodaygen...

展开>> 收起<<
Investigating the detection of Tortured Phrases in Scientific Literature Puthineath Lay1 Martin Lentschat1 and Cyril Labbé1 1Univ. Grenoble Alpes CNRS Grenoble INP LIG 38000 Grenoble France.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:162.45KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注