Investigating the detection of Tortured Phrases in Scientiﬁc Literature Puthineath Lay1 Martin Lentschat1 and Cyril Labbé1 1Univ. Grenoble Alpes CNRS Grenoble INP LIG 38000 Grenoble France

2025-05-03 0 0 162.45KB 5 页 10玖币

侵权投诉

Investigating the detection of Tortured Phrases in Scientiﬁc Literature

Puthineath Lay1, Martin Lentschat1, and Cyril Labbé1

1Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France

puthineath.lay@cadt.edu.kh,martin.lentschat@univ-grenoble-alpes.fr

Abstract

With the help of online tools, unscrupulous au-

thors can today generate a pseudo-scientiﬁc ar-

ticle and attempt to publish it. Some of these

tools work by replacing or paraphrasing ex-

isting texts to produce new content, but they

have a tendency to generate nonsensical ex-

pressions. A recent study introduced the con-

cept of “tortured phrase", an unexpected odd

phrase that appears instead of the ﬁxed expres-

sion. E.g. counterfeit consciousness instead of

artiﬁcial intelligence. The present study aims

at investigating how tortured phrases, that are

not yet listed, can be detected automatically.

We conducted several experiments, including

non-neural binary classiﬁcation, neural binary

classiﬁcation and cosine similarity comparison

of the phrase tokens, yielding noticeable re-

sults.

1 Introduction

Scientiﬁc texts generated by computer programs

can be meaningless, and fake generated papers are

served and sold by various publishers with the

estimation of 4.29 documents every one million

reports (Cabanac and Labbé,2021). But gener-

ated texts are also meaningful: with the inputs

of a thousand articles, new books are now pro-

duced (e.g. Beta Writer,2019). Despite the ability

of text-generators to produce counterfeit publica-

tions, meaningless generated papers can be easily

spotted by both machines and humans (Cabanac

et al.,2021). Texts produced by neural language

models are more difﬁcult to spot (Hutson et al.,

2021). These neural language models can produce

paraphrased texts that are closer to human-written

texts (Brown et al.,2020), and therefore machine-

paraphrased texts are harder to differentiate from

the human-written texts.

Online tools such as Spinbot, and Spinner Chief

are used to paraphrase texts. However the capacity

of a paraphrasing software to assist a writer can be

harmful to the scientiﬁc literature. Cabanac et al.

(2021) screened recent publications (e.g. in the

journal Microprocessors and Microsystems) and

discovered over 500 meaning less phrases in those

scientiﬁc papers. They called it "tortured phrases",

unexpected odd phrases replacing the lexicalised

expression, such as counterfeit consciousness in-

stead of artiﬁcial intelligence (i.e., the expected

phrase). The database of tortured phrases, and arti-

cles that contain them, have since been expanded to

over

9000

publications in different domains such

as Computer Sciences, Biology or Medicine.

In this paper, we investigate strategies to auto-

matically detect new (i.e. unlisted) tortured phrases.

Focusing solely on tortured phrases detection, and

not paraphrased text in general, we will use recent

machine learning techniques and state-of-the-art

language models. Our methods were trained on a

corpus composed of

141

known tortured phrases,

taking their sentences as contexts, and aims at de-

tecting never-seen-before tortured phrases. All

code and corpus used are available online.

2 Related Works

Up to now, no dataset has been built for the auto-

matic detection of tortured phrases. In Cabanac

et al. (2021), authors and contributors collected a

set of tortured phrases and their expected phrases

that we will use as dataset. Wahle et al. (2022)

used Spinbot and Spinnerchief to paraphrase orig-

inal data from several sources such as an arXiv

test sets, graduation theses, and Wikipedia arti-

cles. Their study aims at detecting whether a para-

graph is machine-paraphrased or not. The authors

tested classic machine learning approaches and neu-

ral language models based on the Transformer ar-

chitecture (Vaswani et al.,2017), such as BERT

(Devlin et al.,2018), RoBERTa (Liu et al.,2019),

ALBERT (Lan et al.,2019), Longformer (Beltagy

et al.,2020), and others. They showed that such ap-

proaches can complement text-matching software,

arXiv:2210.13024v1 [cs.CL] 24 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InvestigatingthedetectionofTorturedPhrasesinScienticLiteraturePuthineathLay1,MartinLentschat1,andCyrilLabbé11Univ.GrenobleAlpes,CNRS,GrenobleINP,LIG,38000Grenoble,Franceputhineath.lay@cadt.edu.kh,martin.lentschat@univ-grenoble-alpes.frAbstractWiththehelpofonlinetools,unscrupulousau-thorscantodaygen...

展开>> 收起<<

Investigating the detection of Tortured Phrases in Scientiﬁc Literature Puthineath Lay1 Martin Lentschat1 and Cyril Labbé1 1Univ. Grenoble Alpes CNRS Grenoble INP LIG 38000 Grenoble France.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Investigating the detection of Tortured Phrases in Scientiﬁc Literature Puthineath Lay1 Martin Lentschat1 and Cyril Labbé1 1Univ. Grenoble Alpes CNRS Grenoble INP LIG 38000 Grenoble France

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: