PatternRank Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction Tim Schopf Simon Klimek and Florian Matthes

2025-05-02 0 0 319.76KB 6 页 10玖币

侵权投诉

PatternRank: Leveraging Pretrained Language Models and Part of

Speech for Unsupervised Keyphrase Extraction

Tim Schopf, Simon Klimek, and Florian Matthes

Department of Computer Science, Technical University of Munich, Boltzmannstrasse 3, Garching, Germany

{tim.schopf, simon.klimek, matthes}@tum.de

Keywords: Natural Language Processing, Keyphrase Extraction, Pretrained Language Models, Part of Speech.

Abstract: Keyphrase extraction is the process of automatically selecting a small set of most relevant phrases from a

given text. Supervised keyphrase extraction approaches need large amounts of labeled training data and per-

form poorly outside the domain of the training data (Bennani-Smires et al., 2018). In this paper, we present

PatternRank, which leverages pretrained language models and part-of-speech for unsupervised keyphrase ex-

traction from single documents. Our experiments show PatternRank achieves higher precision, recall and

F1-scores than previous state-of-the-art approaches. In addition, we present the KeyphraseVectorizersapack-

age, which allows easy modiﬁcation of part-of-speech patterns for candidate keyphrase selection, and hence

adaptation of our approach to any domain.

ahttps://github.com/TimSchopf/KeyphraseVectorizers

1 INTRODUCTION

To quickly get an overview of the content of a text, we

can use keyphrases that concisely reﬂect its semantic

context. Keyphrases describe the most essential as-

pect of a text. Unlike simple keywords, keyphrases

do not consist solely of single words, but of sev-

eral compound words. Therefore, keyphrases provide

more information about the content of a text com-

pared to simple keywords. Supervised keyphrase ex-

traction approaches usually achieve higher accuracy

than unsupervised ones (Kim et al., 2012; Caragea

et al., 2014; Meng et al., 2017). However, super-

vised approaches require manually labeled training

data, which often causes subjectivity issues as well

as signiﬁcant investment of time and money (Papa-

giannopoulou and Tsoumakas, 2019). In contrast,

unsupervised keyphrase extraction approaches do not

have these issues and are moreover mostly domain-

independent.

Keyphrases and their vector representations are

very versatile and can be used in a variety of dif-

ferent Natural Language Processing (NLP) down-

stream tasks (Braun et al., 2021; Schopf et al., 2022).

For example, they can be used as features or input

for document clustering and classiﬁcation (Hulth and

Megyesi, 2006; Schopf et al., 2021), they can sup-

port extractive summarization (Zhang et al., 2004), or

they can be used for query expansion (Song et al.,

2006). Keyphrase extraction is particularly relevant

for the scholarly domain as it helps to recommend ar-

ticles, highlight missing citations to authors, identify

potential reviewers for submissions, analyze research

trends over time, and can be used in many different

search scenarios (Augenstein et al., 2017).

In this paper, we present PatternRank, an un-

supervised approach for keyphrase extraction based

on Pretrained Language Models (PLMs) and Part of

Speech (PoS). Since keyphrase extraction is espe-

cially important for the scholarly domain, we evalu-

ate PatternRank on a speciﬁc dataset from this area.

Our approach does not rely on labeled data and there-

fore can be easily adapted to a variety of different do-

mains. Moreover, PatternRank does not require the

input document to be part of a larger corpus, allowing

the keyphrase extraction to be applied to individual

short texts such as publication abstracts. Figure 1 il-

lustrates the general keyphrase extraction approach of

PatternRank.

2 RELATED WORK

Most popular unsupervised keyphrase extraction ap-

proaches can be characterized as either statistics-

based, graph-based, or embedding-based methods,

arXiv:2210.05245v2 [cs.CL] 12 Oct 2022

Candidate Keyphrase

Ranking

……

Candidate Selection

Part of Speech (PoS) Pattern Candidate Keyphrases Pretrained Language Model (PLM)

Output

Ranked Candidate Keyphrases Extracted Keyphrases

Top-NKeyphrases

Text Document

Figure 1: PatternRank approach for unsupervised keyphrase extraction. A single text document is used as input for an

initial ﬁltering step where candidate keyphrases are selected which match a deﬁned PoS pattern. Subsequently, the candidate

keyphrases are ranked by a PLM based on their semantic similarity to the input text document. Finally, the top-Nkeyphrases

are extracted as a concise reﬂection of the input text document.

while Tf-Idf is a common baseline used for evalua-

tion (Papagiannopoulou and Tsoumakas, 2019).

YAKE uses a set of different statistical metrics in-

cluding word casing, word position, word frequency,

and more to extract keyphrases from text (Campos

et al., 2020). TextRank uses PoS ﬁlters to extract noun

phrase candidates that are added to a graph as nodes,

while adding an edge between nodes if the words

co-occur within a deﬁned window (Mihalcea and Ta-

rau, 2004). Finally, PageRank (Page et al., 1999) is

applied to extract keyphrases. SingleRank expands

the TextRank approach by adding weights to edges

based on word co-occurrences (Wan and Xiao, 2008).

RAKE generates a word co-occurrence graph and as-

signs scores based on word frequency, word degree,

or the ratio of degree and frequency for keyphrase

extraction (Rose et al., 2010). Furthermore, Knowl-

edge Graphs can be used to incorporate semantics

for keyphrase extraction (Shi et al., 2017). Em-

bedRank leverages Doc2Vec (Le and Mikolov, 2014)

and Sent2Vec (Pagliardini et al., 2018) sentence em-

beddings to rank candidate keyphrases for extraction

(Bennani-Smires et al., 2018). More recently, a PLM-

based approach was introduced that uses BERT (De-

vlin et al., 2019) for self-labeling of keyphrases and

subsequent use of the generated labels in an LSTM

classiﬁer (Sharma and Li, 2019).

3 KEYPHRASE EXTRACTION

APPROACH

Figure 1 illustrates the general keyphrase extraction

process of our PatternRank approach. The input con-

sists of a single text document which is being word

tokenized. The word tokens are then tagged with PoS

tags. Tokens whose tags match a previously deﬁned

PoS pattern are selected as candidate keyphrases.

Then, the candidate keyphrases are fed into a PLM

to rank them based on their similarity to the input text

document. The PLM embeds the entire text document

as well as all candidate keywords as semantic vector

representations. Subsequently, the cosine similarities

between the document representation and the candi-

date keyphrase representations are computed and the

candidate keyphrases are ranked in descending order

based on the computed similarity scores. Finally, the

top-Nranked keyphrases, which are most representa-

tive of the input document, are extracted.

3.1 Candidate Selection with Part of

Speech

In previous work, simple noun phrases consisting of

zero or more adjectives followed by one or more

nouns were used for keyphrase extraction (Mihalcea

and Tarau, 2004; Wan and Xiao, 2008; Bennani-

Smires et al., 2018). However, we deﬁne a more com-

plex PoS pattern to extract candidate keyphrases from

the input text document. In our approach, the tags

of the word tokens have to match the following PoS

pattern in order for the tokens to be considered as can-

didate keyphrases:

{.∗}{HY PH}{.∗}{NOUN} ∗ 



{V BG}|{V BN}?{ADJ}∗{NOUN}+(1)

The PoS pattern quantiﬁers correspond to the reg-

ular expression syntax. Therefore, we can translate

the PoS pattern as arbitrary parts-of-speech sepa-

rated by a hyphen, followed by zero or more nouns

OR zero or one verb (gerund or present or past par-

ticiple), followed by zero or more adjectives, followed

by one or more nouns.

3.2 Candidate Ranking with Pretrained

Language Models

Earlier work used graphs (Mihalcea and Tarau, 2004;

Wan and Xiao, 2008) or paragraph and sentence em-

beddings (Bennani-Smires et al., 2018) to rank candi-

date keyphrases. However, we leverage PLMs based

on current transformer architectures to rank the can-

didate keyphrases that have recently demonstrated

promising results (Grootendorst, 2020). Therefore,

we follow the general EmbedRank (Bennani-Smires

et al., 2018) approach for ranking, but use PLMs

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PatternRank:LeveragingPretrainedLanguageModelsandPartofSpeechforUnsupervisedKeyphraseExtractionTimSchopf,SimonKlimek,andFlorianMatthesDepartmentofComputerScience,TechnicalUniversityofMunich,Boltzmannstrasse3,Garching,Germanyftim.schopf,simon.klimek,matthesg@tum.deKeywords:NaturalLanguageProcessing,K...

展开>> 收起<<

PatternRank Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction Tim Schopf Simon Klimek and Florian Matthes.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PatternRank Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction Tim Schopf Simon Klimek and Florian Matthes

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: