PatternRank Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction Tim Schopf Simon Klimek and Florian Matthes

2025-05-02 0 0 319.76KB 6 页 10玖币
侵权投诉
PatternRank: Leveraging Pretrained Language Models and Part of
Speech for Unsupervised Keyphrase Extraction
Tim Schopf, Simon Klimek, and Florian Matthes
Department of Computer Science, Technical University of Munich, Boltzmannstrasse 3, Garching, Germany
{tim.schopf, simon.klimek, matthes}@tum.de
Keywords: Natural Language Processing, Keyphrase Extraction, Pretrained Language Models, Part of Speech.
Abstract: Keyphrase extraction is the process of automatically selecting a small set of most relevant phrases from a
given text. Supervised keyphrase extraction approaches need large amounts of labeled training data and per-
form poorly outside the domain of the training data (Bennani-Smires et al., 2018). In this paper, we present
PatternRank, which leverages pretrained language models and part-of-speech for unsupervised keyphrase ex-
traction from single documents. Our experiments show PatternRank achieves higher precision, recall and
F1-scores than previous state-of-the-art approaches. In addition, we present the KeyphraseVectorizersapack-
age, which allows easy modification of part-of-speech patterns for candidate keyphrase selection, and hence
adaptation of our approach to any domain.
ahttps://github.com/TimSchopf/KeyphraseVectorizers
1 INTRODUCTION
To quickly get an overview of the content of a text, we
can use keyphrases that concisely reflect its semantic
context. Keyphrases describe the most essential as-
pect of a text. Unlike simple keywords, keyphrases
do not consist solely of single words, but of sev-
eral compound words. Therefore, keyphrases provide
more information about the content of a text com-
pared to simple keywords. Supervised keyphrase ex-
traction approaches usually achieve higher accuracy
than unsupervised ones (Kim et al., 2012; Caragea
et al., 2014; Meng et al., 2017). However, super-
vised approaches require manually labeled training
data, which often causes subjectivity issues as well
as significant investment of time and money (Papa-
giannopoulou and Tsoumakas, 2019). In contrast,
unsupervised keyphrase extraction approaches do not
have these issues and are moreover mostly domain-
independent.
Keyphrases and their vector representations are
very versatile and can be used in a variety of dif-
ferent Natural Language Processing (NLP) down-
stream tasks (Braun et al., 2021; Schopf et al., 2022).
For example, they can be used as features or input
for document clustering and classification (Hulth and
Megyesi, 2006; Schopf et al., 2021), they can sup-
port extractive summarization (Zhang et al., 2004), or
they can be used for query expansion (Song et al.,
2006). Keyphrase extraction is particularly relevant
for the scholarly domain as it helps to recommend ar-
ticles, highlight missing citations to authors, identify
potential reviewers for submissions, analyze research
trends over time, and can be used in many different
search scenarios (Augenstein et al., 2017).
In this paper, we present PatternRank, an un-
supervised approach for keyphrase extraction based
on Pretrained Language Models (PLMs) and Part of
Speech (PoS). Since keyphrase extraction is espe-
cially important for the scholarly domain, we evalu-
ate PatternRank on a specific dataset from this area.
Our approach does not rely on labeled data and there-
fore can be easily adapted to a variety of different do-
mains. Moreover, PatternRank does not require the
input document to be part of a larger corpus, allowing
the keyphrase extraction to be applied to individual
short texts such as publication abstracts. Figure 1 il-
lustrates the general keyphrase extraction approach of
PatternRank.
2 RELATED WORK
Most popular unsupervised keyphrase extraction ap-
proaches can be characterized as either statistics-
based, graph-based, or embedding-based methods,
arXiv:2210.05245v2 [cs.CL] 12 Oct 2022
Candidate Keyphrase
Ranking
1.
2.
3.
4.
5.
Candidate Selection
Part of Speech (PoS) Pattern Candidate Keyphrases Pretrained Language Model (PLM)
Output
Output
Ranked Candidate Keyphrases Extracted Keyphrases
Top-NKeyphrases
Text Document
Figure 1: PatternRank approach for unsupervised keyphrase extraction. A single text document is used as input for an
initial filtering step where candidate keyphrases are selected which match a defined PoS pattern. Subsequently, the candidate
keyphrases are ranked by a PLM based on their semantic similarity to the input text document. Finally, the top-Nkeyphrases
are extracted as a concise reflection of the input text document.
while Tf-Idf is a common baseline used for evalua-
tion (Papagiannopoulou and Tsoumakas, 2019).
YAKE uses a set of different statistical metrics in-
cluding word casing, word position, word frequency,
and more to extract keyphrases from text (Campos
et al., 2020). TextRank uses PoS filters to extract noun
phrase candidates that are added to a graph as nodes,
while adding an edge between nodes if the words
co-occur within a defined window (Mihalcea and Ta-
rau, 2004). Finally, PageRank (Page et al., 1999) is
applied to extract keyphrases. SingleRank expands
the TextRank approach by adding weights to edges
based on word co-occurrences (Wan and Xiao, 2008).
RAKE generates a word co-occurrence graph and as-
signs scores based on word frequency, word degree,
or the ratio of degree and frequency for keyphrase
extraction (Rose et al., 2010). Furthermore, Knowl-
edge Graphs can be used to incorporate semantics
for keyphrase extraction (Shi et al., 2017). Em-
bedRank leverages Doc2Vec (Le and Mikolov, 2014)
and Sent2Vec (Pagliardini et al., 2018) sentence em-
beddings to rank candidate keyphrases for extraction
(Bennani-Smires et al., 2018). More recently, a PLM-
based approach was introduced that uses BERT (De-
vlin et al., 2019) for self-labeling of keyphrases and
subsequent use of the generated labels in an LSTM
classifier (Sharma and Li, 2019).
3 KEYPHRASE EXTRACTION
APPROACH
Figure 1 illustrates the general keyphrase extraction
process of our PatternRank approach. The input con-
sists of a single text document which is being word
tokenized. The word tokens are then tagged with PoS
tags. Tokens whose tags match a previously defined
PoS pattern are selected as candidate keyphrases.
Then, the candidate keyphrases are fed into a PLM
to rank them based on their similarity to the input text
document. The PLM embeds the entire text document
as well as all candidate keywords as semantic vector
representations. Subsequently, the cosine similarities
between the document representation and the candi-
date keyphrase representations are computed and the
candidate keyphrases are ranked in descending order
based on the computed similarity scores. Finally, the
top-Nranked keyphrases, which are most representa-
tive of the input document, are extracted.
3.1 Candidate Selection with Part of
Speech
In previous work, simple noun phrases consisting of
zero or more adjectives followed by one or more
nouns were used for keyphrase extraction (Mihalcea
and Tarau, 2004; Wan and Xiao, 2008; Bennani-
Smires et al., 2018). However, we define a more com-
plex PoS pattern to extract candidate keyphrases from
the input text document. In our approach, the tags
of the word tokens have to match the following PoS
pattern in order for the tokens to be considered as can-
didate keyphrases:
{.∗}{HY PH}{.∗}{NOUN} ∗
{V BG}|{V BN}?{ADJ}∗{NOUN}+(1)
The PoS pattern quantifiers correspond to the reg-
ular expression syntax. Therefore, we can translate
the PoS pattern as arbitrary parts-of-speech sepa-
rated by a hyphen, followed by zero or more nouns
OR zero or one verb (gerund or present or past par-
ticiple), followed by zero or more adjectives, followed
by one or more nouns.
3.2 Candidate Ranking with Pretrained
Language Models
Earlier work used graphs (Mihalcea and Tarau, 2004;
Wan and Xiao, 2008) or paragraph and sentence em-
beddings (Bennani-Smires et al., 2018) to rank candi-
date keyphrases. However, we leverage PLMs based
on current transformer architectures to rank the can-
didate keyphrases that have recently demonstrated
promising results (Grootendorst, 2020). Therefore,
we follow the general EmbedRank (Bennani-Smires
et al., 2018) approach for ranking, but use PLMs
摘要:

PatternRank:LeveragingPretrainedLanguageModelsandPartofSpeechforUnsupervisedKeyphraseExtractionTimSchopf,SimonKlimek,andFlorianMatthesDepartmentofComputerScience,TechnicalUniversityofMunich,Boltzmannstrasse3,Garching,Germanyftim.schopf,simon.klimek,matthesg@tum.deKeywords:NaturalLanguageProcessing,K...

展开>> 收起<<
PatternRank Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction Tim Schopf Simon Klimek and Florian Matthes.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:319.76KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注