
Candidate Keyphrase
Ranking
……
1.
2.
3.
4.
5.
Candidate Selection
Part of Speech (PoS) Pattern Candidate Keyphrases Pretrained Language Model (PLM)
Output
Output
Ranked Candidate Keyphrases Extracted Keyphrases
Top-NKeyphrases
Text Document
Figure 1: PatternRank approach for unsupervised keyphrase extraction. A single text document is used as input for an
initial filtering step where candidate keyphrases are selected which match a defined PoS pattern. Subsequently, the candidate
keyphrases are ranked by a PLM based on their semantic similarity to the input text document. Finally, the top-Nkeyphrases
are extracted as a concise reflection of the input text document.
while Tf-Idf is a common baseline used for evalua-
tion (Papagiannopoulou and Tsoumakas, 2019).
YAKE uses a set of different statistical metrics in-
cluding word casing, word position, word frequency,
and more to extract keyphrases from text (Campos
et al., 2020). TextRank uses PoS filters to extract noun
phrase candidates that are added to a graph as nodes,
while adding an edge between nodes if the words
co-occur within a defined window (Mihalcea and Ta-
rau, 2004). Finally, PageRank (Page et al., 1999) is
applied to extract keyphrases. SingleRank expands
the TextRank approach by adding weights to edges
based on word co-occurrences (Wan and Xiao, 2008).
RAKE generates a word co-occurrence graph and as-
signs scores based on word frequency, word degree,
or the ratio of degree and frequency for keyphrase
extraction (Rose et al., 2010). Furthermore, Knowl-
edge Graphs can be used to incorporate semantics
for keyphrase extraction (Shi et al., 2017). Em-
bedRank leverages Doc2Vec (Le and Mikolov, 2014)
and Sent2Vec (Pagliardini et al., 2018) sentence em-
beddings to rank candidate keyphrases for extraction
(Bennani-Smires et al., 2018). More recently, a PLM-
based approach was introduced that uses BERT (De-
vlin et al., 2019) for self-labeling of keyphrases and
subsequent use of the generated labels in an LSTM
classifier (Sharma and Li, 2019).
3 KEYPHRASE EXTRACTION
APPROACH
Figure 1 illustrates the general keyphrase extraction
process of our PatternRank approach. The input con-
sists of a single text document which is being word
tokenized. The word tokens are then tagged with PoS
tags. Tokens whose tags match a previously defined
PoS pattern are selected as candidate keyphrases.
Then, the candidate keyphrases are fed into a PLM
to rank them based on their similarity to the input text
document. The PLM embeds the entire text document
as well as all candidate keywords as semantic vector
representations. Subsequently, the cosine similarities
between the document representation and the candi-
date keyphrase representations are computed and the
candidate keyphrases are ranked in descending order
based on the computed similarity scores. Finally, the
top-Nranked keyphrases, which are most representa-
tive of the input document, are extracted.
3.1 Candidate Selection with Part of
Speech
In previous work, simple noun phrases consisting of
zero or more adjectives followed by one or more
nouns were used for keyphrase extraction (Mihalcea
and Tarau, 2004; Wan and Xiao, 2008; Bennani-
Smires et al., 2018). However, we define a more com-
plex PoS pattern to extract candidate keyphrases from
the input text document. In our approach, the tags
of the word tokens have to match the following PoS
pattern in order for the tokens to be considered as can-
didate keyphrases:
{.∗}{HY PH}{.∗}{NOUN} ∗
{V BG}|{V BN}?{ADJ}∗{NOUN}+(1)
The PoS pattern quantifiers correspond to the reg-
ular expression syntax. Therefore, we can translate
the PoS pattern as arbitrary parts-of-speech sepa-
rated by a hyphen, followed by zero or more nouns
OR zero or one verb (gerund or present or past par-
ticiple), followed by zero or more adjectives, followed
by one or more nouns.
3.2 Candidate Ranking with Pretrained
Language Models
Earlier work used graphs (Mihalcea and Tarau, 2004;
Wan and Xiao, 2008) or paragraph and sentence em-
beddings (Bennani-Smires et al., 2018) to rank candi-
date keyphrases. However, we leverage PLMs based
on current transformer architectures to rank the can-
didate keyphrases that have recently demonstrated
promising results (Grootendorst, 2020). Therefore,
we follow the general EmbedRank (Bennani-Smires
et al., 2018) approach for ranking, but use PLMs