not require any human annotation, offers the flexi-
bility to select the right trade-off between accuracy
and inference latency, and enables the deployment
of lightweight models running entirely on CPUs.
At its core, our approach is a weakly supervised
setup (see Figure 2), where transformer models are
fine-tuned (or even entirely pre-trained) using the
weak labels generated by a fully unsupervised term
annotator. The unsupervised annotator (UA) com-
bines novel morphological and semantic signals
to tag sequences of text corresponding to domain-
specific terminology. In fact, in addition to part-of-
speech tagging to identify candidate terms, the UA
exploits sub-word tokenization techniques – com-
monly used in language models to highlight words
that are outside of the common vocabulary – to
indirectly measure the morphological complexity
of a word based on its sub-tokens. To the best of
our knowledge, this is the first work relying on
sub-word tokenization units in the context of term
extraction. To prune the candidate set of terms
the annotator uses two semantic metrics as thresh-
olds: the topic-score and a novel specificity score
that are both computed using representations from
sentence encoders. The unsupervised annotator,
combined with the two-stage weakly supervised
setup, makes our approach particularly attractive
for practical industrial setups because computation-
ally intensive techniques used by the unsupervised
annotator are not paid at inference time. Therefore,
one can improve the annotation quality by using
more expensive techniques (e.g., entity linking to
external knowledge bases), without adding costs at
inference time. The two main contributions of this
paper are summarized as follows:
1.
We extract a novel morphology signal from
subword-unit tokenization and we introduce a
new metric called the specificity score. Upon
those signals, we build an unsupervised term-
extractor that offers competitive results when
no annotation is available.
2.
We show that by fine-tuning transformer mod-
els over the weak labels produced by the un-
supervised term extractor we decrease the la-
tency and improve the prediction quality.
2 Related work
Automated Term Extraction (ATE) is a natural lan-
guage processing task that has been the subject
of many research studies (Buitelaar et al.,2005;
Lossio-Ventura et al.,2016;Zhang et al.,2018;Ma
et al.,2019;Šajatovi´
c et al.,2019). What we de-
scribe in this work is an effective term extraction
approach that is fully unsupervised and also offers
the flexibility and modularity to deploy and easily
maintain systems in production.
ATE should not be confused with keyphrase ex-
traction (Firoozeh et al.,2020;Mahata et al.,2018;
Bennani-Smires et al.,2018) and keyphrase genera-
tion (Wu et al.,2022;Chen et al.,2020), which
have the goal of extracting, or generating, key
phrases that best describe a given free text doc-
ument. Keyphrases can be seen as a set of tags as-
sociated to a document. In the context of keyphrase
extraction, sentence embedders have been used in
the literature, such as in EmbedRank (Bennani-
Smires et al.,2018) and Key2Vec (Mahata et al.,
2018). In our work, we also rely on sentence
encoders, but we use them to generate training
data for sequence tagging. Therefore, we do not
rely on sentence encoders at runtime to extract ter-
minology from text, enabling the creation of lower
latency systems.
To capture complex morphological structures we
use word segmentation techniques. Word seg-
mentation algorithms such as Byte-Pair Encoding
(Sennrich et al.,2016), word-piece (Schuster and
Nakajima,2012), and unigram language modeling
(Kudo,2018) have been introduced to avoid the
problem of out-of-vocabulary words and, more in
general, to reduce the number of distinct symbols
that sequence models for natural language process-
ing have to process. To the best of our knowledge,
we are the first to use the subword-unit tokenization
as a signal to extract technical terms from text.
Our approach builds on the notion of specificity
to find terminology. While there are multiple re-
search works (Caraballo and Charniak,1999;Ryu
and Choi,2006) highlighting the importance of
specificity, to the best of our knowledge, this is the
first work using the notion of specificity to extract
terminology from text.
3 The approach
Figure 2depicts our weakly supervised setup. Start-
ing from a raw text corpus and no labels, our train-
ing workflow produces an efficient sequence tag-
ging model, based on the transformer architecture,
which effectively implements the term extraction.
At the core of the weak labels there is a fully un-
supervised component, called the Unsupervised