Unsupervised Term Extraction for Highly Technical Domains Francesco Fusco IBM Research

2025-05-06 0 0 431.35KB 8 页 10玖币
侵权投诉
Unsupervised Term Extraction for Highly Technical Domains
Francesco Fusco
IBM Research
ffu@zurich.ibm.com
Peter Staar
IBM Research
taa@zurich.ibm.com
Diego Antognini
IBM Research
Diego.Antognini@ibm.com
Abstract
Term extraction is an information extraction
task at the root of knowledge discovery plat-
forms. Developing term extractors that are
able to generalize across very diverse and po-
tentially highly technical domains is challeng-
ing, as annotations for domains requiring in-
depth expertise are scarce and expensive to
obtain. In this paper, we describe the term
extraction subsystem of a commercial knowl-
edge discovery platform that targets highly
technical fields such as pharma, medical, and
material science. To be able to generalize
across domains, we introduce a fully unsu-
pervised annotator (UA). It extracts terms by
combining novel morphological signals from
sub-word tokenization with term-to-topic and
intra-term similarity metrics, computed using
general-domain pre-trained sentence-encoders.
The annotator is used to implement a weakly-
supervised setup, where transformer-models
are fine-tuned (or pre-trained) over the training
data generated by running the UA over large
unlabeled corpora. Our experiments demon-
strate that our setup can improve the predictive
performance while decreasing the inference la-
tency on both CPUs and GPUs. Our annota-
tors provide a very competitive baseline for all
the cases where annotations are not available.
1 Introduction
Automated Term Extraction (ATE) is the task of
extracting terminology from domain-specific cor-
pora. Term extraction is the most important infor-
mation extraction task for knowledge discovery sys-
tems – whose aim is to create structured knowledge
from unstructured text – because domain specific
terms are the linguistic representation of domain-
specific concepts. To be of use in knowledge dis-
covery systems (e.g., SAGA (Ilyas et al.,2022),
DeepSearch (Dognin et al.,2020)) the term extrac-
tion has to identify individual mentions of terms
to enable downstream components (i.e., the entity
JPEG (/ˈdʒeɪpɛɡ/ JAY-peg)[2] is a commonly used method of
lossy compression for digital images, particularly for those
images produced by digital photography.
Wikipedia Tex t fro m https://en.wikipedia.org/wiki/JPEG.
Our unsupervised term-extractor annotator
TEXT = JPEG (/ˈdʒeɪpɛɡ/ JAY-peg)[2] is a commonly used
Method of lossy compression for digital images, particularly
for those images produced by digital photography.
[JPEG]START=0 END=4 Confidence=0.60
[JAY-peg]START=17 END=24 Confidence=0.90
[lossy compression]START=58 END=75 Confidence=0.73
[digital images]START=80 END=94 Confidence=0.93
[digital photography]START=138 END=157 Confidence=0.92
Figure 1: Our term extractor identifies the same men-
tions as Wikipedia without relying on annotated data.
linker) to use not only the terms, but also their sur-
rounding context. Unlike other applications of term
extraction, such as text classification, where it is
sufficient to extract representative terms for entire
documents or even use generative approaches, term
extraction in knowledge discovery systems has to
be approached as a sequence tagging task.
The largest challenges for term extraction sys-
tems, when used for knowledge discovery, are gen-
eralization across domains and lack of annotated
data. In fact, commercial knowledge discovery plat-
forms are typically required to process large cor-
pora targeting very diverse and often highly tech-
nical domains. Organizing annotation campaigns
for such vertical domains is a costly process as
it requires highly specialized domain experts. An
additional challenge for such platforms are the com-
putational requirements, which must be accounted
for when developing technologies required to sift
through very large and often proprietary corpora.
In this work, we describe an effective term ex-
traction approach used in a commercial knowledge
discovery platform
1
to extract Wikipedia-like con-
cepts
2
from text (see Figure 1). Our approach does
1https://ds4sd.github.io.
2
The linking from words to Wikilinks is done manu-
ally on Wikipedia, see
https://en.wikipedia.org/wiki/
Wikipedia:Manual_of_Style/Linking for more details.
arXiv:2210.13118v1 [cs.CL] 24 Oct 2022
not require any human annotation, offers the flexi-
bility to select the right trade-off between accuracy
and inference latency, and enables the deployment
of lightweight models running entirely on CPUs.
At its core, our approach is a weakly supervised
setup (see Figure 2), where transformer models are
fine-tuned (or even entirely pre-trained) using the
weak labels generated by a fully unsupervised term
annotator. The unsupervised annotator (UA) com-
bines novel morphological and semantic signals
to tag sequences of text corresponding to domain-
specific terminology. In fact, in addition to part-of-
speech tagging to identify candidate terms, the UA
exploits sub-word tokenization techniques – com-
monly used in language models to highlight words
that are outside of the common vocabulary – to
indirectly measure the morphological complexity
of a word based on its sub-tokens. To the best of
our knowledge, this is the first work relying on
sub-word tokenization units in the context of term
extraction. To prune the candidate set of terms
the annotator uses two semantic metrics as thresh-
olds: the topic-score and a novel specificity score
that are both computed using representations from
sentence encoders. The unsupervised annotator,
combined with the two-stage weakly supervised
setup, makes our approach particularly attractive
for practical industrial setups because computation-
ally intensive techniques used by the unsupervised
annotator are not paid at inference time. Therefore,
one can improve the annotation quality by using
more expensive techniques (e.g., entity linking to
external knowledge bases), without adding costs at
inference time. The two main contributions of this
paper are summarized as follows:
1.
We extract a novel morphology signal from
subword-unit tokenization and we introduce a
new metric called the specificity score. Upon
those signals, we build an unsupervised term-
extractor that offers competitive results when
no annotation is available.
2.
We show that by fine-tuning transformer mod-
els over the weak labels produced by the un-
supervised term extractor we decrease the la-
tency and improve the prediction quality.
2 Related work
Automated Term Extraction (ATE) is a natural lan-
guage processing task that has been the subject
of many research studies (Buitelaar et al.,2005;
Lossio-Ventura et al.,2016;Zhang et al.,2018;Ma
et al.,2019;Šajatovi´
c et al.,2019). What we de-
scribe in this work is an effective term extraction
approach that is fully unsupervised and also offers
the flexibility and modularity to deploy and easily
maintain systems in production.
ATE should not be confused with keyphrase ex-
traction (Firoozeh et al.,2020;Mahata et al.,2018;
Bennani-Smires et al.,2018) and keyphrase genera-
tion (Wu et al.,2022;Chen et al.,2020), which
have the goal of extracting, or generating, key
phrases that best describe a given free text doc-
ument. Keyphrases can be seen as a set of tags as-
sociated to a document. In the context of keyphrase
extraction, sentence embedders have been used in
the literature, such as in EmbedRank (Bennani-
Smires et al.,2018) and Key2Vec (Mahata et al.,
2018). In our work, we also rely on sentence
encoders, but we use them to generate training
data for sequence tagging. Therefore, we do not
rely on sentence encoders at runtime to extract ter-
minology from text, enabling the creation of lower
latency systems.
To capture complex morphological structures we
use word segmentation techniques. Word seg-
mentation algorithms such as Byte-Pair Encoding
(Sennrich et al.,2016), word-piece (Schuster and
Nakajima,2012), and unigram language modeling
(Kudo,2018) have been introduced to avoid the
problem of out-of-vocabulary words and, more in
general, to reduce the number of distinct symbols
that sequence models for natural language process-
ing have to process. To the best of our knowledge,
we are the first to use the subword-unit tokenization
as a signal to extract technical terms from text.
Our approach builds on the notion of specificity
to find terminology. While there are multiple re-
search works (Caraballo and Charniak,1999;Ryu
and Choi,2006) highlighting the importance of
specificity, to the best of our knowledge, this is the
first work using the notion of specificity to extract
terminology from text.
3 The approach
Figure 2depicts our weakly supervised setup. Start-
ing from a raw text corpus and no labels, our train-
ing workflow produces an efficient sequence tag-
ging model, based on the transformer architecture,
which effectively implements the term extraction.
At the core of the weak labels there is a fully un-
supervised component, called the Unsupervised
摘要:

UnsupervisedTermExtractionforHighlyTechnicalDomainsFrancescoFuscoIBMResearchffu@zurich.ibm.comPeterStaarIBMResearchtaa@zurich.ibm.comDiegoAntogniniIBMResearchDiego.Antognini@ibm.comAbstractTermextractionisaninformationextractiontaskattherootofknowledgediscoveryplat-forms.Developingtermextractorsthat...

展开>> 收起<<
Unsupervised Term Extraction for Highly Technical Domains Francesco Fusco IBM Research.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:431.35KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注