Unsupervised Term Extraction for Highly Technical Domains Francesco Fusco IBM Research

2025-05-06 0 0 431.35KB 8 页 10玖币

侵权投诉

Unsupervised Term Extraction for Highly Technical Domains

Francesco Fusco

IBM Research

ffu@zurich.ibm.com

Peter Staar

IBM Research

taa@zurich.ibm.com

Diego Antognini

IBM Research

Diego.Antognini@ibm.com

Abstract

Term extraction is an information extraction

task at the root of knowledge discovery plat-

forms. Developing term extractors that are

able to generalize across very diverse and po-

tentially highly technical domains is challeng-

ing, as annotations for domains requiring in-

depth expertise are scarce and expensive to

obtain. In this paper, we describe the term

extraction subsystem of a commercial knowl-

edge discovery platform that targets highly

technical ﬁelds such as pharma, medical, and

material science. To be able to generalize

across domains, we introduce a fully unsu-

pervised annotator (UA). It extracts terms by

combining novel morphological signals from

sub-word tokenization with term-to-topic and

intra-term similarity metrics, computed using

general-domain pre-trained sentence-encoders.

The annotator is used to implement a weakly-

supervised setup, where transformer-models

are ﬁne-tuned (or pre-trained) over the training

data generated by running the UA over large

unlabeled corpora. Our experiments demon-

strate that our setup can improve the predictive

performance while decreasing the inference la-

tency on both CPUs and GPUs. Our annota-

tors provide a very competitive baseline for all

the cases where annotations are not available.

1 Introduction

Automated Term Extraction (ATE) is the task of

extracting terminology from domain-speciﬁc cor-

pora. Term extraction is the most important infor-

mation extraction task for knowledge discovery sys-

tems – whose aim is to create structured knowledge

from unstructured text – because domain speciﬁc

terms are the linguistic representation of domain-

speciﬁc concepts. To be of use in knowledge dis-

covery systems (e.g., SAGA (Ilyas et al.,2022),

DeepSearch (Dognin et al.,2020)) the term extrac-

tion has to identify individual mentions of terms

to enable downstream components (i.e., the entity

JPEG (/ˈdʒeɪpɛɡ/ JAY-peg)[2] is a commonly used method of

lossy compression for digital images, particularly for those

images produced by digital photography.

Wikipedia Tex t fro m https://en.wikipedia.org/wiki/JPEG.

Our unsupervised term-extractor annotator

TEXT = JPEG (/ˈdʒeɪpɛɡ/ JAY-peg)[2] is a commonly used

Method of lossy compression for digital images, particularly

for those images produced by digital photography.

[JPEG]START=0 END=4 Confidence=0.60

[JAY-peg]START=17 END=24 Confidence=0.90

[lossy compression]START=58 END=75 Confidence=0.73

[digital images]START=80 END=94 Confidence=0.93

[digital photography]START=138 END=157 Confidence=0.92

Figure 1: Our term extractor identiﬁes the same men-

tions as Wikipedia without relying on annotated data.

linker) to use not only the terms, but also their sur-

rounding context. Unlike other applications of term

extraction, such as text classiﬁcation, where it is

sufﬁcient to extract representative terms for entire

documents or even use generative approaches, term

extraction in knowledge discovery systems has to

be approached as a sequence tagging task.

The largest challenges for term extraction sys-

tems, when used for knowledge discovery, are gen-

eralization across domains and lack of annotated

data. In fact, commercial knowledge discovery plat-

forms are typically required to process large cor-

pora targeting very diverse and often highly tech-

nical domains. Organizing annotation campaigns

for such vertical domains is a costly process as

it requires highly specialized domain experts. An

additional challenge for such platforms are the com-

putational requirements, which must be accounted

for when developing technologies required to sift

through very large and often proprietary corpora.

In this work, we describe an effective term ex-

traction approach used in a commercial knowledge

discovery platform

to extract Wikipedia-like con-

cepts

from text (see Figure 1). Our approach does

1https://ds4sd.github.io.

The linking from words to Wikilinks is done manu-

ally on Wikipedia, see

https://en.wikipedia.org/wiki/

Wikipedia:Manual_of_Style/Linking for more details.

arXiv:2210.13118v1 [cs.CL] 24 Oct 2022

not require any human annotation, offers the ﬂexi-

bility to select the right trade-off between accuracy

and inference latency, and enables the deployment

of lightweight models running entirely on CPUs.

At its core, our approach is a weakly supervised

setup (see Figure 2), where transformer models are

ﬁne-tuned (or even entirely pre-trained) using the

weak labels generated by a fully unsupervised term

annotator. The unsupervised annotator (UA) com-

bines novel morphological and semantic signals

to tag sequences of text corresponding to domain-

speciﬁc terminology. In fact, in addition to part-of-

speech tagging to identify candidate terms, the UA

exploits sub-word tokenization techniques – com-

monly used in language models to highlight words

that are outside of the common vocabulary – to

indirectly measure the morphological complexity

of a word based on its sub-tokens. To the best of

our knowledge, this is the ﬁrst work relying on

sub-word tokenization units in the context of term

extraction. To prune the candidate set of terms

the annotator uses two semantic metrics as thresh-

olds: the topic-score and a novel speciﬁcity score

that are both computed using representations from

sentence encoders. The unsupervised annotator,

combined with the two-stage weakly supervised

setup, makes our approach particularly attractive

for practical industrial setups because computation-

ally intensive techniques used by the unsupervised

annotator are not paid at inference time. Therefore,

one can improve the annotation quality by using

more expensive techniques (e.g., entity linking to

external knowledge bases), without adding costs at

inference time. The two main contributions of this

paper are summarized as follows:

We extract a novel morphology signal from

subword-unit tokenization and we introduce a

new metric called the speciﬁcity score. Upon

those signals, we build an unsupervised term-

extractor that offers competitive results when

no annotation is available.

We show that by ﬁne-tuning transformer mod-

els over the weak labels produced by the un-

supervised term extractor we decrease the la-

tency and improve the prediction quality.

2 Related work

Automated Term Extraction (ATE) is a natural lan-

guage processing task that has been the subject

of many research studies (Buitelaar et al.,2005;

Lossio-Ventura et al.,2016;Zhang et al.,2018;Ma

et al.,2019;Šajatovi´

c et al.,2019). What we de-

scribe in this work is an effective term extraction

approach that is fully unsupervised and also offers

the ﬂexibility and modularity to deploy and easily

maintain systems in production.

ATE should not be confused with keyphrase ex-

traction (Firoozeh et al.,2020;Mahata et al.,2018;

Bennani-Smires et al.,2018) and keyphrase genera-

tion (Wu et al.,2022;Chen et al.,2020), which

have the goal of extracting, or generating, key

phrases that best describe a given free text doc-

ument. Keyphrases can be seen as a set of tags as-

sociated to a document. In the context of keyphrase

extraction, sentence embedders have been used in

the literature, such as in EmbedRank (Bennani-

Smires et al.,2018) and Key2Vec (Mahata et al.,

2018). In our work, we also rely on sentence

encoders, but we use them to generate training

data for sequence tagging. Therefore, we do not

rely on sentence encoders at runtime to extract ter-

minology from text, enabling the creation of lower

latency systems.

To capture complex morphological structures we

use word segmentation techniques. Word seg-

mentation algorithms such as Byte-Pair Encoding

(Sennrich et al.,2016), word-piece (Schuster and

Nakajima,2012), and unigram language modeling

(Kudo,2018) have been introduced to avoid the

problem of out-of-vocabulary words and, more in

general, to reduce the number of distinct symbols

that sequence models for natural language process-

ing have to process. To the best of our knowledge,

we are the ﬁrst to use the subword-unit tokenization

as a signal to extract technical terms from text.

Our approach builds on the notion of speciﬁcity

to ﬁnd terminology. While there are multiple re-

search works (Caraballo and Charniak,1999;Ryu

and Choi,2006) highlighting the importance of

speciﬁcity, to the best of our knowledge, this is the

ﬁrst work using the notion of speciﬁcity to extract

terminology from text.

3 The approach

Figure 2depicts our weakly supervised setup. Start-

ing from a raw text corpus and no labels, our train-

ing workﬂow produces an efﬁcient sequence tag-

ging model, based on the transformer architecture,

which effectively implements the term extraction.

At the core of the weak labels there is a fully un-

supervised component, called the Unsupervised

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnsupervisedTermExtractionforHighlyTechnicalDomainsFrancescoFuscoIBMResearchffu@zurich.ibm.comPeterStaarIBMResearchtaa@zurich.ibm.comDiegoAntogniniIBMResearchDiego.Antognini@ibm.comAbstractTermextractionisaninformationextractiontaskattherootofknowledgediscoveryplat-forms.Developingtermextractorsthat...

展开>> 收起<<

Unsupervised Term Extraction for Highly Technical Domains Francesco Fusco IBM Research.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Unsupervised Term Extraction for Highly Technical Domains Francesco Fusco IBM Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: