BioLORD Learning Ontological Representations from Deﬁnitions for Biomedical Concepts and their Textual Descriptions François Remy and Kris Demuynck and Thomas Demeester

2025-05-06 0 0 456.1KB 12 页 10玖币

侵权投诉

BioLORD: Learning Ontological Representations from Deﬁnitions

for Biomedical Concepts and their Textual Descriptions

François Remy and Kris Demuynck and Thomas Demeester

The Internet and Data Science Lab (IDLab)

Ghent University (UGent) - Imec Belgium

francois.remy@ugent.be

Abstract

This work introduces BioLORD, a new pre-

training strategy for producing meaningful rep-

resentations for clinical sentences and bio-

medical concepts. State-of-the-art methodolo-

gies operate by maximizing the similarity in

representation of names referring to the same

concept, and preventing collapse through con-

trastive learning. However, because biomedical

names are not always self-explanatory, it some-

times results in non-semantic representations.

BioLORD overcomes this issue by ground-

ing concept representations using deﬁnitions,

as well as short descriptions derived from a

multi-relational knowledge graph consisting of

biomedical ontologies. Thanks to this ground-

ing, our model produces more semantic concept

representations that match more closely the hi-

erarchical structure of ontologies. BioLORD

establishes a new state of the art for text simi-

larity on both clinical sentences (MedSTS) and

biomedical concepts (MayoSRS).

1 Introduction

Natural language processing models are well posi-

tioned to support healthcare providers by automat-

ically extracting and synthesizing relevant infor-

mation from clinical notes. For this, we introduce

BioLORD, a pre-training strategy for end-to-end

biomedical information extraction, capable of pro-

ducing meaningful representations for biomedical

terms and clinical sentences simultaneously.

This is achieved through the continued pre-

training of an existing sentence embedding model,

using contrastive learning and pairs consisting of

the names and deﬁnitions of a given biomedical

concept (see Fig. 1). This design choice proved

crucial for the effectiveness of BioLORD, as it en-

ables the transfer of knowledge from the deﬁnitions

to the representation of biomedical names, thereby

overcoming limitations of existing works (see §2.3)

through a more effective usage of the knowledge

contained in biomedical ontologies (see §2.1).

Indeed, to improve coverage and diversity, we

supplemented deﬁnitions with textual descriptions

generated from the numerous concept-to-concept

relationships contained in biomedical ontologies.

Our key contributions are

a versatile training

strategy using dictionaries and knowledge graphs to

create highly semantic representations for the key

phrases of a domain,

an associated BioLORD

model trained on the biomedical domain,

an ex-

tensive evaluation (§4) demonstrating its ability to

provide semantic representations usable in a broad

range of information extraction scenarios, includ-

ing a new state of the art for Biomedical Concept

Representation and Clinical Sentence Similarity,

and

an in-depth analysis of the strengths and

weaknesses of our proposed approach (§5).

2 Related Work

Let us ﬁrst consider how prior works attempted

to address the biomedical domain’s usage of a

large, specialized, and often opaque vocabulary

(e.g., PAPA syndrome1or cat scratch disease2).

2.1 Biomedical ontologies

To condense this lexical knowledge in digital form,

medical practitioners developed semi-structured

concept hierarchies called biomedical ontologies,

merging a dictionary and a knowledge graph.

SnomedCT

(Systematized Nomenclature of

Medicine and Clinical Terms) is one such ontology

covering around 700k medical concepts in total

and a small set of important relationships between

these concepts (Schulz and Klein,2008).

UMLS

(Uniﬁed Medical Language System)

bridges several biomedical ontologies to cover

more than 4 million concepts, each with on av-

erage 4 listed names (Bodenreider,2004). UMLS

also contains around 90 million labeled concept-to-

concept relationships of 900 different types.

1a hereditary inﬂammatory disorder affecting the skin

2a bacterial skin infection caused by Bartonella Henselae

arXiv:2210.11892v1 [cs.CL] 21 Oct 2022

Figure 1: BioLORD aims to bring the representation of biomedical concept names () and their deﬁnitions (⊗)

closer to each other, to ground the name representations with knowledge from the deﬁnitions. This is illustrated

for the Ranitidine and Aspirin concepts from UMLS. Knowledge from the ontology’s relational knowledge graph

is injected by extending the set of known deﬁnitions with automatically generated descriptions (⊕). Each such

description pairs a more generic concept with one relationship (of the described concept) and its related concept,

thereby setting the described concept apart from the more generic one. Contrastive learning is applied to attract

the representations of compatible pairs (,⊗or ⊕) and repel incompatible ones (obtained as in-batch negatives).

2.2 Contrastive Learning Strategies

On the machine learning side, efforts in the tasks

of named entity recognition (NER) and normal-

ization (NEL) are strongly inﬂuenced by the chal-

lenges posed by such a large and specialized vo-

cabulary. In recent years, approaches using ontolo-

gies through string-based pattern matching, such

as MetaMap (Aronson,2001), have been consis-

tently outperformed by newer works relying on

constrative learning with Transformers.

BioSyn

(Sung et al.,2020) was the ﬁrst model

to introduce the idea of contrastive learning to pro-

duce embeddings of biomedical concepts. It takes

existing NEL benchmarks and proposes to use their

training sets in a contrastive manner. An encoder

model initialized with BioBERT (Lee et al.,2020)

is trained to produce embeddings for batches of

concept names (grouped by pairs referring to the

same concept). A contrastive loss is then applied to

ensure that the embeddings of synonyms are signiﬁ-

cantly closer to each other than they are to the other

names in the batch, which refer to other concepts.

After pre-training, the model can be ﬁnetuned for

the end task of NEL using cross-entropy training.

SapBERT

(Liu et al.,2021) was the ﬁrst large-

scale contrastive model to leverage UMLS. Just

like BioSyn, it produces embeddings for biomedi-

cal concept names, without considering the context

they are used in. But, unlike BioSyn, it is based on

PubMedBERT (Gu et al.,2020) and uses the syn-

onyms deﬁned for concepts in UMLS to form the

training pairs. This enables the model to contrast

millions of entries, many more than BioSyn.

BIOCOM

(Ujiie et al.,2021) and

KRISSBERT

(Zhang et al.,2021) independently extended this

approach in a similar way, by noting the need for

context-based disambiguation for some entities.

For each UMLS concept, sentences mentioning the

concept are collected from PubMed articles. These

sentences are used as context during training.

Figure 2: Concept mapping sometimes requires consid-

ering the entire sentence, rather than mentions.

Figure 3: In SapBERT’s latent space, none of the near-

est neighbors of "apyrexial" (i.e. fever-free) happen to

share the word’s meaning. Instead, the alpha-privative

was over-indexed by the model, among other biases.

2.3 Challenges with existing models

BIOCOM and KRISSBERT propose to disam-

biguate mentions of biomedical concepts using con-

textual information. Ambiguous notations requir-

ing context to disambiguate can indeed be found

in clinical notes. However, using these contextual

models for inference is only possible after identify-

ing text spans denoting such concepts in the input

text. This requires introducing a mention detec-

tion model, which comes with its own challenges

and errors. Worse, reducing mentions to text spans

is not always possible, as concepts are sometimes

alluded to in a diffused way (see Fig. 2).

However, models which do not use in-context

mentions usually learn representations of lower

quality than in-context models. By pairing syn-

onyms with a signiﬁcant word or token overlap

with each other, these models isolate concepts con-

taining rare words or tokens early in the training, in

a way that is rarely semantic (see Fig. 3). Indeed,

the training loss of contrastive models only requires

placing all mentions of a particular concept close

to each other, but it does not provide strong guar-

antees about the relative location of different but

similar concepts in the latent space.

While hierarchical relationships from medical

ontologies have sometimes been used to produce

more meaningful concept embeddings (Zhang

et al.,2021), this is however not sufﬁcient to over-

come the issues stated above, because relatedness

is not always possible to encode hierarchically.

3 Pre-training methodology

To produce representations of biomedical concepts

that overcome the limitations described above, we

modiﬁed the way the positive pairs are constructed.

Like the prior works cited in §2.2, we start by es-

tablishing a list of names for each UMLS concept.

However, unlike previous works, we do not use

these names directly to form positive pairs. Instead,

we construct pairs formed with, on the one side, a

randomly selected name for a given concept and,

on the other side, a deﬁnition or description for that

concept (see Fig. 1).

We hypothesize that a deﬁnition or description

of a given concept provides a more robust semantic

anchor for this concept than another of its names.

As mentioned before, names in the medical domain

can be quite opaque, and do not always offer use-

ful insights into what exactly is being referred to.

By inducing representational similarity between a

concept name and its known deﬁnitions, we aim

to distill their respective knowledge into the repre-

sentations of the concept names themselves. This

key idea inﬂuenced some design choices for our

experimental setup, including the choice of the data

curation process, model initialization, and training

procedure (as described in this section).

3.1 Curating deﬁnitions and descriptions

Around 5% of the concepts found in UMLS are

clariﬁed by one or more deﬁnitions. These deﬁ-

nitions aim to provide the most relevant pieces of

information about a given concept to the practi-

tioners reading them, and we can therefore include

them directly in our training set (see Fig. 1).

This is however insufﬁcient, since most concepts

have no matching deﬁnition in UMLS. Addition-

ally, deﬁnitions might not always cover all the rele-

vant aspects of a given concept, and the particular

aspects they cover vary from one concept to an-

other. Consequently, pairing concept names and

their deﬁnitions, alone, cannot be expected to pro-

duce satisfactory results for all UMLS concepts.

We therefore supplement the deﬁnitions already

available in UMLS with automatically generated

textual descriptions, based on the structured in-

formation contained in the ontology and its 90M

concept-to-concept relationships.

These concept descriptions are constructed using

the following template: “[more-generic-concept]

which [

has-relationship-with

] [related-concept]”

(e.g. "drug which may treat headache").

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BioLORD:LearningOntologicalRepresentationsfromDenitionsforBiomedicalConceptsandtheirTextualDescriptionsFrançoisRemyandKrisDemuynckandThomasDemeesterTheInternetandDataScienceLab(IDLab)GhentUniversity(UGent)-ImecBelgiumfrancois.remy@ugent.beAbstractThisworkintroducesBioLORD,anewpre-trainingstrategyfo...

展开>> 收起<<

BioLORD Learning Ontological Representations from Deﬁnitions for Biomedical Concepts and their Textual Descriptions François Remy and Kris Demuynck and Thomas Demeester.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BioLORD Learning Ontological Representations from Deﬁnitions for Biomedical Concepts and their Textual Descriptions François Remy and Kris Demuynck and Thomas Demeester

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: