BioLORD Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions François Remy and Kris Demuynck and Thomas Demeester

2025-05-06 0 0 456.1KB 12 页 10玖币
侵权投诉
BioLORD: Learning Ontological Representations from Definitions
for Biomedical Concepts and their Textual Descriptions
François Remy and Kris Demuynck and Thomas Demeester
The Internet and Data Science Lab (IDLab)
Ghent University (UGent) - Imec Belgium
francois.remy@ugent.be
Abstract
This work introduces BioLORD, a new pre-
training strategy for producing meaningful rep-
resentations for clinical sentences and bio-
medical concepts. State-of-the-art methodolo-
gies operate by maximizing the similarity in
representation of names referring to the same
concept, and preventing collapse through con-
trastive learning. However, because biomedical
names are not always self-explanatory, it some-
times results in non-semantic representations.
BioLORD overcomes this issue by ground-
ing concept representations using definitions,
as well as short descriptions derived from a
multi-relational knowledge graph consisting of
biomedical ontologies. Thanks to this ground-
ing, our model produces more semantic concept
representations that match more closely the hi-
erarchical structure of ontologies. BioLORD
establishes a new state of the art for text simi-
larity on both clinical sentences (MedSTS) and
biomedical concepts (MayoSRS).
1 Introduction
Natural language processing models are well posi-
tioned to support healthcare providers by automat-
ically extracting and synthesizing relevant infor-
mation from clinical notes. For this, we introduce
BioLORD, a pre-training strategy for end-to-end
biomedical information extraction, capable of pro-
ducing meaningful representations for biomedical
terms and clinical sentences simultaneously.
This is achieved through the continued pre-
training of an existing sentence embedding model,
using contrastive learning and pairs consisting of
the names and definitions of a given biomedical
concept (see Fig. 1). This design choice proved
crucial for the effectiveness of BioLORD, as it en-
ables the transfer of knowledge from the definitions
to the representation of biomedical names, thereby
overcoming limitations of existing works (see §2.3)
through a more effective usage of the knowledge
contained in biomedical ontologies (see §2.1).
Indeed, to improve coverage and diversity, we
supplemented definitions with textual descriptions
generated from the numerous concept-to-concept
relationships contained in biomedical ontologies.
Our key contributions are
1
a versatile training
strategy using dictionaries and knowledge graphs to
create highly semantic representations for the key
phrases of a domain,
2
an associated BioLORD
model trained on the biomedical domain,
3
an ex-
tensive evaluation (§4) demonstrating its ability to
provide semantic representations usable in a broad
range of information extraction scenarios, includ-
ing a new state of the art for Biomedical Concept
Representation and Clinical Sentence Similarity,
and
4
an in-depth analysis of the strengths and
weaknesses of our proposed approach (§5).
2 Related Work
Let us first consider how prior works attempted
to address the biomedical domain’s usage of a
large, specialized, and often opaque vocabulary
(e.g., PAPA syndrome1or cat scratch disease2).
2.1 Biomedical ontologies
To condense this lexical knowledge in digital form,
medical practitioners developed semi-structured
concept hierarchies called biomedical ontologies,
merging a dictionary and a knowledge graph.
SnomedCT
(Systematized Nomenclature of
Medicine and Clinical Terms) is one such ontology
covering around 700k medical concepts in total
and a small set of important relationships between
these concepts (Schulz and Klein,2008).
UMLS
(Unified Medical Language System)
bridges several biomedical ontologies to cover
more than 4 million concepts, each with on av-
erage 4 listed names (Bodenreider,2004). UMLS
also contains around 90 million labeled concept-to-
concept relationships of 900 different types.
1a hereditary inflammatory disorder affecting the skin
2a bacterial skin infection caused by Bartonella Henselae
arXiv:2210.11892v1 [cs.CL] 21 Oct 2022
Figure 1: BioLORD aims to bring the representation of biomedical concept names () and their definitions ()
closer to each other, to ground the name representations with knowledge from the definitions. This is illustrated
for the Ranitidine and Aspirin concepts from UMLS. Knowledge from the ontology’s relational knowledge graph
is injected by extending the set of known definitions with automatically generated descriptions (). Each such
description pairs a more generic concept with one relationship (of the described concept) and its related concept,
thereby setting the described concept apart from the more generic one. Contrastive learning is applied to attract
the representations of compatible pairs (,or ) and repel incompatible ones (obtained as in-batch negatives).
2.2 Contrastive Learning Strategies
On the machine learning side, efforts in the tasks
of named entity recognition (NER) and normal-
ization (NEL) are strongly influenced by the chal-
lenges posed by such a large and specialized vo-
cabulary. In recent years, approaches using ontolo-
gies through string-based pattern matching, such
as MetaMap (Aronson,2001), have been consis-
tently outperformed by newer works relying on
constrative learning with Transformers.
BioSyn
(Sung et al.,2020) was the first model
to introduce the idea of contrastive learning to pro-
duce embeddings of biomedical concepts. It takes
existing NEL benchmarks and proposes to use their
training sets in a contrastive manner. An encoder
model initialized with BioBERT (Lee et al.,2020)
is trained to produce embeddings for batches of
concept names (grouped by pairs referring to the
same concept). A contrastive loss is then applied to
ensure that the embeddings of synonyms are signifi-
cantly closer to each other than they are to the other
names in the batch, which refer to other concepts.
After pre-training, the model can be finetuned for
the end task of NEL using cross-entropy training.
SapBERT
(Liu et al.,2021) was the first large-
scale contrastive model to leverage UMLS. Just
like BioSyn, it produces embeddings for biomedi-
cal concept names, without considering the context
they are used in. But, unlike BioSyn, it is based on
PubMedBERT (Gu et al.,2020) and uses the syn-
onyms defined for concepts in UMLS to form the
training pairs. This enables the model to contrast
millions of entries, many more than BioSyn.
BIOCOM
(Ujiie et al.,2021) and
KRISSBERT
(Zhang et al.,2021) independently extended this
approach in a similar way, by noting the need for
context-based disambiguation for some entities.
For each UMLS concept, sentences mentioning the
concept are collected from PubMed articles. These
sentences are used as context during training.
Figure 2: Concept mapping sometimes requires consid-
ering the entire sentence, rather than mentions.
Figure 3: In SapBERT’s latent space, none of the near-
est neighbors of "apyrexial" (i.e. fever-free) happen to
share the word’s meaning. Instead, the alpha-privative
was over-indexed by the model, among other biases.
2.3 Challenges with existing models
BIOCOM and KRISSBERT propose to disam-
biguate mentions of biomedical concepts using con-
textual information. Ambiguous notations requir-
ing context to disambiguate can indeed be found
in clinical notes. However, using these contextual
models for inference is only possible after identify-
ing text spans denoting such concepts in the input
text. This requires introducing a mention detec-
tion model, which comes with its own challenges
and errors. Worse, reducing mentions to text spans
is not always possible, as concepts are sometimes
alluded to in a diffused way (see Fig. 2).
However, models which do not use in-context
mentions usually learn representations of lower
quality than in-context models. By pairing syn-
onyms with a significant word or token overlap
with each other, these models isolate concepts con-
taining rare words or tokens early in the training, in
a way that is rarely semantic (see Fig. 3). Indeed,
the training loss of contrastive models only requires
placing all mentions of a particular concept close
to each other, but it does not provide strong guar-
antees about the relative location of different but
similar concepts in the latent space.
While hierarchical relationships from medical
ontologies have sometimes been used to produce
more meaningful concept embeddings (Zhang
et al.,2021), this is however not sufficient to over-
come the issues stated above, because relatedness
is not always possible to encode hierarchically.
3 Pre-training methodology
To produce representations of biomedical concepts
that overcome the limitations described above, we
modified the way the positive pairs are constructed.
Like the prior works cited in §2.2, we start by es-
tablishing a list of names for each UMLS concept.
However, unlike previous works, we do not use
these names directly to form positive pairs. Instead,
we construct pairs formed with, on the one side, a
randomly selected name for a given concept and,
on the other side, a definition or description for that
concept (see Fig. 1).
We hypothesize that a definition or description
of a given concept provides a more robust semantic
anchor for this concept than another of its names.
As mentioned before, names in the medical domain
can be quite opaque, and do not always offer use-
ful insights into what exactly is being referred to.
By inducing representational similarity between a
concept name and its known definitions, we aim
to distill their respective knowledge into the repre-
sentations of the concept names themselves. This
key idea influenced some design choices for our
experimental setup, including the choice of the data
curation process, model initialization, and training
procedure (as described in this section).
3.1 Curating definitions and descriptions
Around 5% of the concepts found in UMLS are
clarified by one or more definitions. These defi-
nitions aim to provide the most relevant pieces of
information about a given concept to the practi-
tioners reading them, and we can therefore include
them directly in our training set (see Fig. 1).
This is however insufficient, since most concepts
have no matching definition in UMLS. Addition-
ally, definitions might not always cover all the rele-
vant aspects of a given concept, and the particular
aspects they cover vary from one concept to an-
other. Consequently, pairing concept names and
their definitions, alone, cannot be expected to pro-
duce satisfactory results for all UMLS concepts.
We therefore supplement the definitions already
available in UMLS with automatically generated
textual descriptions, based on the structured in-
formation contained in the ontology and its 90M
concept-to-concept relationships.
These concept descriptions are constructed using
the following template: [more-generic-concept]
which [
has-relationship-with
] [related-concept]
(e.g. "drug which may treat headache").
摘要:

BioLORD:LearningOntologicalRepresentationsfromDenitionsforBiomedicalConceptsandtheirTextualDescriptionsFrançoisRemyandKrisDemuynckandThomasDemeesterTheInternetandDataScienceLab(IDLab)GhentUniversity(UGent)-ImecBelgiumfrancois.remy@ugent.beAbstractThisworkintroducesBioLORD,anewpre-trainingstrategyfo...

展开>> 收起<<
BioLORD Learning Ontological Representations from Definitions for Biomedical Concepts and their Textual Descriptions François Remy and Kris Demuynck and Thomas Demeester.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:456.1KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注