Adapters for Enhanced Modeling of Multilingual Knowledge and Text Yifan Hou1 Wenxiang Jiao2 Meizhen Liu3 Carl Allen1 Zhaopeng Tu2 Mrinmaya Sachan1 1ETH Zürich2Tencent AI Lab3Shandong University

2025-04-30 0 0 806.4KB 16 页 10玖币
侵权投诉
Adapters for Enhanced Modeling of Multilingual Knowledge and Text
Yifan Hou1, Wenxiang Jiao2, Meizhen Liu3, Carl Allen1, Zhaopeng Tu2, Mrinmaya Sachan1
1ETH Zürich, 2Tencent AI Lab, 3Shandong University
1{yifan.hou, carl.allen, mrinmaya.sachan}@inf.ethz.ch
2{joelwxjiao, zptu}@tencent.com,3meizhen.liu@mail.sdu.edu.cn
Abstract
Large language models appear to learn facts
from the large text corpora they are trained
on. Such facts are encoded implicitly within
their many parameters, making it difficult
to verify or manipulate what knowledge has
been learned. Language models have re-
cently been extended to multilingual language
models (MLLMs), enabling knowledge to be
learned across hundreds of languages. Mean-
while, knowledge graphs contain facts in an
explicit triple format, which require careful
and costly curation and are only available
in a few high-resource languages, restricting
their research and application. To address
these issues, we propose to enhance MLLMs
with knowledge from multilingual knowledge
graphs (MLKGs) so as to tackle language and
knowledge graph tasks across many languages,
including low-resource ones. Specifically, we
introduce a lightweight adapter set to enhance
MLLMs with cross-lingual entity alignment
and facts from MLKGs for many languages.
Experiments on common benchmarks show
that such enhancement benefits both MLLMs
and MLKGs, achieving: (1) comparable or im-
proved performance for knowledge graph com-
pletion and entity alignment relative to base-
lines, especially for low-resource languages
(for which knowledge graphs are unavailable);
and (2) improved MLLM performance on lan-
guage understanding tasks that require mul-
tilingual factual knowledge; all while main-
taining performance on other general language
tasks.1
1 Introduction
Knowledge graphs serve as a source of explicit fac-
tual information for various NLP tasks. However,
language models (Devlin et al.,2019;Brown et al.,
2020), which capture implicit knowledge from vast
1
Our code, models, and data (e.g., integration corpus and
extended datasets) are available at https://github.com/yifan-
h/Multilingual_Space.
is located in
si trova a
Zurich
Alain de Boon
was born in
is
Svizzera
Alain de Boon
位于
属于
English
Italian
Chinese
write
出版
瑞士
Status Anxiety
身份的焦虑
MLLM task:
Relaon extracon
Enty alignment
(it -> en)
Alain de Boon Switzerland
lives in => vive in
Knowledge
triple
(en)
(Alain de Boon, vive in, Svizzera)
(it)
Alain de Boon Svizzera
(it)
1. Sentence:
Alain de Boon vive di recente
in Svizzera.
MLKG
Nonficon
纪实作品
阿兰·德波顿
Switzerland
lives in
出生于
苏黎世
Zurigo
è nato in
2. Using MLKG:
3. Knowledge triple:
Figure 1:
Combining MLLMs and MLKGs benefits both:
MLKGs suffer from incompleteness and are limited to few
languages, which MLLMs can supplement. MLLMs lack
entity alignment and firm facts, which MLKGs can provide.
text corpora, are already being used in knowledge-
intensive tasks. Recently, language models have
been successfully extended to multilingual lan-
guage models (MLLMs) that integrate information
sourced across hundreds of languages (Devlin et al.,
2019;Conneau and Lample,2019;Conneau et al.,
2020). However, as with most neural networks, the
information is encoded in a diffused and opaque
manner that is difficult to interpret, verify or uti-
lize (AlKhamissi et al.,2022).
Meanwhile, multilingual knowledge graphs
(MLKGs) require careful curation of explicit facts
and annotation of entities that occur in languages
(cross-lingual entity alignment), making knowl-
edge graphs expensive and time-consuming to ex-
tend to new languages, restricting knowledge graph
research to a few high-resource languages. Fur-
ther, open-source MLKGs such as WordNet (Bond
and Foster,2013) and Wikidata (Vrandeˇ
ci´
c and
Krötzsch,2014) suffer from incompleteness as
many true facts (or triples) and entity alignments
are missing (Chen et al.,2017,2020).
In this work, we propose to overcome the above
arXiv:2210.13617v2 [cs.CL] 26 Oct 2022
limitations of each knowledge source by integrat-
ing MLKGs into MLLMs (as shown in Figure 1),
to enable (i) the transfer of MLKG knowledge
from high-resource languages to low-resource lan-
guages; and (ii) explicit knowledge of MLKGs
to supplement MLLMs for knowledge-intensive
language tasks, one of the key challenges in
MLLMs (AlKhamissi et al.,2022).
While this idea seems intuitive, there is no
easy way to incorporate the explicit knowledge of
MLKGs into the parametrically stored information
of MLLMs. Existing knowledge integration meth-
ods utilize language models and knowledge graphs
in two ways: (1) training knowledge graph embed-
dings individually and combining the embeddings
corresponding to linked entities in sentences with
the language model representations (e.g., Know-
BERT (Peters et al.,2019) and ERNIE (Zhang et al.,
2019)); or (2) absorbing the knowledge in knowl-
edge graphs into the language model’s parameters
via joint training (e.g., K-BERT (Liu et al.,2020)
and K-Adapter (Wang et al.,2021)).
The first method requires embedding knowl-
edge graph entities and accurately extracting en-
tities in sentences across hundreds of languages,
which is highly challenging. The second method
typically suffers from the curse of multilingual-
ity (Conneau et al.,2020;Doddapaneni et al.,2021;
Jiao et al.,2022) and catastrophic forgetting (Kirk-
patrick et al.,2016) due to limited model capacity.
Most importantly, both methods integrate knowl-
edge implicitly such that it is difficult to access
and extend to low-resource languages (AlKhamissi
et al.,2022). Furthermore, both methods require
large sets of aligned sentences and knowledge
triples, which is costly to gather and accurately
annotate across hundreds of languages.
To address above issues, we first collect
and clean multilingual data from Wikidata
2
and
Wikipedia
3
for the enhancement, where rich fac-
tual knowledge and cross-lingual alignments are
available. Then, we propose to enhance MLLMs
with the MLKG information by using a set
of adapters (Houlsby et al.,2019), which are
lightweight, collectively having only around 0.5%
extra parameters than the MLLM. Each adapter
integrates information from either MLKG
T
riples
(i.e. facts) or cross-lingual
E
ntity alignments, and
is trained on either
P
hrase or
S
entence level data.
2https://www.wikidata.org/wiki/Wikidata:Main_Page
3https://en.wikipedia.org/wiki/Main_Page
Each of the resulting four adapters (EP/TP/ES/TS)
is trained individually to learn information sup-
plemental to that already learned by the MLLM.
Adapter outputs are combined by a fusion mecha-
nism (Pfeiffer et al.,2021). Training objectives are
similar to those for MLKG embedding (Chen et al.,
2017) instead of mask language modeling, which
are more efficient with large corpus.
We conduct experiments on various downstream
tasks to demonstrate the effectiveness of our ap-
proach. For MLKG tasks, following the data col-
lection methods of two existing benchmarks (Chen
et al.,2020,2017), we extended them from 2-5
languages to 22 languages, including two rare lan-
guages.
4
Results show that our method obtains
comparable performance to existing state-of-the-
art baselines on the knowledge graph completion
benchmark, and significantly better performance on
the entity alignment benchmark. More importantly,
we can perform these knowledge graph tasks in low-
resource languages for which no knowledge graph
exists, and achieve comparable results to the high-
resource languages. Improvements over baseline
MLLMs are significant. The results demonstrate
that our proposed method integrates the explicit
knowledge from MLKGs into MLLMs that can be
used across many languages. Our method also im-
proves existing MLLMs noticeably on knowledge-
intensive language tasks, such as cross-lingual rela-
tion classification, whilst maintaining performance
on general language tasks such as named entity
recognition (NER) and question answering (QA).
2 Multilingual Knowledge Integration
In this paper, we fuse knowledge from a MLKG
into a MLLM. Following previous works (Wang
et al.,2021;Liu et al.,2021), we make use of an
entity tagged corpus of text (called a knowledge
integration corpus) for knowledge integration. We
formally introduce these concepts below.
MLLM.
A multilingual LM can be thought of
as an encoder that can represent text in any lan-
guage
l
in a set of languages
L
. Let
V
denote the
shared vocabulary over all languages. Let
tl∈ V
denote a token in language
l
. A sentence
sl
in a
language
l
can be denoted as a sequence of tokens:
sl= (tl
1, tl
2, ...)
. The output representations of the
MLLM for
sl
can be denoted by a sequence of
4
The extended datasets as well as KI corpus are published
with our code implementation.
vectors:
LM(sl)=(h1,h2, ...)
. These vectors cor-
respond to representations for each token in the
sentence, one representation per input token. Var-
ious tokenization schemes such as wordpiece or
BPE might be considered here. We use the aver-
age of the token representations as the representa-
tion of the sentence:
LM(sl) = mean(h1,h2, ...)
.
Similarly, for a phrase
sl
ij
(starting from the
i
-th
token and ending in the
j
-th token in the sentence),
we can obtain its contextualized representation as
LM(sl
ij ) = mean(hi,hi+1, . . . hj).
MLKG.
A multilingual knowledge graph is a
graph with entities and knowledge triples in each
language
l∈ L
. Let
E
denote the set of entities
and
T
denote the set of knowledge triples. In a
MLKG, each entity indexed
i
might appear in sev-
eral languages. Let
el
i
denote the entity label of the
i
-th entity in language
l
. Furthermore, we denote a
knowledge triple in the MLKG as
(el
i, rl00
k, el0
j)∈ T
,
where
rl00
k
is the
kth
relation. Note that since en-
tities (as well as relations) may appear in various
languages under different labels, knowledge triples
can be defined across languages.
Knowledge Integration Corpus.
For knowl-
edge integration, besides the MLKG, we make use
of a corpus of text
C
(as shown in the right part of
Figure 2). The corpus
C
comprises of two kinds
of texts. First, we have a set of texts
C1
for the
cross-lingual entity alignment, which comprise of
sentences with mentions of entities in the MLKG.
For example in Figure 2, given the sentence De
Botton spent his early years in Zurich, we have the
aligned entity Zurich and its cross-lingual labels.
The second set of texts
C2
is for the knowledge
triple, which comprises of sentences aligned with
knowledge triples in the MLKG. For example in
Figure 2, given the sentence Zurich is the largest
city in Switzerland, we have its aligned knowledge
triple (Zurich, is located in, Switzerland).
3 Adapters and Adapter Fusion
In this section, we first describe how we incorporate
adapters into language models and how they can
be used to enhance them with different sources of
knowledge from knowledge graphs.
Adapter.
Adapters have become a popular
choice for parameter-efficient finetuning of lan-
guage models on downstream tasks (Houlsby et al.,
2019) due to their flexibility, effectiveness, low
cost and scalability (Pfeiffer et al.,2021). Adapters
are new modules that are added between layers of
language models
5
, the parameters of which are up-
dated only during finetuning while the language
model parameters are frozen. An adapter is a bot-
tleneck layer composed of two feed-forward layers
with one non-linear activation function. For
hm
,
the hidden representation of token
tl
i
at layer
m
,
the adapter acts as
A(hm) = Wup ·σ(Wdown ·hm+bdown) + bup.(1)
Here,
Wdown
and
Wup
are weight matrices,
which map the hidden representations to the low-
dimensional space and then map them back.
bdown
and
bup
are bias parameters, and
σ
is a nonlinear
activation function.
Adapter Fusion.
We follow the architecture
of Pfeiffer et al. (2021), but instead of using
adapters for finetuning, we use them to enhance
MLLMs with knowledge. Our approach is similar
to Wang et al. (2021), but our adapters supplement
and augment the existing implicit knowledge of
MLLMs (into the explicit geometric properties of
hidden representations), And our approach is more
lightweight, with only c.
0.5%
additional parame-
ters (cf >10% in Wang et al. (2021)).
As shown in Figure 2(left), still considering
the
m
-th layer, the output representations of the
feedforward layer (denoted
hm
as in Eq. 1) are
input to the adapters. A fusion layer aggregates
all adapter outputs
An(hm)
(
n∈ {1...N}
indexes
each adapter) and the un-adapted representations
with a multiplicative attention mechanism:
Afusion(hm) =
N
X
n=0
am
n·Vm·An(hm),
am
n=softmax(hmQmAn(hm)Km).
Here,
A0(·)
is the identity function;
Qm
,
Km
,
Vm
are parameters in the multiplicative attention mech-
anism; and is the Hadamard product.
The additional knowledge to be learned by the
adapters comes from knowledge
T
riples and
E
ntity
alignments, each provided in both
P
hrase and
S
entence format (hence
N= 2 ×2=4
). As
shown in Figure 2(center), for a given entity in
two languages
l
and
l0
,
Adapter-EP.
learns to align
the two (multilingual) representations of
el
i
and
el0
i
,
e.g., Zurich is aligned with Zurigo.
Adapter-TP.
5
Where to insert adapters is flexible but a common choice
is after the feedforward layer of a transformer layer.
Figure 2:
The architecture of MLLMs with adapters and their roles. We enhance multilingual and factual knowledge in phrase
and sentence levels using different knowledge integration corpus.
learns knowledge triples, e.g., predicting Switzer-
land given entity and relation (Zurich, is located
in,). Besides these non-contextualized settings, en-
tities within context can be considered also (MLLM
corpus). Thus,
Adapter-ES.
and
Adapter-TS.
have the similar objectives but use contextualized
representations from input sentences.
4 Knowledgeable Adapters
Next, we design objectives with corresponding
knowledge integration datasets to train a set of
adapters. Similar to MLKG embedding (Chen et al.,
2017), we aim to encode knowledge into the geo-
metric properties of the adapted MLLM representa-
tions, i.e., the MLLM and adapters collectively act
as an MLKG embedding model. Specifically, we
use cosine distance within the contrastive learning
loss of InfoNCE (van den Oord et al.,2018):
INCE(x,x0) = log cos(x,x0)
Px00 Xcos(x,x00),
where
X
is a batch that includes the positive sample
x0and a number of negative samples.6
Adapter-EP.
We use Wikidata (Vrandeˇ
ci´
c and
Krötzsch,2014) to enhance MLLMs with the
knowledge of cross-lingual entity alignments. In-
spired by the idea that languages are aligned im-
plicitly in a universal space in MLLMs (Wu and
Dredze,2019;Wei et al.,2021), we train the
aligned entities to have closer representations. De-
noting the MLLM with this adapter as
LM(·)
, the
6
We use in-batch negative sampling, where entities (with
labels in any languages) in the batch are randomly selected.
objective used to train EP is:
LEP =X
(el
i,el0
i)∈E
INCELM(el
i),LM(el0
i),
where
LM(·)
means we take the mean of token
representations as the entity representation vector.
Adapter-TP.
We train this adapter using the
knowledge triples in Wikidata. Inspired by pre-
vious knowledge graph embedding algorithms (e.g.
Bordes et al.,2013), for a given fact triple, we
train the (adapted) object entity embedding to be
close to the (adapted) joint embedding of the sub-
ject entity and relation. The objective used to train
TP is quite different from existing mask language
modeling-based ones:
LTP =X
(el
i,rl00
k,el0
j)∈T
INCELM([el
i;rl00
k]),LM(el0
j),
where
[; ]
denotes text concatenation. Note that we
apply code-switching (Liu et al.,2021), and thus
entities and relations can be in different languages.
This is helpful to capture knowledge triples for
low-resource languages.
Adapter-ES.
Entity alignment can also be ap-
plied to contextualized embeddings produced by
the MLLM when entities are input within natural
language sentences. For this purpose, we use sum-
maries taken from multilingual Wikipedia. Specifi-
cally, we first align the entity in Wikidata with the
Wikipedia title, and extract sentences that contain
the entity label in its summary. As described ear-
lier, we denoted this corpus as
C1
. Thus, similar
to Adapter-EP, we train ES by aligning contex-
tualized entity representations of cross-lingually
摘要:

AdaptersforEnhancedModelingofMultilingualKnowledgeandTextYifanHou1,WenxiangJiao2,MeizhenLiu3,CarlAllen1,ZhaopengTu2,MrinmayaSachan11ETHZürich,2TencentAILab,3ShandongUniversity1{yifan.hou,carl.allen,mrinmaya.sachan}@inf.ethz.ch2{joelwxjiao,zptu}@tencent.com,3meizhen.liu@mail.sdu.edu.cnAbstractLargela...

展开>> 收起<<
Adapters for Enhanced Modeling of Multilingual Knowledge and Text Yifan Hou1 Wenxiang Jiao2 Meizhen Liu3 Carl Allen1 Zhaopeng Tu2 Mrinmaya Sachan1 1ETH Zürich2Tencent AI Lab3Shandong University.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:806.4KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注