A Unified Encoder-Decoder Framework with Entity Memory Zhihan Zhang1 Wenhao Yu1 Chenguang Zhu2 Meng Jiang1 1University of Notre Dame Notre Dame IN USA

2025-04-27 0 0 554.51KB 17 页 10玖币
侵权投诉
A Unified Encoder-Decoder Framework with Entity Memory
Zhihan Zhang1, Wenhao Yu1, Chenguang Zhu2, Meng Jiang1
1University of Notre Dame, Notre Dame, IN, USA
2Microsoft Cognitive Services Research, Redmond, WA, USA
1{zzhang23, wyu1, mjiang2}@nd.edu;2chezhu@microsoft.com
Abstract
Entities, as important carriers of real-world
knowledge, play a key role in many NLP tasks.
We focus on incorporating entity knowledge
into an encoder-decoder framework for infor-
mative text generation. Existing approaches
tried to index, retrieve, and read external doc-
uments as evidence, but they suffered from a
large computational overhead. In this work,
we propose an Encoder-Decoder framework
with an entity Memory, namely EDMem. The
entity knowledge is stored in the memory
as latent representations, and the memory is
pre-trained on Wikipedia along with encoder-
decoder parameters. To precisely generate en-
tity names, we design three decoding meth-
ods to constrain entity generation by linking
entities in the memory. EDMem is a uni-
fied framework that can be used on various
entity-intensive question answering and gen-
eration tasks. Extensive experimental results
show that EDMem outperforms both memory-
based auto-encoder models and non-memory
encoder-decoder models.1
1 Introduction
A large amount of real-world knowledge is related
to entities, e.g., persons, nations, and events. Entity
knowledge is the information describing facts and
attributes related to entities. Many entity-intensive
NLP tasks require models obtain entity knowledge
to generate informative outputs, such as answer-
ing factual questions (Kwiatkowski et al.,2019),
explaining claims (Onoe et al.,2021), or making
informative conversations (Dinan et al.,2019). Pre-
trained encoder-decoder models can be directly ap-
plied on such entity-intensive tasks (Ye et al.,2020;
Roberts et al.,2020), but their ability to store and
use knowledge is still questionable (Lewis et al.,
2021;Wang et al.,2021). A popular approach
1
Code will be available at
https://github.com/
DM2-ND/EDMem
Transformer
Transformer
Who wrote [Es] Evening Class [Ee] ?
+
Entity Linking
Head
Entity
Memory
Transformer
Transformer
[BOS] [Es] Maeve Binchy
+
[Es] Maeve Binchy [Ee]
Language
Modeling Head
Encoder Decoder
Figure 1: An overview of the EDMem framework. H
denotes the final hidden states of the encoder.
to incorporate knowledge into the generation pro-
cess is retrieving evidence documents from external
sources (Lewis et al.,2020b;Izacard and Grave,
2021;Oguz et al.,2020;Yu et al.,2022c). How-
ever, they suffer from significant computational
overheads in indexing, retrieving, and reading a
large number of extra documents (Lee et al.,2021;
de Jong et al.,2022). Therefore, it is important
to give encoder-decoder models access to entity
knowledge without sacrificing too much efficiency.
Recently it has been proposed to use an in-model
memory to augment auto-encoder models with en-
tity knowledge on entity linking tasks (Févry et al.,
2020;Verga et al.,2021;Sun et al.,2021). The en-
tity memory stores entity knowledge as dense vec-
tors which can be directly incorporated into the hid-
den states of Transformer models (Vaswani et al.,
2017), with no need to encode extra text. How-
ever, the auto-encoder framework in previous ap-
proaches can only select entities from a pre-defined
entity vocabulary. Hence, they are not able to give
an entity outside the vocabulary, nor to generate
answers or text beyond a single entity.
In this paper, we propose a novel
E
ncoder-
D
ecoder framework with an entity
Mem
ory (ED-
Mem), as shown in Figure 1. EDMem is a uni-
arXiv:2210.03273v3 [cs.CL] 24 Apr 2023
fied framework on various entity-intensive QA and
generation tasks, in which we train an entity mem-
ory for efficient knowledge incorporation. First,
EDMem is pre-trained on Wikipedia documents,
where it learns entity embeddings in the memory
along with an encoder-decoder model. EDMem
learns to select relevant entities from the memory
via an entity linking objective, and learns to gener-
ate answers using entity knowledge via a language
modeling objective. Second, to precisely generate
entity names, we design three decoding methods
that utilize the entity linking ability of EDMem
in its generation process, when we fine-tune it on
downstream tasks. These include (1) free-form:
left-to-right generation with entity identifiers; (2)
static entity linking: first select entities by entity
linking, build prefix trees for the selected entities,
and then perform constrained entity generation us-
ing the trees; (3) dynamic entity linking: select
entities on-the-fly for constrained entity generation.
We conduct experiments on two popular testbeds
of entity knowledge: open-domain QA and entity-
intensive generation. With the incorporation of en-
tity knowledge, EDMem outperforms non-memory
encoder-decoder models on both tasks, and it re-
tains the efficiency advantage of closed-book (i.e.,
non-retrieval) models. Compared to memory-based
auto-encoders, EDMem achieves both higher over-
all accuracy (+9%) and better entity precision
(+8%) on open-domain QA datasets, and it gener-
ates high-quality text from the memory-supported
decoder on generation datasets when auto-encoders
fail to do so. To summarize, EDMem is the first
knowledge-augmented closed-book framework to
perform both tasks in a unified manner.
2 Related Work
Closed-Book Models
Closed-book models are
pre-trained models that store knowledge in their
own parameters. For example, COMET (Bosse-
lut et al.,2019) fine-tuned GPT2 (Radford et al.,
2018) to construct knowledge graphs by gener-
ating commonsense triples. Recently, fine-tuned
BART (Lewis et al.,2020a) or T5 (Raffel et al.,
2020) models are proved to be competitive on
open-domain QA (Ye et al.,2020;Roberts et al.,
2020). Therefore, closed-book models are able to
memorize some entity knowledge after pre-trained
on massive data. However, studies showed that
closed-book models just recalled similar inputs and
answers in their pre-training corpus (Wang et al.,
2021), and their performances were behind open-
book models.
Open-Book Models
Open-book models first re-
trieve evidence documents from external cor-
pora and read these documents to predict an an-
swer (Chen et al.,2017). REALM (Guu et al.,
2020) proposed a self-supervised approach to pre-
train a retriever-reader model. DPR (Karpukhin
et al.,2020) devised a contrastive objective to train
a dense bi-encoder retriever on open-domain QA.
Subsequent approaches combined DPR with a gen-
erative objective to build large, powerful models
on open-domain QA and generation tasks (Lewis
et al.,2020b;Izacard and Grave,2021;Sachan
et al.,2021;Yu et al.,2022a). However, open-
book models have to process the raw text of all re-
trieved documents, which leads to extremely long
inference time. Besides, additional overheads are
brought by loading the document index and retriev-
ing evidence documents for each example.
Entity Memory
EaE (Févry et al.,2020) was
the first to pre-train an entity memory with an auto-
encoder framework to perform entity prediction on
open-domain QA. FILM (Verga et al.,2021) fol-
lowed EaE and added a fact memory containing
representations of Wikidata triples. To better en-
code relational knowledge, OPQL (Sun et al.,2021)
learned latent relational representations for arbi-
trary entity pairs. Recent work focused on learn-
ing a huge mention-level memory (~150M entries)
with extensive pre-training (de Jong et al.,2022) or
leveraging the entity memory in domain adaptive
training (Kang et al.,2022). These models are all
based on an auto-encoder framework. Thus, they
are able to predict entities IDs but would fail to gen-
erate any non-entity answers or sentences. There
is a preprint paper contemporaneous to our work
which trained a memory with an encoder-decoder
model (Chen et al.,2022). However, it used QA
pairs as memory entries instead of entities, limiting
its application to QA tasks. Besides, their memory
is much heavier (60M entries) than ours (1M).
3 Proposed Framework
Suppose we have a pre-defined vocabulary of
N
entities
E={e1, . . . , eN}
. A mention is the ac-
tual tokens in context which refer to an entity. The
set of all mentions in the corpus is denoted as
M
.
Thus, there is a global alias table
T:E 2M
,
where each entity is mapped to all its mentions.
The input of EDMem is a sequence of tokens
x
of
length
S
, and the target output is another sequence
y= [y1,· · · , yT]
of length
T
. Both sequences
contain a pre-labeled set of mentions. Each men-
tion refers to an entity in
E
. We add two special
tokens
[Es]
and
[Ee]
to represent “entity start” and
“entity end” boundaries of a mention, e.g., “
[Es]
Brett Hart
[Ee]
is the president of the
[Es]
United
Airlines
[Ee]
”. These special tokens come from
either Wikipedia hyperlinks (in pre-training, §3.3)
or an entity linking model (in fine-tuning, §3.4).
3.1 Architecture
An overview of EDMem is presented in Figure 1.
The framework has a transformer encoder, a trans-
former decoder, an entity memory, and two pre-
diction heads. Both the encoder and decoder have
two parts: (
L1×
) lower layers and (
L2×
) upper
layers. Transformer layers in EDMem have the
same architecture with BART (Lewis et al.,2020a).
At the end of lower layers, EDMem is allowed to
use the hidden states as a query to access the entity
memory. The knowledge representation obtained
by each memory access is summed and normalized
with the hidden states before performing further
reasoning in upper layers. Two prediction heads
use the final hidden states of the decoder for pre-
diction: an LM head for token prediction and an
entity linking head for entity prediction (Details are
in §3.3). In practice, we follow EaE (Févry et al.,
2020) to set L1= 4 and L2= 8.
3.2 Entity Memory
The entity memory contains a large embedding
table, which stores the embeddings of entities in
E
. Intuitively, an entity embedding contains the
contextual information around all mentions of the
entity in Wikipedia documents. During encoding
and decoding, EDMem queries the entity memory
whenever it encounters a mention. It recognizes
mentions by identifying the
[Es]
token. EDMem
takes the hidden state of the
[Es]
token as query to
retrieve relevant knowledge from the entity mem-
ory by attending to the entity embedding table (bias
terms are omitted):
hent
s=Wout(
N
X
i=1
αi·ei),(1)
where αi=exp (e|
iWinhlow
s)
PN
j=1 exp (e|
jWinhlow
s).(2)
ei
is the embedding of entity
ei
.
hlow
s
denotes
the hidden state of the
[Es]
token (from lower en-
coder/decoder layers).
hent
s
is the aggregated entity
representation, which is summed and normalized
with
hlow
s
to put into upper layers.
Win
and
Wout
are linear projection layers for dimension matching.
Following EaE, during inference, we aggregate the
entity representaion of top 100 entities (sorted by
αi) instead of attending to all Nentities.
3.3 Pre-Training
3.3.1 Pre-Training Corpus
We pre-train EDMem on the whole Wikipedia cor-
pus. All documents are split into 128-token pas-
sages. In addition, we set a 10-token sliding win-
dow between passages to avoid an entity being split
into two adjacent chunks. Such a setting yields a to-
tal of 39M passages, of which we hold out 0.5% of
them as the validation set during pre-training. We
leverage Wikipedia hyperlinks as gold annotations
of 249M mentions and their linked entities. Since
hyperlinks do not cover all mentions in text, we
heuristically label missing mentions to create more
training signals for the entity memory. We use the
alias table
T
to label all mentions in a Wikipedia
page if they match either (1) a linked entity in the
same page, or (2) the title entity of this page. This
leads to a total of 468M mentions in the pre-training
corpus. We collect 1M most frequently linked enti-
ties to form the entity vocabulary
E
. More details
can be found in Appendix A.
3.3.2 Pre-Training Objective
Our pre-training objective is a combination of lan-
guage modeling and entity linking. For language
modeling objectives, we randomly corrupt parts
of the input sequence and train EDMem to recon-
struct the original sequence. We adopt two kinds
of sequence corruption: random token masking
and salient span masking. In random token mask-
ing, each token has a probability of
Prtm
to be
replaced by a [MASK] token. Salient span mask-
ing is adapted from (Guu et al.,2020), where each
mention has a probability of
Pssm
that all tokens
within the mention are replaced by [MASK]. Such
explicit masking of whole mention names encour-
ages EDMem to rely on the entity memory in pre-
dicting mentions, which facilitates the learning of
entity embeddings. The LM head performs token
prediction through a linear-softmax layer, and the
LM loss is the negative log-likelihood of the target
sequence: LLM =PT
j=1 log P(yj|x, y1:j1).
United States
United Kingdom
Ireland
Memory Decoder
Which country ... ?
Select
Build
Prefix tree:
Decoder
Softmax
States
[BOS] [Es] United
[Es]
United
Ireland [Ee]
States
Kingdom
[Ee]
[Ee]
[Es]
Decoder
Softmax
States
[BOS] [Es] United
Memory
[Es]
Select
United States
United Kingdom
Build
Prefix tree:
[Es]
United
Ireland [Ee]
States
Kingdom
[Ee]
[Ee]
Ireland
Decoder
Softmax
States
[BOS] [Es] United
Memory
(a) Free-Form Generation
(c) Dynamic Entity Linking
(b) Static Entity Linking
Figure 2: Three decoding methods in downstream
tasks.
EDMem utilizes direct supervision signals to
the entity memory for entity representation learn-
ing. The entity linking loss is applied each time it
queries the entity memory. Besides in the middle of
the encoder and decoder, EDMem queries the mem-
ory in the entity linking head, as shown in Figure 1.
The entity linking head predicts the corresponding
entity using the hidden states of each mention, the
same as Equation (2). We use a cross-entropy loss
to maximize the attention weights of the labelled
entities:
LEL =Pmlog αi
, where
m
is a men-
tion in the input or output sequence that is linked
to the
i
-th entity in
E
. The final loss function is
LLM +λELLEL
, where the coefficient
λEL
is a
hyper-parameter.
3.4 Fine-Tuning
EDMem is fine-tuned on downstream tasks via an
LM objective and an entity linking objective. The
LM objective is to maximize the probability of the
task-specific output. The entity linking objective
links mentions to entities in the memory, the same
as pre-training. Mention boundaries are pre-labeled
using an state-of-the-art entity linking model (Li
et al.,2020). In entity-intensive downstream tasks,
the entity memory assists sequence generation by
not only providing entity knowledge but also gener-
ating entity names. Thus, we design three decoding
settings to let the entity linking objective assist se-
quence generation. A sketch of different settings is
given in Figure 2.
Free-Form Generation
In this setting, the
model generates the output sequence entirely based
on the probability given by the LM head. This
includes the special tokens
[Es]
and
[Ee]
which
indicate an access to the memory. There is no con-
straint on what tokens to generate between
[Es]
and
[Ee]
,i.e., the subsequence
[Es], yi,· · · , yj,[Ee]
may not be a valid entity name in the entity vocabu-
lary. One advantage is that the model processes the
entity knowledge in a latent manner, which does
not explicitly affect the probability distribution of
the language model. However, this may affect the
model’s performance in tasks where entity names
are strictly required, e.g., open-domain QA tasks
where exact match is used as evaluation.
Static Entity Linking
Static entity linking ex-
plicitly restricts the model to generate entity names
for QA. Here, the decoding process is divided into
two steps: entity linking and constrained gener-
ation. First, given a question, the model selects
one or multiple entities as references. As shown
in Figure 2(b), the question with an appended
[Es]
token as a placeholder is passed into the decoder,
and the entity linking head is trained to predict the
entity ID of the gold answer
2
. Then we have the
selected top-
k
entities for each test question. We
restrict the generation space to the top-
k
entities
when the model is trying to generate an entity name.
To achieve this, inspired by (Cao et al.,2021), we
build a prefix tree for
k
entities for each test exam-
ple. The prefix tree tells the model which tokens
are allowed to generate given a prefix (i.e., previ-
ous generated tokens). When the model generates
an
[Es]
token, we restrict the following generated
tokens to be one of the
k
entity names (i.e., one of
the paths in the prefix tree). In this way, the model
can either generate an entity answer (by generat-
ing
[Es]
and traversing the pre-built prefix tree), or
generate a non-entity answer (if no
[Es]
token is
generated). Readers can refer to (Cao et al.,2021)
for more implementation details.
Dynamic Entity Linking
Static entity linking is
applicable only when the downstream task can be
converted into an entity linking objective. Another
way to generate entities is to predict the entities
on-the-fly. After each time the model generates
2
Training examples with non-entity answers are discarded.
摘要:

AUniedEncoder-DecoderFrameworkwithEntityMemoryZhihanZhang1,WenhaoYu1,ChenguangZhu2,MengJiang11UniversityofNotreDame,NotreDame,IN,USA2MicrosoftCognitiveServicesResearch,Redmond,WA,USA1{zzhang23,wyu1,mjiang2}@nd.edu;2chezhu@microsoft.comAbstractEntities,asimportantcarriersofreal-worldknowledge,playak...

展开>> 收起<<
A Unified Encoder-Decoder Framework with Entity Memory Zhihan Zhang1 Wenhao Yu1 Chenguang Zhu2 Meng Jiang1 1University of Notre Dame Notre Dame IN USA.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:554.51KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注