Deep Span Representations for Named Entity Recognition

2025-04-22 0 0 669.89KB 16 页 10玖币
侵权投诉
Deep Span Representations for Named Entity Recognition
Enwei Zhu and Yiyang Liu and Jinpeng Li
Ningbo No.2 Hospital
Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences
{zhuenwei,liuyiyang,lijinpeng}@ucas.ac.cn
Abstract
Span-based models are one of the most
straightforward methods for named entity
recognition (NER). Existing span-based NER
systems shallowly aggregate the token repre-
sentations to span representations. However,
this typically results in significant ineffective-
ness for long entities, a coupling between
the representations of overlapping spans, and
ultimately a performance degradation. In
this study, we propose DSpERT (Deep Span
Encoder Representations from Transformers),
which comprises a standard Transformer and
a span Transformer. The latter uses low-
layered span representations as queries, and
aggregates the token representations as keys
and values, layer by layer from bottom to top.
Thus, DSpERT produces span representations
of deep semantics.
With weight initialization from pretrained lan-
guage models, DSpERT achieves performance
higher than or competitive with recent state-of-
the-art systems on six NER benchmarks.1Ex-
perimental results verify the importance of the
depth for span representations, and show that
DSpERT performs particularly well on long-
span entities and nested structures. Further, the
deep span representations are well structured
and easily separable in the feature space.
1 Introduction
As a fundamental information extraction task,
named entity recognition (NER) requires predict-
ing a set of entities from a piece of text. Thus,
the model has to distinguish the entity spans (i.e.,
positive examples) from the non-entity spans (i.e.,
negative examples). In this view, it is natural to
enumerate all possible spans and classify them into
the entity categories (including an extra non-entity
category). This is exactly the core idea of span-
*Corresponding author.
1Our code is available at https://github.com/syuoni/
eznlp.
based approaches (Sohrab and Miwa,2018;Eberts
and Ulges,2020;Yu et al.,2020).
Analogously to how representation learning mat-
ters to image classification (Katiyar and Cardie,
2018;Bengio et al.,2013;Chen et al.,2020), it
should be crucial to construct good span repre-
sentations for span-based NER. However, exist-
ing models typically build span representations
by shallowly aggregating the top/last token repre-
sentations, e.g., pooling over the sequence dimen-
sion (Sohrab and Miwa,2018;Eberts and Ulges,
2020;Shen et al.,2021), or integrating the start-
ing and ending tokens (Yu et al.,2020;Li et al.,
2020d). In that case, the token representations have
not been fully interacted before they are fed into the
classifier, which impairs the capability of capturing
the information of long spans. If the spans overlap,
the resulting span representations are technically
coupled because of the shared tokens. This causes
the representations less distinguishable from the
ones of overlapping spans in nested structures.
Inspired by (probably) the most sophisticated
implementation of attention mechanism — Trans-
former and BERT (Vaswani et al.,2017;Devlin
et al.,2019), we propose DSpERT, which stands
for
D
eep
Sp
an
E
ncoder
R
epresentations from
T
ransformers. It consists of a standard Transformer
and a span Transformer; the latter uses low-layered
span representations as queries, and token represen-
tations within the corresponding span as keys and
values, and thus aggregates token representations
layer by layer from bottom to top. Such multi-
layered Transformer-style aggregation promisingly
produces deep span representations of rich seman-
tics, analogously to how BERT yields highly con-
textualized token representations.
With weight initialization from pretrained lan-
guage models (PLMs), DSpERT performs compa-
rably to recent state-of-the-art (SOTA) NER sys-
tems on six well-known benchmarks. Experimental
results clearly verify the importance of the depth
arXiv:2210.04182v2 [cs.CL] 9 May 2023
for the span representations. In addition, DSpERT
achieves particularly amplified performance im-
provements against its shallow counterparts
2
on
long-span entities and nested structures.
Different from most related work which focuses
on the decoder designs (Yu et al.,2020;Li et al.,
2020b;Shen et al.,2021;Li et al.,2022), we make
an effort to optimize the span representations, but
employ a simple and standard neural classifier for
decoding. This exposes the pre-logit representa-
tions that directly determine the entity prediction
results, and thus allows further representation anal-
ysis widely employed in a broader machine learn-
ing community (Van der Maaten and Hinton,2008;
Krizhevsky et al.,2012). This sheds light on neu-
ral NER systems towards higher robustness and
interpretability (Ouchi et al.,2020).
2 Related Work
The NER research had been long-term focused on
recognizing flat entities. After the introduction of
linear-chain conditional random field (Collobert
et al.,2011), neural sequence tagging models be-
came the de facto standard solution for flat NER
tasks (Huang et al.,2015;Lample et al.,2016;Ma
and Hovy,2016;Chiu and Nichols,2016;Zhang
and Yang,2018).
Recent studies pay much more attention to
nested NER, which a plain sequence tagging model
struggles with (Ju et al.,2018). This stimulates a
number of novel NER system designs beyond the
sequence tagging framework. Hypergraph-based
methods extend sequence tagging by allowing mul-
tiple tags for each token and multiple tag transi-
tions between adjacent tokens, which is compatible
with nested structures (Lu and Roth,2015;Katiyar
and Cardie,2018). Span-based models enumerate
candidate spans and classify them into entity cate-
gories (Sohrab and Miwa,2018;Eberts and Ulges,
2020;Yu et al.,2020). Li et al. (2020b) refor-
mulates nested NER as a reading comprehension
task. Shen et al. (2021,2022) borrow the methods
from image object detection to solve nested NER.
Yan et al. (2021) propose a generative approach,
which encodes the ground-truth entity set as a se-
quence, and thus reformulates NER as a sequence-
to-sequence task. Li et al. (2022) describe the entity
2
In this paper, unless otherwise specified, we use “shal-
low” to refer to models that construct span representations by
shallowly aggregating (typically top) token representations,
although the token representations could be “deep”.
set by word-word relation, and solve nested NER
by word-word relation classification.
The span-based models are probably the most
straightforward among these approaches. However,
existing span-based models typically build span
representations by shallowly aggregating the top
token representations from a standard text encoder.
Here, the shallow aggregation could be pooling
over the sequence dimension (Eberts and Ulges,
2020;Shen et al.,2021), integrating the starting
and ending token representations (Yu et al.,2020;
Li et al.,2020d), or a concatenation of these re-
sults (Sohrab and Miwa,2018). Apparently, shal-
low aggregation may be too simple to capture the
information embedded in long spans; and if the
spans overlap, the resulting span representations
are technically coupled because of the shared to-
kens. These ultimately lead to a performance degra-
dation.
Our DSpERT addresses this issue by multi-
layered and bottom-to-top construction of span rep-
resentations. Empirical results show that such deep
span representations outperform the shallow coun-
terpart qualitatively and quantitatively.
3 Methods
Deep Token Representations.
Given a
T
-length
sequence passed into an
L
-layered
d
-dimensional
Transformer encoder (Vaswani et al.,2017), the
initial token embeddings, together with the poten-
tial positional and segmentation embeddings (e.g.,
BERT; Devlin et al.,2019), are denoted as
H0
RT×d
. Thus, the
l
-th (
l= 1,2, . . . , L
) token repre-
sentations are:
Hl= TrBlock(Hl1,Hl1,Hl1),(1)
where
TrBlock(Q,K,V)
is a Transformer en-
coder block that takes
QRT×d
,
KRT×d
,
VRT×d
as the query, key, value inputs, respec-
tively. It consists of a multi-head attention module
and a position-wise feed-forward network (FFN),
both followed by a residual connection and a layer
normalization. Passing the same matrix, i.e.,
Hl1
,
for queries, keys and values exactly results in self-
attention (Vaswani et al.,2017).
The resulting top representations
HL
, computed
through
L
Transformer blocks, are believed to em-
brace deep, rich and contextualized semantics that
are useful for a wide range of tasks. Hence, in a
typical neural NLP modeling paradigm, only the
top representations
HL
are used for loss calcula-
tion and decoding (Devlin et al.,2019;Eberts and
Ulges,2020;Yu et al.,2020).
Deep Span Representations.
Figure 1presents
the architecture of DSpERT, which consists of a
standard Transformer encoder and a span Trans-
former encoder. In a span Transformer of size
k
(
k= 2,3, . . . , K
), the initial span representations
S0,k R(T+k1)×d
are directly aggregated from
the corresponding token embeddings:
s0,k
i= Aggregating(H0
[i:i+k]),(2)
where
s0,k
iRd
is the
i
-th vector of
S0,k
,
and
H0
[i:i+k]= [h0
i;. . . ;h0
i+k1]Rk×d
is a
slice of
H0
from position
i
to position
i+k1
;
Aggregating(·)
is a shallowly aggregating func-
tion, such as max-pooling. Check Appendix Afor
more details on alternative aggregating functions
used in this study. Technically,
s0,k
i
covers the
token embeddings in the span (i, i +k).
The computation of high-layered span represen-
tations imitates that of the standard Transformer.
For each span Transformer block, the query is a
low-layered span representation vector, and the
keys and values are the aforementioned token repre-
sentation vectors in the positions of that very span.
Formally, the l-th layer span representations are:
sl,k
i= SpanTrBlock(sl1,k
i,Hl1
[i:i+k],Hl1
[i:i+k]),
(3)
where
SpanTrBlock(Q,K,V)
shares the ex-
actly same structure with the corresponding Trans-
former block, but receives different inputs. More
specifically, for span
(i, i +k)
, the query is the
span representation
sl1,k
i
, and the keys and values
are the token representations
Hl1
[i:i+k]
. Again, the
resulting
sl,k
i
technically covers the token represen-
tations in the span (i, i +k)on layer l1.
In our default configuration, the weights of the
standard and span Transformers are independent,
but initialized from a same PLM. Given the ex-
actly same structure, the weights can be optionally
shared between the two modules. This reduces
the model parameters, but empirically results in
slightly lower performance (See Appendix F).
The top span representations
SL,k
are built
through
L
Transformer blocks, which are capable
of enriching the representations towards deep se-
mantics. Thus, the representations of overlapping
spans are decoupled, and promisingly distinguish-
able from each other, although they are originally
built from
S0,k
— those shallowly aggregated from
token embeddings. This is conceptually analogous
to how the BERT uses 12 or more Transformer
blocks to produce highly contextualized represen-
tations from the original static token embeddings.
The top span representations are then passed to
an entity classifier. Note that we do not construct a
unigram span Transformer, but directly borrow the
token representations as the span representations
of size 1. In other words,
SL,1HL.(4)
Entity Classifier.
Following Dozat and Man-
ning (2017) and Yu et al. (2020), we introduce
a dimension-reducing FFN before feeding the span
representations into the decoder. According to
the preceding notations, the representation of span
(i, j)is sL,ji
i, thus,
zij = FFN(sL,ji
iwji),(5)
where
wjiRdw
is the
(ji)
-th width embed-
ding from a dedicated learnable matrix;
means
the concatenation operation.
zij Rdz
is the
dimension-reduced span representation, which is
then fed into a softmax layer:
ˆ
yij = softmax(Wzij +b),(6)
where
WRc×dz
and
bRc
are learnable pa-
rameters, and
ˆ
yij Rc
is the vector of predicted
probabilities over entity types. Note that Eq.
(6)
fol-
lows the form of a typical neural classification head,
which receives a single vector
zij
, and yields the
predicted probabilities
ˆ
yij
. Here, the pre-softmax
vector
Wzij
is called logits, and
zij
is called pre-
logit representation (Müller et al.,2019).
Given the one-hot encoded ground truth
yij
Rc
, the model could be trained by optimizing the
cross entropy loss for all spans:
L=X
0i<jT
yT
ij log(ˆ
yij ).(7)
We additionally apply the boundary smoothing
technique (Zhu and Li,2022), which is a variant
of label smoothing (Szegedy et al.,2016) for span-
based NER and brings performance improvements.
4 Experiments
4.1 Experimental Settings
Datasets.
We perform experiments on four En-
glish nested NER datasets: ACE 2004
3
, ACE
3https://catalog.ldc.upenn.edu/LDC2005T09.
Transformer Block
Embedding Layer
K V QK V
Initial
Aggregation
Span Transformer Block
Q
Lx Lx(K-1)x
Token
Embeddings
Deep Token
Representations
Initial Span
Representations
Deep Span
Representations
0123456
0–2 1–3 2–4 3–5 4–6
Tokens
Positions
Span
Positions
Weight
Initialization
(or Sharing)
Entity Classifier
Figure 1: Architecture of DSpERT. It comprises: (Left) a standard L-layer Transformer encoder (e.g., BERT); and
(Right) a span Transformer encoder, where the span representations are the query inputs, and token representations
(from the Transformer encoder) are the key/value inputs. There are totally K1span Transformer encoders,
where Kis the maximum span size; and each has Llayers. The figure specifically displays the case of span size 3;
the span of positions 1–3 is highlighted, whereas the others are in dotted lines.
2005
4
, GENIA (Kim et al.,2003) and KBP 2017 (Ji
et al.,2017); and two English flat NER datasets:
CoNLL 2003 (Tjong Kim Sang and De Meulder,
2003) and OntoNotes 5
5
. More details on data pro-
cessing and descriptive statistics are reported in
Appendix B.
Implementation Details.
To save space, our im-
plementation details are all placed in Appendix C.
4.2 Main Results
Table 1shows the evaluation results on English
nested NER benchmarks. For a fair and reliable
comparison to previous SOTA NER systems,
6
we
run DSpERT for five times on each dataset, and
report both the best score and the average score
with corresponding standard deviation.
4https://catalog.ldc.upenn.edu/LDC2006T06.
5https://catalog.ldc.upenn.edu/LDC2013T19.
6
We exclude previous systems relying on extra training
data (e.g., Li et al.,2020c), external resources (e.g., Yamada
et al.,2020), extremely large PLMs (e.g., Yuan et al.,2022),
or neural architecture search (e.g., Wang et al.,2021).
With a
base
-sized PLM, DSpERT achieves on-
par or better results compared with previous SOTA
systems. More specifically, the best
F1
scores are
88.31%, 87.42%, 81.90% and 87.65% on ACE
2004, ACE 2005, GENIA and KBP 2017, respec-
tively. Except for ACE 2005, these scores corre-
spond to 0.17%, 0.13% and 3.15% absolute im-
provements.
Table 2presents the results on English flat NER
datasets. The best
F1
scores are 93.70% and
91.76% on CoNLL 2003 and OntoNotes 5, respec-
tively. These scores are slightly higher than those
reported by previous literature.
Appendix Dfurther lists the category-wise
F1
scores; the results show that DSpERT can consis-
tently outperform the biaffine model, a classic and
strong baseline, across most entity categories. Ap-
pendix Eprovides additional experimental results
on Chinese NER, suggesting that the effectiveness
of DSpERT is generalizable across languages.
Overall, DSpERT shows strong and competitive
performance on both the nested and flat NER tasks.
摘要:

DeepSpanRepresentationsforNamedEntityRecognitionEnweiZhuandYiyangLiuandJinpengLiNingboNo.2HospitalNingboInstituteofLifeandHealthIndustry,UniversityofChineseAcademyofSciences{zhuenwei,liuyiyang,lijinpeng}@ucas.ac.cnAbstractSpan-basedmodelsareoneofthemoststraightforwardmethodsfornamedentityrecognitio...

展开>> 收起<<
Deep Span Representations for Named Entity Recognition.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:669.89KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注