Deep Span Representations for Named Entity Recognition

2025-04-22 0 0 669.89KB 16 页 10玖币

侵权投诉

Enwei Zhu and Yiyang Liu and Jinpeng Li∗

Ningbo No.2 Hospital

Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences

{zhuenwei,liuyiyang,lijinpeng}@ucas.ac.cn

Abstract

Span-based models are one of the most

straightforward methods for named entity

recognition (NER). Existing span-based NER

systems shallowly aggregate the token repre-

sentations to span representations. However,

this typically results in signiﬁcant ineffective-

ness for long entities, a coupling between

the representations of overlapping spans, and

ultimately a performance degradation. In

this study, we propose DSpERT (Deep Span

Encoder Representations from Transformers),

which comprises a standard Transformer and

a span Transformer. The latter uses low-

layered span representations as queries, and

aggregates the token representations as keys

and values, layer by layer from bottom to top.

Thus, DSpERT produces span representations

of deep semantics.

With weight initialization from pretrained lan-

guage models, DSpERT achieves performance

higher than or competitive with recent state-of-

the-art systems on six NER benchmarks.1Ex-

perimental results verify the importance of the

depth for span representations, and show that

DSpERT performs particularly well on long-

span entities and nested structures. Further, the

deep span representations are well structured

and easily separable in the feature space.

1 Introduction

As a fundamental information extraction task,

named entity recognition (NER) requires predict-

ing a set of entities from a piece of text. Thus,

the model has to distinguish the entity spans (i.e.,

positive examples) from the non-entity spans (i.e.,

negative examples). In this view, it is natural to

enumerate all possible spans and classify them into

the entity categories (including an extra non-entity

category). This is exactly the core idea of span-

*Corresponding author.

1Our code is available at https://github.com/syuoni/

eznlp.

based approaches (Sohrab and Miwa,2018;Eberts

and Ulges,2020;Yu et al.,2020).

Analogously to how representation learning mat-

ters to image classiﬁcation (Katiyar and Cardie,

2018;Bengio et al.,2013;Chen et al.,2020), it

should be crucial to construct good span repre-

sentations for span-based NER. However, exist-

ing models typically build span representations

by shallowly aggregating the top/last token repre-

sentations, e.g., pooling over the sequence dimen-

sion (Sohrab and Miwa,2018;Eberts and Ulges,

2020;Shen et al.,2021), or integrating the start-

ing and ending tokens (Yu et al.,2020;Li et al.,

2020d). In that case, the token representations have

not been fully interacted before they are fed into the

classiﬁer, which impairs the capability of capturing

the information of long spans. If the spans overlap,

the resulting span representations are technically

coupled because of the shared tokens. This causes

the representations less distinguishable from the

ones of overlapping spans in nested structures.

Inspired by (probably) the most sophisticated

implementation of attention mechanism — Trans-

former and BERT (Vaswani et al.,2017;Devlin

et al.,2019), we propose DSpERT, which stands

for

eep

ncoder

epresentations from

ransformers. It consists of a standard Transformer

and a span Transformer; the latter uses low-layered

span representations as queries, and token represen-

tations within the corresponding span as keys and

values, and thus aggregates token representations

layer by layer from bottom to top. Such multi-

layered Transformer-style aggregation promisingly

produces deep span representations of rich seman-

tics, analogously to how BERT yields highly con-

textualized token representations.

With weight initialization from pretrained lan-

guage models (PLMs), DSpERT performs compa-

rably to recent state-of-the-art (SOTA) NER sys-

tems on six well-known benchmarks. Experimental

results clearly verify the importance of the depth

arXiv:2210.04182v2 [cs.CL] 9 May 2023

for the span representations. In addition, DSpERT

achieves particularly ampliﬁed performance im-

provements against its shallow counterparts

long-span entities and nested structures.

Different from most related work which focuses

on the decoder designs (Yu et al.,2020;Li et al.,

2020b;Shen et al.,2021;Li et al.,2022), we make

an effort to optimize the span representations, but

employ a simple and standard neural classiﬁer for

decoding. This exposes the pre-logit representa-

tions that directly determine the entity prediction

results, and thus allows further representation anal-

ysis widely employed in a broader machine learn-

ing community (Van der Maaten and Hinton,2008;

Krizhevsky et al.,2012). This sheds light on neu-

ral NER systems towards higher robustness and

interpretability (Ouchi et al.,2020).

2 Related Work

The NER research had been long-term focused on

recognizing ﬂat entities. After the introduction of

linear-chain conditional random ﬁeld (Collobert

et al.,2011), neural sequence tagging models be-

came the de facto standard solution for ﬂat NER

tasks (Huang et al.,2015;Lample et al.,2016;Ma

and Hovy,2016;Chiu and Nichols,2016;Zhang

and Yang,2018).

Recent studies pay much more attention to

nested NER, which a plain sequence tagging model

struggles with (Ju et al.,2018). This stimulates a

number of novel NER system designs beyond the

sequence tagging framework. Hypergraph-based

methods extend sequence tagging by allowing mul-

tiple tags for each token and multiple tag transi-

tions between adjacent tokens, which is compatible

with nested structures (Lu and Roth,2015;Katiyar

and Cardie,2018). Span-based models enumerate

candidate spans and classify them into entity cate-

gories (Sohrab and Miwa,2018;Eberts and Ulges,

2020;Yu et al.,2020). Li et al. (2020b) refor-

mulates nested NER as a reading comprehension

task. Shen et al. (2021,2022) borrow the methods

from image object detection to solve nested NER.

Yan et al. (2021) propose a generative approach,

which encodes the ground-truth entity set as a se-

quence, and thus reformulates NER as a sequence-

to-sequence task. Li et al. (2022) describe the entity

In this paper, unless otherwise speciﬁed, we use “shal-

low” to refer to models that construct span representations by

shallowly aggregating (typically top) token representations,

although the token representations could be “deep”.

set by word-word relation, and solve nested NER

by word-word relation classiﬁcation.

The span-based models are probably the most

straightforward among these approaches. However,

existing span-based models typically build span

representations by shallowly aggregating the top

token representations from a standard text encoder.

Here, the shallow aggregation could be pooling

over the sequence dimension (Eberts and Ulges,

2020;Shen et al.,2021), integrating the starting

and ending token representations (Yu et al.,2020;

Li et al.,2020d), or a concatenation of these re-

sults (Sohrab and Miwa,2018). Apparently, shal-

low aggregation may be too simple to capture the

information embedded in long spans; and if the

spans overlap, the resulting span representations

are technically coupled because of the shared to-

kens. These ultimately lead to a performance degra-

dation.

Our DSpERT addresses this issue by multi-

layered and bottom-to-top construction of span rep-

resentations. Empirical results show that such deep

span representations outperform the shallow coun-

terpart qualitatively and quantitatively.

3 Methods

Deep Token Representations.

Given a

-length

sequence passed into an

-layered

-dimensional

Transformer encoder (Vaswani et al.,2017), the

initial token embeddings, together with the poten-

tial positional and segmentation embeddings (e.g.,

BERT; Devlin et al.,2019), are denoted as

H0∈

RT×d

. Thus, the

-th (

l= 1,2, . . . , L

) token repre-

sentations are:

Hl= TrBlock(Hl−1,Hl−1,Hl−1),(1)

where

TrBlock(Q,K,V)

is a Transformer en-

coder block that takes

Q∈RT×d

K∈RT×d

V∈RT×d

as the query, key, value inputs, respec-

tively. It consists of a multi-head attention module

and a position-wise feed-forward network (FFN),

both followed by a residual connection and a layer

normalization. Passing the same matrix, i.e.,

Hl−1

for queries, keys and values exactly results in self-

attention (Vaswani et al.,2017).

The resulting top representations

, computed

through

Transformer blocks, are believed to em-

brace deep, rich and contextualized semantics that

are useful for a wide range of tasks. Hence, in a

typical neural NLP modeling paradigm, only the

top representations

are used for loss calcula-

tion and decoding (Devlin et al.,2019;Eberts and

Ulges,2020;Yu et al.,2020).

Deep Span Representations.

Figure 1presents

the architecture of DSpERT, which consists of a

standard Transformer encoder and a span Trans-

former encoder. In a span Transformer of size

(

k= 2,3, . . . , K

), the initial span representations

S0,k ∈R(T+k−1)×d

are directly aggregated from

the corresponding token embeddings:

s0,k

i= Aggregating(H0

[i:i+k]),(2)

where

s0,k

i∈Rd

is the

-th vector of

S0,k

and

[i:i+k]= [h0

i;. . . ;h0

i+k−1]∈Rk×d

is a

slice of

from position

to position

i+k−1

;

Aggregating(·)

is a shallowly aggregating func-

tion, such as max-pooling. Check Appendix Afor

more details on alternative aggregating functions

used in this study. Technically,

s0,k

covers the

token embeddings in the span (i, i +k).

The computation of high-layered span represen-

tations imitates that of the standard Transformer.

For each span Transformer block, the query is a

low-layered span representation vector, and the

keys and values are the aforementioned token repre-

sentation vectors in the positions of that very span.

Formally, the l-th layer span representations are:

sl,k

i= SpanTrBlock(sl−1,k

i,Hl−1

[i:i+k],Hl−1

[i:i+k]),

(3)

where

SpanTrBlock(Q,K,V)

shares the ex-

actly same structure with the corresponding Trans-

former block, but receives different inputs. More

speciﬁcally, for span

(i, i +k)

, the query is the

span representation

sl−1,k

, and the keys and values

are the token representations

Hl−1

[i:i+k]

. Again, the

resulting

sl,k

technically covers the token represen-

tations in the span (i, i +k)on layer l−1.

In our default conﬁguration, the weights of the

standard and span Transformers are independent,

but initialized from a same PLM. Given the ex-

actly same structure, the weights can be optionally

shared between the two modules. This reduces

the model parameters, but empirically results in

slightly lower performance (See Appendix F).

The top span representations

SL,k

are built

through

Transformer blocks, which are capable

of enriching the representations towards deep se-

mantics. Thus, the representations of overlapping

spans are decoupled, and promisingly distinguish-

able from each other, although they are originally

built from

S0,k

— those shallowly aggregated from

token embeddings. This is conceptually analogous

to how the BERT uses 12 or more Transformer

blocks to produce highly contextualized represen-

tations from the original static token embeddings.

The top span representations are then passed to

an entity classiﬁer. Note that we do not construct a

unigram span Transformer, but directly borrow the

token representations as the span representations

of size 1. In other words,

SL,1≡HL.(4)

Entity Classiﬁer.

Following Dozat and Man-

ning (2017) and Yu et al. (2020), we introduce

a dimension-reducing FFN before feeding the span

representations into the decoder. According to

the preceding notations, the representation of span

(i, j)is sL,j−i

i, thus,

zij = FFN(sL,j−i

i⊕wj−i),(5)

where

wj−i∈Rdw

is the

(j−i)

-th width embed-

ding from a dedicated learnable matrix;

⊕

means

the concatenation operation.

zij ∈Rdz

is the

dimension-reduced span representation, which is

then fed into a softmax layer:

yij = softmax(Wzij +b),(6)

where

W∈Rc×dz

and

b∈Rc

are learnable pa-

rameters, and

yij ∈Rc

is the vector of predicted

probabilities over entity types. Note that Eq.

(6)

fol-

lows the form of a typical neural classiﬁcation head,

which receives a single vector

zij

, and yields the

predicted probabilities

yij

. Here, the pre-softmax

vector

Wzij

is called logits, and

zij

is called pre-

logit representation (Müller et al.,2019).

Given the one-hot encoded ground truth

yij ∈

, the model could be trained by optimizing the

cross entropy loss for all spans:

L=−X

0≤i<j≤T

ij log(ˆ

yij ).(7)

We additionally apply the boundary smoothing

technique (Zhu and Li,2022), which is a variant

of label smoothing (Szegedy et al.,2016) for span-

based NER and brings performance improvements.

4 Experiments

4.1 Experimental Settings

Datasets.

We perform experiments on four En-

glish nested NER datasets: ACE 2004

, ACE

3https://catalog.ldc.upenn.edu/LDC2005T09.

Transformer Block

Embedding Layer

K V QK V

Initial

Aggregation

Span Transformer Block

Lx Lx(K-1)x

Token

Embeddings

Deep Token

Representations

Initial Span

Representations

Deep Span

Representations

0123456

0–2 1–3 2–4 3–5 4–6

Tokens

Positions

Span

Positions

Weight

Initialization

(or Sharing)

Entity Classiﬁer

Figure 1: Architecture of DSpERT. It comprises: (Left) a standard L-layer Transformer encoder (e.g., BERT); and

(Right) a span Transformer encoder, where the span representations are the query inputs, and token representations

(from the Transformer encoder) are the key/value inputs. There are totally K−1span Transformer encoders,

where Kis the maximum span size; and each has Llayers. The ﬁgure speciﬁcally displays the case of span size 3;

the span of positions 1–3 is highlighted, whereas the others are in dotted lines.

2005

, GENIA (Kim et al.,2003) and KBP 2017 (Ji

et al.,2017); and two English ﬂat NER datasets:

CoNLL 2003 (Tjong Kim Sang and De Meulder,

2003) and OntoNotes 5

. More details on data pro-

cessing and descriptive statistics are reported in

Appendix B.

Implementation Details.

To save space, our im-

plementation details are all placed in Appendix C.

4.2 Main Results

Table 1shows the evaluation results on English

nested NER benchmarks. For a fair and reliable

comparison to previous SOTA NER systems,

run DSpERT for ﬁve times on each dataset, and

report both the best score and the average score

with corresponding standard deviation.

4https://catalog.ldc.upenn.edu/LDC2006T06.

5https://catalog.ldc.upenn.edu/LDC2013T19.

We exclude previous systems relying on extra training

data (e.g., Li et al.,2020c), external resources (e.g., Yamada

et al.,2020), extremely large PLMs (e.g., Yuan et al.,2022),

or neural architecture search (e.g., Wang et al.,2021).

With a

base

-sized PLM, DSpERT achieves on-

par or better results compared with previous SOTA

systems. More speciﬁcally, the best

scores are

88.31%, 87.42%, 81.90% and 87.65% on ACE

2004, ACE 2005, GENIA and KBP 2017, respec-

tively. Except for ACE 2005, these scores corre-

spond to 0.17%, 0.13% and 3.15% absolute im-

provements.

Table 2presents the results on English ﬂat NER

datasets. The best

scores are 93.70% and

91.76% on CoNLL 2003 and OntoNotes 5, respec-

tively. These scores are slightly higher than those

reported by previous literature.

Appendix Dfurther lists the category-wise

scores; the results show that DSpERT can consis-

tently outperform the biafﬁne model, a classic and

strong baseline, across most entity categories. Ap-

pendix Eprovides additional experimental results

on Chinese NER, suggesting that the effectiveness

of DSpERT is generalizable across languages.

Overall, DSpERT shows strong and competitive

performance on both the nested and ﬂat NER tasks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepSpanRepresentationsforNamedEntityRecognitionEnweiZhuandYiyangLiuandJinpengLiNingboNo.2HospitalNingboInstituteofLifeandHealthIndustry,UniversityofChineseAcademyofSciences{zhuenwei,liuyiyang,lijinpeng}@ucas.ac.cnAbstractSpan-basedmodelsareoneofthemoststraightforwardmethodsfornamedentityrecognitio...

展开>> 收起<<

Deep Span Representations for Named Entity Recognition.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Span Representations for Named Entity Recognition

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: