END-TO-ENDENTITY DETECTION WITH PROPOSER AND REGRESSOR Xueru Wen

2025-05-06 0 0 1.03MB 21 页 10玖币

侵权投诉

END-TO-END ENTITY DETECTION WITH PROPOSER AND

REGRESSOR

Xueru Wen

College of Computer Science and Technology

Jilin University

Changchun

wenxr2119@mails.jlu.edu.cn

Changjiang Zhou

College of Computer Science and Technology

Jilin University

Changchun

Haotian Tang

College of Computer Science and Technology

Jilin University

Changchun

Luguang Liang

College of Computer Science and Technology

Jilin University

Changchun

Yu Jiang

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education

Jilin University

jiangyu2011@jlu.edu.cn

Hong Qi

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education

Jilin University

ABSTRACT

Named entity recognition is a traditional task in natural language processing. In particular, nested

entity recognition receives extensive attention for the widespread existence of the nesting scenario.

The latest research migrates the well-established paradigm of set prediction in object detection to cope

with entity nesting. However, the manual creation of query vectors, which fail to adapt to the rich

semantic information in the context, limits these approaches. An end-to-end entity detection approach

with proposer and regressor is presented in this paper to tackle the issues. First, the proposer utilizes

the feature pyramid network to generate high-quality entity proposals. Then, the regressor reﬁnes

the proposals for generating the ﬁnal prediction. The model adopts encoder-only architecture and

thus obtains the advantages of the richness of query semantics, high precision of entity localization,

and easiness of model training. Moreover, we introduce the novel spatially modulated attention and

progressive reﬁnement for further improvement. Extensive experiments demonstrate that our model

achieves advanced performance in ﬂat and nested NER, achieving a new state-of-the-art F1 score of

80.74 on the GENIA dataset and 72.38 on the WeiboNER dataset.

Keywords Named Entity Recognition, Set Prediction, Attention, Feature Pyramid

1 Introduction

Named entity recognition identifying text spans of speciﬁc entity categories is a fundamental task in natural language

processing. It has played a crucial role in many downstream tasks such as relation extraction [

], information retrieval

[

] and entity linking [

]. The model [

] based on sequence labeling has achieved great success in this task. Even

though mature and efﬁcient as these models are, they fail to handle the nested entities that are a non-negligible scenario

in the real-world language environment. Some recent studies [

] have noted the formal consistency of Object detection

arXiv:2210.10260v5 [cs.CL] 22 May 2023

Running Title for Header

with NER tasks. Figure 1 shows the instances where the entities overlap with each other and the detection boxes

intersect with each other.

Cytomegalovirus modulates interleukin-6 gene

expression.

DNA

PRO

Cow

HB24 is likely to have an important role in

lymphocytes as well as in certain developing

tissues.

Zebra

DNA

CELL

Flat Nested

Object

Detection

Named

Entity

Recognition

Figure 1: Examples for object detection and named entity recognition under ﬂat and nested circumstances. Examples

are obtained from GENIA [7] and COCO2017 [8].

A few previous works have designed proprietary structures to deal with the nested entities, such as the constituency

graph [

] and hypergraph [

]. Other works [

] capture entities through the layered model containing multiple

recognition layers. Despite the success achieved by these approaches, they inevitably necessitate the deployment of

sophisticated transformations and costly decoding processes, introducing extra errors compared to the end-to-end

manner.

Seq2Seq methods [

] can address various kinds of NER subtasks in a uniﬁed form. However, these methods have

difﬁculties deﬁning the order of the outputs due to the natural conﬂict between sets and sequences. This trait limits the

performance of the model. The span-based approaches [

], which identify entities by enumerating all candidate

spans in a sentence and classifying them, also receive lots of attention. Although enumeration can be theoretically

perfect, the high computational complexity still burdens these methods. Second, these methods mainly focus on learning

span representations without the supervision of entity boundaries [

]. Further, enumerating all subsequences from a

sentence generate many negative samples, which reduces the recall rate. Some recent work, including set prediction

networks, has attempted to address these defects.

The latest works [

] treat information extraction as a reading comprehension task, extracting entities and relations

through manually constructed queries. The set prediction network [

] is introduced to the entity and relation extraction.

Because the techniques accommodate the unordered character of the prediction target, these methods achieve great

success. However, most of them still confront problems caused by query vectors. The random initialization of the query

vector leads to the lack of sufﬁcient semantic information and difﬁculty in learning the proper attention pattern.

This paper presents the end-to-end entity detection network, which predicts the entities in a single run and thus is

no longer affected by prediction order. The proposed model transforms the NER task into a set prediction problem.

First, we utilize the feature pyramid network to build the proposer, which generates high-quality entity proposals with

rich semantical query vectors, high-overlapping spans, and category logarithms. High-quality proposals signiﬁcantly

alleviate the difﬁculty of training. Then, the encoder-only regressor constructed by the iterative transformer employs

the regression procedure on the entity proposals. In contrast to some span-based methods discarding the partially

match proposals, the regressor adjusts these proposals to improve model performance. The prediction head computes

probability distributions for each entity proposal to identify the entities. In the training phase, we dynamically assign

prediction targets to each proposal.

Moreover, we introduce the novel spatially modulated attention in this paper. It guides the model to learn more

reasonable attention patterns and enhances the sparsity of the attention map by making full use of the spatial prior

Running Title for Header

knowledge, which improves the model’s performance. We also correct the entity proposal at every layer of the regressor

network, called the progressive reﬁnement. This strategy increases the precision of the model and facilitates gradient

backpropagation.

Our contribution can be summarized as follows:

•

We design the proposer constructed by the feature pyramid to incorporate multi-scale features and initialize

high-quality proposals with high-overlapping spans and strongly correlated queries. Compared to previous

works which randomly initialize query vectors, the proposer network may greatly reduce training difﬁculty.

•

We deploy the encoder-only framework in the regressor, which evades the handcraft construction of query

vectors and hardships in learning appropriate query representations and thus notably eases the difﬁculties of

convergence. The iterative reﬁnement strategy is further utilized in the regressor to improve the precision and

promote gradient backpropagation.

•

We introduce the novel spatially modulated attention mechanisms that help learn the proper attention pattern.

The spatially modulated attention mechanism dramatically improves the model’s performance by integrating

the spatially prior knowledge to increase attention sparsity.

2 Related Work

In this section, we will review some related works about Named entity recognition and set prediction. We will analyze

different methodologies for the NER task and the trend of set prediction algorithm development.

2.1 Named Entity Recognition

Since traditional NER methods with sequence labeling [

] have been well studied, many works [

] have been

devoted to extending sequence tagging methods to nested NER. One of the most explored methods is the layered

approach [

]. Other works deploy proprietary structures to handle the nested entities, such as the hypergraph [

Although these methods have achieved advanced performance, they are still not ﬂexible enough due to the need

for manually designed labeling schemes. In contrast, the end-to-end framework proposed in this paper avoids this

disadvantage and thus facilitates the implementation and migration of the method.

Seq2Seq approach [

] uniﬁes different forms of nested entity problems into sequence generation problems. Even

though this strategy avoids the complicated annotation methods, the sensitivity of the decoding order and beam search

algorithm poses a barrier to boosting the model performance. The end-to-end model in this paper incorporates the set

prediction algorithm in order to overcome the difﬁculties confronting the Seq2Seq model.

Span-based approaches [

], which classify candidate spans to identify the entities, also draw broad interest. The

span-based method formally resembles the object detection task in computer vision. Based on the long-standing idea

[

] of associating the image with natural language, some [

] proposes a two-stage identiﬁer that fully exploits partially

matched entity proposals and introduces the regression procedure. Despite the instructiveness of the insight to migrate

proven methods in computer vision to NER tasks, the error propagation due to two-stage models and the proper way

for boundary regression in languages remain issues to address. In this paper, we comprise the propose and regression

stage into an end-to-end framework and reﬁne the entity proposal based on probability distribution to ﬁt the language

background of the NER task.

2.2 Set Prediction

Several recent pieces of research [

] have deployed the set prediction network in information extraction tasks and

proved its effectiveness. These works can be seen as variants of DETR [

] which is proposed for object detection and

use the transformer decoder to update the manually created query vector for the generation of detection boxes and

corresponding categories.

The models based on set prediction networks, especially DETR, have been extensively studied. Slow convergence

due to the random initialization of the object query is the fundamental obstacle of DETR. A two-stage model [

]

with the feature pyramid [

] which generates high-quality queries and introduces multi-scale features is proposed

to settle the convergence problem. This paper also discusses the necessity for cross-attention and suggests that an

encoder-only network can achieve equally satisfactory results. Spatially modulated co-attention [

] integrating spatially

prior knowledge is also introduced to ease the problem. This work increases the sparsity of attention through a priori

knowledge for the purpose of accelerating training. The thought-provoking deformable attention [

] is presented,

Running Title for Header

Char

Embedding

BERT

Word

Embedding

POS

Embedding

BiGRU Pyramid

Proposal Fuse

…

Iterative Transformer

…

…Head

Head

(2,3,DNA)

…

(4,4,CELL)

(4,5,CELL)

None

Head

Figure 2: Architecture of our proposed model for end-to-end entity detection.

which shows the possibility of models learning the spatial structure of the attention. It also improves the model

performance by iteratively reﬁning the detection box.

A considerable amount of improvement work has been proposed to solve the convergence of set prediction problems.

Previous work [

] employing set prediction requires delicate selection for the number of queries to reach a promising

speed of model convergence. Inspired by the analysis [

] done previously, we propose the end-to-end framework to

settle the problem.

3 Method

In this section, we are going to detail our method. The general framework of our model is shown in Figure 2, which is

constructed in the following parts:

•Sentence Encoder

We utilize hybrid embedding to encode the sentence. The generated embeddings are then

fused by the BiGRU [33] to produce the ﬁnal multi-granularity representation of the sentence.

•Proposer

We build up the feature pyramid through the stack of BiGRU and CNN [

] to constitute the proposer

network. The proposer exploits the multi-scale features to initialize the entity proposals.

•Regressor

We design the regressor that reﬁnes the proposals progressively to locate and classify spans more

accurately. The regressor is built by stacking update layers constructed by the spatially modulated attention

mechanism.

•Prediction Head

The prediction head outputs the span location probability distribution based on the reﬁned

proposals. The distribution will be combined with probabilities generated by the category logarithms to

compute the joint probability distribution, from which can obtain the eventual prediction results.

3.1 Sentence Encoder

The goal of this component is to transform the original sentence into dense hidden representations. With the inputted

sentence

S= [w1, w2, ...wL]

, we represent the

-th token

with the concatenation of multi-granularity embeddings as

follows:

i=hchar

i;hbert

i;hword

i;hpos

i(1)

The embedding at character level

hchar

is generated by fusing each character’s embedding

through recurrent neural

networks and average pooling them as follows:

hchar

i=Pool (BiGRU ([hc

1, hc

2, ..., hc

D])) (2)

where

is the number of characters constituting the token. The character-level embedding can help the model cope

better with out-of-vocabulary words.

hbert

stands for the representation generated by the pre-trained language model BERT [

]. We follow [

] to obtain

the contextualized representation by encoding the sentence with the surrounding tokens. The BERT separates the

Running Title for Header

LayerNormLayerNorm

BiGRUBiGRU

Conv1DConv1D

LinearLinear

LayerNormLayerNorm

BiGRUBiGRU

LayerNormLayerNorm

LinearLinear

Conv1DConv1D

PoolPool

Forward block

Forward block Backward block

Backward block

Figure 3: Data ﬂow of feature pyramid and detailed structure of blocks.

tokens into subtokens by Wordpiece partitioning [37]. The representation of subtokens is average pooled to create the

contextualized embedding as follows:

hbert

i=Pool hb

1, hb

2, ..., hb

O (3)

where

is the number of subtokens forming the token. The pre-trained model can aid in the generation of more

contextually relevant text representations.

As for the embedding of word-level

hword

, we exploit pre-trained word vectors including Glove [

]. To introduce the

semantic message of part-of-speech, we embed each token’s POS tag as hpos

The multi-granularity embeddings are then fed into the BiGRU network to produce the hybrid embedding for the ﬁnal

representation Hof the sentence as follows:

H0= [h0

1, h0

2, ..., h0

H=WBiGRU (H0) + b

= [h1, h2, ..., hL]

(4)

3.2 Proposer

We tailor the pyramid network [

] to build up our proposer, which is able to integrate features of multi-scales and

reasonably initialize the proposals. Figure 3 illustrates the structure of the feature pyramid. The constitution of the

pyramid is performed in a bottom-to-up and up-to-bottom manner. The bidirectional construction procedures allow

better message communication between layers. We selectively merge the features at different layers to yield initial

proposals. This process is implemented in a similar way to the attention mechanism.

3.2.1 Forward Block

The feature pyramid is ﬁrst built from the bottom to up. It consists of

layers, each with two main components, a

BiGRU and a CNN of kernel sizes

. At layer

, the BiGRU models the interconnections of spans of the same size. The

CNN aggregates

neighboring hidden states, which are then passed into higher layers. Apparently, each feature vector

represents a span of vloriginal tokens and vlcan be calculated as:

vl= 1 −l+

i=1

ki(5)

One may note that the pyramid structure provides inherent induction: the higher the number of layers, the shorter the

input sequence, with higher levels of feature vectors representing the long entities and lower levels representing the

short entities. Moreover, since the input scales of the layers are diverse, we apply Layer Normalization [

] before

feeding the hidden states into the BiGRU.

As described above, the forward block can be formalized as follow:

l=WBiGRU (LayerNorm (H0

l)) + b

l+1 =σ(Conv1D (Hl)) (6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

END-TO-ENDENTITYDETECTIONWITHPROPOSERANDREGRESSORXueruWenCollegeofComputerScienceandTechnologyJilinUniversityChangchunwenxr2119@mails.jlu.edu.cnChangjiangZhouCollegeofComputerScienceandTechnologyJilinUniversityChangchunHaotianTangCollegeofComputerScienceandTechnologyJilinUniversityChangchunLuguangLi...

展开>> 收起<<

END-TO-ENDENTITY DETECTION WITH PROPOSER AND REGRESSOR Xueru Wen.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

END-TO-ENDENTITY DETECTION WITH PROPOSER AND REGRESSOR Xueru Wen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: