END-TO-ENDENTITY DETECTION WITH PROPOSER AND REGRESSOR Xueru Wen

2025-05-06 0 0 1.03MB 21 页 10玖币
侵权投诉
END-TO-END ENTITY DETECTION WITH PROPOSER AND
REGRESSOR
Xueru Wen
College of Computer Science and Technology
Jilin University
Changchun
wenxr2119@mails.jlu.edu.cn
Changjiang Zhou
College of Computer Science and Technology
Jilin University
Changchun
Haotian Tang
College of Computer Science and Technology
Jilin University
Changchun
Luguang Liang
College of Computer Science and Technology
Jilin University
Changchun
Yu Jiang
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Jilin University
jiangyu2011@jlu.edu.cn
Hong Qi
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Jilin University
ABSTRACT
Named entity recognition is a traditional task in natural language processing. In particular, nested
entity recognition receives extensive attention for the widespread existence of the nesting scenario.
The latest research migrates the well-established paradigm of set prediction in object detection to cope
with entity nesting. However, the manual creation of query vectors, which fail to adapt to the rich
semantic information in the context, limits these approaches. An end-to-end entity detection approach
with proposer and regressor is presented in this paper to tackle the issues. First, the proposer utilizes
the feature pyramid network to generate high-quality entity proposals. Then, the regressor refines
the proposals for generating the final prediction. The model adopts encoder-only architecture and
thus obtains the advantages of the richness of query semantics, high precision of entity localization,
and easiness of model training. Moreover, we introduce the novel spatially modulated attention and
progressive refinement for further improvement. Extensive experiments demonstrate that our model
achieves advanced performance in flat and nested NER, achieving a new state-of-the-art F1 score of
80.74 on the GENIA dataset and 72.38 on the WeiboNER dataset.
Keywords Named Entity Recognition, Set Prediction, Attention, Feature Pyramid
1 Introduction
Named entity recognition identifying text spans of specific entity categories is a fundamental task in natural language
processing. It has played a crucial role in many downstream tasks such as relation extraction [
1
], information retrieval
[
2
] and entity linking [
3
]. The model [
4
,
5
] based on sequence labeling has achieved great success in this task. Even
though mature and efficient as these models are, they fail to handle the nested entities that are a non-negligible scenario
in the real-world language environment. Some recent studies [
6
] have noted the formal consistency of Object detection
arXiv:2210.10260v5 [cs.CL] 22 May 2023
Running Title for Header
with NER tasks. Figure 1 shows the instances where the entities overlap with each other and the detection boxes
intersect with each other.
Cytomegalovirus modulates interleukin-6 gene
expression.
DNA
PRO
Cow
Cow
HB24 is likely to have an important role in
lymphocytes as well as in certain developing
tissues.
Zebra
Zebra
Zebra
DNA
CELL
Flat Nested
Object
Detection
Named
Entity
Recognition
Figure 1: Examples for object detection and named entity recognition under flat and nested circumstances. Examples
are obtained from GENIA [7] and COCO2017 [8].
A few previous works have designed proprietary structures to deal with the nested entities, such as the constituency
graph [
9
] and hypergraph [
10
]. Other works [
11
,
12
] capture entities through the layered model containing multiple
recognition layers. Despite the success achieved by these approaches, they inevitably necessitate the deployment of
sophisticated transformations and costly decoding processes, introducing extra errors compared to the end-to-end
manner.
Seq2Seq methods [
13
] can address various kinds of NER subtasks in a unified form. However, these methods have
difficulties defining the order of the outputs due to the natural conflict between sets and sequences. This trait limits the
performance of the model. The span-based approaches [
14
,
15
], which identify entities by enumerating all candidate
spans in a sentence and classifying them, also receive lots of attention. Although enumeration can be theoretically
perfect, the high computational complexity still burdens these methods. Second, these methods mainly focus on learning
span representations without the supervision of entity boundaries [
16
]. Further, enumerating all subsequences from a
sentence generate many negative samples, which reduces the recall rate. Some recent work, including set prediction
networks, has attempted to address these defects.
The latest works [
17
] treat information extraction as a reading comprehension task, extracting entities and relations
through manually constructed queries. The set prediction network [
18
] is introduced to the entity and relation extraction.
Because the techniques accommodate the unordered character of the prediction target, these methods achieve great
success. However, most of them still confront problems caused by query vectors. The random initialization of the query
vector leads to the lack of sufficient semantic information and difficulty in learning the proper attention pattern.
This paper presents the end-to-end entity detection network, which predicts the entities in a single run and thus is
no longer affected by prediction order. The proposed model transforms the NER task into a set prediction problem.
First, we utilize the feature pyramid network to build the proposer, which generates high-quality entity proposals with
rich semantical query vectors, high-overlapping spans, and category logarithms. High-quality proposals significantly
alleviate the difficulty of training. Then, the encoder-only regressor constructed by the iterative transformer employs
the regression procedure on the entity proposals. In contrast to some span-based methods discarding the partially
match proposals, the regressor adjusts these proposals to improve model performance. The prediction head computes
probability distributions for each entity proposal to identify the entities. In the training phase, we dynamically assign
prediction targets to each proposal.
Moreover, we introduce the novel spatially modulated attention in this paper. It guides the model to learn more
reasonable attention patterns and enhances the sparsity of the attention map by making full use of the spatial prior
2
Running Title for Header
knowledge, which improves the model’s performance. We also correct the entity proposal at every layer of the regressor
network, called the progressive refinement. This strategy increases the precision of the model and facilitates gradient
backpropagation.
Our contribution can be summarized as follows:
We design the proposer constructed by the feature pyramid to incorporate multi-scale features and initialize
high-quality proposals with high-overlapping spans and strongly correlated queries. Compared to previous
works which randomly initialize query vectors, the proposer network may greatly reduce training difficulty.
We deploy the encoder-only framework in the regressor, which evades the handcraft construction of query
vectors and hardships in learning appropriate query representations and thus notably eases the difficulties of
convergence. The iterative refinement strategy is further utilized in the regressor to improve the precision and
promote gradient backpropagation.
We introduce the novel spatially modulated attention mechanisms that help learn the proper attention pattern.
The spatially modulated attention mechanism dramatically improves the model’s performance by integrating
the spatially prior knowledge to increase attention sparsity.
2 Related Work
In this section, we will review some related works about Named entity recognition and set prediction. We will analyze
different methodologies for the NER task and the trend of set prediction algorithm development.
2.1 Named Entity Recognition
Since traditional NER methods with sequence labeling [
19
,
20
] have been well studied, many works [
21
] have been
devoted to extending sequence tagging methods to nested NER. One of the most explored methods is the layered
approach [
22
]. Other works deploy proprietary structures to handle the nested entities, such as the hypergraph [
10
].
Although these methods have achieved advanced performance, they are still not flexible enough due to the need
for manually designed labeling schemes. In contrast, the end-to-end framework proposed in this paper avoids this
disadvantage and thus facilitates the implementation and migration of the method.
Seq2Seq approach [
23
] unifies different forms of nested entity problems into sequence generation problems. Even
though this strategy avoids the complicated annotation methods, the sensitivity of the decoding order and beam search
algorithm poses a barrier to boosting the model performance. The end-to-end model in this paper incorporates the set
prediction algorithm in order to overcome the difficulties confronting the Seq2Seq model.
Span-based approaches [
24
,
25
], which classify candidate spans to identify the entities, also draw broad interest. The
span-based method formally resembles the object detection task in computer vision. Based on the long-standing idea
[
26
] of associating the image with natural language, some [
6
] proposes a two-stage identifier that fully exploits partially
matched entity proposals and introduces the regression procedure. Despite the instructiveness of the insight to migrate
proven methods in computer vision to NER tasks, the error propagation due to two-stage models and the proper way
for boundary regression in languages remain issues to address. In this paper, we comprise the propose and regression
stage into an end-to-end framework and refine the entity proposal based on probability distribution to fit the language
background of the NER task.
2.2 Set Prediction
Several recent pieces of research [
18
,
27
] have deployed the set prediction network in information extraction tasks and
proved its effectiveness. These works can be seen as variants of DETR [
28
] which is proposed for object detection and
use the transformer decoder to update the manually created query vector for the generation of detection boxes and
corresponding categories.
The models based on set prediction networks, especially DETR, have been extensively studied. Slow convergence
due to the random initialization of the object query is the fundamental obstacle of DETR. A two-stage model [
29
]
with the feature pyramid [
30
] which generates high-quality queries and introduces multi-scale features is proposed
to settle the convergence problem. This paper also discusses the necessity for cross-attention and suggests that an
encoder-only network can achieve equally satisfactory results. Spatially modulated co-attention [
31
] integrating spatially
prior knowledge is also introduced to ease the problem. This work increases the sparsity of attention through a priori
knowledge for the purpose of accelerating training. The thought-provoking deformable attention [
32
] is presented,
3
Running Title for Header
Char
Embedding
BERT
Word
Embedding
POS
Embedding
BiGRU Pyramid
Proposal Fuse
Iterative Transformer
Head
Head
Head
Head
(2,3,DNA)
(4,4,CELL)
(4,5,CELL)
None
Head
Figure 2: Architecture of our proposed model for end-to-end entity detection.
which shows the possibility of models learning the spatial structure of the attention. It also improves the model
performance by iteratively refining the detection box.
A considerable amount of improvement work has been proposed to solve the convergence of set prediction problems.
Previous work [
27
] employing set prediction requires delicate selection for the number of queries to reach a promising
speed of model convergence. Inspired by the analysis [
29
] done previously, we propose the end-to-end framework to
settle the problem.
3 Method
In this section, we are going to detail our method. The general framework of our model is shown in Figure 2, which is
constructed in the following parts:
Sentence Encoder
We utilize hybrid embedding to encode the sentence. The generated embeddings are then
fused by the BiGRU [33] to produce the final multi-granularity representation of the sentence.
Proposer
We build up the feature pyramid through the stack of BiGRU and CNN [
34
] to constitute the proposer
network. The proposer exploits the multi-scale features to initialize the entity proposals.
Regressor
We design the regressor that refines the proposals progressively to locate and classify spans more
accurately. The regressor is built by stacking update layers constructed by the spatially modulated attention
mechanism.
Prediction Head
The prediction head outputs the span location probability distribution based on the refined
proposals. The distribution will be combined with probabilities generated by the category logarithms to
compute the joint probability distribution, from which can obtain the eventual prediction results.
3.1 Sentence Encoder
The goal of this component is to transform the original sentence into dense hidden representations. With the inputted
sentence
S= [w1, w2, ...wL]
, we represent the
i
-th token
wi
with the concatenation of multi-granularity embeddings as
follows:
h0
i=hchar
i;hbert
i;hword
i;hpos
i(1)
The embedding at character level
hchar
is generated by fusing each character’s embedding
hc
through recurrent neural
networks and average pooling them as follows:
hchar
i=Pool (BiGRU ([hc
1, hc
2, ..., hc
D])) (2)
where
D
is the number of characters constituting the token. The character-level embedding can help the model cope
better with out-of-vocabulary words.
hbert
stands for the representation generated by the pre-trained language model BERT [
35
]. We follow [
36
] to obtain
the contextualized representation by encoding the sentence with the surrounding tokens. The BERT separates the
4
Running Title for Header
LayerNormLayerNorm
BiGRUBiGRU
Conv1DConv1D
LinearLinear
LayerNormLayerNorm
BiGRUBiGRU
LayerNormLayerNorm
LinearLinear
Conv1DConv1D
PoolPool
Forward block
Forward block
Forward block Backward block
Backward block
Backward block
Figure 3: Data flow of feature pyramid and detailed structure of blocks.
tokens into subtokens by Wordpiece partitioning [37]. The representation of subtokens is average pooled to create the
contextualized embedding as follows:
hbert
i=Pool hb
1, hb
2, ..., hb
O (3)
where
O
is the number of subtokens forming the token. The pre-trained model can aid in the generation of more
contextually relevant text representations.
As for the embedding of word-level
hword
i
, we exploit pre-trained word vectors including Glove [
38
]. To introduce the
semantic message of part-of-speech, we embed each token’s POS tag as hpos
i.
The multi-granularity embeddings are then fed into the BiGRU network to produce the hybrid embedding for the final
representation Hof the sentence as follows:
H0= [h0
1, h0
2, ..., h0
L]
H=WBiGRU (H0) + b
= [h1, h2, ..., hL]
(4)
3.2 Proposer
We tailor the pyramid network [
22
] to build up our proposer, which is able to integrate features of multi-scales and
reasonably initialize the proposals. Figure 3 illustrates the structure of the feature pyramid. The constitution of the
pyramid is performed in a bottom-to-up and up-to-bottom manner. The bidirectional construction procedures allow
better message communication between layers. We selectively merge the features at different layers to yield initial
proposals. This process is implemented in a similar way to the attention mechanism.
3.2.1 Forward Block
The feature pyramid is first built from the bottom to up. It consists of
L
layers, each with two main components, a
BiGRU and a CNN of kernel sizes
k
. At layer
l
, the BiGRU models the interconnections of spans of the same size. The
CNN aggregates
k
neighboring hidden states, which are then passed into higher layers. Apparently, each feature vector
represents a span of vloriginal tokens and vlcan be calculated as:
vl= 1 l+
l
X
i=1
ki(5)
One may note that the pyramid structure provides inherent induction: the higher the number of layers, the shorter the
input sequence, with higher levels of feature vectors representing the long entities and lower levels representing the
short entities. Moreover, since the input scales of the layers are diverse, we apply Layer Normalization [
39
] before
feeding the hidden states into the BiGRU.
As described above, the forward block can be formalized as follow:
Hf
l=WBiGRU (LayerNorm (H0
l)) + b
H0
l+1 =σ(Conv1D (Hl)) (6)
5
摘要:

END-TO-ENDENTITYDETECTIONWITHPROPOSERANDREGRESSORXueruWenCollegeofComputerScienceandTechnologyJilinUniversityChangchunwenxr2119@mails.jlu.edu.cnChangjiangZhouCollegeofComputerScienceandTechnologyJilinUniversityChangchunHaotianTangCollegeofComputerScienceandTechnologyJilinUniversityChangchunLuguangLi...

展开>> 收起<<
END-TO-ENDENTITY DETECTION WITH PROPOSER AND REGRESSOR Xueru Wen.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.03MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注