Running Title for Header
knowledge, which improves the model’s performance. We also correct the entity proposal at every layer of the regressor
network, called the progressive refinement. This strategy increases the precision of the model and facilitates gradient
backpropagation.
Our contribution can be summarized as follows:
•
We design the proposer constructed by the feature pyramid to incorporate multi-scale features and initialize
high-quality proposals with high-overlapping spans and strongly correlated queries. Compared to previous
works which randomly initialize query vectors, the proposer network may greatly reduce training difficulty.
•
We deploy the encoder-only framework in the regressor, which evades the handcraft construction of query
vectors and hardships in learning appropriate query representations and thus notably eases the difficulties of
convergence. The iterative refinement strategy is further utilized in the regressor to improve the precision and
promote gradient backpropagation.
•
We introduce the novel spatially modulated attention mechanisms that help learn the proper attention pattern.
The spatially modulated attention mechanism dramatically improves the model’s performance by integrating
the spatially prior knowledge to increase attention sparsity.
2 Related Work
In this section, we will review some related works about Named entity recognition and set prediction. We will analyze
different methodologies for the NER task and the trend of set prediction algorithm development.
2.1 Named Entity Recognition
Since traditional NER methods with sequence labeling [
19
,
20
] have been well studied, many works [
21
] have been
devoted to extending sequence tagging methods to nested NER. One of the most explored methods is the layered
approach [
22
]. Other works deploy proprietary structures to handle the nested entities, such as the hypergraph [
10
].
Although these methods have achieved advanced performance, they are still not flexible enough due to the need
for manually designed labeling schemes. In contrast, the end-to-end framework proposed in this paper avoids this
disadvantage and thus facilitates the implementation and migration of the method.
Seq2Seq approach [
23
] unifies different forms of nested entity problems into sequence generation problems. Even
though this strategy avoids the complicated annotation methods, the sensitivity of the decoding order and beam search
algorithm poses a barrier to boosting the model performance. The end-to-end model in this paper incorporates the set
prediction algorithm in order to overcome the difficulties confronting the Seq2Seq model.
Span-based approaches [
24
,
25
], which classify candidate spans to identify the entities, also draw broad interest. The
span-based method formally resembles the object detection task in computer vision. Based on the long-standing idea
[
26
] of associating the image with natural language, some [
6
] proposes a two-stage identifier that fully exploits partially
matched entity proposals and introduces the regression procedure. Despite the instructiveness of the insight to migrate
proven methods in computer vision to NER tasks, the error propagation due to two-stage models and the proper way
for boundary regression in languages remain issues to address. In this paper, we comprise the propose and regression
stage into an end-to-end framework and refine the entity proposal based on probability distribution to fit the language
background of the NER task.
2.2 Set Prediction
Several recent pieces of research [
18
,
27
] have deployed the set prediction network in information extraction tasks and
proved its effectiveness. These works can be seen as variants of DETR [
28
] which is proposed for object detection and
use the transformer decoder to update the manually created query vector for the generation of detection boxes and
corresponding categories.
The models based on set prediction networks, especially DETR, have been extensively studied. Slow convergence
due to the random initialization of the object query is the fundamental obstacle of DETR. A two-stage model [
29
]
with the feature pyramid [
30
] which generates high-quality queries and introduces multi-scale features is proposed
to settle the convergence problem. This paper also discusses the necessity for cross-attention and suggests that an
encoder-only network can achieve equally satisfactory results. Spatially modulated co-attention [
31
] integrating spatially
prior knowledge is also introduced to ease the problem. This work increases the sparsity of attention through a priori
knowledge for the purpose of accelerating training. The thought-provoking deformable attention [
32
] is presented,
3