
for the span representations. In addition, DSpERT
achieves particularly amplified performance im-
provements against its shallow counterparts
2
on
long-span entities and nested structures.
Different from most related work which focuses
on the decoder designs (Yu et al.,2020;Li et al.,
2020b;Shen et al.,2021;Li et al.,2022), we make
an effort to optimize the span representations, but
employ a simple and standard neural classifier for
decoding. This exposes the pre-logit representa-
tions that directly determine the entity prediction
results, and thus allows further representation anal-
ysis widely employed in a broader machine learn-
ing community (Van der Maaten and Hinton,2008;
Krizhevsky et al.,2012). This sheds light on neu-
ral NER systems towards higher robustness and
interpretability (Ouchi et al.,2020).
2 Related Work
The NER research had been long-term focused on
recognizing flat entities. After the introduction of
linear-chain conditional random field (Collobert
et al.,2011), neural sequence tagging models be-
came the de facto standard solution for flat NER
tasks (Huang et al.,2015;Lample et al.,2016;Ma
and Hovy,2016;Chiu and Nichols,2016;Zhang
and Yang,2018).
Recent studies pay much more attention to
nested NER, which a plain sequence tagging model
struggles with (Ju et al.,2018). This stimulates a
number of novel NER system designs beyond the
sequence tagging framework. Hypergraph-based
methods extend sequence tagging by allowing mul-
tiple tags for each token and multiple tag transi-
tions between adjacent tokens, which is compatible
with nested structures (Lu and Roth,2015;Katiyar
and Cardie,2018). Span-based models enumerate
candidate spans and classify them into entity cate-
gories (Sohrab and Miwa,2018;Eberts and Ulges,
2020;Yu et al.,2020). Li et al. (2020b) refor-
mulates nested NER as a reading comprehension
task. Shen et al. (2021,2022) borrow the methods
from image object detection to solve nested NER.
Yan et al. (2021) propose a generative approach,
which encodes the ground-truth entity set as a se-
quence, and thus reformulates NER as a sequence-
to-sequence task. Li et al. (2022) describe the entity
2
In this paper, unless otherwise specified, we use “shal-
low” to refer to models that construct span representations by
shallowly aggregating (typically top) token representations,
although the token representations could be “deep”.
set by word-word relation, and solve nested NER
by word-word relation classification.
The span-based models are probably the most
straightforward among these approaches. However,
existing span-based models typically build span
representations by shallowly aggregating the top
token representations from a standard text encoder.
Here, the shallow aggregation could be pooling
over the sequence dimension (Eberts and Ulges,
2020;Shen et al.,2021), integrating the starting
and ending token representations (Yu et al.,2020;
Li et al.,2020d), or a concatenation of these re-
sults (Sohrab and Miwa,2018). Apparently, shal-
low aggregation may be too simple to capture the
information embedded in long spans; and if the
spans overlap, the resulting span representations
are technically coupled because of the shared to-
kens. These ultimately lead to a performance degra-
dation.
Our DSpERT addresses this issue by multi-
layered and bottom-to-top construction of span rep-
resentations. Empirical results show that such deep
span representations outperform the shallow coun-
terpart qualitatively and quantitatively.
3 Methods
Deep Token Representations.
Given a
T
-length
sequence passed into an
L
-layered
d
-dimensional
Transformer encoder (Vaswani et al.,2017), the
initial token embeddings, together with the poten-
tial positional and segmentation embeddings (e.g.,
BERT; Devlin et al.,2019), are denoted as
H0∈
RT×d
. Thus, the
l
-th (
l= 1,2, . . . , L
) token repre-
sentations are:
Hl= TrBlock(Hl−1,Hl−1,Hl−1),(1)
where
TrBlock(Q,K,V)
is a Transformer en-
coder block that takes
Q∈RT×d
,
K∈RT×d
,
V∈RT×d
as the query, key, value inputs, respec-
tively. It consists of a multi-head attention module
and a position-wise feed-forward network (FFN),
both followed by a residual connection and a layer
normalization. Passing the same matrix, i.e.,
Hl−1
,
for queries, keys and values exactly results in self-
attention (Vaswani et al.,2017).
The resulting top representations
HL
, computed
through
L
Transformer blocks, are believed to em-
brace deep, rich and contextualized semantics that
are useful for a wide range of tasks. Hence, in a
typical neural NLP modeling paradigm, only the