Improving Chinese Named Entity Recognition by Search Engine Augmentation Qinghua Mao and Kui Meng

2025-05-08 0 0 500.87KB 8 页 10玖币
侵权投诉
Improving Chinese Named Entity Recognition by Search Engine
Augmentation
Qinghua Mao and Kui Meng
Shanghai Jiao Tong University
{mmmm2018,mengkui}@sjtu.edu.cn
Jiatong Li
The University of Melbourne
jiatongl3@student.unimelb.edu.au
Abstract
Compared with English, Chinese suffers from
more grammatical ambiguities, like fuzzy
word boundaries and polysemous words. In
this case, contextual information is not suffi-
cient to support Chinese named entity recogni-
tion (NER), especially for rare and emerging
named entities. Semantic augmentation using
external knowledge is a potential way to alle-
viate this problem, while how to obtain and
leverage external knowledge for the NER task
remains a challenge. In this paper, we propose
a neural-based approach to perform semantic
augmentation using external knowledge from
search engine for Chinese NER. In particu-
lar, a multi-channel semantic fusion model is
adopted to generate the augmented input rep-
resentations, which aggregates external related
texts retrieved from the search engine. Experi-
ments have shown the superiority of our model
across 4 NER datasets, including formal and
social media language contexts, which further
prove the effectiveness of our approach.
1 Introduction
Different from English, Chinese is correlated with
word segmentation and suffers from more polyse-
mous words and grammatical ambiguities. Given
that contextual information is limited, external
knowledge is leveraged to support the entity dis-
ambiguation, which is critical to improve Chinese
NER, especially for rare and emerging named enti-
ties.
Apart from lexical information (Gui et al.,2019),
other external sources of information has been
leveraged to perform semantic augmentation for
NER, such as external syntactic features (Li et al.,
2020a), character radical features (Xu et al.,2019)
and domain-specific knowledge (Zafarian and As-
ghari,2019). However, it takes extra efforts to
extract these information and most of them are
domain-specific. Search engine is a straightforward
way to retrieve open-domain external knowledge,
which can be evidence for recognizing those am-
biguous named entities. A motivating example is
shown in Figure 1.
失去懂王的日子让我索然无味
Unconventional Named Entities
Original Input External Related Texts
1.懂王,是一个网络流行语,这里指美国前任
总统特朗普。因为自其上台以来,口无遮拦,
全知全能过度自信
2.马斯克改造推特第一步:解封特朗普禁令
欲召回“懂王”。“马斯克欲解封特朗普”的消息
一早登上微博热搜,引发热议
3.懂王室友又来给我上课了,一百多块
掌握科学刷牙。
4.特朗普为何叫懂王与地堡男孩?其实他还有
好几个外号
Life without the king who knows
everything makes me dull
Figure 1: A motivating example for recognizing
new entities using external related texts retrived from
the search engine, in which "the king knows every-
thing"("Dong Wang")is an unconventional named en-
tity referring to Donald J. Trump.
In this paper, we suggest to improve Chinese
NER by semantic augmentation through a search
engine. Inspired by Fusion-in-Decoder (Izacard
and Grave,2021), we propose a multi-channel se-
mantic fusion NER model which leverages exter-
nal knowledge to augment the contextual informa-
tion of the original input. Given external related
texts retrieved from the search engine, our model
first adopts multi-channel BERT encoder to encode
each texts independently. An attention fusion layer
is utilized to incorporate external knowledge into
the original input representation. Finally, the fused
semantic representation is fed into CRF layer for
decoding.
We implement an external related texts genera-
tion module for optimizing retrieval results from
the search engine. TextRank (Mihalcea and Tarau,
2004) and BM25 (Robertson and Zaragoza,2009)
are utilized to generate external related texts which
are semantically relevant to the original input sen-
tence.
The experimental results on the generic domain
arXiv:2210.12662v1 [cs.CL] 23 Oct 2022
External
Context
Input Sentence BERT
BERT
BERT
BERT
Fusion Layer
Classifier
y1
y2
yn
CRF Layer
External
Context
External
Context
Input Embedding
External
Embedding
External
Embedding
External
Embedding
Concatenate
... ... ...
...
Figure 2: The multi-channel semantic fusion NER model. Multi-channel BERT encodes each text independently.
Fusion layer generates fused semantic representations based on the attention mechanism, which are fed into CRF
layer for named entity prediction.
and social media domain show the superiority
of our approach. It demonstrates that search en-
gine augmentation can effectively improve Chinese
NER, especially for the social media domain.
2 Model
The proposed approach can be described in two
steps. Given an input sentence, external related
texts are retrieved from a search engine. The origi-
nal input sentence, along with external related texts,
is fed into the multi-channel semantic fusion NER
model to generate fused representations which ag-
gregates external knowledge obtained from the
search engine.
2.1 Multi-channel Semantic Fusion
We view NER task as a sequence labeling problem.
Our multi-channel semantic fusion NER model is
shown in Figure 2, in which BERT-CRF (Souza
et al.,2019) serves as the backbone structure.
Given original input
x
and
K
external related
texts
˜
X={˜x1,˜x2, ..., ˜xK}
, multi-channel BERT
encoder is utilized to encode each texts indepen-
dently, from which original input embedding
Hx
and external embedding Hexternal are obtained.
[Hx, Hexternal] = BERT ([x, ˜
X]) (1)
where Hexternal ={h˜x1, h ˜x2, ..., h ˜xK}.
Processing texts independently with a multi-
channel encoder means that the computation time
grows linearly as the number of texts scales. So it
makes the model more extensible. Meanwhile, con-
textual information of each channel is independent
to facilitate the subsequent semantic fusion.
We feed input embedding and context embed-
ding into the attention fusion layer to generate
fused semantic representation, which is finally put
into CRF layer for decoding. In particular, for input
懂 王 特 朗 普
懂 王 将 归 来
attention
score
Context Embedding
𝑝
1 − 𝑝
Fused Semantic Representation
Input Embedding External Embedding
Figure 3: A token-level illustration of semantic fusion.
embedding
Hx
, we compute attention scores over
external embedding
Hexternal
to generate context
embedding
Hcontext
, which fuses external knowl-
edge based on semantic relevance to the original in-
put. The fused semantic representation
Hfusion
is
acquired by calculating the weighted sum of input
embedding and context embedding. The weights
of input embedding and context embedding are
set to a fusion factor
p
and
1p
respectively. A
token-level illustration is shown in Figure 3.
Hcontext =Attention(Hx, Hexternal)(2)
Hfusion =p×Hx+ (1 p)×Hcontext (3)
Three points are taken into account when design-
ing the fusion layer. First, sequence dependency
should be reserved, which is very important for
NER. Secondly, relation between original input
and external contexts should be considered in the
semantic fusion, i.e., the former is given priority
and external knowledge is supplementary evidence.
Thirdly, not all external related texts are necessary
for semantic augmentation, we should focus on
that part which is of help to accurately identify the
named entity.
摘要:

ImprovingChineseNamedEntityRecognitionbySearchEngineAugmentationQinghuaMaoandKuiMengShanghaiJiaoTongUniversity{mmmm2018,mengkui}@sjtu.edu.cnJiatongLiTheUniversityofMelbournejiatongl3@student.unimelb.edu.auAbstractComparedwithEnglish,Chinesesuffersfrommoregrammaticalambiguities,likefuzzywordboundarie...

展开>> 收起<<
Improving Chinese Named Entity Recognition by Search Engine Augmentation Qinghua Mao and Kui Meng.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:500.87KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注