HUE Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea Haneul Yoo1 Jiho Jin1 Juhee Son1 JinYeong Bak2 Kyunghyun Cho3 Alice Oh1

2025-04-29 0 0 745.57KB 13 页 10玖币
侵权投诉
HUE: Pretrained Model and Dataset for
Understanding Hanja Documents of Ancient Korea
Haneul Yoo1, Jiho Jin1, Juhee Son1, JinYeong Bak2, Kyunghyun Cho3, Alice Oh1
1KAIST, South Korea, 2Sungkyunkwan University, South Korea, 3New York University, USA
{haneul.yoo, jinjh0123, sjh5665}@kaist.ac.kr,
jy.bak@skku.edu,kyunghyun.cho@nyu.edu,alice.oh@kaist.edu
Abstract
Historical records in Korea before the 20th
century were primarily written in Hanja, an
extinct language based on Chinese characters
and not understood by modern Korean or Chi-
nese speakers. Historians with expertise in
this time period have been analyzing the doc-
uments, but that process is very difficult and
time-consuming, and language models would
significantly speed up the process. Toward
building and evaluating language models for
Hanja, we release the Hanja Understanding
Evaluation dataset consisting of chronologi-
cal attribution, topic classification, named en-
tity recognition, and summary retrieval tasks.
We also present BERT-based models contin-
ued pretraining on the two major corpora from
the 14th to the 19th centuries: the Annals of the
Joseon Dynasty and Diaries of the Royal Sec-
retariats. 1We compare the models with sev-
eral baselines on all tasks and show there are
significant improvements gained by training
on the two corpora. Additionally, we run zero-
shot experiments on the Daily Records of the
Royal Court and Important Officials (DRRI).
The DRRI dataset has not been studied much
by the historians, and not at all by the NLP
community.
1 Introduction
Large-scale historical records in Korea were mostly
produced during the Joseon dynasty (1392-1897),
and the Institute for the Translation of Korean Clas-
sics (ITKC) keeps a comprehensive database of Ko-
rean classics at a scale of approximately 9 billion
characters. This digital archive is a great resource
for Korean historians, but the documents remain in
the original Hanja language
2
. Hanja is an extinct
1
All codes, models, and dataset are available at
https:
//github.com/haneul-yoo/HUE.git
2
Hanja is a set of characters (script) used in ancient Korean,
while Hanmun is a writing style (language) in the same era.
However, we will refer to Hanja as a language following the
conventions of the previous works.
Language Sentence
Hanja
Modern Korean ᆼᄋ
.
Simplified Chinese
Traditional Chinese
English The King attended
the Royal Lecture .
Table 1: Example sentence in AJD
language, and as Table 1illustrates with a simple
sentence, Hanja is lexically and syntactically differ-
ent from modern Korean, as well as simplified and
traditional Chinese. Understanding the documents
in the digital archive is thus difficult and would
benefit greatly from a Hanja language model which
can also be used to accelerate the expert translation
(Vale de Gato,2015). There are two corpora we
can use to train the language model, the Annals of
Joseon Dynasty (AJD), first introduced to the NLP
community in (Bak and Oh,2015), and the Diaries
of the Royal Secretariats (DRS) (Kang et al.,2021).
In this paper, we provide the HUE (
H
anja
U
nderstanding
E
valuation) dataset consisting
of chronological attribution, topic classification,
named entity recognition and summary retrieval, a
suite of tasks to help build and evaluate the Hanja
language model. In addition to AJD and DRS, we
also work with the Daily Records of the Royal
Court and Important Officials (DRRI). Unlike AJD
and DRS that have been analyzed by historians and
contain their annotations, DRRI lacks such system-
atic analysis, and we use it for zero-shot learning
and introduce it to the NLP community.
We also provide pretrained language models
(PLMs) for Hanja trained on AJD and DRS, fine-
arXiv:2210.05112v1 [cs.CL] 11 Oct 2022
tuned for each task in HUE. Our pretrained models
on the corpora from that era outperform the existing
language models built for ancient Chinese, confirm-
ing the need for specially-trained Hanja language
models. We also run additional experiments based
on the analyses of entity- and word-level changes
on AJD by controlling input conditions by masking
named entity and giving the time period as input.
Finally, we demonstrate the effectiveness of our
Hanja language model for analyzing unseen docu-
ments, running zero-shot experiments for chrono-
logical attribution and named entity recognition
tasks on DRRI.
Our main contributions are as follows:
We release the HUE dataset and Hanja PLMs
to support historians to understand and ana-
lyze a large volume of historical documents
written in Hanja. To the best of our knowl-
edge, this is the first work proposing Hanja
language models and releasing a NLP bench-
mark dataset for ancient Hanja documents.
We demonstrate that providing key informa-
tion such as named entity and document age
as input improves the performance of Hanja
language model on the HUE tasks.
We run zero-shot experiments on several HUE
tasks from DRRI which have not been dis-
cussed in the NLP community, and demon-
strate the performance of our Hanja language
models on unseen historical documents.
2 Background
2.1 Hanja
Hanja, the writing system based on ancient Chinese
characters, was the main writing system in Korea
before the 20
th
century, while Hangul, the unique
Korean alphabet, has been the main writing sys-
tem in Korea from the last century. Formal records
from the Joseon dynasty (1392-1897) are written
in Hanja, while spoken language and some written
documents were in Hangul, developed in the 15th
century. This co-existence of the written and collo-
quial languages has led Hanja to evolve to have the
basic syntax of classical Chinese, mixing with the
lexical, semantic, and syntactic characteristics of
colloquial Korean.
Hanja is significantly different from both mod-
ern Korean and modern Chinese. Modern Korean
uses a different alphabet and structure, and tradi-
tional Chinese shares some characters with Hanja,
while the lexicon has evolved greatly to reflect the
temporal, geographical, and cultural differences be-
tween the Joseon dynasty and modern-day China.
Simplified Chinese, the current written language in
China has diverged more because of the simplifica-
tion of the characters. These differences between
Hanja and other related languages would lead to
suboptimal performance when adopting the cur-
rent Chinese language models to NLU tasks for the
Korean historical Hanja documents.
2.2 Dataset
We describe the three corpora of records written in
Hanja during the Joseon dynasty, whose contents
and additional information such as topic and named
entities are provided by historians in IKTC
3
. Table
2shows the list of the Hanja corpora used.
Annals of the Joseon Dynasty (AJD)
also
called Veritable Records of the Joseon Dynasty,
is a corpus of 27 sets of chronological records, and
each set covers one ruler’s reign. AJD has been
translated into Korean from 1968 to 1993 and in-
cludes relevant tags such as the named entities and
dates of the documents
4
. We use AJD for both
training our Hanja language models and building
the HUE dataset of NLP tasks.
Diaries of the Royal Secretariat (DRS)
is a cor-
pus of detailed records of daily events and official
schedules of the court from the first King Taejo to
the last (27
th
) Sunjong. Many of the earlier records
were lost, and we use the extant records starting
from the 16
th
King Injo. DRS is known to hold
the largest amount of authentic historic records and
state secrets of the Joseon Dynasty
5
. We use DRS
to continue pretraining the language models.
Daily Records of the Royal Court and Impor-
tant Officials (DRRI)
is a corpus of journals
written from the 21
st
King Yeongjo to the last Em-
peror Sungjong and presumably initiated from the
diaries of the crown prince who became the 22
nd
King Jeongjo after he was enthroned. DRRI has
official daily records from both the central and the
local governments, so encompasses all events in
the country and reports to the king with summaries.
DRRI is known to include details and events of
3https://db.itkc.or.kr/
4http://esillok.history.go.kr/
5http://english.cha.go.kr/
Dataset Size Training data Downstream Tasks Zero-shot King
AJD 230K 4CA, TC, NER - Taejo (1st) - Sunjong (27th)
DRS 1,380K 4- - Injo (16th) - Sunjong (27th)
DRRI 426K - SR CA, NER Yeongjo (21st) - Sunjong (27th)
Table 2: Source corpora chosen for building HUE dataset and PLMs
the late Joseon Dynasty not recorded in the AJD
or DRS
5
, thus making it a good corpus for zero-
shot experiments. We use DRRI for the supervised
summary retrieval task and the zero-shot experi-
ments for chronological attribution and named en-
tity recognition.
3 HUE Dataset
The HUE (
H
anja
U
nderstanding
E
valuation)
dataset is built to assist history scholars to under-
stand Korean historical records written in Hanja.
HUE consists of chronological attribution, topic
classification, named entity recognition, and sum-
mary retrieval, which are tasks that can provide
helpful information for studying the documents.
We expect that the language models based on HUE
will ultimately help historians to interpret unseen
historical documents and public to grasp basic con-
cept of those documents. We describe each task in
detail below.
3.1 Task Description
Chronological Attribution (CA)
is a classifica-
tion task predicting the ruling king when the docu-
ment was written. When given a Hanja document
from AJD, a classifier outputs one of the 27 kings
of the Joseon dynasty. Chronological attribution
of the undiscovered document is the first step in
anthology to interpret and translate it. Korean his-
torians mostly divide the history of the Joseon Dy-
nasty based on the reigning king, so that we treat
chronological attribution as a classification task.
Topic Classification (TC)
is a multi-class and
multi-label classification task to find the topics of
the given document. For TC, we use Hanja docu-
ment from AJD. We suggest two levels of topics,
namely major and minor categories. The major
categories consist of 4 broad topics: politics, econ-
omy, society, and culture. The minor categories go
with 106 sub-topics such as diplomacy, agriculture,
and science.
Named Entity Recognition (NER)
is a se-
quence tagging task, identifying the two types of
named entities, person and location, from the Hanja
document from AJD. We divide train, validation,
and test sets such that there are no overlapping
entities across the sets.
Summary Retrieval (SR)
is a task to find the
most relevant summary that matches the content
among the summary candidates. For this task, we
use DRRI, in which each document is a pair of sum-
mary (gang) and detailed content (mok). Among
426k articles, 265k articles in DRRI dataset contain
both gamg and mok. Also, we exclude those with
gang longer than mok, in which gang is not the
summary of mok. The final dataset contains 213K
pairs of content and the corresponding summaries.
We describe more details of the preprocessing in
the Appendix.
4 Hanja Pretrained Model
As far as we know, there have been no pretrained
language models for the Hanja language. One can
use related LMs, the pretrained models for ancient
Chinese as well as multilingual BERT (Devlin et al.,
2019) which includes traditional Chinese in its
training corpus. AnchiBERT (Tian et al.,2021) is
pretrained in ancient Chinese with the Chinese an-
thologies written around 1000BC to 200BC. There
is some vocabulary overlap between the Hanja doc-
uments and traditional Chinese corpora, we can
adopt multilingual BERT and AnchiBERT to learn
the representations of the Hanja texts.
We propose the pretrained language models suit-
able for Hanja documents by continuing pretraining
those two models on both AJD and DRS. Table 3
shows the ratio of unknown tokens in the test set
of AJD by each model. It implies that existing
AnchiBERT and multilingual BERT can also be
exploited as language models for Hanja documents
written in the Joseon dynasty, but the second phase
of pretraining on the corpora of that era remarkably
decreases the ratio of unknown tokens.
摘要:

HUE:PretrainedModelandDatasetforUnderstandingHanjaDocumentsofAncientKoreaHaneulYoo1,JihoJin1,JuheeSon1,JinYeongBak2,KyunghyunCho3,AliceOh11KAIST,SouthKorea,2SungkyunkwanUniversity,SouthKorea,3NewYorkUniversity,USA{haneul.yoo,jinjh0123,sjh5665}@kaist.ac.kr,jy.bak@skku.edu,kyunghyun.cho@nyu.edu,alice....

展开>> 收起<<
HUE Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea Haneul Yoo1 Jiho Jin1 Juhee Son1 JinYeong Bak2 Kyunghyun Cho3 Alice Oh1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:745.57KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注