
tuned for each task in HUE. Our pretrained models
on the corpora from that era outperform the existing
language models built for ancient Chinese, confirm-
ing the need for specially-trained Hanja language
models. We also run additional experiments based
on the analyses of entity- and word-level changes
on AJD by controlling input conditions by masking
named entity and giving the time period as input.
Finally, we demonstrate the effectiveness of our
Hanja language model for analyzing unseen docu-
ments, running zero-shot experiments for chrono-
logical attribution and named entity recognition
tasks on DRRI.
Our main contributions are as follows:
•
We release the HUE dataset and Hanja PLMs
to support historians to understand and ana-
lyze a large volume of historical documents
written in Hanja. To the best of our knowl-
edge, this is the first work proposing Hanja
language models and releasing a NLP bench-
mark dataset for ancient Hanja documents.
•
We demonstrate that providing key informa-
tion such as named entity and document age
as input improves the performance of Hanja
language model on the HUE tasks.
•
We run zero-shot experiments on several HUE
tasks from DRRI which have not been dis-
cussed in the NLP community, and demon-
strate the performance of our Hanja language
models on unseen historical documents.
2 Background
2.1 Hanja
Hanja, the writing system based on ancient Chinese
characters, was the main writing system in Korea
before the 20
th
century, while Hangul, the unique
Korean alphabet, has been the main writing sys-
tem in Korea from the last century. Formal records
from the Joseon dynasty (1392-1897) are written
in Hanja, while spoken language and some written
documents were in Hangul, developed in the 15th
century. This co-existence of the written and collo-
quial languages has led Hanja to evolve to have the
basic syntax of classical Chinese, mixing with the
lexical, semantic, and syntactic characteristics of
colloquial Korean.
Hanja is significantly different from both mod-
ern Korean and modern Chinese. Modern Korean
uses a different alphabet and structure, and tradi-
tional Chinese shares some characters with Hanja,
while the lexicon has evolved greatly to reflect the
temporal, geographical, and cultural differences be-
tween the Joseon dynasty and modern-day China.
Simplified Chinese, the current written language in
China has diverged more because of the simplifica-
tion of the characters. These differences between
Hanja and other related languages would lead to
suboptimal performance when adopting the cur-
rent Chinese language models to NLU tasks for the
Korean historical Hanja documents.
2.2 Dataset
We describe the three corpora of records written in
Hanja during the Joseon dynasty, whose contents
and additional information such as topic and named
entities are provided by historians in IKTC
3
. Table
2shows the list of the Hanja corpora used.
Annals of the Joseon Dynasty (AJD)
also
called Veritable Records of the Joseon Dynasty,
is a corpus of 27 sets of chronological records, and
each set covers one ruler’s reign. AJD has been
translated into Korean from 1968 to 1993 and in-
cludes relevant tags such as the named entities and
dates of the documents
4
. We use AJD for both
training our Hanja language models and building
the HUE dataset of NLP tasks.
Diaries of the Royal Secretariat (DRS)
is a cor-
pus of detailed records of daily events and official
schedules of the court from the first King Taejo to
the last (27
th
) Sunjong. Many of the earlier records
were lost, and we use the extant records starting
from the 16
th
King Injo. DRS is known to hold
the largest amount of authentic historic records and
state secrets of the Joseon Dynasty
5
. We use DRS
to continue pretraining the language models.
Daily Records of the Royal Court and Impor-
tant Officials (DRRI)
is a corpus of journals
written from the 21
st
King Yeongjo to the last Em-
peror Sungjong and presumably initiated from the
diaries of the crown prince who became the 22
nd
King Jeongjo after he was enthroned. DRRI has
official daily records from both the central and the
local governments, so encompasses all events in
the country and reports to the king with summaries.
DRRI is known to include details and events of
3https://db.itkc.or.kr/
4http://esillok.history.go.kr/
5http://english.cha.go.kr/