HUE Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea Haneul Yoo1 Jiho Jin1 Juhee Son1 JinYeong Bak2 Kyunghyun Cho3 Alice Oh1

2025-04-29 0 0 745.57KB 13 页 10玖币

侵权投诉

HUE: Pretrained Model and Dataset for

Understanding Hanja Documents of Ancient Korea

Haneul Yoo1, Jiho Jin1, Juhee Son1, JinYeong Bak2, Kyunghyun Cho3, Alice Oh1

1KAIST, South Korea, 2Sungkyunkwan University, South Korea, 3New York University, USA

{haneul.yoo, jinjh0123, sjh5665}@kaist.ac.kr,

jy.bak@skku.edu,kyunghyun.cho@nyu.edu,alice.oh@kaist.edu

Abstract

Historical records in Korea before the 20th

century were primarily written in Hanja, an

extinct language based on Chinese characters

and not understood by modern Korean or Chi-

nese speakers. Historians with expertise in

this time period have been analyzing the doc-

uments, but that process is very difﬁcult and

time-consuming, and language models would

signiﬁcantly speed up the process. Toward

building and evaluating language models for

Hanja, we release the Hanja Understanding

Evaluation dataset consisting of chronologi-

cal attribution, topic classiﬁcation, named en-

tity recognition, and summary retrieval tasks.

We also present BERT-based models contin-

ued pretraining on the two major corpora from

the 14th to the 19th centuries: the Annals of the

Joseon Dynasty and Diaries of the Royal Sec-

retariats. 1We compare the models with sev-

eral baselines on all tasks and show there are

signiﬁcant improvements gained by training

on the two corpora. Additionally, we run zero-

shot experiments on the Daily Records of the

Royal Court and Important Ofﬁcials (DRRI).

The DRRI dataset has not been studied much

by the historians, and not at all by the NLP

community.

1 Introduction

Large-scale historical records in Korea were mostly

produced during the Joseon dynasty (1392-1897),

and the Institute for the Translation of Korean Clas-

sics (ITKC) keeps a comprehensive database of Ko-

rean classics at a scale of approximately 9 billion

characters. This digital archive is a great resource

for Korean historians, but the documents remain in

the original Hanja language

. Hanja is an extinct

All codes, models, and dataset are available at

https:

//github.com/haneul-yoo/HUE.git

Hanja is a set of characters (script) used in ancient Korean,

while Hanmun is a writing style (language) in the same era.

However, we will refer to Hanja as a language following the

conventions of the previous works.

Language Sentence

Hanja 上御經筵。

Modern Korean 임금  경연

갔 .

Simpliﬁed Chinese 国王参加了皇家讲座。

Traditional Chinese 國王參加了皇家講座。

English The King attended

the Royal Lecture .

Table 1: Example sentence in AJD

language, and as Table 1illustrates with a simple

sentence, Hanja is lexically and syntactically differ-

ent from modern Korean, as well as simpliﬁed and

traditional Chinese. Understanding the documents

in the digital archive is thus difﬁcult and would

beneﬁt greatly from a Hanja language model which

can also be used to accelerate the expert translation

(Vale de Gato,2015). There are two corpora we

can use to train the language model, the Annals of

Joseon Dynasty (AJD), ﬁrst introduced to the NLP

community in (Bak and Oh,2015), and the Diaries

of the Royal Secretariats (DRS) (Kang et al.,2021).

In this paper, we provide the HUE (

anja

nderstanding

valuation) dataset consisting

of chronological attribution, topic classiﬁcation,

named entity recognition and summary retrieval, a

suite of tasks to help build and evaluate the Hanja

language model. In addition to AJD and DRS, we

also work with the Daily Records of the Royal

Court and Important Ofﬁcials (DRRI). Unlike AJD

and DRS that have been analyzed by historians and

contain their annotations, DRRI lacks such system-

atic analysis, and we use it for zero-shot learning

and introduce it to the NLP community.

We also provide pretrained language models

(PLMs) for Hanja trained on AJD and DRS, ﬁne-

arXiv:2210.05112v1 [cs.CL] 11 Oct 2022

tuned for each task in HUE. Our pretrained models

on the corpora from that era outperform the existing

language models built for ancient Chinese, conﬁrm-

ing the need for specially-trained Hanja language

models. We also run additional experiments based

on the analyses of entity- and word-level changes

on AJD by controlling input conditions by masking

named entity and giving the time period as input.

Finally, we demonstrate the effectiveness of our

Hanja language model for analyzing unseen docu-

ments, running zero-shot experiments for chrono-

logical attribution and named entity recognition

tasks on DRRI.

Our main contributions are as follows:

•

We release the HUE dataset and Hanja PLMs

to support historians to understand and ana-

lyze a large volume of historical documents

written in Hanja. To the best of our knowl-

edge, this is the ﬁrst work proposing Hanja

language models and releasing a NLP bench-

mark dataset for ancient Hanja documents.

•

We demonstrate that providing key informa-

tion such as named entity and document age

as input improves the performance of Hanja

language model on the HUE tasks.

•

We run zero-shot experiments on several HUE

tasks from DRRI which have not been dis-

cussed in the NLP community, and demon-

strate the performance of our Hanja language

models on unseen historical documents.

2 Background

2.1 Hanja

Hanja, the writing system based on ancient Chinese

characters, was the main writing system in Korea

before the 20

century, while Hangul, the unique

Korean alphabet, has been the main writing sys-

tem in Korea from the last century. Formal records

from the Joseon dynasty (1392-1897) are written

in Hanja, while spoken language and some written

documents were in Hangul, developed in the 15th

century. This co-existence of the written and collo-

quial languages has led Hanja to evolve to have the

basic syntax of classical Chinese, mixing with the

lexical, semantic, and syntactic characteristics of

colloquial Korean.

Hanja is signiﬁcantly different from both mod-

ern Korean and modern Chinese. Modern Korean

uses a different alphabet and structure, and tradi-

tional Chinese shares some characters with Hanja,

while the lexicon has evolved greatly to reﬂect the

temporal, geographical, and cultural differences be-

tween the Joseon dynasty and modern-day China.

Simpliﬁed Chinese, the current written language in

China has diverged more because of the simpliﬁca-

tion of the characters. These differences between

Hanja and other related languages would lead to

suboptimal performance when adopting the cur-

rent Chinese language models to NLU tasks for the

Korean historical Hanja documents.

2.2 Dataset

We describe the three corpora of records written in

Hanja during the Joseon dynasty, whose contents

and additional information such as topic and named

entities are provided by historians in IKTC

. Table

2shows the list of the Hanja corpora used.

Annals of the Joseon Dynasty (AJD)

also

called Veritable Records of the Joseon Dynasty,

is a corpus of 27 sets of chronological records, and

each set covers one ruler’s reign. AJD has been

translated into Korean from 1968 to 1993 and in-

cludes relevant tags such as the named entities and

dates of the documents

. We use AJD for both

training our Hanja language models and building

the HUE dataset of NLP tasks.

Diaries of the Royal Secretariat (DRS)

is a cor-

pus of detailed records of daily events and ofﬁcial

schedules of the court from the ﬁrst King Taejo to

the last (27

) Sunjong. Many of the earlier records

were lost, and we use the extant records starting

from the 16

King Injo. DRS is known to hold

the largest amount of authentic historic records and

state secrets of the Joseon Dynasty

. We use DRS

to continue pretraining the language models.

Daily Records of the Royal Court and Impor-

tant Ofﬁcials (DRRI)

is a corpus of journals

written from the 21

King Yeongjo to the last Em-

peror Sungjong and presumably initiated from the

diaries of the crown prince who became the 22

King Jeongjo after he was enthroned. DRRI has

ofﬁcial daily records from both the central and the

local governments, so encompasses all events in

the country and reports to the king with summaries.

DRRI is known to include details and events of

3https://db.itkc.or.kr/

4http://esillok.history.go.kr/

5http://english.cha.go.kr/

Dataset Size Training data Downstream Tasks Zero-shot King

AJD 230K 4CA, TC, NER - Taejo (1st) - Sunjong (27th)

DRS 1,380K 4- - Injo (16th) - Sunjong (27th)

DRRI 426K - SR CA, NER Yeongjo (21st) - Sunjong (27th)

Table 2: Source corpora chosen for building HUE dataset and PLMs

the late Joseon Dynasty not recorded in the AJD

or DRS

, thus making it a good corpus for zero-

shot experiments. We use DRRI for the supervised

summary retrieval task and the zero-shot experi-

ments for chronological attribution and named en-

tity recognition.

3 HUE Dataset

The HUE (

anja

nderstanding

valuation)

dataset is built to assist history scholars to under-

stand Korean historical records written in Hanja.

HUE consists of chronological attribution, topic

classiﬁcation, named entity recognition, and sum-

mary retrieval, which are tasks that can provide

helpful information for studying the documents.

We expect that the language models based on HUE

will ultimately help historians to interpret unseen

historical documents and public to grasp basic con-

cept of those documents. We describe each task in

detail below.

3.1 Task Description

Chronological Attribution (CA)

is a classiﬁca-

tion task predicting the ruling king when the docu-

ment was written. When given a Hanja document

from AJD, a classiﬁer outputs one of the 27 kings

of the Joseon dynasty. Chronological attribution

of the undiscovered document is the ﬁrst step in

anthology to interpret and translate it. Korean his-

torians mostly divide the history of the Joseon Dy-

nasty based on the reigning king, so that we treat

chronological attribution as a classiﬁcation task.

Topic Classiﬁcation (TC)

is a multi-class and

multi-label classiﬁcation task to ﬁnd the topics of

the given document. For TC, we use Hanja docu-

ment from AJD. We suggest two levels of topics,

namely major and minor categories. The major

categories consist of 4 broad topics: politics, econ-

omy, society, and culture. The minor categories go

with 106 sub-topics such as diplomacy, agriculture,

and science.

Named Entity Recognition (NER)

is a se-

quence tagging task, identifying the two types of

named entities, person and location, from the Hanja

document from AJD. We divide train, validation,

and test sets such that there are no overlapping

entities across the sets.

Summary Retrieval (SR)

is a task to ﬁnd the

most relevant summary that matches the content

among the summary candidates. For this task, we

use DRRI, in which each document is a pair of sum-

mary (gang) and detailed content (mok). Among

426k articles, 265k articles in DRRI dataset contain

both gamg and mok. Also, we exclude those with

gang longer than mok, in which gang is not the

summary of mok. The ﬁnal dataset contains 213K

pairs of content and the corresponding summaries.

We describe more details of the preprocessing in

the Appendix.

4 Hanja Pretrained Model

As far as we know, there have been no pretrained

language models for the Hanja language. One can

use related LMs, the pretrained models for ancient

Chinese as well as multilingual BERT (Devlin et al.,

2019) which includes traditional Chinese in its

training corpus. AnchiBERT (Tian et al.,2021) is

pretrained in ancient Chinese with the Chinese an-

thologies written around 1000BC to 200BC. There

is some vocabulary overlap between the Hanja doc-

uments and traditional Chinese corpora, we can

adopt multilingual BERT and AnchiBERT to learn

the representations of the Hanja texts.

We propose the pretrained language models suit-

able for Hanja documents by continuing pretraining

those two models on both AJD and DRS. Table 3

shows the ratio of unknown tokens in the test set

of AJD by each model. It implies that existing

AnchiBERT and multilingual BERT can also be

exploited as language models for Hanja documents

written in the Joseon dynasty, but the second phase

of pretraining on the corpora of that era remarkably

decreases the ratio of unknown tokens.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HUE:PretrainedModelandDatasetforUnderstandingHanjaDocumentsofAncientKoreaHaneulYoo1,JihoJin1,JuheeSon1,JinYeongBak2,KyunghyunCho3,AliceOh11KAIST,SouthKorea,2SungkyunkwanUniversity,SouthKorea,3NewYorkUniversity,USA{haneul.yoo,jinjh0123,sjh5665}@kaist.ac.kr,jy.bak@skku.edu,kyunghyun.cho@nyu.edu,alice....

展开>> 收起<<

HUE Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea Haneul Yoo1 Jiho Jin1 Juhee Son1 JinYeong Bak2 Kyunghyun Cho3 Alice Oh1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HUE Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea Haneul Yoo1 Jiho Jin1 Juhee Son1 JinYeong Bak2 Kyunghyun Cho3 Alice Oh1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: