DARE A large-scale handwritten date recognition system Christian M. Dahl Torben S. D. Johansen Emil N. Srensen

2025-04-27 0 0 2.96MB 41 页 10玖币

侵权投诉

DARE: A large-scale handwritten date recognition

system

Christian M. Dahl*, Torben S. D. Johansen*, Emil N. Sørensen**,

Christian E. Westermann*, and Simon F. Wittrock*

*Department of Economics, University of Southern Denmark

**School of Economics, University of Bristol

October 4, 2022

Abstract

Handwritten text recognition for historical documents is an important task

but it remains diﬃcult due to a lack of suﬃcient training data in combina-

tion with a large variability of writing styles and degradation of historical

documents. While recurrent neural network architectures are commonly used

for handwritten text recognition, they are often computationally expensive to

train and the beneﬁt of recurrence drastically diﬀers by task. For these rea-

sons, it is important to consider non-recurrent architectures. In the context

of handwritten date recognition, we propose an architecture based on the Ef-

ﬁcientNetV2 class of models that is fast to train, robust to parameter choices,

and accurately transcribes handwritten dates from a number of sources. For

training, we introduce a database containing almost 10 million tokens, orig-

inating from more than 2.2 million handwritten dates which are segmented

from diﬀerent historical documents. As dates are some of the most common

information on historical documents, and with historical archives containing

∗Acknowledgements: Torben gratefully acknowledges ﬁnancial support from the Independent Re-

search Fund Denmark, grant 8106-00003B. Emil gratefully acknowledges ﬁnancial support from the

European Research Council (Starting Grant Reference 851725). The DARE database is available

at https://www.kaggle.com/datasets/sdusimonwittrock/dare-database. Our code is avail-

able upon request.

arXiv:2210.00503v1 [cs.CV] 2 Oct 2022

millions of such documents, the eﬃcient and automatic transcription of dates

has the potential to lead to signiﬁcant cost-savings over manual transcription.

We show that training on handwritten text with high variability in writing

styles result in robust models for general handwritten text recognition and

that transfer learning from the DARE system increases transcription accuracy

substantially, allowing one to obtain high accuracy even when using a relatively

small training sample.

Keywords

Handwritten date recognition; EﬃcientNetV2; Multi-head classiﬁcation; Transfer

learning; Date database.

1 Introduction

Handwritten text recognition (HTR) is an important step towards the transcription

and preservation of historical archives but it remains a challenging task. One obstacle

is that deep learning models often require large-scale annotated datasets to perform

well, which is particularly a concern for HTR on historical documents due to the

high variation in writing styles and that historical documents often suﬀer from bleed-

through, fading of text, and general degradation. One step towards solving this is

to introduce more and larger real-world datasets that can increase the robustness

of HTR models. Even though dates are some of the most important and frequently

represented information on historical documents, little research have been invested

into the speciﬁc recognition of handwritten dates from historical documents. To

solve the challenge of handwritten date transcription from historical documents, we

introduce the DARE database and system, which together facilitate the transcription

of dates even from diﬃcult-to-read historical documents.1The DARE database is

the largest available database of handwritten dates and consists of segmented dates

from a number of diﬀerent historical documents, meaning that the database contains

many types of documents from diﬀerent time periods written by a large number of

authors and suﬀering from varying amounts of degradation. This results in signiﬁcant

variation between the handwritten dates, see, for example, Figure 1, which enables

us to train robust models for transcription, which in turn perform well for transfer

1While general-purpose out-of-the-box HTR such as Transkribus has improved signiﬁcantly in

recent years, transcribing dates from historical documents proves challenging due to issues including

document degradation, other handwriting overlapping dates in segmented images, and variation in

handwriting styles.

learning. In total, the DARE database is derived from 3,145,922 images, originating

from six diﬀerent data sources. While a large number of the segmented images are

empty, we still obtain more than 2.2 million manually transcribed dates, totalling

almost 10 million individual tokens.2

We use this large database of handwritten dates to train neural networks to

transcribe dates: The DARE system. These networks achieve date transcription

accuracies of between 92% to 99%, with the exception of one particularly diﬃcult

dataset.3To put this performance into context, we compare the performance on one

of these datasets to previous work and demonstrate signiﬁcant improvements over

manual labelling. While it is reassuring that the DARE system achieves high levels of

transcription accuracies on the test sets, our ultimate objective is to create a system

that improves automated transcription of dates from new data sources. To illustrate

the usefulness of the DARE system for such tasks, we ﬁrst use the DARE system

for transfer learning, showing that it signiﬁcantly improves transcription performance

over networks trained from scratch as well as networks transfer learned on other data

sources. Using the DARE system, we are able to achieve high levels of transcription

accuracy even when using only a small training sample.

Second, we use Danish census data from 1916 to further illustrate the usefulness

of the DARE system for transfer learning and for zero-shot transcription. A subset

(less than 5%) of the 3.7 million entries recorded in the 1916 census, for which we

have image ﬁles, are labelled manually. However, the labels come with no link back

2We still include the empty images when training our models, as we want models that are robust

in the sense of being able to inform when a date is not present on an image.

3This is primarily due to errors in the manual labels and the segmentation. However, we still

believe that this dataset is helpful to improve the overall performance of our neural networks.

to the source images, so the labels cannot directly be used as training data. We

show that by transcribing images in a zero-shot fashion using the DARE system and

subsequently linking the images to the labelled data, we can gradually create a larger

training set to which we can eventually apply transfer learning and obtain a more

accurate model. This procedure is repeated until a suﬃcient degree of accuracy is

obtained.

This procedure allows us to match nearly all of the labelled entries to the source

images, resulting in a large training sample that makes it possible to transcribe the

census with a date recognition accuracy of around 95%. The resulting database

comprising more than 3.7 million rows contains unique information on name, birth

date, residence, civil status, income, and wealth for every single individual living in

Denmark (excluding Copenhagen residents) in 1916.

The rest of the paper proceeds as follows: We start by surveying some of the most

relevant literature within HTR. Next, in Section 3 we describe the database, network

architecture, and technical details of the training pipeline. Section 4 shows the

within-distribution performance of the DARE system, by studying the performance

of the system on the test sets of the DARE database. Section 5 describes the pipeline

for transfer learning and presents the associated results. Section 6 shows the process

of linking images to transcriptions. Section 7 concludes.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DARE:Alarge-scalehandwrittendaterecognitionsystemChristianM.Dahl*,TorbenS.D.Johansen*,EmilN.Srensen**,ChristianE.Westermann*,andSimonF.Wittrock**DepartmentofEconomics,UniversityofSouthernDenmark**SchoolofEconomics,UniversityofBristolOctober4,2022AbstractHandwrittentextrecognitionforhistoricaldocume...

展开>> 收起<<

DARE A large-scale handwritten date recognition system Christian M. Dahl Torben S. D. Johansen Emil N. Srensen.pdf

共41页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DARE A large-scale handwritten date recognition system Christian M. Dahl Torben S. D. Johansen Emil N. Srensen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: