DARE A large-scale handwritten date recognition system Christian M. Dahl Torben S. D. Johansen Emil N. Srensen

2025-04-27 0 0 2.96MB 41 页 10玖币
侵权投诉
DARE: A large-scale handwritten date recognition
system
Christian M. Dahl*, Torben S. D. Johansen*, Emil N. Sørensen**,
Christian E. Westermann*, and Simon F. Wittrock*
*Department of Economics, University of Southern Denmark
**School of Economics, University of Bristol
October 4, 2022
Abstract
Handwritten text recognition for historical documents is an important task
but it remains difficult due to a lack of sufficient training data in combina-
tion with a large variability of writing styles and degradation of historical
documents. While recurrent neural network architectures are commonly used
for handwritten text recognition, they are often computationally expensive to
train and the benefit of recurrence drastically differs by task. For these rea-
sons, it is important to consider non-recurrent architectures. In the context
of handwritten date recognition, we propose an architecture based on the Ef-
ficientNetV2 class of models that is fast to train, robust to parameter choices,
and accurately transcribes handwritten dates from a number of sources. For
training, we introduce a database containing almost 10 million tokens, orig-
inating from more than 2.2 million handwritten dates which are segmented
from different historical documents. As dates are some of the most common
information on historical documents, and with historical archives containing
Acknowledgements: Torben gratefully acknowledges financial support from the Independent Re-
search Fund Denmark, grant 8106-00003B. Emil gratefully acknowledges financial support from the
European Research Council (Starting Grant Reference 851725). The DARE database is available
at https://www.kaggle.com/datasets/sdusimonwittrock/dare-database. Our code is avail-
able upon request.
1
arXiv:2210.00503v1 [cs.CV] 2 Oct 2022
millions of such documents, the efficient and automatic transcription of dates
has the potential to lead to significant cost-savings over manual transcription.
We show that training on handwritten text with high variability in writing
styles result in robust models for general handwritten text recognition and
that transfer learning from the DARE system increases transcription accuracy
substantially, allowing one to obtain high accuracy even when using a relatively
small training sample.
Keywords
Handwritten date recognition; EfficientNetV2; Multi-head classification; Transfer
learning; Date database.
2
1 Introduction
Handwritten text recognition (HTR) is an important step towards the transcription
and preservation of historical archives but it remains a challenging task. One obstacle
is that deep learning models often require large-scale annotated datasets to perform
well, which is particularly a concern for HTR on historical documents due to the
high variation in writing styles and that historical documents often suffer from bleed-
through, fading of text, and general degradation. One step towards solving this is
to introduce more and larger real-world datasets that can increase the robustness
of HTR models. Even though dates are some of the most important and frequently
represented information on historical documents, little research have been invested
into the specific recognition of handwritten dates from historical documents. To
solve the challenge of handwritten date transcription from historical documents, we
introduce the DARE database and system, which together facilitate the transcription
of dates even from difficult-to-read historical documents.1The DARE database is
the largest available database of handwritten dates and consists of segmented dates
from a number of different historical documents, meaning that the database contains
many types of documents from different time periods written by a large number of
authors and suffering from varying amounts of degradation. This results in significant
variation between the handwritten dates, see, for example, Figure 1, which enables
us to train robust models for transcription, which in turn perform well for transfer
1While general-purpose out-of-the-box HTR such as Transkribus has improved significantly in
recent years, transcribing dates from historical documents proves challenging due to issues including
document degradation, other handwriting overlapping dates in segmented images, and variation in
handwriting styles.
1
learning. In total, the DARE database is derived from 3,145,922 images, originating
from six different data sources. While a large number of the segmented images are
empty, we still obtain more than 2.2 million manually transcribed dates, totalling
almost 10 million individual tokens.2
We use this large database of handwritten dates to train neural networks to
transcribe dates: The DARE system. These networks achieve date transcription
accuracies of between 92% to 99%, with the exception of one particularly difficult
dataset.3To put this performance into context, we compare the performance on one
of these datasets to previous work and demonstrate significant improvements over
manual labelling. While it is reassuring that the DARE system achieves high levels of
transcription accuracies on the test sets, our ultimate objective is to create a system
that improves automated transcription of dates from new data sources. To illustrate
the usefulness of the DARE system for such tasks, we first use the DARE system
for transfer learning, showing that it significantly improves transcription performance
over networks trained from scratch as well as networks transfer learned on other data
sources. Using the DARE system, we are able to achieve high levels of transcription
accuracy even when using only a small training sample.
Second, we use Danish census data from 1916 to further illustrate the usefulness
of the DARE system for transfer learning and for zero-shot transcription. A subset
(less than 5%) of the 3.7 million entries recorded in the 1916 census, for which we
have image files, are labelled manually. However, the labels come with no link back
2We still include the empty images when training our models, as we want models that are robust
in the sense of being able to inform when a date is not present on an image.
3This is primarily due to errors in the manual labels and the segmentation. However, we still
believe that this dataset is helpful to improve the overall performance of our neural networks.
2
to the source images, so the labels cannot directly be used as training data. We
show that by transcribing images in a zero-shot fashion using the DARE system and
subsequently linking the images to the labelled data, we can gradually create a larger
training set to which we can eventually apply transfer learning and obtain a more
accurate model. This procedure is repeated until a sufficient degree of accuracy is
obtained.
This procedure allows us to match nearly all of the labelled entries to the source
images, resulting in a large training sample that makes it possible to transcribe the
census with a date recognition accuracy of around 95%. The resulting database
comprising more than 3.7 million rows contains unique information on name, birth
date, residence, civil status, income, and wealth for every single individual living in
Denmark (excluding Copenhagen residents) in 1916.
The rest of the paper proceeds as follows: We start by surveying some of the most
relevant literature within HTR. Next, in Section 3 we describe the database, network
architecture, and technical details of the training pipeline. Section 4 shows the
within-distribution performance of the DARE system, by studying the performance
of the system on the test sets of the DARE database. Section 5 describes the pipeline
for transfer learning and presents the associated results. Section 6 shows the process
of linking images to transcriptions. Section 7 concludes.
3
摘要:

DARE:Alarge-scalehandwrittendaterecognitionsystemChristianM.Dahl*,TorbenS.D.Johansen*,EmilN.Srensen**,ChristianE.Westermann*,andSimonF.Wittrock**DepartmentofEconomics,UniversityofSouthernDenmark**SchoolofEconomics,UniversityofBristolOctober4,2022AbstractHandwrittentextrecognitionforhistoricaldocume...

展开>> 收起<<
DARE A large-scale handwritten date recognition system Christian M. Dahl Torben S. D. Johansen Emil N. Srensen.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:2.96MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注