A two-stage approach for table extraction in invoices Thomas Saout Univ Angers LERIA

2025-04-27 1 0 682.53KB 14 页 10玖币
侵权投诉
A two-stage approach for table extraction in invoices
Thomas Saout
Univ Angers, LERIA,
F-49000 Angers, France
thomas.saout@etud.univ-angers.fr
Fr´ed´eric Lardeux
Univ Angers, LERIA,
F-49000 Angers, France
frederic.lardeux@univ-angers.fr
Fr´ed´eric Saubion
Univ Angers, LERIA,
F-49000 Angers, France
frederic.saubion@univ-angers.fr
October 11, 2022
Abstract
The automated analysis of administrative documents is an important field in document recog-
nition that is studied for decades. Invoices are key documents among these huge amounts of
documents available in companies and public services. Invoices contain most of the time data that
are presented in tables that should be clearly identified to extract suitable information. In this
paper, we propose an approach that combines an image processing based estimation of the shape
of the tables with a graph-based representation of the document, which is used to identify complex
tables precisely. We propose an experimental evaluation using a real case application.
Keywords : Data extraction, invoice, graph-based representation
1 Introduction
The automated analysis of administrative documents is an important field in document recognition that
is studied for decades [1]. Among these huge amounts of documents that are available in companies
and public services, invoices are key documents. Their automated processing is a complex task [2]
and has led to the development of commercial systems developed by companies such as ITESOFT or
ABBYY.
Invoices generally require complex administrative procedures, which involve different departments
(e.g., accounting department, logistics, supply chain...). The invoices have to be processed through
specific workflows [3]. From the document point of view, the full processing of invoices includes
their digitalization using Optical Character Recognition (OCR) [4] and their processing to achieve
information extraction, which aims at finding identifiers and their types, amounts, dates [5,6]. This
global process requires handling some specific characteristics of the considered documents [2]:
handling the variability of layouts,
training and quickly adapt to new contexts,
minimizing the end-user task.
Context
Many solutions have been proposed to manage information from scanned invoices and most of
them are based on machine learning techniques (e.g. classification). Recent research is still active
on this problem [5,7]. The first problem was certainly to identify invoices [8] and hence, models
have been proposed to ease their processing [9]. Once invoices have been correctly scanned and
identified, a remaining crucial question is ”how to extract relevant information from these invoices ?”.
Labeling techniques can be applied using rules [10]. Addressing this named entity recognition (NER)
task has been recently handled using neural networks [11,12]. Since invoices contain text sequences
1
arXiv:2210.04716v1 [cs.IR] 10 Oct 2022
that are mostly different from natural languages corpus, specific information extraction methods have
been proposed to take into account the specific structures in these documents. For instance, in [2],
a star graph is used to consider the neighborhood of a token (a token is an elementary, semantically
coherent, element of the document). The specific structures of invoices lead to consider the geographical
organization of the document and graph-based models are thus relevant [13]. Examining more precisely
invoices leads to consider that most of them include tables as a main structural character. Hence, table
detection within invoices appears as an important processing task [13]. Table processing is indeed an old
challenge [14,15] and, as quoted in [16] it includes different tasks : detection, extraction, interpretation
and understanding. Here, we focus on detection (detecting the presence of a tabular structure in a
document) and extraction (providing the data in a detected table in a more readable format). These
challenges are still active [17]. Recent work[18] proposes an approach to detect the general frame of
a table and to extract its content. Targeting wide classes of documents, many recent works often use
neural networks [19,20,21,22] to recognize table structures in documents by means of large training
sets. Focusing on more specific tables, their characteristics are also intended to help these tasks, such
as headers [23]. Rule based systems, which were seminal table extraction techniques, may also be
relevant [24].
Aim and Contribution
The general purpose is to automatically process invoices to get important data that are contained
in these documents and particularly information contained in tables. This work is motivated by a real
case application in a full document management system1.
We focus on different types of information, such as localisation, tables, dates and actors who are
organizations or people identified in the invoice. These fields have been completed after analysis of
several invoice models and according to the current requests of the companies questioned. From a
general point of view, some important information (not limited to) contained in a invoice can be
sketched as follows :
actors individual, company or companies involved in the invoice ( customer or supplier).
addresses all the addresses contained in the document and, if available, their types, billing
address, delivery or sender, for instance
dates the set of dates, specific to the invoice process such as edition date, payment date ...
tables tables often presenting the invoiced items, quantities, prices...
In this paper, we propose an integrated approach to extract information that are formatted into
tables in invoices. The purpose is to extract the whole data, and not only specific labels such as ”total
price” or ”product description”, in order to be able to automatically process these data, for instance
into a dedicated database that could be used for invoice information retrieval. In our collected data
set, the tables are not necessarily drawn using lines to precisely delimit information and they may
include missing information.
Given an invoice, we consider that a table can be detected at two levels : either by detecting
visually some characteristic shapes (vertical and/or horizontal lines) or by detecting some structural
organization of the tokens of information, aggregated according to rows and columns that may intersect.
We refer to this second level as a semantic level since it uses an intrinsic model that tells us what a
table should be, even if its graphical frame is not necessary fully present. Our contribution is twofold.
On the one hand we propose a complete formalization of the document based on ordering relations
that help us to more precisely process a graph-based model of the structure of the document. On the
other hand, we combine visual analysis of the document with this more semantic structure to design
an efficient table extraction tool.
Organization
Section 2introduces the domain of data recognition and our modeling of invoices. Section 3
presents the ordering relations used in our graph-based approach. Section 4introduces the notion of
pattern. Section 5presents our new approach to extract tables and Section 6shows its efficiency with
experiments.
1This work is conducted in association with the KaliConseil company that is developing its own specific invoices
processing system.
2
2 Data recognition
2.1 Optical Character recognition
As mentioned in the introduction, the first task is to recognize the text in a scanned document by
means of an optical character recognition (OCR) system. Here, we use Flexicapture [25]2. The full
description of OCR techniques is not in the scope of this paper, nevertheless we point out some biases
of this kind of tool.
Punctuation recognition can be complex for some document. The tool can be configured to
change automatically the format of the dates, for example 01/01/1992 will be transformed into
01.01.1992
Certain digits are complex to be recognized and can be easily confused with letters. This problem
may depend on the font used in the document. For instance, ”0oOo” can be confusing for an
OCR
The quality of the digitization is an important parameter. There is indeed a strong correlation
between the quality of the scanned document and the errors made by the OCR system.
This step allows us to generate a searchable PDF that includes images obtained during the dig-
itization and a non visible text layer. This text layer includes the characters and their geographical
position in the document. Note the searchable pdf documents are documents whose text element can
be selected.
2.2 Tokenization
A token corresponds to a useful semantic unit according to a given target language. The tokenization
process is often considered as the pre-processing of any natural language processing system (NLP) [26].
In our work, we use word level tokenization. Once the OCR processing has been performed, an API,
called PDFBox, is used to extract the text layer from the PDF document, we use this API to tokenize
the extracted text, by separating each word by a maximal distance of two character size in the lowest
font.
Once the text has been extracted and the positions of the characters have been retrieved, we
generate word level tokens by grouping the characters. The tokens include the characters and the
coordinates of a box embedding all the characters.
Figure 1: Example of obtained tokens
2.3 Entity Recognition
First presented in [27] named entity recognition (NER) consists in the labeling of a text where each
string is associated to a person, a location, an organization, a temporality, an amount or a percentage...
Later, NER has evolved to consider more or less labels [28,29,30,31]. In our context, we used regEx
to perform simple rule-based NER, which is sufficient. Our NER is restricted to a few labels due to
our restricted context but when faced to more open domains, more labels are necessary [30]. In our
case, some tags like header recognition can easily be improved by using a keyword library.
2The choice of Flexicapture is motivated by our practical industrial tool.
3
摘要:

Atwo-stageapproachfortableextractionininvoicesThomasSaoutUnivAngers,LERIA,F-49000Angers,Francethomas.saout@etud.univ-angers.frFredericLardeuxUnivAngers,LERIA,F-49000Angers,Francefrederic.lardeux@univ-angers.frFredericSaubionUnivAngers,LERIA,F-49000Angers,Francefrederic.saubion@univ-angers.frOcto...

展开>> 收起<<
A two-stage approach for table extraction in invoices Thomas Saout Univ Angers LERIA.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:682.53KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注