
that are mostly different from natural languages corpus, specific information extraction methods have
been proposed to take into account the specific structures in these documents. For instance, in [2],
a star graph is used to consider the neighborhood of a token (a token is an elementary, semantically
coherent, element of the document). The specific structures of invoices lead to consider the geographical
organization of the document and graph-based models are thus relevant [13]. Examining more precisely
invoices leads to consider that most of them include tables as a main structural character. Hence, table
detection within invoices appears as an important processing task [13]. Table processing is indeed an old
challenge [14,15] and, as quoted in [16] it includes different tasks : detection, extraction, interpretation
and understanding. Here, we focus on detection (detecting the presence of a tabular structure in a
document) and extraction (providing the data in a detected table in a more readable format). These
challenges are still active [17]. Recent work[18] proposes an approach to detect the general frame of
a table and to extract its content. Targeting wide classes of documents, many recent works often use
neural networks [19,20,21,22] to recognize table structures in documents by means of large training
sets. Focusing on more specific tables, their characteristics are also intended to help these tasks, such
as headers [23]. Rule based systems, which were seminal table extraction techniques, may also be
relevant [24].
Aim and Contribution
The general purpose is to automatically process invoices to get important data that are contained
in these documents and particularly information contained in tables. This work is motivated by a real
case application in a full document management system1.
We focus on different types of information, such as localisation, tables, dates and actors who are
organizations or people identified in the invoice. These fields have been completed after analysis of
several invoice models and according to the current requests of the companies questioned. From a
general point of view, some important information (not limited to) contained in a invoice can be
sketched as follows :
•actors individual, company or companies involved in the invoice ( customer or supplier).
•addresses all the addresses contained in the document and, if available, their types, billing
address, delivery or sender, for instance
•dates the set of dates, specific to the invoice process such as edition date, payment date ...
•tables tables often presenting the invoiced items, quantities, prices...
In this paper, we propose an integrated approach to extract information that are formatted into
tables in invoices. The purpose is to extract the whole data, and not only specific labels such as ”total
price” or ”product description”, in order to be able to automatically process these data, for instance
into a dedicated database that could be used for invoice information retrieval. In our collected data
set, the tables are not necessarily drawn using lines to precisely delimit information and they may
include missing information.
Given an invoice, we consider that a table can be detected at two levels : either by detecting
visually some characteristic shapes (vertical and/or horizontal lines) or by detecting some structural
organization of the tokens of information, aggregated according to rows and columns that may intersect.
We refer to this second level as a semantic level since it uses an intrinsic model that tells us what a
table should be, even if its graphical frame is not necessary fully present. Our contribution is twofold.
On the one hand we propose a complete formalization of the document based on ordering relations
that help us to more precisely process a graph-based model of the structure of the document. On the
other hand, we combine visual analysis of the document with this more semantic structure to design
an efficient table extraction tool.
Organization
Section 2introduces the domain of data recognition and our modeling of invoices. Section 3
presents the ordering relations used in our graph-based approach. Section 4introduces the notion of
pattern. Section 5presents our new approach to extract tables and Section 6shows its efficiency with
experiments.
1This work is conducted in association with the KaliConseil company that is developing its own specific invoices
processing system.
2