A two-stage approach for table extraction in invoices Thomas Saout Univ Angers LERIA

2025-04-27 1 0 682.53KB 14 页 10玖币

侵权投诉

A two-stage approach for table extraction in invoices

Thomas Saout

Univ Angers, LERIA,

F-49000 Angers, France

thomas.saout@etud.univ-angers.fr

Fr´ed´eric Lardeux

Univ Angers, LERIA,

F-49000 Angers, France

frederic.lardeux@univ-angers.fr

Fr´ed´eric Saubion

Univ Angers, LERIA,

F-49000 Angers, France

frederic.saubion@univ-angers.fr

October 11, 2022

Abstract

The automated analysis of administrative documents is an important ﬁeld in document recog-

nition that is studied for decades. Invoices are key documents among these huge amounts of

documents available in companies and public services. Invoices contain most of the time data that

are presented in tables that should be clearly identiﬁed to extract suitable information. In this

paper, we propose an approach that combines an image processing based estimation of the shape

of the tables with a graph-based representation of the document, which is used to identify complex

tables precisely. We propose an experimental evaluation using a real case application.

Keywords : Data extraction, invoice, graph-based representation

1 Introduction

The automated analysis of administrative documents is an important ﬁeld in document recognition that

is studied for decades [1]. Among these huge amounts of documents that are available in companies

and public services, invoices are key documents. Their automated processing is a complex task [2]

and has led to the development of commercial systems developed by companies such as ITESOFT or

ABBYY.

Invoices generally require complex administrative procedures, which involve diﬀerent departments

(e.g., accounting department, logistics, supply chain...). The invoices have to be processed through

speciﬁc workﬂows [3]. From the document point of view, the full processing of invoices includes

their digitalization using Optical Character Recognition (OCR) [4] and their processing to achieve

information extraction, which aims at ﬁnding identiﬁers and their types, amounts, dates [5,6]. This

global process requires handling some speciﬁc characteristics of the considered documents [2]:

•handling the variability of layouts,

•training and quickly adapt to new contexts,

•minimizing the end-user task.

Context

Many solutions have been proposed to manage information from scanned invoices and most of

them are based on machine learning techniques (e.g. classiﬁcation). Recent research is still active

on this problem [5,7]. The ﬁrst problem was certainly to identify invoices [8] and hence, models

have been proposed to ease their processing [9]. Once invoices have been correctly scanned and

identiﬁed, a remaining crucial question is ”how to extract relevant information from these invoices ?”.

Labeling techniques can be applied using rules [10]. Addressing this named entity recognition (NER)

task has been recently handled using neural networks [11,12]. Since invoices contain text sequences

arXiv:2210.04716v1 [cs.IR] 10 Oct 2022

that are mostly diﬀerent from natural languages corpus, speciﬁc information extraction methods have

been proposed to take into account the speciﬁc structures in these documents. For instance, in [2],

a star graph is used to consider the neighborhood of a token (a token is an elementary, semantically

coherent, element of the document). The speciﬁc structures of invoices lead to consider the geographical

organization of the document and graph-based models are thus relevant [13]. Examining more precisely

invoices leads to consider that most of them include tables as a main structural character. Hence, table

detection within invoices appears as an important processing task [13]. Table processing is indeed an old

challenge [14,15] and, as quoted in [16] it includes diﬀerent tasks : detection, extraction, interpretation

and understanding. Here, we focus on detection (detecting the presence of a tabular structure in a

document) and extraction (providing the data in a detected table in a more readable format). These

challenges are still active [17]. Recent work[18] proposes an approach to detect the general frame of

a table and to extract its content. Targeting wide classes of documents, many recent works often use

neural networks [19,20,21,22] to recognize table structures in documents by means of large training

sets. Focusing on more speciﬁc tables, their characteristics are also intended to help these tasks, such

as headers [23]. Rule based systems, which were seminal table extraction techniques, may also be

relevant [24].

Aim and Contribution

The general purpose is to automatically process invoices to get important data that are contained

in these documents and particularly information contained in tables. This work is motivated by a real

case application in a full document management system1.

We focus on diﬀerent types of information, such as localisation, tables, dates and actors who are

organizations or people identiﬁed in the invoice. These ﬁelds have been completed after analysis of

several invoice models and according to the current requests of the companies questioned. From a

general point of view, some important information (not limited to) contained in a invoice can be

sketched as follows :

•actors individual, company or companies involved in the invoice ( customer or supplier).

•addresses all the addresses contained in the document and, if available, their types, billing

address, delivery or sender, for instance

•dates the set of dates, speciﬁc to the invoice process such as edition date, payment date ...

•tables tables often presenting the invoiced items, quantities, prices...

In this paper, we propose an integrated approach to extract information that are formatted into

tables in invoices. The purpose is to extract the whole data, and not only speciﬁc labels such as ”total

price” or ”product description”, in order to be able to automatically process these data, for instance

into a dedicated database that could be used for invoice information retrieval. In our collected data

set, the tables are not necessarily drawn using lines to precisely delimit information and they may

include missing information.

Given an invoice, we consider that a table can be detected at two levels : either by detecting

visually some characteristic shapes (vertical and/or horizontal lines) or by detecting some structural

organization of the tokens of information, aggregated according to rows and columns that may intersect.

We refer to this second level as a semantic level since it uses an intrinsic model that tells us what a

table should be, even if its graphical frame is not necessary fully present. Our contribution is twofold.

On the one hand we propose a complete formalization of the document based on ordering relations

that help us to more precisely process a graph-based model of the structure of the document. On the

other hand, we combine visual analysis of the document with this more semantic structure to design

an eﬃcient table extraction tool.

Organization

Section 2introduces the domain of data recognition and our modeling of invoices. Section 3

presents the ordering relations used in our graph-based approach. Section 4introduces the notion of

pattern. Section 5presents our new approach to extract tables and Section 6shows its eﬃciency with

experiments.

1This work is conducted in association with the KaliConseil company that is developing its own speciﬁc invoices

processing system.

2 Data recognition

2.1 Optical Character recognition

As mentioned in the introduction, the ﬁrst task is to recognize the text in a scanned document by

means of an optical character recognition (OCR) system. Here, we use Flexicapture [25]2. The full

description of OCR techniques is not in the scope of this paper, nevertheless we point out some biases

of this kind of tool.

•Punctuation recognition can be complex for some document. The tool can be conﬁgured to

change automatically the format of the dates, for example 01/01/1992 will be transformed into

01.01.1992

•Certain digits are complex to be recognized and can be easily confused with letters. This problem

may depend on the font used in the document. For instance, ”0oOo” can be confusing for an

OCR

•The quality of the digitization is an important parameter. There is indeed a strong correlation

between the quality of the scanned document and the errors made by the OCR system.

This step allows us to generate a searchable PDF that includes images obtained during the dig-

itization and a non visible text layer. This text layer includes the characters and their geographical

position in the document. Note the searchable pdf documents are documents whose text element can

be selected.

2.2 Tokenization

A token corresponds to a useful semantic unit according to a given target language. The tokenization

process is often considered as the pre-processing of any natural language processing system (NLP) [26].

In our work, we use word level tokenization. Once the OCR processing has been performed, an API,

called PDFBox, is used to extract the text layer from the PDF document, we use this API to tokenize

the extracted text, by separating each word by a maximal distance of two character size in the lowest

font.

Once the text has been extracted and the positions of the characters have been retrieved, we

generate word level tokens by grouping the characters. The tokens include the characters and the

coordinates of a box embedding all the characters.

Figure 1: Example of obtained tokens

2.3 Entity Recognition

First presented in [27] named entity recognition (NER) consists in the labeling of a text where each

string is associated to a person, a location, an organization, a temporality, an amount or a percentage...

Later, NER has evolved to consider more or less labels [28,29,30,31]. In our context, we used regEx

to perform simple rule-based NER, which is suﬃcient. Our NER is restricted to a few labels due to

our restricted context but when faced to more open domains, more labels are necessary [30]. In our

case, some tags like header recognition can easily be improved by using a keyword library.

2The choice of Flexicapture is motivated by our practical industrial tool.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Atwo-stageapproachfortableextractionininvoicesThomasSaoutUnivAngers,LERIA,F-49000Angers,Francethomas.saout@etud.univ-angers.frFredericLardeuxUnivAngers,LERIA,F-49000Angers,Francefrederic.lardeux@univ-angers.frFredericSaubionUnivAngers,LERIA,F-49000Angers,Francefrederic.saubion@univ-angers.frOcto...

展开>> 收起<<

A two-stage approach for table extraction in invoices Thomas Saout Univ Angers LERIA.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A two-stage approach for table extraction in invoices Thomas Saout Univ Angers LERIA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: