PP-StructureV2 A Stronger Document Analysis System Chenxia Li Ruoyu Guo Jun Zhou Mengtao An Yuning Du Lingfeng Zhu Yi Liu Xiaoguang Hu Dianhai Yu

2025-05-02 0 0 9.71MB 8 页 10玖币

侵权投诉

PP-StructureV2: A Stronger Document Analysis System

Chenxia Li, Ruoyu Guo, Jun Zhou, Mengtao An,

Yuning Du, Lingfeng Zhu, Yi Liu, Xiaoguang Hu, Dianhai Yu

Baidu Inc.

{lichenxia, zhulingfeng}@baidu.com

Abstract

A large amount of document data exists in unstructured form

such as raw images without any text information. Design-

ing a practical document image analysis system is a mean-

ingful but challenging task. In previous work, we proposed

an intelligent document analysis system PP-Structure. In or-

der to further upgrade the function and performance of PP-

Structure, we propose PP-StructureV2 in this work, which

contains two subsystems: Layout Information Extraction and

Key Information Extraction. Firstly, we integrate Image Di-

rection Correction module and Layout Restoration module to

enhance the functionality of the system. Secondly, 8 practi-

cal strategies are utilized in PP-StructureV2 for better perfor-

mance. For Layout Analysis model, we introduce ultra light-

weight detector PP-PicoDet and knowledge distillation algo-

rithm FGD for model lightweighting, which increased the

inference speed by 11 times with comparable mAP. For Ta-

ble Recognition model, we utilize PP-LCNet, CSP-PAN and

SLAHead to optimize the backbone module, feature fusion

module and decoding module, respectively, which improved

the table structure accuracy by 6% with comparable infer-

ence speed. For Key Information Extraction model, we intro-

duce VI-LayoutXLM which is a visual-feature independent

LayoutXLM architecture, TB-YX sorting algorithm and U-

DML knowledge distillation algorithm, which brought 2.8%

and 9.1% improvement respectively on the Hmean of Seman-

tic Entity Recognition and Relation Extraction tasks. All the

above mentioned models and code are open-sourced in the

GitHub repository PaddleOCR 1.

1 Introduction

Document intelligence is a booming research topic and prac-

tical industrial demand in recent years. It mainly refers to

the process of understanding, classiﬁcation, extraction and

information induction through artiﬁcial intelligence technol-

ogy for the text and rich typography contained in web pages,

digital documents or scanned documents. Due to the diver-

sity of layouts and formats, low-quality scanned document

images, and the complexity of template structures, docu-

ment intelligence is a very challenging task and has received

extensive attention in related ﬁelds. Layout Analysis, Table

Recognition, and Key Information Extraction are three rep-

resentative tasks in intelligent document analysis.

1https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/

ppstructure

Document Layout Analysis can be regarded as an ob-

ject detection task for document images in essence. The

basic units such as titles, paragraphs, tables, and illustra-

tions in the document are the objects needed to be de-

tected and recognized. Layout-parser(Shen et al. 2021) is

a uniﬁed toolkit for Deep Learning Based Document Im-

age Analysis. VSR(Zhang et al. 2021) is proposed for lay-

out analysis, which comes to state-of-the-art on PubLayNet

dataset(Zhong, Tang, and Yepes 2019). In PP-Structure, we

use PP-YOLOv2(Huang et al. 2021) to complete the layout

analysis task, which is real-time on GPU devices. However,

currently proposed models are not CPU-friendly and thus

not conducive to deployment on CPUs or mobile devices.

Table Recognition is used to convert table images into ed-

itable Excel format ﬁles. The diversity of tables in document

images, such as various rowspans and colspans and different

text types, makes table recognition a hard task in document

understanding. There are many table recognition methods,

such as traditional algorithms based on heuristic rules and

recently developed methods based on deep learning. Among

them, the end-to-end method has received extensive atten-

tion due to the simplicity of the pipeline, which represent

the table in HTML format and adopt Seq2Seq(Sutskever,

Vinyals, and Le 2014) to predict the table structure, such as

TableRec-RARE(Du et al. 2021b) in PP-Structure powered

by PaddlePaddle(Ma et al. 2019). In TableMaster(Ye et al.

2021), transformer is used as the decoder, which achieves

high accuracy, but brings huge computation cost.

Key Information Extraction (KIE) refers to extracting the

speciﬁc information that users pay attention to. Semantic

Entity Recognition (SER) and Relation Extraction (RE) are

two main subtasks for KIE. LayoutLM(Xu et al. 2020a) is

ﬁrstly proposed to jointly model interactions between text

and layout information across scanned document images,

which is beneﬁcial to the downstream KIE process. Lay-

outLMv2(Xu et al. 2020b) integrates the image information

in the pre-training stage by taking advantage of the trans-

former architecture to learn the cross-modality interaction

between visual and textual information. LayoutXLM(Xu

et al. 2021) is a multilingual extension of LayoutLMv2(Xu

et al. 2020b) model. XY-LayoutLM (Gu et al. 2022) pro-

posed Augmented XY-CUT algorithm to sort the textlines

in human reading order based on the observation that read-

ing order is vital for KIE. However, these multi-modal ap-

arXiv:2210.05391v2 [cs.CV] 13 Oct 2022

Figure 1: Framework of the proposed PP-StructureV2. It contains two subsystems: layout information extraction and key

information extraction.

proaches do not pay much attention to inference time.

PP-Structure is our ﬁrst attempt for an intelligent docu-

ment analysis system, which supports basic functions such

as Layout Analysis and Table Recognition, but lacks con-

sideration of efﬁciency, and there is still much room for

performance improvement. In this work, we propose PP-

StructureV2, a more robust and comprehensive document

analysis system. Figure 1 shows the PP-StructureV2 frame-

work. Firstly, the input document image direction is cor-

rected by the Image Direction Correction module. For the

Layout Information Extraction subsystem, as shown in the

upper branch, the corrected image is ﬁrstly divided into dif-

ferent areas such as text, table and image through the layout

analysis module, and then these areas are recognized respec-

tively. For example, the table area is sent to the table recog-

nition module for structural recognition, and the text area

is sent to the OCR engine for text recognition. Finally, the

layout recovery module is used to restore the image to an

editable Word ﬁle consistent with the original image layout.

For the Key Information Extraction subsystem, as shown in

the lower branch, OCR engine is used to extract the text con-

tent, then the Semantic Entity Recognition module and Re-

lation Extraction module are used to obtain the entities and

their relationship in the image, respectively, so as to extract

the required key information.

The contributions of this paper are summarized as fol-

lows:

• We upgrade the intelligent document analysis system PP-

Structure and proposed PP-StructureV2 with better per-

formance.

• We newly introduce two modules in PP-StructureV2: Im-

age Direction Correction and Layout Recovery, which

support processing rotated images and restore images to

editable Word ﬁles based on analysis results.

• We optimize Layout Analysis, Table Recognition and Key

Information Extraction models, signiﬁcantly surpassing

the previous version in terms of speed or accuracy.

The rest of the paper is organized as follows. In section 2,

we present the details of the newly proposed improvement

strategies. Experimental results are discussed in section 3

and conclusions are conducted in section 4.

2 Improvement Strategies

2.1 Image Direction Correction Module

Since the training set is generally dominated by 0-degree

images, the information extraction effect of rotated images

is often compromised. In PP-StructureV2, the input image

direction is ﬁrstly corrected by the PULC text image direc-

tion model(Cui 2022) provided by PaddleClas 2. Some demo

images in the dataset are shown in Figure 2. Different from

the text line direction classiﬁer, the text image direction clas-

siﬁer performs direction classiﬁcation for the entire image.

The text image direction classiﬁcation model achieves 99%

accuracy on the validation set with 463 FPS on CPU device.

2.2 Layout Analysis

Layout Analysis refers to dividing document images into

predeﬁned areas such as text, title, table, and ﬁgure. In

PP-Structure, we adopted the object detection algorithm

PP-YOLOv2(Huang et al. 2021) as the layout detector. In

2https://github.com/PaddlePaddle/PaddleClas

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PP-StructureV2:AStrongerDocumentAnalysisSystemChenxiaLi,RuoyuGuo,JunZhou,MengtaoAn,YuningDu,LingfengZhu,YiLiu,XiaoguangHu,DianhaiYuBaiduInc.flichenxia,zhulingfengg@baidu.comAbstractAlargeamountofdocumentdataexistsinunstructuredformsuchasrawimageswithoutanytextinformation.Design-ingapracticaldocument...

展开>> 收起<<

PP-StructureV2 A Stronger Document Analysis System Chenxia Li Ruoyu Guo Jun Zhou Mengtao An Yuning Du Lingfeng Zhu Yi Liu Xiaoguang Hu Dianhai Yu.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PP-StructureV2 A Stronger Document Analysis System Chenxia Li Ruoyu Guo Jun Zhou Mengtao An Yuning Du Lingfeng Zhu Yi Liu Xiaoguang Hu Dianhai Yu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: