PP-StructureV2 A Stronger Document Analysis System Chenxia Li Ruoyu Guo Jun Zhou Mengtao An Yuning Du Lingfeng Zhu Yi Liu Xiaoguang Hu Dianhai Yu

2025-05-02 0 0 9.71MB 8 页 10玖币
侵权投诉
PP-StructureV2: A Stronger Document Analysis System
Chenxia Li, Ruoyu Guo, Jun Zhou, Mengtao An,
Yuning Du, Lingfeng Zhu, Yi Liu, Xiaoguang Hu, Dianhai Yu
Baidu Inc.
{lichenxia, zhulingfeng}@baidu.com
Abstract
A large amount of document data exists in unstructured form
such as raw images without any text information. Design-
ing a practical document image analysis system is a mean-
ingful but challenging task. In previous work, we proposed
an intelligent document analysis system PP-Structure. In or-
der to further upgrade the function and performance of PP-
Structure, we propose PP-StructureV2 in this work, which
contains two subsystems: Layout Information Extraction and
Key Information Extraction. Firstly, we integrate Image Di-
rection Correction module and Layout Restoration module to
enhance the functionality of the system. Secondly, 8 practi-
cal strategies are utilized in PP-StructureV2 for better perfor-
mance. For Layout Analysis model, we introduce ultra light-
weight detector PP-PicoDet and knowledge distillation algo-
rithm FGD for model lightweighting, which increased the
inference speed by 11 times with comparable mAP. For Ta-
ble Recognition model, we utilize PP-LCNet, CSP-PAN and
SLAHead to optimize the backbone module, feature fusion
module and decoding module, respectively, which improved
the table structure accuracy by 6% with comparable infer-
ence speed. For Key Information Extraction model, we intro-
duce VI-LayoutXLM which is a visual-feature independent
LayoutXLM architecture, TB-YX sorting algorithm and U-
DML knowledge distillation algorithm, which brought 2.8%
and 9.1% improvement respectively on the Hmean of Seman-
tic Entity Recognition and Relation Extraction tasks. All the
above mentioned models and code are open-sourced in the
GitHub repository PaddleOCR 1.
1 Introduction
Document intelligence is a booming research topic and prac-
tical industrial demand in recent years. It mainly refers to
the process of understanding, classification, extraction and
information induction through artificial intelligence technol-
ogy for the text and rich typography contained in web pages,
digital documents or scanned documents. Due to the diver-
sity of layouts and formats, low-quality scanned document
images, and the complexity of template structures, docu-
ment intelligence is a very challenging task and has received
extensive attention in related fields. Layout Analysis, Table
Recognition, and Key Information Extraction are three rep-
resentative tasks in intelligent document analysis.
1https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/
ppstructure
Document Layout Analysis can be regarded as an ob-
ject detection task for document images in essence. The
basic units such as titles, paragraphs, tables, and illustra-
tions in the document are the objects needed to be de-
tected and recognized. Layout-parser(Shen et al. 2021) is
a unified toolkit for Deep Learning Based Document Im-
age Analysis. VSR(Zhang et al. 2021) is proposed for lay-
out analysis, which comes to state-of-the-art on PubLayNet
dataset(Zhong, Tang, and Yepes 2019). In PP-Structure, we
use PP-YOLOv2(Huang et al. 2021) to complete the layout
analysis task, which is real-time on GPU devices. However,
currently proposed models are not CPU-friendly and thus
not conducive to deployment on CPUs or mobile devices.
Table Recognition is used to convert table images into ed-
itable Excel format files. The diversity of tables in document
images, such as various rowspans and colspans and different
text types, makes table recognition a hard task in document
understanding. There are many table recognition methods,
such as traditional algorithms based on heuristic rules and
recently developed methods based on deep learning. Among
them, the end-to-end method has received extensive atten-
tion due to the simplicity of the pipeline, which represent
the table in HTML format and adopt Seq2Seq(Sutskever,
Vinyals, and Le 2014) to predict the table structure, such as
TableRec-RARE(Du et al. 2021b) in PP-Structure powered
by PaddlePaddle(Ma et al. 2019). In TableMaster(Ye et al.
2021), transformer is used as the decoder, which achieves
high accuracy, but brings huge computation cost.
Key Information Extraction (KIE) refers to extracting the
specific information that users pay attention to. Semantic
Entity Recognition (SER) and Relation Extraction (RE) are
two main subtasks for KIE. LayoutLM(Xu et al. 2020a) is
firstly proposed to jointly model interactions between text
and layout information across scanned document images,
which is beneficial to the downstream KIE process. Lay-
outLMv2(Xu et al. 2020b) integrates the image information
in the pre-training stage by taking advantage of the trans-
former architecture to learn the cross-modality interaction
between visual and textual information. LayoutXLM(Xu
et al. 2021) is a multilingual extension of LayoutLMv2(Xu
et al. 2020b) model. XY-LayoutLM (Gu et al. 2022) pro-
posed Augmented XY-CUT algorithm to sort the textlines
in human reading order based on the observation that read-
ing order is vital for KIE. However, these multi-modal ap-
arXiv:2210.05391v2 [cs.CV] 13 Oct 2022
Figure 1: Framework of the proposed PP-StructureV2. It contains two subsystems: layout information extraction and key
information extraction.
proaches do not pay much attention to inference time.
PP-Structure is our first attempt for an intelligent docu-
ment analysis system, which supports basic functions such
as Layout Analysis and Table Recognition, but lacks con-
sideration of efficiency, and there is still much room for
performance improvement. In this work, we propose PP-
StructureV2, a more robust and comprehensive document
analysis system. Figure 1 shows the PP-StructureV2 frame-
work. Firstly, the input document image direction is cor-
rected by the Image Direction Correction module. For the
Layout Information Extraction subsystem, as shown in the
upper branch, the corrected image is firstly divided into dif-
ferent areas such as text, table and image through the layout
analysis module, and then these areas are recognized respec-
tively. For example, the table area is sent to the table recog-
nition module for structural recognition, and the text area
is sent to the OCR engine for text recognition. Finally, the
layout recovery module is used to restore the image to an
editable Word file consistent with the original image layout.
For the Key Information Extraction subsystem, as shown in
the lower branch, OCR engine is used to extract the text con-
tent, then the Semantic Entity Recognition module and Re-
lation Extraction module are used to obtain the entities and
their relationship in the image, respectively, so as to extract
the required key information.
The contributions of this paper are summarized as fol-
lows:
We upgrade the intelligent document analysis system PP-
Structure and proposed PP-StructureV2 with better per-
formance.
We newly introduce two modules in PP-StructureV2: Im-
age Direction Correction and Layout Recovery, which
support processing rotated images and restore images to
editable Word files based on analysis results.
We optimize Layout Analysis, Table Recognition and Key
Information Extraction models, significantly surpassing
the previous version in terms of speed or accuracy.
The rest of the paper is organized as follows. In section 2,
we present the details of the newly proposed improvement
strategies. Experimental results are discussed in section 3
and conclusions are conducted in section 4.
2 Improvement Strategies
2.1 Image Direction Correction Module
Since the training set is generally dominated by 0-degree
images, the information extraction effect of rotated images
is often compromised. In PP-StructureV2, the input image
direction is firstly corrected by the PULC text image direc-
tion model(Cui 2022) provided by PaddleClas 2. Some demo
images in the dataset are shown in Figure 2. Different from
the text line direction classifier, the text image direction clas-
sifier performs direction classification for the entire image.
The text image direction classification model achieves 99%
accuracy on the validation set with 463 FPS on CPU device.
2.2 Layout Analysis
Layout Analysis refers to dividing document images into
predefined areas such as text, title, table, and figure. In
PP-Structure, we adopted the object detection algorithm
PP-YOLOv2(Huang et al. 2021) as the layout detector. In
2https://github.com/PaddlePaddle/PaddleClas
摘要:

PP-StructureV2:AStrongerDocumentAnalysisSystemChenxiaLi,RuoyuGuo,JunZhou,MengtaoAn,YuningDu,LingfengZhu,YiLiu,XiaoguangHu,DianhaiYuBaiduInc.flichenxia,zhulingfengg@baidu.comAbstractAlargeamountofdocumentdataexistsinunstructuredformsuchasrawimageswithoutanytextinformation.Design-ingapracticaldocument...

展开>> 收起<<
PP-StructureV2 A Stronger Document Analysis System Chenxia Li Ruoyu Guo Jun Zhou Mengtao An Yuning Du Lingfeng Zhu Yi Liu Xiaoguang Hu Dianhai Yu.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:9.71MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注