
PP-StructureV2: A Stronger Document Analysis System
Chenxia Li, Ruoyu Guo, Jun Zhou, Mengtao An,
Yuning Du, Lingfeng Zhu, Yi Liu, Xiaoguang Hu, Dianhai Yu
Baidu Inc.
{lichenxia, zhulingfeng}@baidu.com
Abstract
A large amount of document data exists in unstructured form
such as raw images without any text information. Design-
ing a practical document image analysis system is a mean-
ingful but challenging task. In previous work, we proposed
an intelligent document analysis system PP-Structure. In or-
der to further upgrade the function and performance of PP-
Structure, we propose PP-StructureV2 in this work, which
contains two subsystems: Layout Information Extraction and
Key Information Extraction. Firstly, we integrate Image Di-
rection Correction module and Layout Restoration module to
enhance the functionality of the system. Secondly, 8 practi-
cal strategies are utilized in PP-StructureV2 for better perfor-
mance. For Layout Analysis model, we introduce ultra light-
weight detector PP-PicoDet and knowledge distillation algo-
rithm FGD for model lightweighting, which increased the
inference speed by 11 times with comparable mAP. For Ta-
ble Recognition model, we utilize PP-LCNet, CSP-PAN and
SLAHead to optimize the backbone module, feature fusion
module and decoding module, respectively, which improved
the table structure accuracy by 6% with comparable infer-
ence speed. For Key Information Extraction model, we intro-
duce VI-LayoutXLM which is a visual-feature independent
LayoutXLM architecture, TB-YX sorting algorithm and U-
DML knowledge distillation algorithm, which brought 2.8%
and 9.1% improvement respectively on the Hmean of Seman-
tic Entity Recognition and Relation Extraction tasks. All the
above mentioned models and code are open-sourced in the
GitHub repository PaddleOCR 1.
1 Introduction
Document intelligence is a booming research topic and prac-
tical industrial demand in recent years. It mainly refers to
the process of understanding, classification, extraction and
information induction through artificial intelligence technol-
ogy for the text and rich typography contained in web pages,
digital documents or scanned documents. Due to the diver-
sity of layouts and formats, low-quality scanned document
images, and the complexity of template structures, docu-
ment intelligence is a very challenging task and has received
extensive attention in related fields. Layout Analysis, Table
Recognition, and Key Information Extraction are three rep-
resentative tasks in intelligent document analysis.
1https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6/
ppstructure
Document Layout Analysis can be regarded as an ob-
ject detection task for document images in essence. The
basic units such as titles, paragraphs, tables, and illustra-
tions in the document are the objects needed to be de-
tected and recognized. Layout-parser(Shen et al. 2021) is
a unified toolkit for Deep Learning Based Document Im-
age Analysis. VSR(Zhang et al. 2021) is proposed for lay-
out analysis, which comes to state-of-the-art on PubLayNet
dataset(Zhong, Tang, and Yepes 2019). In PP-Structure, we
use PP-YOLOv2(Huang et al. 2021) to complete the layout
analysis task, which is real-time on GPU devices. However,
currently proposed models are not CPU-friendly and thus
not conducive to deployment on CPUs or mobile devices.
Table Recognition is used to convert table images into ed-
itable Excel format files. The diversity of tables in document
images, such as various rowspans and colspans and different
text types, makes table recognition a hard task in document
understanding. There are many table recognition methods,
such as traditional algorithms based on heuristic rules and
recently developed methods based on deep learning. Among
them, the end-to-end method has received extensive atten-
tion due to the simplicity of the pipeline, which represent
the table in HTML format and adopt Seq2Seq(Sutskever,
Vinyals, and Le 2014) to predict the table structure, such as
TableRec-RARE(Du et al. 2021b) in PP-Structure powered
by PaddlePaddle(Ma et al. 2019). In TableMaster(Ye et al.
2021), transformer is used as the decoder, which achieves
high accuracy, but brings huge computation cost.
Key Information Extraction (KIE) refers to extracting the
specific information that users pay attention to. Semantic
Entity Recognition (SER) and Relation Extraction (RE) are
two main subtasks for KIE. LayoutLM(Xu et al. 2020a) is
firstly proposed to jointly model interactions between text
and layout information across scanned document images,
which is beneficial to the downstream KIE process. Lay-
outLMv2(Xu et al. 2020b) integrates the image information
in the pre-training stage by taking advantage of the trans-
former architecture to learn the cross-modality interaction
between visual and textual information. LayoutXLM(Xu
et al. 2021) is a multilingual extension of LayoutLMv2(Xu
et al. 2020b) model. XY-LayoutLM (Gu et al. 2022) pro-
posed Augmented XY-CUT algorithm to sort the textlines
in human reading order based on the observation that read-
ing order is vital for KIE. However, these multi-modal ap-
arXiv:2210.05391v2 [cs.CV] 13 Oct 2022