ERNIE-Layout Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Qiming Peng1 Yinxu Pan1 Wenjin Wang2y Bin Luo1z Zhenyu Zhang1

2025-04-26 0 0 4.08MB 13 页 10玖币
侵权投诉
ERNIE-Layout: Layout Knowledge Enhanced Pre-training
for Visually-rich Document Understanding
Qiming Peng1, Yinxu Pan1, Wenjin Wang2, Bin Luo1, Zhenyu Zhang1,
Zhengjie Huang1, Teng Hu1, Weichong Yin1, Yongfeng Chen1, Yin Zhang2,
Shikun Feng1, Yu Sun1, Hao Tian1, Hua Wu1, Haifeng Wang1
1Baidu Inc., Beijing, China
2Zhejiang University, Hangzhou, China
{pengqiming, panyinxu, luobin06, zhangzhenyu07}@baidu.com
{wangwenjin, zhangyin98}@zju.edu.cn {yinweichong, fengshikun, sunyu02}@baidu.com
Abstract
Recent years have witnessed the rise and suc-
cess of pre-training techniques in visually-rich
document understanding. However, most ex-
isting methods lack the systematic mining and
utilization of layout-centered knowledge, lead-
ing to sub-optimal performances. In this pa-
per, we propose ERNIE-Layout, a novel docu-
ment pre-training solution with layout knowl-
edge enhancement in the whole workflow, to
learn better representations that combine the
features from text, layout, and image. Specifi-
cally, we first rearrange input sequences in the
serialization stage, and then present a correla-
tive pre-training task, reading order prediction,
to learn the proper reading order of documents.
To improve the layout awareness of the model,
we integrate a spatial-aware disentangled at-
tention into the multi-modal transformer and
a replaced regions prediction task into the pre-
training phase. Experimental results show that
ERNIE-Layout achieves superior performance
on various downstream tasks, setting new state-
of-the-art on key information extraction, docu-
ment image classification, and document ques-
tion answering datasets. The code and models
are publicly available at PaddleNLP1.
1 Introduction
Visually-rich Document Understanding (VrDU) is
an important research field aiming to handle vari-
ous types of scanned or digital-born business docu-
ments (e.g., forms, invoices), which has attracted
great attention from the industry and academia due
to its various applications. Distinct from conven-
tional natural language understanding (NLU) tasks
that use only plain text, VrDU models have the op-
portunity to access the most primitive data features.
Equal contribution.
Work done during internship at Baidu Inc.
Corresponding author: Bin Luo.
1https://github.com/PaddlePaddle/
PaddleNLP/tree/develop/model_zoo/
ernie-layout
Herein, the diversity and complexity of document
formats pose new challenges to the task, an ideal
model needs to make full use of the textual, layout,
and even visual information to fully understand
visually-rich document like humans.
The preliminary works for VrDU (Yang et al.,
2016, 2017; Katti et al., 2018; Sarkhel and Nandi,
2019; Cheng et al., 2020) usually adopt uni-modal
or shallow multi-modal fusion approaches, which
are task-specific and require massive data annota-
tions. Recently, pre-training language models have
swept the field, LayoutLM (Xu et al., 2020), Lay-
outLMv2 (Xu et al., 2021), and some advanced
document pre-training approaches (Li et al., 2021a;
Appalaraju et al., 2021; Gu et al., 2022) have been
proposed successively and achieved great successes
in various VrDU tasks. Unlike popular uni-modal
or vision-language frameworks (Devlin et al., 2019;
Liu et al., 2019; Lu et al., 2019; Yu et al., 2021),
the uniqueness of document understanding models
lies in how to exploit the layout knowledge.
However, existing document pre-training solu-
tions typically fall into the trap of simply taking
2D coordinates as an extension of 1D positions to
endow the model layout awareness. Considering
the characteristics of VrDU, we believe that the
layout-centered knowledge should be systemati-
cally mined and utilized from two aspects: (1) On
the one hand, layout implicitly reflects the proper
reading order of documents, while previous meth-
ods are used to perform the serialization by multi-
plexing the results of Optical Character Recogni-
tion (OCR), which roughly arrange tokens in the
top-to-bottom and left-to-right manner (Wang et al.,
2021c; Gu et al., 2022). Inevitably, it is inconsis-
tent with human reading habits for documents with
complex layouts (e.g., tables, forms, multi-column
templates) and leads to sub-optimal performances
for downstream tasks. (2) On the other hand, layout
is actually the third modality besides language and
vision, while current models are used to take lay-
arXiv:2210.06155v2 [cs.CL] 14 Oct 2022
out as a special position feature, such as the layout
embedding in input layer (Xu et al., 2020) or the
bias item in attention layer (Xu et al., 2021). The
lack of cross-modal interaction between layout and
text/image might restrict the model from learning
the role of layout in semantic expression.
To achieve these goals, we propose a systematic
layout knowledge enhanced pre-training approach,
ERNIE-Layout
2
, to improve the performances of
document understanding tasks. First of all, we em-
ploy an off-the-shelf layout-based document parser
in the serialization stage to generate an appropriate
reading order for each input document, so that the
input sequences received by the model are more in
line with human reading habits than using the rough
raster-scanning order. Then, each textual/visual to-
ken is equipped with its position embedding and
layout embedding, and sent to the stacked multi-
modal transformer layers. To enhance cross-modal
interaction, we present a spatial-aware disentangled
attention mechanism, inspired by the disentangled
attention of DeBERTa (He et al., 2021), in which
the attention weights between tokens are computed
using disentangled matrices based on their hidden
states and relative positions. In the end, layout not
only acts as the 2D position attribute of input to-
kens, but also contributes a spatial perspective to
the calculation of semantic similarity.
With satisfactory serialization results, we pro-
pose the pre-training task, reading order prediction,
to predict the next token for each position, which
facilitates the consistency within the same arranged
text segment and the discrimination between dif-
ferent segments. Furthermore, when pre-training,
we also adopt the classic masked visual-language
modeling and text-image alignment tasks (Xu et al.,
2021), and present a fine-grained multi-modal task,
replaced regions prediction, to learn the correlation
among language, vision and layout.
We construct broad experiments on three repre-
sentative VrDU downstream tasks with six publicly
available datasets to evaluate the performance of
the pre-trained model, i.e., the key information ex-
traction task with the FUNSD (Jain and Wigington,
2019), CORD (Park et al., 2019), SROIE (Huang
et al., 2019), Kleister-NDA (Grali´
nski et al., 2021)
datasets, the document question answering task
with the DocVQA (Mathew et al., 2021) dataset,
and the document image classification task with the
2
It is named after the knowledge enhanced pre-training
model, ERNIE (Sun et al., 2019), as a layout enhanced version.
RVL-CDIP (Harley et al., 2015) dataset. The re-
sults show that ERNIE-Layout significantly outper-
forms strong baselines on almost all tasks, proving
the effectiveness of our two-part layout knowledge
enhancement philosophy.
The contributions are summarized as follows:
ERNIE-Layout proposes to rearrange the or-
der of input tokens in serialization and adopt
a reading order prediction task in pre-training.
To the best of our knowledge, ERNIE-Layout
is the first attempt to consider the proper read-
ing order in document pre-training.
ERNIE-Layout incorporates the spatial-aware
disentangled attention mechanism in the multi-
modal transformer, and designs a replaced re-
gions prediction pre-training task, to facilitate
the fine-grained interaction across textual, vi-
sual, and layout modalities.
ERNIE-Layout refreshes the state-of-the-art
of various VrDU tasks, and extensive exper-
iments demonstrate the effectiveness of ex-
ploiting layout-centered knowledge.
2 Related Work
Layout-aware Pre-trained Model.
Humans un-
derstand visually rich documents through many
perspectives, such as language, vision, and layout.
Based on the powerful modeling ability of Trans-
former (Vaswani et al., 2017), LayoutLM (Xu et al.,
2020) initially embeds the 2D coordinates as layout
embeddings for each token and extends the famous
masked language modeling pre-training task (De-
vlin et al., 2019) to masked visual-language mod-
eling, which opens the prologue of layout-aware
pre-trained models. Afterwards, LayoutLMv2 (Xu
et al., 2021) concatenates document image patches
with textual tokens, and two pre-training tasks, text-
image matching and text-image alignment, are pro-
posed to realize the cross-modal interaction. Struc-
tralLM (Li et al., 2021a) leverages segment-level,
instead of word-level, layout features to make the
model aware of which words come from the same
cell. DocFormer (Appalaraju et al., 2021) shares
the learned spatial embeddings across modalities,
making it easy for the model to correlate text to
visual tokens and vice versa. TILT (Powalski et al.,
2021) proposes an encoder-decoder model to gen-
erate results that are not explicitly included in the
input sequence to solve the limitations of sequence
T1T2
[CLS] [SEP]… V1V2V3V7V8 V9
0.02 0.04 0.87 0.01
Transformer Layers
with Spatial-aware Disentangled Attention Mechanism
Visual FeaturesTextual Features
T1 T2[CLS] T4
T3T5T6T7T8[SEP]
0 0 1 0 1 0
0.08 0.76
BCE
Replaced Region Prediction
T2 T4T1 T3 T5 T6 T7 T8
T1
T2
T3
T4
T6
T5
T7
T8
T2 T4T1 T3 T5 T6 T7 T8
T1
T2
T3
T4
T6
T5
T7
T8
CE
Reading Order Prediction
.
.
.
.
.
.
Attention matrix Golden Label
Visual Encoder
L3
L6L7L1L5
L2
L4L8
Serialization Module
T1
T1T2T3T3
T2T1T2
T1
T2T3
Text-Image
Alignment
Masked
Visual-
Languages
Modeling
(a) (b) (c) (d)
Figure 1: The architecture and pre-training objectives of ERNIE-Layout. The serialization module is introduced
to correct the order of raster-scan, and the visual encoder extracts corresponding image features. With the spatial-
aware disentangled attention mechanism, ERNIE-Layout is pre-trained with four tasks.
labeling. However, these methods ignore the poten-
tial value of layout in-depth and directly rely on a
raster-scanning serialization, which is contrary to
human reading habits. To solve this problem, Lay-
outReader (Wang et al., 2021c) designs a sequence-
to-sequence framework to generate an appropriate
reading order for each document. Unfortunately,
it is carefully designed for reading order detection
and cannot directly empower various document un-
derstanding tasks. Besides, the above methods are
used to regard layout as a subsidiary feature of text
along with the idea of LayoutLM, but the same text
with different layouts may also express different se-
mantics. Therefore, we believe that layout should
be regarded as the third modality independent of
language and vision.
Knowledge-enhanced Representation.
Follow-
ing the BERT (Devlin et al., 2019) architecture,
many efforts are devoted to pre-trained language
models for learning informative representations.
There are some studies show that extra knowledge,
such as facts in WikiData and WordNet, can further
benefit the pre-trained models (Zhang et al., 2019;
Liu et al., 2020; He et al., 2020; Wang et al., 2021b),
but the embeddings of words in the text and enti-
ties in the knowledge graphs are not in the same
vector space, so a cumbersome adaptation module
is required (He et al., 2020; Wang et al., 2021a).
Another research line is to excavate the potential hu-
man cognitive laws of the text itself: ERNIE (Sun
et al., 2019) creativity proposes entity-level mask
in pre-training to incorporate the human knowledge
into language models. Similarly, SpanBERT (Joshi
et al., 2020) modifies the making schema and train-
ing objectives to better represent and predict text
spans. BERT-wwm (Cui et al., 2021) introduces a
whole word masking strategy for Chinese language
models. Outside the field of plain text, ERNIE-
ViL (Yu et al., 2021) incorporates structured knowl-
edge obtained from scene graphs to learn joint rep-
resentations of vision-language. Inspired by the
above work, we leverage the implicit knowledge
related to layout, e.g., reading order, for the under-
standing of visually rich documents.
3 Methodology
Figure 1 shows an overview of the ERNIE-Layout.
Given a document, ERNIE-Layout rearranges the
token sequence with the layout knowledge and ex-
tracts visual features from the visual encoder. The
textual and layout embeddings are combined into
textual features through a linear projection, and
similar operations are executed for visual embed-
dings. Next, these features are concatenated and
fed into the stacked multi-modal transformer lay-
ers, which are equipped with the proposed spatial-
aware disentangled attention mechanism. For pre-
training, ERNIE-Layout adopts four pre-training
tasks, including the new proposed reading order
prediction, replaced region prediction tasks, and
the traditional masked visual-language modeling,
text-image alignment tasks.
摘要:

ERNIE-Layout:LayoutKnowledgeEnhancedPre-trainingforVisually-richDocumentUnderstandingQimingPeng1,YinxuPan1,WenjinWang2y,BinLuo1z,ZhenyuZhang1,ZhengjieHuang1,TengHu1,WeichongYin1,YongfengChen1,YinZhang2,ShikunFeng1,YuSun1,HaoTian1,HuaWu1,HaifengWang11BaiduInc.,Beijing,China2ZhejiangUniversity,Hang...

展开>> 收起<<
ERNIE-Layout Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Qiming Peng1 Yinxu Pan1 Wenjin Wang2y Bin Luo1z Zhenyu Zhang1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:4.08MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注