ERNIE-Layout Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Qiming Peng1 Yinxu Pan1 Wenjin Wang2y Bin Luo1z Zhenyu Zhang1

2025-04-26 0 0 4.08MB 13 页 10玖币

侵权投诉

ERNIE-Layout: Layout Knowledge Enhanced Pre-training

for Visually-rich Document Understanding

Qiming Peng1∗, Yinxu Pan1∗, Wenjin Wang2∗†, Bin Luo1‡, Zhenyu Zhang1,

Zhengjie Huang1, Teng Hu1, Weichong Yin1, Yongfeng Chen1, Yin Zhang2,

Shikun Feng1, Yu Sun1, Hao Tian1, Hua Wu1, Haifeng Wang1

1Baidu Inc., Beijing, China

2Zhejiang University, Hangzhou, China

{pengqiming, panyinxu, luobin06, zhangzhenyu07}@baidu.com

{wangwenjin, zhangyin98}@zju.edu.cn {yinweichong, fengshikun, sunyu02}@baidu.com

Abstract

Recent years have witnessed the rise and suc-

cess of pre-training techniques in visually-rich

document understanding. However, most ex-

isting methods lack the systematic mining and

utilization of layout-centered knowledge, lead-

ing to sub-optimal performances. In this pa-

per, we propose ERNIE-Layout, a novel docu-

ment pre-training solution with layout knowl-

edge enhancement in the whole workﬂow, to

learn better representations that combine the

features from text, layout, and image. Speciﬁ-

cally, we ﬁrst rearrange input sequences in the

serialization stage, and then present a correla-

tive pre-training task, reading order prediction,

to learn the proper reading order of documents.

To improve the layout awareness of the model,

we integrate a spatial-aware disentangled at-

tention into the multi-modal transformer and

a replaced regions prediction task into the pre-

training phase. Experimental results show that

ERNIE-Layout achieves superior performance

on various downstream tasks, setting new state-

of-the-art on key information extraction, docu-

ment image classiﬁcation, and document ques-

tion answering datasets. The code and models

are publicly available at PaddleNLP1.

1 Introduction

Visually-rich Document Understanding (VrDU) is

an important research ﬁeld aiming to handle vari-

ous types of scanned or digital-born business docu-

ments (e.g., forms, invoices), which has attracted

great attention from the industry and academia due

to its various applications. Distinct from conven-

tional natural language understanding (NLU) tasks

that use only plain text, VrDU models have the op-

portunity to access the most primitive data features.

∗Equal contribution.

†Work done during internship at Baidu Inc.

‡Corresponding author: Bin Luo.

1https://github.com/PaddlePaddle/

PaddleNLP/tree/develop/model_zoo/

ernie-layout

Herein, the diversity and complexity of document

formats pose new challenges to the task, an ideal

model needs to make full use of the textual, layout,

and even visual information to fully understand

visually-rich document like humans.

The preliminary works for VrDU (Yang et al.,

2016, 2017; Katti et al., 2018; Sarkhel and Nandi,

2019; Cheng et al., 2020) usually adopt uni-modal

or shallow multi-modal fusion approaches, which

are task-speciﬁc and require massive data annota-

tions. Recently, pre-training language models have

swept the ﬁeld, LayoutLM (Xu et al., 2020), Lay-

outLMv2 (Xu et al., 2021), and some advanced

document pre-training approaches (Li et al., 2021a;

Appalaraju et al., 2021; Gu et al., 2022) have been

proposed successively and achieved great successes

in various VrDU tasks. Unlike popular uni-modal

or vision-language frameworks (Devlin et al., 2019;

Liu et al., 2019; Lu et al., 2019; Yu et al., 2021),

the uniqueness of document understanding models

lies in how to exploit the layout knowledge.

However, existing document pre-training solu-

tions typically fall into the trap of simply taking

2D coordinates as an extension of 1D positions to

endow the model layout awareness. Considering

the characteristics of VrDU, we believe that the

layout-centered knowledge should be systemati-

cally mined and utilized from two aspects: (1) On

the one hand, layout implicitly reﬂects the proper

reading order of documents, while previous meth-

ods are used to perform the serialization by multi-

plexing the results of Optical Character Recogni-

tion (OCR), which roughly arrange tokens in the

top-to-bottom and left-to-right manner (Wang et al.,

2021c; Gu et al., 2022). Inevitably, it is inconsis-

tent with human reading habits for documents with

complex layouts (e.g., tables, forms, multi-column

templates) and leads to sub-optimal performances

for downstream tasks. (2) On the other hand, layout

is actually the third modality besides language and

vision, while current models are used to take lay-

arXiv:2210.06155v2 [cs.CL] 14 Oct 2022

out as a special position feature, such as the layout

embedding in input layer (Xu et al., 2020) or the

bias item in attention layer (Xu et al., 2021). The

lack of cross-modal interaction between layout and

text/image might restrict the model from learning

the role of layout in semantic expression.

To achieve these goals, we propose a systematic

layout knowledge enhanced pre-training approach,

ERNIE-Layout

, to improve the performances of

document understanding tasks. First of all, we em-

ploy an off-the-shelf layout-based document parser

in the serialization stage to generate an appropriate

reading order for each input document, so that the

input sequences received by the model are more in

line with human reading habits than using the rough

raster-scanning order. Then, each textual/visual to-

ken is equipped with its position embedding and

layout embedding, and sent to the stacked multi-

modal transformer layers. To enhance cross-modal

interaction, we present a spatial-aware disentangled

attention mechanism, inspired by the disentangled

attention of DeBERTa (He et al., 2021), in which

the attention weights between tokens are computed

using disentangled matrices based on their hidden

states and relative positions. In the end, layout not

only acts as the 2D position attribute of input to-

kens, but also contributes a spatial perspective to

the calculation of semantic similarity.

With satisfactory serialization results, we pro-

pose the pre-training task, reading order prediction,

to predict the next token for each position, which

facilitates the consistency within the same arranged

text segment and the discrimination between dif-

ferent segments. Furthermore, when pre-training,

we also adopt the classic masked visual-language

modeling and text-image alignment tasks (Xu et al.,

2021), and present a ﬁne-grained multi-modal task,

replaced regions prediction, to learn the correlation

among language, vision and layout.

We construct broad experiments on three repre-

sentative VrDU downstream tasks with six publicly

available datasets to evaluate the performance of

the pre-trained model, i.e., the key information ex-

traction task with the FUNSD (Jain and Wigington,

2019), CORD (Park et al., 2019), SROIE (Huang

et al., 2019), Kleister-NDA (Grali´

nski et al., 2021)

datasets, the document question answering task

with the DocVQA (Mathew et al., 2021) dataset,

and the document image classiﬁcation task with the

It is named after the knowledge enhanced pre-training

model, ERNIE (Sun et al., 2019), as a layout enhanced version.

RVL-CDIP (Harley et al., 2015) dataset. The re-

sults show that ERNIE-Layout signiﬁcantly outper-

forms strong baselines on almost all tasks, proving

the effectiveness of our two-part layout knowledge

enhancement philosophy.

The contributions are summarized as follows:

•

ERNIE-Layout proposes to rearrange the or-

der of input tokens in serialization and adopt

a reading order prediction task in pre-training.

To the best of our knowledge, ERNIE-Layout

is the ﬁrst attempt to consider the proper read-

ing order in document pre-training.

•

ERNIE-Layout incorporates the spatial-aware

disentangled attention mechanism in the multi-

modal transformer, and designs a replaced re-

gions prediction pre-training task, to facilitate

the ﬁne-grained interaction across textual, vi-

sual, and layout modalities.

•

ERNIE-Layout refreshes the state-of-the-art

of various VrDU tasks, and extensive exper-

iments demonstrate the effectiveness of ex-

ploiting layout-centered knowledge.

2 Related Work

Layout-aware Pre-trained Model.

Humans un-

derstand visually rich documents through many

perspectives, such as language, vision, and layout.

Based on the powerful modeling ability of Trans-

former (Vaswani et al., 2017), LayoutLM (Xu et al.,

2020) initially embeds the 2D coordinates as layout

embeddings for each token and extends the famous

masked language modeling pre-training task (De-

vlin et al., 2019) to masked visual-language mod-

eling, which opens the prologue of layout-aware

pre-trained models. Afterwards, LayoutLMv2 (Xu

et al., 2021) concatenates document image patches

with textual tokens, and two pre-training tasks, text-

image matching and text-image alignment, are pro-

posed to realize the cross-modal interaction. Struc-

tralLM (Li et al., 2021a) leverages segment-level,

instead of word-level, layout features to make the

model aware of which words come from the same

cell. DocFormer (Appalaraju et al., 2021) shares

the learned spatial embeddings across modalities,

making it easy for the model to correlate text to

visual tokens and vice versa. TILT (Powalski et al.,

2021) proposes an encoder-decoder model to gen-

erate results that are not explicitly included in the

input sequence to solve the limitations of sequence

T1T2

[CLS] [SEP]… V1V2V3V7…V8 V9

0.02 …0.04 0.87 0.01

Transformer Layers

with Spatial-aware Disentangled Attention Mechanism

Visual FeaturesTextual Features

T1 T2[CLS] T4

T3T5T6T7…T8[SEP] …

0 …0 1 0 1 0

0.08 0.76

BCE

Replaced Region Prediction

T2 T4T1 T3 T5 T6 T7 T8

Reading Order Prediction

…

Attention matrix Golden Label

Visual Encoder

L6L7L1L5

L4L8

Serialization Module

T1T2T3…T3

T2T1T2

T2T3

…

Text-Image

Alignment

Masked

Visual-

Languages

Modeling

(a) (b) (c) (d)

Figure 1: The architecture and pre-training objectives of ERNIE-Layout. The serialization module is introduced

to correct the order of raster-scan, and the visual encoder extracts corresponding image features. With the spatial-

aware disentangled attention mechanism, ERNIE-Layout is pre-trained with four tasks.

labeling. However, these methods ignore the poten-

tial value of layout in-depth and directly rely on a

raster-scanning serialization, which is contrary to

human reading habits. To solve this problem, Lay-

outReader (Wang et al., 2021c) designs a sequence-

to-sequence framework to generate an appropriate

reading order for each document. Unfortunately,

it is carefully designed for reading order detection

and cannot directly empower various document un-

derstanding tasks. Besides, the above methods are

used to regard layout as a subsidiary feature of text

along with the idea of LayoutLM, but the same text

with different layouts may also express different se-

mantics. Therefore, we believe that layout should

be regarded as the third modality independent of

language and vision.

Knowledge-enhanced Representation.

Follow-

ing the BERT (Devlin et al., 2019) architecture,

many efforts are devoted to pre-trained language

models for learning informative representations.

There are some studies show that extra knowledge,

such as facts in WikiData and WordNet, can further

beneﬁt the pre-trained models (Zhang et al., 2019;

Liu et al., 2020; He et al., 2020; Wang et al., 2021b),

but the embeddings of words in the text and enti-

ties in the knowledge graphs are not in the same

vector space, so a cumbersome adaptation module

is required (He et al., 2020; Wang et al., 2021a).

Another research line is to excavate the potential hu-

man cognitive laws of the text itself: ERNIE (Sun

et al., 2019) creativity proposes entity-level mask

in pre-training to incorporate the human knowledge

into language models. Similarly, SpanBERT (Joshi

et al., 2020) modiﬁes the making schema and train-

ing objectives to better represent and predict text

spans. BERT-wwm (Cui et al., 2021) introduces a

whole word masking strategy for Chinese language

models. Outside the ﬁeld of plain text, ERNIE-

ViL (Yu et al., 2021) incorporates structured knowl-

edge obtained from scene graphs to learn joint rep-

resentations of vision-language. Inspired by the

above work, we leverage the implicit knowledge

related to layout, e.g., reading order, for the under-

standing of visually rich documents.

3 Methodology

Figure 1 shows an overview of the ERNIE-Layout.

Given a document, ERNIE-Layout rearranges the

token sequence with the layout knowledge and ex-

tracts visual features from the visual encoder. The

textual and layout embeddings are combined into

textual features through a linear projection, and

similar operations are executed for visual embed-

dings. Next, these features are concatenated and

fed into the stacked multi-modal transformer lay-

ers, which are equipped with the proposed spatial-

aware disentangled attention mechanism. For pre-

training, ERNIE-Layout adopts four pre-training

tasks, including the new proposed reading order

prediction, replaced region prediction tasks, and

the traditional masked visual-language modeling,

text-image alignment tasks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ERNIE-Layout:LayoutKnowledgeEnhancedPre-trainingforVisually-richDocumentUnderstandingQimingPeng1,YinxuPan1,WenjinWang2y,BinLuo1z,ZhenyuZhang1,ZhengjieHuang1,TengHu1,WeichongYin1,YongfengChen1,YinZhang2,ShikunFeng1,YuSun1,HaoTian1,HuaWu1,HaifengWang11BaiduInc.,Beijing,China2ZhejiangUniversity,Hang...

展开>> 收起<<

ERNIE-Layout Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Qiming Peng1 Yinxu Pan1 Wenjin Wang2y Bin Luo1z Zhenyu Zhang1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ERNIE-Layout Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Qiming Peng1 Yinxu Pan1 Wenjin Wang2y Bin Luo1z Zhenyu Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: