
out as a special position feature, such as the layout
embedding in input layer (Xu et al., 2020) or the
bias item in attention layer (Xu et al., 2021). The
lack of cross-modal interaction between layout and
text/image might restrict the model from learning
the role of layout in semantic expression.
To achieve these goals, we propose a systematic
layout knowledge enhanced pre-training approach,
ERNIE-Layout
2
, to improve the performances of
document understanding tasks. First of all, we em-
ploy an off-the-shelf layout-based document parser
in the serialization stage to generate an appropriate
reading order for each input document, so that the
input sequences received by the model are more in
line with human reading habits than using the rough
raster-scanning order. Then, each textual/visual to-
ken is equipped with its position embedding and
layout embedding, and sent to the stacked multi-
modal transformer layers. To enhance cross-modal
interaction, we present a spatial-aware disentangled
attention mechanism, inspired by the disentangled
attention of DeBERTa (He et al., 2021), in which
the attention weights between tokens are computed
using disentangled matrices based on their hidden
states and relative positions. In the end, layout not
only acts as the 2D position attribute of input to-
kens, but also contributes a spatial perspective to
the calculation of semantic similarity.
With satisfactory serialization results, we pro-
pose the pre-training task, reading order prediction,
to predict the next token for each position, which
facilitates the consistency within the same arranged
text segment and the discrimination between dif-
ferent segments. Furthermore, when pre-training,
we also adopt the classic masked visual-language
modeling and text-image alignment tasks (Xu et al.,
2021), and present a fine-grained multi-modal task,
replaced regions prediction, to learn the correlation
among language, vision and layout.
We construct broad experiments on three repre-
sentative VrDU downstream tasks with six publicly
available datasets to evaluate the performance of
the pre-trained model, i.e., the key information ex-
traction task with the FUNSD (Jain and Wigington,
2019), CORD (Park et al., 2019), SROIE (Huang
et al., 2019), Kleister-NDA (Grali´
nski et al., 2021)
datasets, the document question answering task
with the DocVQA (Mathew et al., 2021) dataset,
and the document image classification task with the
2
It is named after the knowledge enhanced pre-training
model, ERNIE (Sun et al., 2019), as a layout enhanced version.
RVL-CDIP (Harley et al., 2015) dataset. The re-
sults show that ERNIE-Layout significantly outper-
forms strong baselines on almost all tasks, proving
the effectiveness of our two-part layout knowledge
enhancement philosophy.
The contributions are summarized as follows:
•
ERNIE-Layout proposes to rearrange the or-
der of input tokens in serialization and adopt
a reading order prediction task in pre-training.
To the best of our knowledge, ERNIE-Layout
is the first attempt to consider the proper read-
ing order in document pre-training.
•
ERNIE-Layout incorporates the spatial-aware
disentangled attention mechanism in the multi-
modal transformer, and designs a replaced re-
gions prediction pre-training task, to facilitate
the fine-grained interaction across textual, vi-
sual, and layout modalities.
•
ERNIE-Layout refreshes the state-of-the-art
of various VrDU tasks, and extensive exper-
iments demonstrate the effectiveness of ex-
ploiting layout-centered knowledge.
2 Related Work
Layout-aware Pre-trained Model.
Humans un-
derstand visually rich documents through many
perspectives, such as language, vision, and layout.
Based on the powerful modeling ability of Trans-
former (Vaswani et al., 2017), LayoutLM (Xu et al.,
2020) initially embeds the 2D coordinates as layout
embeddings for each token and extends the famous
masked language modeling pre-training task (De-
vlin et al., 2019) to masked visual-language mod-
eling, which opens the prologue of layout-aware
pre-trained models. Afterwards, LayoutLMv2 (Xu
et al., 2021) concatenates document image patches
with textual tokens, and two pre-training tasks, text-
image matching and text-image alignment, are pro-
posed to realize the cross-modal interaction. Struc-
tralLM (Li et al., 2021a) leverages segment-level,
instead of word-level, layout features to make the
model aware of which words come from the same
cell. DocFormer (Appalaraju et al., 2021) shares
the learned spatial embeddings across modalities,
making it easy for the model to correlate text to
visual tokens and vice versa. TILT (Powalski et al.,
2021) proposes an encoder-decoder model to gen-
erate results that are not explicitly included in the
input sequence to solve the limitations of sequence