
XDoc: Unified Pre-training for Cross-Format Document Understanding
Jingye Chen∗
, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei
Microsoft Corporation
{v-jingyechen,tengchaolv,lecu,chazhang,fuwei}@microsoft.com
Abstract
The surge of pre-training has witnessed the
rapid development of document understanding
recently. Pre-training and fine-tuning frame-
work has been effectively used to tackle texts
in various formats, including plain texts, docu-
ment texts, and web texts. Despite achieving
promising performance, existing pre-trained
models usually target one specific document
format at one time, making it difficult to com-
bine knowledge from multiple document for-
mats. To address this, we propose XDoc, a
unified pre-trained model which deals with dif-
ferent document formats in a single model.
For parameter efficiency, we share backbone
parameters for different formats such as the
word embedding layer and the Transformer
layers. Meanwhile, we introduce adaptive lay-
ers with lightweight parameters to enhance the
distinction across different formats. Exper-
imental results have demonstrated that with
only 36.7% parameters, XDoc achieves com-
parable or even better performance on a variety
of downstream tasks compared with the indi-
vidual pre-trained models, which is cost effec-
tive for real-world deployment. The code and
pre-trained models will be publicly available
at https://aka.ms/xdoc.
1 Introduction
Document understanding has undoubtedly been an
important research topic as documents play an es-
sential role in message delivery in our daily lives
(Cui et al.,2021). During the past several years, the
flourishing blossom of deep learning has witnessed
the rapid development of document understanding
in various formats, ranging from plain texts (Devlin
et al.,2018;Liu et al.,2019;Dong et al.,2019),
document texts (Xu et al.,2020,2021a;Huang
et al.,2022), and web texts (Chen et al.,2021;Li
et al.,2022a;Wang et al.,2022b). Recently, pre-
training techniques have been the de facto standard
∗
Work done during internship at Microsoft Research Asia.
Figure 1: Pre-trained models for different document
formats. Most of the structures are similar (word em-
bedding, 1D position embedding, and Transformer lay-
ers) while only a small proportion of the structures (2D
position and XPaths embedding) are different.
for document understanding, where the model is
first pre-trained in a self-supervised manner (e.g.
using masked language modeling as the pretext
task (Devlin et al.,2018)) with large-scale corpus,
then fine-tuned on a series of downstream tasks
like question-answering (Rajpurkar et al.,2016;
Mathew et al.,2021), key information extraction
(Jaume et al.,2019;Xu et al.,2022) and many
others. Albeit achieving impressive performance
on specific tasks, existing pre-trained models are
far from flexible as they can only tackle texts in
a single format (e.g. LayoutLM (Xu et al.,2020)
is designed for tackling document texts and is not
suitable for web texts). This makes it difficult to
combine knowledge from multiple document for-
mats. Meanwhile, the category of pre-trained mod-
els will keep increasing if more formats (e.g. Word
and PowerPoint) are further studied in academia.
Among different pre-trained models for docu-
ment understanding, it is observed that many pre-
arXiv:2210.02849v1 [cs.CL] 6 Oct 2022