XDoc Unified Pre-training for Cross-Format Document Understanding

2025-04-15 0 0 912.87KB 11 页 10玖币
侵权投诉
XDoc: Unified Pre-training for Cross-Format Document Understanding
Jingye Chen
, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei
Microsoft Corporation
{v-jingyechen,tengchaolv,lecu,chazhang,fuwei}@microsoft.com
Abstract
The surge of pre-training has witnessed the
rapid development of document understanding
recently. Pre-training and fine-tuning frame-
work has been effectively used to tackle texts
in various formats, including plain texts, docu-
ment texts, and web texts. Despite achieving
promising performance, existing pre-trained
models usually target one specific document
format at one time, making it difficult to com-
bine knowledge from multiple document for-
mats. To address this, we propose XDoc, a
unified pre-trained model which deals with dif-
ferent document formats in a single model.
For parameter efficiency, we share backbone
parameters for different formats such as the
word embedding layer and the Transformer
layers. Meanwhile, we introduce adaptive lay-
ers with lightweight parameters to enhance the
distinction across different formats. Exper-
imental results have demonstrated that with
only 36.7% parameters, XDoc achieves com-
parable or even better performance on a variety
of downstream tasks compared with the indi-
vidual pre-trained models, which is cost effec-
tive for real-world deployment. The code and
pre-trained models will be publicly available
at https://aka.ms/xdoc.
1 Introduction
Document understanding has undoubtedly been an
important research topic as documents play an es-
sential role in message delivery in our daily lives
(Cui et al.,2021). During the past several years, the
flourishing blossom of deep learning has witnessed
the rapid development of document understanding
in various formats, ranging from plain texts (Devlin
et al.,2018;Liu et al.,2019;Dong et al.,2019),
document texts (Xu et al.,2020,2021a;Huang
et al.,2022), and web texts (Chen et al.,2021;Li
et al.,2022a;Wang et al.,2022b). Recently, pre-
training techniques have been the de facto standard
Work done during internship at Microsoft Research Asia.
Figure 1: Pre-trained models for different document
formats. Most of the structures are similar (word em-
bedding, 1D position embedding, and Transformer lay-
ers) while only a small proportion of the structures (2D
position and XPaths embedding) are different.
for document understanding, where the model is
first pre-trained in a self-supervised manner (e.g.
using masked language modeling as the pretext
task (Devlin et al.,2018)) with large-scale corpus,
then fine-tuned on a series of downstream tasks
like question-answering (Rajpurkar et al.,2016;
Mathew et al.,2021), key information extraction
(Jaume et al.,2019;Xu et al.,2022) and many
others. Albeit achieving impressive performance
on specific tasks, existing pre-trained models are
far from flexible as they can only tackle texts in
a single format (e.g. LayoutLM (Xu et al.,2020)
is designed for tackling document texts and is not
suitable for web texts). This makes it difficult to
combine knowledge from multiple document for-
mats. Meanwhile, the category of pre-trained mod-
els will keep increasing if more formats (e.g. Word
and PowerPoint) are further studied in academia.
Among different pre-trained models for docu-
ment understanding, it is observed that many pre-
arXiv:2210.02849v1 [cs.CL] 6 Oct 2022
Figure 2: Illustrations of three document formats. For each format, the corresponding meta-information is shown
in the dash boxes. Please note that the text content and 1D position are common attributes across three formats
while 2D position and XPath strings (marked as red) are specific for document and web texts respectively.
trained models share a similar architecture, such
as a word embedding layer, a 1D position em-
bedding layer, and Transformer layers (see Fig-
ure 1). In contrast, there are also different parts
serving as prior knowledge for a specific format
(e.g. two-dimensional coordinates for document
texts and XPaths for web texts). Intuitively, we
find that the parameters of different parts are far
less than the parameters of the shared backbones.
For instance,
LayoutLMBASE
(Xu et al.,2020)
based on RoBERTa (Liu et al.,2019) consists of
131M parameters while the 2D position embedding
layer only contains 3M parameters (2.3%). Simi-
larly,
MarkupLMBASE
(Li et al.,2022a) based on
RoBERTa has 138M parameters while the XPath
embedding layer only contains 11M parameters
(8.0%). Therefore, it is indispensable to design a
unified pre-trained model for various text formats
while sharing backbone parameters to make models
more compact.
To this end, we propose XDoc, a unified architec-
ture with multiple input heads designed for various
categories of documents. For the sake of parameter
efficiency, we share the backbone network archi-
tecture across different formats, including the word
embedding layer, the 1D position embedding layer,
and dense Transformer layers. Considering that
the different parts only take up a small proportion
in XDoc, we introduce adaptive layers to make
the representation learning for different formats
more robust. We collect the large-scale training
samples for different document formats, and lever-
age masked language modeling to pre-train XDoc.
Specifically, we use three widely-used document
formats for experiments, including plain, docu-
ment, and web texts (see Figure 2for more details).
To verify the model accuracy, we select the GLUE
benchmark (Wang et al.,2019) and SQuAD (Ra-
jpurkar et al.,2016,2018) to evaluate plain text
understanding, FUNSD (Jaume et al.,2019) and
DocVQA (Mathew et al.,2021) to evaluate doc-
ument understanding, and WebSRC (Chen et al.,
2021) for web text understanding. Experimental
results have demonstrated that XDoc achieves com-
parable or even better performance on these tasks
while maintaining the parameter efficacy.
The contributions of this paper are summarized
as follows:
We propose XDoc, a unified pre-trained
model that tackles texts in various formats
in pursuit of parameter efficiency.
Pre-trained with only masked language mod-
eling task, XDoc achieves comparable or even
better accuracy on various downstream tasks.
The code and pre-trained models will be
publicly available at
https://aka.ms/
xdoc.
2 XDoc
In this section, we first introduce the architecture of
XDoc and details of the embedding used for each
document format, then introduce the objectives for
pre-training the XDoc model.
2.1 Model Architecture
As is demonstrated in Figure 3, XDoc is capable
of tackling texts in various formats (plain, docu-
ment, and web texts) in one model. For any input
sequences, XDoc learns to embed them using a
shared backbone and additional embedding layers
when other prior knowledge is available. In de-
tail, for any input text
T
, XDoc first tokenizes it
Figure 3: XDoc tackles multiple formats in one model while sharing most parameters including 1D position
embedding, word embedding, and dense Transformer layers. An optional embedding layer and adaptive layer
are utilized for specific prior knowledge such as 2D position for document texts and XPaths for web texts (no
additional prior for plain texts). We demonstrate the dataflow for document texts and use dash lines for other
formats.
into subwords
s=s1:L
using WordPiece, where
L
denotes the maximum length. Subsequently, for
each subword
si
with index
i
, it is first fed to a
word embedding layer and we denote the output as
WordEmb(si)
. Then it is added with a learnable
1D position embedding
1DEmb(i)
. Since the word
embedding and 1D position embedding layers are
indispensable for Transformer-based models, we
attempt to
share
the parameters across different
formats. Based on this, we will detail the overall
embedding for each document format in the next.
Overall embedding for plain texts
As there is
no additional prior knowledge for plain texts, we
simply add up the word embedding and 1D position
embedding to construct the input for Transformer
layers following (Devlin et al.,2018;Liu et al.,
2019). For each word
sP
i
, where
i
is the index
and “P” denotes “Plain”, the overall embedding
Emb(sP
i)can be calculated as follows:
Emb(sP
i) = WordEmb(sP
i) + 1DEmb(i)(1)
Overall embedding for document texts
Differ-
ent from the plain texts, the visually rich docu-
ment texts are usually organized with 2-D layouts,
where the coordinates of each text box play cru-
cial roles in understanding. Hence, the 2D position
should be necessarily taken into account during pre-
training. Concretely, for a given subword
sD
i
(“D”
is the abbreviation of “Document”), we denote the
2D position as
boxD
i= (li, ri, ti, bi, wi, hi)
, where
l, r, t, b, w, h
denote left, right, top, and bottom
coordinates, width and height of the text box, re-
spectively. For example, as illustrated in Figure
2(b),
l, r, t, b, w, h
of the text “PERSONAL” is set
to 240, 275, 80, 100, 35, and 20, respectively. Con-
sidering that most parameters are shared across
different formats, we introduce an adaptive layer
to enhance the distinction of specific prior infor-
mation. The adaptive layer is simply implemented
with a lightweight Linear-ReLU-Linear sequence
and we will discuss the effectiveness in Section 3.4.
Following (Xu et al.,2020,2021a), we add up all
the embedding to construct the overall embedding
Emb(sD
i)as follows:
Emb(sD
i) = WordEmb(sD
i) + 1DEmb(i)
+ DocAdaptive[2DEmb(boxD
i)] (2)
2DEmb(boxD
i) = LeftEmb(li) + RightEmb(ri)
+ TopEmb(ti) + BottomEmb(bi)
+ WidthEmb(wi) + HeightEmb(hi)
(3)
where “
LeftEmb
” denotes the embedding layer of
the left coordinates (other embedding layers follow
the same naming conventions). Please note that
the adaptive layer is not shared across different
formats and “
DocAdaptive
” is specifically used
for document texts.
Overall embedding for web texts
Since the 2-
D layout of each website is not fixed and it highly
摘要:

XDoc:UniedPre-trainingforCross-FormatDocumentUnderstandingJingyeChen,TengchaoLv,LeiCui,ChaZhang,FuruWeiMicrosoftCorporation{v-jingyechen,tengchaolv,lecu,chazhang,fuwei}@microsoft.comAbstractThesurgeofpre-traininghaswitnessedtherapiddevelopmentofdocumentunderstandingrecently.Pre-trainingandne-tuni...

展开>> 收起<<
XDoc Unified Pre-training for Cross-Format Document Understanding.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:学术论文 价格:10玖币 属性:11 页 大小:912.87KB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注