XDoc Uniﬁed Pre-training for Cross-Format Document Understanding

2025-04-15 0 0 912.87KB 11 页 10玖币

侵权投诉

XDoc: Uniﬁed Pre-training for Cross-Format Document Understanding

Jingye Chen∗

, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei

Microsoft Corporation

{v-jingyechen,tengchaolv,lecu,chazhang,fuwei}@microsoft.com

Abstract

The surge of pre-training has witnessed the

rapid development of document understanding

recently. Pre-training and ﬁne-tuning frame-

work has been effectively used to tackle texts

in various formats, including plain texts, docu-

ment texts, and web texts. Despite achieving

promising performance, existing pre-trained

models usually target one speciﬁc document

format at one time, making it difﬁcult to com-

bine knowledge from multiple document for-

mats. To address this, we propose XDoc, a

uniﬁed pre-trained model which deals with dif-

ferent document formats in a single model.

For parameter efﬁciency, we share backbone

parameters for different formats such as the

word embedding layer and the Transformer

layers. Meanwhile, we introduce adaptive lay-

ers with lightweight parameters to enhance the

distinction across different formats. Exper-

imental results have demonstrated that with

only 36.7% parameters, XDoc achieves com-

parable or even better performance on a variety

of downstream tasks compared with the indi-

vidual pre-trained models, which is cost effec-

tive for real-world deployment. The code and

pre-trained models will be publicly available

at https://aka.ms/xdoc.

1 Introduction

Document understanding has undoubtedly been an

important research topic as documents play an es-

sential role in message delivery in our daily lives

(Cui et al.,2021). During the past several years, the

ﬂourishing blossom of deep learning has witnessed

the rapid development of document understanding

in various formats, ranging from plain texts (Devlin

et al.,2018;Liu et al.,2019;Dong et al.,2019),

document texts (Xu et al.,2020,2021a;Huang

et al.,2022), and web texts (Chen et al.,2021;Li

et al.,2022a;Wang et al.,2022b). Recently, pre-

training techniques have been the de facto standard

∗

Work done during internship at Microsoft Research Asia.

Figure 1: Pre-trained models for different document

formats. Most of the structures are similar (word em-

bedding, 1D position embedding, and Transformer lay-

ers) while only a small proportion of the structures (2D

position and XPaths embedding) are different.

for document understanding, where the model is

ﬁrst pre-trained in a self-supervised manner (e.g.

using masked language modeling as the pretext

task (Devlin et al.,2018)) with large-scale corpus,

then ﬁne-tuned on a series of downstream tasks

like question-answering (Rajpurkar et al.,2016;

Mathew et al.,2021), key information extraction

(Jaume et al.,2019;Xu et al.,2022) and many

others. Albeit achieving impressive performance

on speciﬁc tasks, existing pre-trained models are

far from ﬂexible as they can only tackle texts in

a single format (e.g. LayoutLM (Xu et al.,2020)

is designed for tackling document texts and is not

suitable for web texts). This makes it difﬁcult to

combine knowledge from multiple document for-

mats. Meanwhile, the category of pre-trained mod-

els will keep increasing if more formats (e.g. Word

and PowerPoint) are further studied in academia.

Among different pre-trained models for docu-

ment understanding, it is observed that many pre-

arXiv:2210.02849v1 [cs.CL] 6 Oct 2022

Figure 2: Illustrations of three document formats. For each format, the corresponding meta-information is shown

in the dash boxes. Please note that the text content and 1D position are common attributes across three formats

while 2D position and XPath strings (marked as red) are speciﬁc for document and web texts respectively.

trained models share a similar architecture, such

as a word embedding layer, a 1D position em-

bedding layer, and Transformer layers (see Fig-

ure 1). In contrast, there are also different parts

serving as prior knowledge for a speciﬁc format

(e.g. two-dimensional coordinates for document

texts and XPaths for web texts). Intuitively, we

ﬁnd that the parameters of different parts are far

less than the parameters of the shared backbones.

For instance,

LayoutLMBASE

(Xu et al.,2020)

based on RoBERTa (Liu et al.,2019) consists of

131M parameters while the 2D position embedding

layer only contains 3M parameters (2.3%). Simi-

larly,

MarkupLMBASE

(Li et al.,2022a) based on

RoBERTa has 138M parameters while the XPath

embedding layer only contains 11M parameters

(8.0%). Therefore, it is indispensable to design a

uniﬁed pre-trained model for various text formats

while sharing backbone parameters to make models

more compact.

To this end, we propose XDoc, a uniﬁed architec-

ture with multiple input heads designed for various

categories of documents. For the sake of parameter

efﬁciency, we share the backbone network archi-

tecture across different formats, including the word

embedding layer, the 1D position embedding layer,

and dense Transformer layers. Considering that

the different parts only take up a small proportion

in XDoc, we introduce adaptive layers to make

the representation learning for different formats

more robust. We collect the large-scale training

samples for different document formats, and lever-

age masked language modeling to pre-train XDoc.

Speciﬁcally, we use three widely-used document

formats for experiments, including plain, docu-

ment, and web texts (see Figure 2for more details).

To verify the model accuracy, we select the GLUE

benchmark (Wang et al.,2019) and SQuAD (Ra-

jpurkar et al.,2016,2018) to evaluate plain text

understanding, FUNSD (Jaume et al.,2019) and

DocVQA (Mathew et al.,2021) to evaluate doc-

ument understanding, and WebSRC (Chen et al.,

2021) for web text understanding. Experimental

results have demonstrated that XDoc achieves com-

parable or even better performance on these tasks

while maintaining the parameter efﬁcacy.

The contributions of this paper are summarized

as follows:

•

We propose XDoc, a uniﬁed pre-trained

model that tackles texts in various formats

in pursuit of parameter efﬁciency.

•

Pre-trained with only masked language mod-

eling task, XDoc achieves comparable or even

better accuracy on various downstream tasks.

•

The code and pre-trained models will be

publicly available at

https://aka.ms/

xdoc.

2 XDoc

In this section, we ﬁrst introduce the architecture of

XDoc and details of the embedding used for each

document format, then introduce the objectives for

pre-training the XDoc model.

2.1 Model Architecture

As is demonstrated in Figure 3, XDoc is capable

of tackling texts in various formats (plain, docu-

ment, and web texts) in one model. For any input

sequences, XDoc learns to embed them using a

shared backbone and additional embedding layers

when other prior knowledge is available. In de-

tail, for any input text

, XDoc ﬁrst tokenizes it

Figure 3: XDoc tackles multiple formats in one model while sharing most parameters including 1D position

embedding, word embedding, and dense Transformer layers. An optional embedding layer and adaptive layer

are utilized for speciﬁc prior knowledge such as 2D position for document texts and XPaths for web texts (no

additional prior for plain texts). We demonstrate the dataﬂow for document texts and use dash lines for other

formats.

into subwords

s=s1:L

using WordPiece, where

denotes the maximum length. Subsequently, for

each subword

with index

, it is ﬁrst fed to a

word embedding layer and we denote the output as

WordEmb(si)

. Then it is added with a learnable

1D position embedding

1DEmb(i)

. Since the word

embedding and 1D position embedding layers are

indispensable for Transformer-based models, we

attempt to

the parameters across different

formats. Based on this, we will detail the overall

embedding for each document format in the next.

Overall embedding for plain texts

As there is

no additional prior knowledge for plain texts, we

simply add up the word embedding and 1D position

embedding to construct the input for Transformer

layers following (Devlin et al.,2018;Liu et al.,

2019). For each word

, where

is the index

and “P” denotes “Plain”, the overall embedding

Emb(sP

i)can be calculated as follows:

Emb(sP

i) = WordEmb(sP

i) + 1DEmb(i)(1)

Overall embedding for document texts

Differ-

ent from the plain texts, the visually rich docu-

ment texts are usually organized with 2-D layouts,

where the coordinates of each text box play cru-

cial roles in understanding. Hence, the 2D position

should be necessarily taken into account during pre-

training. Concretely, for a given subword

(“D”

is the abbreviation of “Document”), we denote the

2D position as

boxD

i= (li, ri, ti, bi, wi, hi)

, where

l, r, t, b, w, h

denote left, right, top, and bottom

coordinates, width and height of the text box, re-

spectively. For example, as illustrated in Figure

2(b),

l, r, t, b, w, h

of the text “PERSONAL” is set

to 240, 275, 80, 100, 35, and 20, respectively. Con-

sidering that most parameters are shared across

different formats, we introduce an adaptive layer

to enhance the distinction of speciﬁc prior infor-

mation. The adaptive layer is simply implemented

with a lightweight Linear-ReLU-Linear sequence

and we will discuss the effectiveness in Section 3.4.

Following (Xu et al.,2020,2021a), we add up all

the embedding to construct the overall embedding

Emb(sD

i)as follows:

Emb(sD

i) = WordEmb(sD

i) + 1DEmb(i)

+ DocAdaptive[2DEmb(boxD

i)] (2)

2DEmb(boxD

i) = LeftEmb(li) + RightEmb(ri)

+ TopEmb(ti) + BottomEmb(bi)

+ WidthEmb(wi) + HeightEmb(hi)

(3)

where “

LeftEmb

” denotes the embedding layer of

the left coordinates (other embedding layers follow

the same naming conventions). Please note that

the adaptive layer is not shared across different

formats and “

DocAdaptive

” is speciﬁcally used

for document texts.

Overall embedding for web texts

Since the 2-

D layout of each website is not ﬁxed and it highly

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

XDoc:UniedPre-trainingforCross-FormatDocumentUnderstandingJingyeChen,TengchaoLv,LeiCui,ChaZhang,FuruWeiMicrosoftCorporation{v-jingyechen,tengchaolv,lecu,chazhang,fuwei}@microsoft.comAbstractThesurgeofpre-traininghaswitnessedtherapiddevelopmentofdocumentunderstandingrecently.Pre-trainingandne-tuni...

展开>> 收起<<

XDoc Uniﬁed Pre-training for Cross-Format Document Understanding.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

XDoc Uniﬁed Pre-training for Cross-Format Document Understanding

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: