Contrastive Training Improves Zero-Shot Classiﬁcation of Semi-structured Documents Muhammad KhalifayYogarshi VyaszShuai Wangz

2025-04-24 0 0 342.58KB 9 页 10玖币

侵权投诉

Contrastive Training Improves Zero-Shot Classiﬁcation of

Semi-structured Documents

Muhammad Khalifa†∗

,Yogarshi Vyas‡,Shuai Wang‡,

Graham Horwood‡,Sunil Mallya§∗,Miguel Ballesteros‡

University of Michigan†, AWS AI Labs‡, Flip.ai§

khalifam@umich.edu,

{yogarshi,wshui,graham.horwood,ballemig}@amazon.com

Abstract

We investigate semi-structured document clas-

siﬁcation in a zero-shot setting. Classiﬁcation

of semi-structured documents is more chal-

lenging than that of standard unstructured doc-

uments, as positional, layout, and style infor-

mation play a vital role in interpreting such

documents. The standard classiﬁcation setting

where categories are ﬁxed during both train-

ing and testing falls short in dynamic environ-

ments where new document categories could

potentially emerge. We focus exclusively on

the zero-shot setting where inference is done

on new unseen classes. To address this task,

we propose a matching-based approach that re-

lies on a pairwise contrastive objective for both

pretraining and ﬁne-tuning. Our results show

a signiﬁcant boost in Macro F1from the pro-

posed pretraining step in both supervised and

unsupervised zero-shot settings.

1 Introduction

Textual information assumes many forms ranging

from unstructured (e.g., text messages) to semi-

structured (e.g., forms, invoices, letters), all the

way to fully structured (e.g., databases or spread-

sheets). Our focus in this work is the classiﬁcation

of semi-structured documents. A semi-structured

document consists of information that is organized

using a regular visual layout and includes tables,

forms, multi-columns, and (nested) bulleted lists,

and that is either understandable only in the con-

text of its visual layout or that requires substan-

tially more work to understand without the visual

layout. Automatic processing of semi-structured

documents comes with a unique set of challenges

including a non-linear text ﬂow (Wang et al.,2021),

layout inconsistencies, and low-accuracy optical

character recognition. Prior work has shown that

integrating the two-dimensional layout informa-

tion of such documents is critical in models for

∗Work done while at AWS AI Labs.

analyzing such documents (Xu et al.,2020,2021;

Huang et al.,2022;Appalaraju et al.,2021). Due

to these challenges, methods for unstructured doc-

ument classiﬁcation, such as static word vectors

(Socher et al.,2013) and standard pretrained lan-

guage models (Devlin et al.,2019;Reimers and

Gurevych,2019;Liu et al.,2019) perform poorly

with semi-structured inputs as they model text in

a one-dimensional space and ignore information

about document layout and style (Xu et al.,2020).

Past work on semi-structured document classi-

ﬁcation (Harley et al.,2015;Iwana et al.,2016;

Tensmeyer and Martinez,2017;Xu et al.,2020,

2021) has focused exclusively on the full-shot set-

ting, where the target classes are ﬁxed and iden-

tical across training and inference, neglecting the

zero-shot setting (Xian et al.,2018), which requires

generalization to unseen classes during inference.

Our work addresses zero-shot classiﬁcation of

semi-structured documents in English using the

matching framework, which has been used for

many tasks on unstructured text (Dauphin et al.,

2014;Nam et al.,2016;Pappas and Henderson,

2019;Vyas and Ballesteros,2021;Ma et al.,2022).

Under this framework, a matching (similarity) met-

ric between documents and their assigned classes is

maximized in a joint embedding space. We extend

this matching framework with two enhancements.

First, we use a pairwise contrastive objective (Reth-

meier and Augenstein,2020;Radford et al.,2021;

Gunel et al.,2021) that increases the similarity be-

tween documents and their ground-truth labels, and

decreases it for incorrect pairs of documents and

labels. We augment the textual representations of

documents with layout features representing the

positions of tokens on the page to capture the two-

dimensional nature of the documents. Second, we

propose an unsupervised contrastive pretraining

procedure to warm up the representations of doc-

uments and classes. In summary,

(i)

we study the

zero-shot classiﬁcation of semi-structured docu-

arXiv:2210.05613v1 [cs.CL] 11 Oct 2022

ments, which, to the best of our knowledge, has

not been explored before.

(ii)

we use a pairwise

contrastive objective to both pretrain and ﬁne-tune

a matching model for the task. This technique uses

a layout-aware document encoder and a regular text

encoder to maximize the similarity between docu-

ments and their ground-truth labels.

(iii)

Using this

contrastive objective, we propose an unsupervised

pretraining step with pseudo-labels (Rethmeier and

Augenstein,2020) to initialize document and label

encoders. The proposed pretraining step improves

F1 scores by 9 and 19 points in supervised and

unsupervised zero-shot settings respectively, com-

pared to a setup without this pretraining.

2 Approach

This section describes our proposed architecture

(§ 2.1), pretrained model (§ 2.2), as well as the

contrastive objective used for pretraining (§ 2.3)

and ﬁne-tuning (§ 2.4).

2.1 Model

Our goal is to learn a matching function between

documents and labels such that similarity between

a document and its gold label is maximized com-

pared to other labels, which can be seen as an in-

stance of metric learning (Xing et al.,2002;Kulis

et al.,2012;Sohn,2016). This requires encoding

documents and class names

into a joint document-

label space (Ba et al.,2015;Zhou et al.,2019;Chen

et al.,2020;Hou et al.,2020). In this work, doc-

uments and class names are of different nature—

documents are semi-structured (§ 1), while class

names are one or two-word fragments of text.

We use two encoders to account for this differ-

ence: a document encoder

Φdoc

suitable for semi-

structured documents, and a label (class) encoder

Φlabel

suitable for the natural language representa-

tions of the class labels.

Φlabel

is simply a vanilla

pretrained BERT

BASE

model (Devlin et al.,2019).

Φdoc

, as in prior work (Xu et al.,2020;Lockard

et al.,2020), is a pretrained language model that en-

codes the text and the layout of the document using

the coordinates of each token. The next section ex-

plains this model, Layout

BERT

, in detail. We choose

this model for its simplicity, but our proposed ap-

proach can be combined with more sophisticated

We use class names as the natural language representation

of a class, but more descriptive representations can be used

if available (e.g. dictionary deﬁnitions) (Logeswaran et al.,

2019)

Label

En co d er

Batch of documents

pseudo-label

Do cu m ent

En co d er 𝑀!"

#=Φ$%&' $ 𝑙!(⋅Φ)*+ (𝑑")

𝑀!

Maximize

Minimize

Figure 1: The unsupervised contrastive pretraining pro-

cedure. A random block of tokens from a document is

used as the pseudo-label for that document. Dot prod-

ucts between documents and their labels are maximized

and all other pairwise dot products are minimized.

document encoders that incorporate layout and vi-

sual information in different ways (Huang et al.,

2022;Xu et al.,2021;Appalaraju et al.,2021).

2.2 LayoutBERT

Layout

BERT

is a 6-layer Transformer based on

BERT

BASE

(Devlin et al.,2019) and is pretrained

using masked language modeling on a large collec-

tion of semi-structured documents (§ 3). Unlike

prior work, Layout

BERT

has a simpler architecture

that decreases model footprint while maintaining

accuracy. Speciﬁcally, there are three main archi-

tectural differences between Layout

BERT

and Lay-

outLM, which is the most comparable architecture

in the literature (Xu et al.,2020):

(a)

LayoutLM

uses 12 transformer layers while Layout

BERT

uses

only 6 layers

(b)

LayoutLM uses four positions per

token, namely upper-left and bottom-right coordi-

nates, while Layout

BERT

use only two positions viz.

the centroid of the token bounding box.

(c)

Un-

like LayoutLM, Layout

BERT

does not use an image

encoder to obtain CNN-based visual features.2

2.3 Contrastive Layout Pretraining

Φlabel

and

Φdoc

are models that have been pre-

trained independently. To encourage these models

to produce similar representations for documents

and their labels, we continue pretraining

Φlabel

and

Φdoc

via an unsupervised procedure based on a

pairwise contrastive objective. The unsupervised

objective can learn from large amounts of unla-

beled semi-structured documents. This also allows

us to directly use the pretrained encoders in an

unsupervised zero-shot setting (§ 3.3.1).

The results in Xu et al. (2020) show that image features

are not always useful. To keep things simple, we do not

include the CNN component in our model.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContrastiveTrainingImprovesZero-ShotClassicationofSemi-structuredDocumentsMuhammadKhalifay,YogarshiVyasz,ShuaiWangz,GrahamHorwoodz,SunilMallyax,MiguelBallesteroszUniversityofMichigany,AWSAILabsz,Flip.aixkhalifam@umich.edu,{yogarshi,wshui,graham.horwood,ballemig}@amazon.comAbstractWeinvestigatesem...

展开>> 收起<<

Contrastive Training Improves Zero-Shot Classiﬁcation of Semi-structured Documents Muhammad KhalifayYogarshi VyaszShuai Wangz.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Contrastive Training Improves Zero-Shot Classiﬁcation of Semi-structured Documents Muhammad KhalifayYogarshi VyaszShuai Wangz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: