Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents Muhammad KhalifayYogarshi VyaszShuai Wangz

2025-04-24 0 0 342.58KB 9 页 10玖币
侵权投诉
Contrastive Training Improves Zero-Shot Classification of
Semi-structured Documents
Muhammad Khalifa
,Yogarshi Vyas,Shuai Wang,
Graham Horwood,Sunil Mallya§∗,Miguel Ballesteros
University of Michigan, AWS AI Labs, Flip.ai§
khalifam@umich.edu,
{yogarshi,wshui,graham.horwood,ballemig}@amazon.com
Abstract
We investigate semi-structured document clas-
sification in a zero-shot setting. Classification
of semi-structured documents is more chal-
lenging than that of standard unstructured doc-
uments, as positional, layout, and style infor-
mation play a vital role in interpreting such
documents. The standard classification setting
where categories are fixed during both train-
ing and testing falls short in dynamic environ-
ments where new document categories could
potentially emerge. We focus exclusively on
the zero-shot setting where inference is done
on new unseen classes. To address this task,
we propose a matching-based approach that re-
lies on a pairwise contrastive objective for both
pretraining and fine-tuning. Our results show
a significant boost in Macro F1from the pro-
posed pretraining step in both supervised and
unsupervised zero-shot settings.
1 Introduction
Textual information assumes many forms ranging
from unstructured (e.g., text messages) to semi-
structured (e.g., forms, invoices, letters), all the
way to fully structured (e.g., databases or spread-
sheets). Our focus in this work is the classification
of semi-structured documents. A semi-structured
document consists of information that is organized
using a regular visual layout and includes tables,
forms, multi-columns, and (nested) bulleted lists,
and that is either understandable only in the con-
text of its visual layout or that requires substan-
tially more work to understand without the visual
layout. Automatic processing of semi-structured
documents comes with a unique set of challenges
including a non-linear text flow (Wang et al.,2021),
layout inconsistencies, and low-accuracy optical
character recognition. Prior work has shown that
integrating the two-dimensional layout informa-
tion of such documents is critical in models for
Work done while at AWS AI Labs.
analyzing such documents (Xu et al.,2020,2021;
Huang et al.,2022;Appalaraju et al.,2021). Due
to these challenges, methods for unstructured doc-
ument classification, such as static word vectors
(Socher et al.,2013) and standard pretrained lan-
guage models (Devlin et al.,2019;Reimers and
Gurevych,2019;Liu et al.,2019) perform poorly
with semi-structured inputs as they model text in
a one-dimensional space and ignore information
about document layout and style (Xu et al.,2020).
Past work on semi-structured document classi-
fication (Harley et al.,2015;Iwana et al.,2016;
Tensmeyer and Martinez,2017;Xu et al.,2020,
2021) has focused exclusively on the full-shot set-
ting, where the target classes are fixed and iden-
tical across training and inference, neglecting the
zero-shot setting (Xian et al.,2018), which requires
generalization to unseen classes during inference.
Our work addresses zero-shot classification of
semi-structured documents in English using the
matching framework, which has been used for
many tasks on unstructured text (Dauphin et al.,
2014;Nam et al.,2016;Pappas and Henderson,
2019;Vyas and Ballesteros,2021;Ma et al.,2022).
Under this framework, a matching (similarity) met-
ric between documents and their assigned classes is
maximized in a joint embedding space. We extend
this matching framework with two enhancements.
First, we use a pairwise contrastive objective (Reth-
meier and Augenstein,2020;Radford et al.,2021;
Gunel et al.,2021) that increases the similarity be-
tween documents and their ground-truth labels, and
decreases it for incorrect pairs of documents and
labels. We augment the textual representations of
documents with layout features representing the
positions of tokens on the page to capture the two-
dimensional nature of the documents. Second, we
propose an unsupervised contrastive pretraining
procedure to warm up the representations of doc-
uments and classes. In summary,
(i)
we study the
zero-shot classification of semi-structured docu-
arXiv:2210.05613v1 [cs.CL] 11 Oct 2022
ments, which, to the best of our knowledge, has
not been explored before.
(ii)
we use a pairwise
contrastive objective to both pretrain and fine-tune
a matching model for the task. This technique uses
a layout-aware document encoder and a regular text
encoder to maximize the similarity between docu-
ments and their ground-truth labels.
(iii)
Using this
contrastive objective, we propose an unsupervised
pretraining step with pseudo-labels (Rethmeier and
Augenstein,2020) to initialize document and label
encoders. The proposed pretraining step improves
F1 scores by 9 and 19 points in supervised and
unsupervised zero-shot settings respectively, com-
pared to a setup without this pretraining.
2 Approach
This section describes our proposed architecture
2.1), pretrained model (§ 2.2), as well as the
contrastive objective used for pretraining (§ 2.3)
and fine-tuning (§ 2.4).
2.1 Model
Our goal is to learn a matching function between
documents and labels such that similarity between
a document and its gold label is maximized com-
pared to other labels, which can be seen as an in-
stance of metric learning (Xing et al.,2002;Kulis
et al.,2012;Sohn,2016). This requires encoding
documents and class names
1
into a joint document-
label space (Ba et al.,2015;Zhou et al.,2019;Chen
et al.,2020;Hou et al.,2020). In this work, doc-
uments and class names are of different nature—
documents are semi-structured (§ 1), while class
names are one or two-word fragments of text.
We use two encoders to account for this differ-
ence: a document encoder
Φdoc
suitable for semi-
structured documents, and a label (class) encoder
Φlabel
suitable for the natural language representa-
tions of the class labels.
Φlabel
is simply a vanilla
pretrained BERT
BASE
model (Devlin et al.,2019).
Φdoc
, as in prior work (Xu et al.,2020;Lockard
et al.,2020), is a pretrained language model that en-
codes the text and the layout of the document using
the coordinates of each token. The next section ex-
plains this model, Layout
BERT
, in detail. We choose
this model for its simplicity, but our proposed ap-
proach can be combined with more sophisticated
1
We use class names as the natural language representation
of a class, but more descriptive representations can be used
if available (e.g. dictionary definitions) (Logeswaran et al.,
2019)
Label
En co d er
Batch of documents
pseudo-label
Do cu m ent
En co d er 𝑀!"
#=Φ$%&' $ 𝑙!(Φ)*+ (𝑑")
𝑀!
Maximize
Minimize
Figure 1: The unsupervised contrastive pretraining pro-
cedure. A random block of tokens from a document is
used as the pseudo-label for that document. Dot prod-
ucts between documents and their labels are maximized
and all other pairwise dot products are minimized.
document encoders that incorporate layout and vi-
sual information in different ways (Huang et al.,
2022;Xu et al.,2021;Appalaraju et al.,2021).
2.2 LayoutBERT
Layout
BERT
is a 6-layer Transformer based on
BERT
BASE
(Devlin et al.,2019) and is pretrained
using masked language modeling on a large collec-
tion of semi-structured documents (§ 3). Unlike
prior work, Layout
BERT
has a simpler architecture
that decreases model footprint while maintaining
accuracy. Specifically, there are three main archi-
tectural differences between Layout
BERT
and Lay-
outLM, which is the most comparable architecture
in the literature (Xu et al.,2020):
(a)
LayoutLM
uses 12 transformer layers while Layout
BERT
uses
only 6 layers
(b)
LayoutLM uses four positions per
token, namely upper-left and bottom-right coordi-
nates, while Layout
BERT
use only two positions viz.
the centroid of the token bounding box.
(c)
Un-
like LayoutLM, Layout
BERT
does not use an image
encoder to obtain CNN-based visual features.2
2.3 Contrastive Layout Pretraining
Φlabel
and
Φdoc
are models that have been pre-
trained independently. To encourage these models
to produce similar representations for documents
and their labels, we continue pretraining
Φlabel
and
Φdoc
via an unsupervised procedure based on a
pairwise contrastive objective. The unsupervised
objective can learn from large amounts of unla-
beled semi-structured documents. This also allows
us to directly use the pretrained encoders in an
unsupervised zero-shot setting (§ 3.3.1).
2
The results in Xu et al. (2020) show that image features
are not always useful. To keep things simple, we do not
include the CNN component in our model.
摘要:

ContrastiveTrainingImprovesZero-ShotClassicationofSemi-structuredDocumentsMuhammadKhalifay,YogarshiVyasz,ShuaiWangz,GrahamHorwoodz,SunilMallyax,MiguelBallesteroszUniversityofMichigany,AWSAILabsz,Flip.aixkhalifam@umich.edu,{yogarshi,wshui,graham.horwood,ballemig}@amazon.comAbstractWeinvestigatesem...

展开>> 收起<<
Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents Muhammad KhalifayYogarshi VyaszShuai Wangz.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:342.58KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注