Pix2Struct Screenshot Parsing as Pretraining for Visual Language Understanding

2025-05-02 0 0 4.98MB 20 页 10玖币

侵权投诉

Pix2Struct: Screenshot Parsing as Pretraining for

Visual Language Understanding

Kenton Lee * 1 Mandar Joshi * 1 Iulia Turc 2Hexiang Hu 1Fangyu Liu 3Julian Eisenschlos 1

Urvashi Khandelwal 1Peter Shaw 1Ming-Wei Chang 1Kristina Toutanova 1

Abstract

Visually-situated language is ubiquitous—

sources range from textbooks with diagrams to

web pages with images and tables, to mobile

apps with buttons and forms. Perhaps due to

this diversity, previous work has typically relied

on domain-speciﬁc recipes with limited sharing

of the underlying data, model architectures,

and objectives. We present Pix2Struct,

a pretrained image-to-text model for purely

visual language understanding, which can be

ﬁnetuned on tasks containing visually-situated

language. Pix2Struct is pretrained by

learning to parse masked screenshots of web

pages into simpliﬁed HTML. The web, with its

richness of visual elements cleanly reﬂected in

the HTML structure, provides a large source

of pretraining data well suited to the diversity

of downstream tasks. Intuitively, this objective

subsumes common pretraining signals such as

OCR, language modeling, and image captioning.

In addition to the novel pretraining strategy,

we introduce a variable-resolution input rep-

resentation and a more ﬂexible integration of

language and vision inputs, where language

prompts such as questions are rendered directly

on top of the input image. For the ﬁrst time, we

show that a single pretrained model can achieve

state-of-the-art results in six out of nine tasks

across four domains: documents, illustrations,

user interfaces, and natural images.

*Equal contribution 1Google Research 2succinctly.ai

3University of Cambridge. Correspondence to: Ken-

ton Lee <kentonl@google.com>, Mandar Joshi <man-

darj@google.com>.

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

1. Introduction

Research on the interaction between language and vision

has traditionally focused on tasks where images and text

can be separated into distinct channels, e.g. visual question

answering or image captioning. However, visually-situated

language is a far more pervasive way in which these modal-

ities interact and blend together. For example, documents,

tables, infographics, and user interfaces (UIs) are intended

to be consumed holistically, without clear boundaries be-

tween textual and visual elements (Figure 1). Comprehen-

sive understanding of this information requires a deep set

of skills, including the ability to recognize text, understand

language, and incorporate diverse visual context.

Previous work on understanding visually-situated language

is scattered. The focus is typically on complex task-speciﬁc

combinations of available inputs and tools. For example,

document-understanding models (Huang et al., 2022) rely

on external OCR systems, UI-understanding models rely

on platform-speciﬁc metadata (e.g. Android view hierar-

chy) (Bai et al., 2021), and diagram-understanding models

rely on diagram parses (Kembhavi et al., 2016). Domain-

speciﬁc engineering can be effective for high-resource set-

tings such as documents, where there is an abundance of

tools and data available. However, these pipelined models

lack sharing of the underlying data, model architectures,

and objectives across domains, limiting their general appli-

cability. Moreover, relying on external systems like OCR

increases engineering complexity, limits adaptability, and

can increase overall computational cost. Recent work on

OCR-free, end-to-end document understanding from im-

ages (Kim et al., 2022; Davis et al., 2022) has attempted

to remove such task-speciﬁc engineering and reliance on

external components during inference by learning to de-

code OCR outputs during pretraining—a signiﬁcant step

towards more general-purpose models. However, the focus

on text at the surface level limits the depth of knowledge

transferred from unsupervised data.

We present Pix2Struct2, a pretrained model that com-

2For pretrained checkpoints and code, see https://

github.com/google-research/pix2struct.

arXiv:2210.03347v2 [cs.CL] 15 Jun 2023

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Screenshot Parsing Pretraining AI2D Screen2Words DocVQA

<<Pro>

<<<$15> </mo>>

<<20 users included>

<10 GB of storage>

carnivore list of videos

for weather

reports in

different

locations

Fred LeCrone

Figure 1: Examples of visually-situated language understanding tasks, including diagram QA (AI2D), app captioning

(Screen2Words), and document QA (DocVQA). We also include an example of our proposed pretraining task (screenshot

parsing) on the left. Pix2Struct encodes the pixels from the input image (above) and decodes the output text (below).

bines the simplicity of purely pixel-level inputs with the

generality and scalability provided by self-supervised pre-

training from diverse and abundant web data. Speciﬁcally,

we propose a screenshot parsing objective that requires

predicting an HTML-based parse from a masked screen-

shot of a web page. HTML provides clean signals about

text, images, and layouts, while the masked inputs encour-

age joint reasoning about their co-occurrence. With the di-

versity and complexity of textual and visual elements found

on the web, Pix2Struct learns rich representations of

the underlying structure of web pages, which we show can

effectively transfer to a variety of downstream visual lan-

guage understanding tasks.

A key ingredient which enables this transfer is process-

ing inputs visually and holistically as they are intended

for human readers. We introduce variable-resolution in-

puts for vision transformers (ViT) that prevent distortion

of the original aspect ratio, which can vary greatly across

documents, ﬁgures, and UIs. During ﬁnetuning, we render

other inputs (e.g., questions in VQA and bounding boxes

in UI tasks) onto the image input for the task. In effect, we

consume all our inputs through a single modality, simplify-

ing the modality combination problem in previous work.

We train two variants with 282M and 1.3B pa-

rameters, which we refer to as Pix2Struct-Base

and Pix2Struct-Large respectively, on 80M screen-

shots of web pages collected from the URLs in the C4 cor-

pus (Raffel et al., 2020)3. Experiments on four domains

and nine tasks show that our ﬁnetuned models strongly out-

perform Donut (ranging from 9 to 53 points), the strongest

existing baseline without pipelines. Compared with mod-

3We do not use the released text in C4. The web page content

and screenshots were crawled directly from the URLs.

els with domain-speciﬁc pipelines, we lag behind the state

of the art in high-resource domains such as documents and

natural images but observe signiﬁcant improvements (rang-

ing from 1 to 44 points) in low-resource domains such as

illustrations and UIs. We hope these results encourage the

community to continue developing such general-purpose

methods and further enable new applications in this cur-

rently fragmented intersection of language and vision.

To summarize, our major contributions are as follows:

• We introduce the area of general-purpose visually-

situated language understanding, which consists of di-

verse tasks but common challenges.

• We propose a screenshot parsing pretraining objective

based on the HTML source of web pages. Our objec-

tive is shown to be more effective than prior attempts

to enable the elegant pixel-to-text design for general-

purpose visually-situated language understanding.

• We introduce variable-resolution input representations

to ViT and new ﬁne-tuning strategies that seamlessly

integrate language and vision inputs by directly ren-

dering any text prompts on top of the input image.

2. Method

2.1. Background

Prior attempts at pixel-only modeling of visually situated

language have largely focused on documents and natural

images. For documents, Donut (Kim et al., 2022) and

Dessurt (Davis et al., 2022) combine pretrained objectives

based on surface-level features from synthetic images or

predicted OCR outputs. For natural images, recent work—

GIT2 (Wang et al., 2022a) and PaLI (Chen et al., 2022c)—

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Figure 2: Comparison of our variable resolution inputs and the typical ﬁxed resolution input. We illustrate the preprocessing

for a target sequence length of 36 patches for both inputs.

focuses on collecting and training on large scale image cap-

tioning data that transfers well to datasets with natural im-

ages (e.g. TextCaps).

We aim to provide a single pretrained model that can be

ﬁnetuned on a wider variety of tasks and domains. The in-

put to our model is an image in the form of raw pixels only,

and the output is text in the form of token sequences, sim-

ilar to Donut. The goal is a visual analog of models like

T5 (Raffel et al., 2020), where the generality of simple in-

puts and outputs is combined with the power of pretraining

on large unsupervised sources of data. During ﬁnetuning,

the complexity of adapting to diverse downstream tasks re-

sides only in data preprocessing.

Even without visual context, pixel-only language model-

ing for text has only recently been attempted (Rust et al.,

2022)—perhaps because it requires solving multiple hard

sub-problems. First, the ability to read with high ﬁdelity

while also building rich high-level representations poses

a difﬁcult optimization problem. Second, encoding text-

heavy inputs (e.g. long documents) involves processing

high-resolution images with variable aspect ratios. State-

of-the-art document understanding models (Huang et al.,

2022) therefore rely on the combination of (possibly noisy)

OCR outputs with low resolution images.

We show the components of Pix2Struct that address

these challenges. Section 2.2 discusses modiﬁcations to the

transformer inputs to handle variable aspect ratios and reso-

lutions. Section 2.3 details our proposed screenshot parsing

objective and Section 2.4 describes curriculum learning for

more robust transfer learning. Finally, Section 2.5 shows

how Pix2Struct consumes textual and visual inputs for

downstream tasks (e.g. questions and images) in the same

space by rendering text inputs onto images.

2.2. Architecture

Pix2Struct is an image-encoder-text-decoder based

on ViT (Dosovitskiy et al., 2021). While the bulk

of the model is fairly standard, we propose one small

but impactful change to the input representation to

make Pix2Struct more robust to various forms of

visually-situated language. Before extracting ﬁxed-size

patches, the standard ViT scales the input images to a pre-

deﬁned resolution, which creates two undesirable effects:

(1) rescaling the image distorts the true aspect ratio, which

can be highly variable for documents, mobile UIs, and ﬁg-

ures. (2) transferring these models to downstream tasks

with higher resolution is non-trivial (Touvron et al., 2019;

Wang et al., 2021b), since the model only observes one spe-

ciﬁc resolution during pretraining.

We instead propose to always scale our input image up or

down such that we extract the maximal number of ﬁxed-

size patches that ﬁt within the given sequence length (Fig-

ure 2). In order for the model to handle variable resolutions

unambiguously, we use 2-dimensional absolute positional

embeddings for the input patches. Together these changes

to the standard ViT inputs provide two major advantages in

terms of robustness to: (1) extreme aspect ratios, which is

common in the domains that we experiment with, and (2)

on-the-ﬂy changes to the sequence length and resolution.

2.3. Pretraining

The goal of pretraining is for Pix2Struct to represent

the underlying structure of the input image. To that end, we

create self-supervised pairs of input images and target text

from web pages. For each page in the pretraining corpus,

we start by collecting its HTML source and a screenshot

using a viewport of 1024 x 1024.

Screenshot parsing inputs & outputs The screenshot

and HTML are modiﬁed to ensure rich and dense learning

signal during pretraining. These modiﬁcations provide a

reasonable trade-off between preserving the semantics of

the page and requiring a practical decoder sequence length.

We condense the HTML DOM tree by (1) only keeping

nodes with visible elements or descendants with visible el-

ements and (2) if a node does not contain visible elements

and it only has a single child, replacing the singleton child

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

:→

<<<Python>

<img_src=py_logo.png img_alt=Python>>

<<C++>

<img_src=cpp_logo.png img_alt=C++>>

<<Java>

<img_src=java_logo.png img_alt=Java>>

Figure 3: Toy illustration of input-output pairs (right) sampled from the original web page (left).

with any grandchildren to remove chained nesting. In each

node, we only use the text, along with ﬁlenames and alt-text

of images. Much more information could be retained (e.g.

element tags, style, titles and URLs) in future work. The

decoder sequence length is further reduced by ﬁnding the

largest linearized subtree that ﬁts within a predeﬁned se-

quence length. A bounding box indicating the region cov-

ered by the chosen subtree is also drawn on the screenshot.

For better context modeling, we introduce a BART-

like (Lewis et al., 2020) learning signal by masking 50%

of the text and decoding the entire subtree. The masked re-

gions are randomly sampled spans of text from the chosen

subtree where we render masks (Figure 3).

Comparison to existing pretraining strategies Our

proposed screenshot parsing seamlessly integrates signals

reminiscent of several well-known pretraining strategies:

• Recovering the unmasked parts of the parse is simi-

lar to OCR, a prerequisite skill for understanding lan-

guage. OCR pretraining was proposed in Donut which

uses synthetic renderings or OCR outputs. In Figure 3,

predicting <C++> exempliﬁes this learning signal.

• Recovering the masked parts of the parse is much like

masked language modeling (Devlin et al., 2019). A

major difference is that the visual context often pro-

vides additional powerful cues. In Figure 3, predict-

ing <Python> exempliﬁes this signal.

• Recovering the alt-text from images is a common pre-

training strategy for image captioning (Sharma et al.,

2018; Wang et al., 2022a; Chen et al., 2022c). A ma-

jor difference is that the model is permitted to use the

web page as additional context. In Figure 3, predict-

ing img alt=C++ exempliﬁes this learning signal.

Appendix F contains more details including examples of

screenshots paired with their gold and predicted parses.

2.4. Warming up with a reading curriculum

While we can directly pretrain Pix2Struct on the

screenshot parsing task, we ﬁnd that doing this naively

can result in instability and slow learning. However, if we

ﬁrst expose the model to a short “warmup” stage of sim-

ply learning to read, we ﬁnd a strong curriculum learning

effect where (1) pretraining is more stable and converges

faster, and (2) we observe better ﬁnetuning performance,

as discussed in Section 5. We create images of text snip-

pets with random colors and fonts. The model is simply

trained to decode the original text (see Appendix E for ex-

amples). This type of curriculum learning was also used

in Dessurt (Davis et al., 2022) and can also be viewed as a

simpliﬁed version of Donut’s pretraining.

2.5. Finetuning

Finetuning Pix2Struct is straightforward and largely a

matter of preprocessing the downstream data to unambigu-

ously reﬂect the task in the image inputs and text outputs,

analogous to the way T5 (Raffel et al., 2020) is used for

text-based tasks. In this section, we cover the preprocess-

ing strategies for the tasks described in Table 4. Examples

of this preprocessing are shown in Figure 1.

Captioning is the most straightforward, since the input im-

age and the output text can be directly used (as in TextCaps,

Screen2Words). In the case where the focus of the caption

is a speciﬁc bounding box (as in Widget Captioning), we

draw the target bounding box on the image itself.

For visual question answering (as in OCR-VQA, ChartQA,

DocVQA, InfographicsVQA), while multimodal models

typically reserve a specialized text channel for the question,

we opt to instead directly render the question as a header

at the top of the original image. Pix2Struct reads both

the question and the image jointly via the visual modality.

This strategy is analogous to the common practice of sim-

ply concatenating all inputs during ﬁnetuning of pretrained

text models, ﬁrst proposed in GPT (Radford et al., 2018)

and has been the default method in NLP since then. Intu-

itively, this strategy is effective because Pix2Struct has

been pretrained to be sensitive to long-range interactions

between various parts of the input image. In the case of

multiple choice answers (as in AI2D), we also render the

choices in the header as part of the question.

The most complex scenario is RefExp, where the task is

choosing between UI components that a natural language

expression could be referring to. For each candidate, we

create a training instance where the input image contains

the bounding box and referring expression, and the decod-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Pix2Struct:ScreenshotParsingasPretrainingforVisualLanguageUnderstandingKentonLee*1MandarJoshi*1IuliaTurc2HexiangHu1FangyuLiu3JulianEisenschlos1UrvashiKhandelwal1PeterShaw1Ming-WeiChang1KristinaToutanova1AbstractVisually-situatedlanguageisubiquitous—sourcesrangefromtextbookswithdiagramstowebpageswith...

展开>> 收起<<

Pix2Struct Screenshot Parsing as Pretraining for Visual Language Understanding.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pix2Struct Screenshot Parsing as Pretraining for Visual Language Understanding

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: