Pix2Struct Screenshot Parsing as Pretraining for Visual Language Understanding

2025-05-02 0 0 4.98MB 20 页 10玖币
侵权投诉
Pix2Struct: Screenshot Parsing as Pretraining for
Visual Language Understanding
Kenton Lee * 1 Mandar Joshi * 1 Iulia Turc 2Hexiang Hu 1Fangyu Liu 3Julian Eisenschlos 1
Urvashi Khandelwal 1Peter Shaw 1Ming-Wei Chang 1Kristina Toutanova 1
Abstract
Visually-situated language is ubiquitous—
sources range from textbooks with diagrams to
web pages with images and tables, to mobile
apps with buttons and forms. Perhaps due to
this diversity, previous work has typically relied
on domain-specific recipes with limited sharing
of the underlying data, model architectures,
and objectives. We present Pix2Struct,
a pretrained image-to-text model for purely
visual language understanding, which can be
finetuned on tasks containing visually-situated
language. Pix2Struct is pretrained by
learning to parse masked screenshots of web
pages into simplified HTML. The web, with its
richness of visual elements cleanly reflected in
the HTML structure, provides a large source
of pretraining data well suited to the diversity
of downstream tasks. Intuitively, this objective
subsumes common pretraining signals such as
OCR, language modeling, and image captioning.
In addition to the novel pretraining strategy,
we introduce a variable-resolution input rep-
resentation and a more flexible integration of
language and vision inputs, where language
prompts such as questions are rendered directly
on top of the input image. For the first time, we
show that a single pretrained model can achieve
state-of-the-art results in six out of nine tasks
across four domains: documents, illustrations,
user interfaces, and natural images.
*Equal contribution 1Google Research 2succinctly.ai
3University of Cambridge. Correspondence to: Ken-
ton Lee <kentonl@google.com>, Mandar Joshi <man-
darj@google.com>.
Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
1. Introduction
Research on the interaction between language and vision
has traditionally focused on tasks where images and text
can be separated into distinct channels, e.g. visual question
answering or image captioning. However, visually-situated
language is a far more pervasive way in which these modal-
ities interact and blend together. For example, documents,
tables, infographics, and user interfaces (UIs) are intended
to be consumed holistically, without clear boundaries be-
tween textual and visual elements (Figure 1). Comprehen-
sive understanding of this information requires a deep set
of skills, including the ability to recognize text, understand
language, and incorporate diverse visual context.
Previous work on understanding visually-situated language
is scattered. The focus is typically on complex task-specific
combinations of available inputs and tools. For example,
document-understanding models (Huang et al., 2022) rely
on external OCR systems, UI-understanding models rely
on platform-specific metadata (e.g. Android view hierar-
chy) (Bai et al., 2021), and diagram-understanding models
rely on diagram parses (Kembhavi et al., 2016). Domain-
specific engineering can be effective for high-resource set-
tings such as documents, where there is an abundance of
tools and data available. However, these pipelined models
lack sharing of the underlying data, model architectures,
and objectives across domains, limiting their general appli-
cability. Moreover, relying on external systems like OCR
increases engineering complexity, limits adaptability, and
can increase overall computational cost. Recent work on
OCR-free, end-to-end document understanding from im-
ages (Kim et al., 2022; Davis et al., 2022) has attempted
to remove such task-specific engineering and reliance on
external components during inference by learning to de-
code OCR outputs during pretraining—a significant step
towards more general-purpose models. However, the focus
on text at the surface level limits the depth of knowledge
transferred from unsupervised data.
We present Pix2Struct2, a pretrained model that com-
2For pretrained checkpoints and code, see https://
github.com/google-research/pix2struct.
1
arXiv:2210.03347v2 [cs.CL] 15 Jun 2023
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Screenshot Parsing Pretraining AI2D Screen2Words DocVQA
<<Pro>
<<<$15> </mo>>
<<20 users included>
<10 GB of storage>
<Priority email support>
<Help center access>>
<Get started>>>
carnivore list of videos
for weather
reports in
different
locations
Fred LeCrone
Figure 1: Examples of visually-situated language understanding tasks, including diagram QA (AI2D), app captioning
(Screen2Words), and document QA (DocVQA). We also include an example of our proposed pretraining task (screenshot
parsing) on the left. Pix2Struct encodes the pixels from the input image (above) and decodes the output text (below).
bines the simplicity of purely pixel-level inputs with the
generality and scalability provided by self-supervised pre-
training from diverse and abundant web data. Specifically,
we propose a screenshot parsing objective that requires
predicting an HTML-based parse from a masked screen-
shot of a web page. HTML provides clean signals about
text, images, and layouts, while the masked inputs encour-
age joint reasoning about their co-occurrence. With the di-
versity and complexity of textual and visual elements found
on the web, Pix2Struct learns rich representations of
the underlying structure of web pages, which we show can
effectively transfer to a variety of downstream visual lan-
guage understanding tasks.
A key ingredient which enables this transfer is process-
ing inputs visually and holistically as they are intended
for human readers. We introduce variable-resolution in-
puts for vision transformers (ViT) that prevent distortion
of the original aspect ratio, which can vary greatly across
documents, figures, and UIs. During finetuning, we render
other inputs (e.g., questions in VQA and bounding boxes
in UI tasks) onto the image input for the task. In effect, we
consume all our inputs through a single modality, simplify-
ing the modality combination problem in previous work.
We train two variants with 282M and 1.3B pa-
rameters, which we refer to as Pix2Struct-Base
and Pix2Struct-Large respectively, on 80M screen-
shots of web pages collected from the URLs in the C4 cor-
pus (Raffel et al., 2020)3. Experiments on four domains
and nine tasks show that our finetuned models strongly out-
perform Donut (ranging from 9 to 53 points), the strongest
existing baseline without pipelines. Compared with mod-
3We do not use the released text in C4. The web page content
and screenshots were crawled directly from the URLs.
els with domain-specific pipelines, we lag behind the state
of the art in high-resource domains such as documents and
natural images but observe significant improvements (rang-
ing from 1 to 44 points) in low-resource domains such as
illustrations and UIs. We hope these results encourage the
community to continue developing such general-purpose
methods and further enable new applications in this cur-
rently fragmented intersection of language and vision.
To summarize, our major contributions are as follows:
We introduce the area of general-purpose visually-
situated language understanding, which consists of di-
verse tasks but common challenges.
We propose a screenshot parsing pretraining objective
based on the HTML source of web pages. Our objec-
tive is shown to be more effective than prior attempts
to enable the elegant pixel-to-text design for general-
purpose visually-situated language understanding.
We introduce variable-resolution input representations
to ViT and new fine-tuning strategies that seamlessly
integrate language and vision inputs by directly ren-
dering any text prompts on top of the input image.
2. Method
2.1. Background
Prior attempts at pixel-only modeling of visually situated
language have largely focused on documents and natural
images. For documents, Donut (Kim et al., 2022) and
Dessurt (Davis et al., 2022) combine pretrained objectives
based on surface-level features from synthetic images or
predicted OCR outputs. For natural images, recent work—
GIT2 (Wang et al., 2022a) and PaLI (Chen et al., 2022c)—
2
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Figure 2: Comparison of our variable resolution inputs and the typical fixed resolution input. We illustrate the preprocessing
for a target sequence length of 36 patches for both inputs.
focuses on collecting and training on large scale image cap-
tioning data that transfers well to datasets with natural im-
ages (e.g. TextCaps).
We aim to provide a single pretrained model that can be
finetuned on a wider variety of tasks and domains. The in-
put to our model is an image in the form of raw pixels only,
and the output is text in the form of token sequences, sim-
ilar to Donut. The goal is a visual analog of models like
T5 (Raffel et al., 2020), where the generality of simple in-
puts and outputs is combined with the power of pretraining
on large unsupervised sources of data. During finetuning,
the complexity of adapting to diverse downstream tasks re-
sides only in data preprocessing.
Even without visual context, pixel-only language model-
ing for text has only recently been attempted (Rust et al.,
2022)—perhaps because it requires solving multiple hard
sub-problems. First, the ability to read with high fidelity
while also building rich high-level representations poses
a difficult optimization problem. Second, encoding text-
heavy inputs (e.g. long documents) involves processing
high-resolution images with variable aspect ratios. State-
of-the-art document understanding models (Huang et al.,
2022) therefore rely on the combination of (possibly noisy)
OCR outputs with low resolution images.
We show the components of Pix2Struct that address
these challenges. Section 2.2 discusses modifications to the
transformer inputs to handle variable aspect ratios and reso-
lutions. Section 2.3 details our proposed screenshot parsing
objective and Section 2.4 describes curriculum learning for
more robust transfer learning. Finally, Section 2.5 shows
how Pix2Struct consumes textual and visual inputs for
downstream tasks (e.g. questions and images) in the same
space by rendering text inputs onto images.
2.2. Architecture
Pix2Struct is an image-encoder-text-decoder based
on ViT (Dosovitskiy et al., 2021). While the bulk
of the model is fairly standard, we propose one small
but impactful change to the input representation to
make Pix2Struct more robust to various forms of
visually-situated language. Before extracting fixed-size
patches, the standard ViT scales the input images to a pre-
defined resolution, which creates two undesirable effects:
(1) rescaling the image distorts the true aspect ratio, which
can be highly variable for documents, mobile UIs, and fig-
ures. (2) transferring these models to downstream tasks
with higher resolution is non-trivial (Touvron et al., 2019;
Wang et al., 2021b), since the model only observes one spe-
cific resolution during pretraining.
We instead propose to always scale our input image up or
down such that we extract the maximal number of fixed-
size patches that fit within the given sequence length (Fig-
ure 2). In order for the model to handle variable resolutions
unambiguously, we use 2-dimensional absolute positional
embeddings for the input patches. Together these changes
to the standard ViT inputs provide two major advantages in
terms of robustness to: (1) extreme aspect ratios, which is
common in the domains that we experiment with, and (2)
on-the-fly changes to the sequence length and resolution.
2.3. Pretraining
The goal of pretraining is for Pix2Struct to represent
the underlying structure of the input image. To that end, we
create self-supervised pairs of input images and target text
from web pages. For each page in the pretraining corpus,
we start by collecting its HTML source and a screenshot
using a viewport of 1024 x 1024.
Screenshot parsing inputs & outputs The screenshot
and HTML are modified to ensure rich and dense learning
signal during pretraining. These modifications provide a
reasonable trade-off between preserving the semantics of
the page and requiring a practical decoder sequence length.
We condense the HTML DOM tree by (1) only keeping
nodes with visible elements or descendants with visible el-
ements and (2) if a node does not contain visible elements
and it only has a single child, replacing the singleton child
3
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
:
<<<Python>
<img_src=py_logo.png img_alt=Python>>
<<C++>
<img_src=cpp_logo.png img_alt=C++>>
<<Java>
<img_src=java_logo.png img_alt=Java>>
<Submit>>
Figure 3: Toy illustration of input-output pairs (right) sampled from the original web page (left).
with any grandchildren to remove chained nesting. In each
node, we only use the text, along with filenames and alt-text
of images. Much more information could be retained (e.g.
element tags, style, titles and URLs) in future work. The
decoder sequence length is further reduced by finding the
largest linearized subtree that fits within a predefined se-
quence length. A bounding box indicating the region cov-
ered by the chosen subtree is also drawn on the screenshot.
For better context modeling, we introduce a BART-
like (Lewis et al., 2020) learning signal by masking 50%
of the text and decoding the entire subtree. The masked re-
gions are randomly sampled spans of text from the chosen
subtree where we render masks (Figure 3).
Comparison to existing pretraining strategies Our
proposed screenshot parsing seamlessly integrates signals
reminiscent of several well-known pretraining strategies:
Recovering the unmasked parts of the parse is simi-
lar to OCR, a prerequisite skill for understanding lan-
guage. OCR pretraining was proposed in Donut which
uses synthetic renderings or OCR outputs. In Figure 3,
predicting <C++> exemplifies this learning signal.
Recovering the masked parts of the parse is much like
masked language modeling (Devlin et al., 2019). A
major difference is that the visual context often pro-
vides additional powerful cues. In Figure 3, predict-
ing <Python> exemplifies this signal.
Recovering the alt-text from images is a common pre-
training strategy for image captioning (Sharma et al.,
2018; Wang et al., 2022a; Chen et al., 2022c). A ma-
jor difference is that the model is permitted to use the
web page as additional context. In Figure 3, predict-
ing img alt=C++ exemplifies this learning signal.
Appendix F contains more details including examples of
screenshots paired with their gold and predicted parses.
2.4. Warming up with a reading curriculum
While we can directly pretrain Pix2Struct on the
screenshot parsing task, we find that doing this naively
can result in instability and slow learning. However, if we
first expose the model to a short “warmup” stage of sim-
ply learning to read, we find a strong curriculum learning
effect where (1) pretraining is more stable and converges
faster, and (2) we observe better finetuning performance,
as discussed in Section 5. We create images of text snip-
pets with random colors and fonts. The model is simply
trained to decode the original text (see Appendix E for ex-
amples). This type of curriculum learning was also used
in Dessurt (Davis et al., 2022) and can also be viewed as a
simplified version of Donut’s pretraining.
2.5. Finetuning
Finetuning Pix2Struct is straightforward and largely a
matter of preprocessing the downstream data to unambigu-
ously reflect the task in the image inputs and text outputs,
analogous to the way T5 (Raffel et al., 2020) is used for
text-based tasks. In this section, we cover the preprocess-
ing strategies for the tasks described in Table 4. Examples
of this preprocessing are shown in Figure 1.
Captioning is the most straightforward, since the input im-
age and the output text can be directly used (as in TextCaps,
Screen2Words). In the case where the focus of the caption
is a specific bounding box (as in Widget Captioning), we
draw the target bounding box on the image itself.
For visual question answering (as in OCR-VQA, ChartQA,
DocVQA, InfographicsVQA), while multimodal models
typically reserve a specialized text channel for the question,
we opt to instead directly render the question as a header
at the top of the original image. Pix2Struct reads both
the question and the image jointly via the visual modality.
This strategy is analogous to the common practice of sim-
ply concatenating all inputs during finetuning of pretrained
text models, first proposed in GPT (Radford et al., 2018)
and has been the default method in NLP since then. Intu-
itively, this strategy is effective because Pix2Struct has
been pretrained to be sensitive to long-range interactions
between various parts of the input image. In the case of
multiple choice answers (as in AI2D), we also render the
choices in the header as part of the question.
The most complex scenario is RefExp, where the task is
choosing between UI components that a natural language
expression could be referring to. For each candidate, we
create a training instance where the input image contains
the bounding box and referring expression, and the decod-
4
摘要:

Pix2Struct:ScreenshotParsingasPretrainingforVisualLanguageUnderstandingKentonLee*1MandarJoshi*1IuliaTurc2HexiangHu1FangyuLiu3JulianEisenschlos1UrvashiKhandelwal1PeterShaw1Ming-WeiChang1KristinaToutanova1AbstractVisually-situatedlanguageisubiquitous—sourcesrangefromtextbookswithdiagramstowebpageswith...

展开>> 收起<<
Pix2Struct Screenshot Parsing as Pretraining for Visual Language Understanding.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:4.98MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注