
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
:→
<<<Python>
<img_src=py_logo.png img_alt=Python>>
<<C++>
<img_src=cpp_logo.png img_alt=C++>>
<<Java>
<img_src=java_logo.png img_alt=Java>>
<Submit>>
Figure 3: Toy illustration of input-output pairs (right) sampled from the original web page (left).
with any grandchildren to remove chained nesting. In each
node, we only use the text, along with filenames and alt-text
of images. Much more information could be retained (e.g.
element tags, style, titles and URLs) in future work. The
decoder sequence length is further reduced by finding the
largest linearized subtree that fits within a predefined se-
quence length. A bounding box indicating the region cov-
ered by the chosen subtree is also drawn on the screenshot.
For better context modeling, we introduce a BART-
like (Lewis et al., 2020) learning signal by masking 50%
of the text and decoding the entire subtree. The masked re-
gions are randomly sampled spans of text from the chosen
subtree where we render masks (Figure 3).
Comparison to existing pretraining strategies Our
proposed screenshot parsing seamlessly integrates signals
reminiscent of several well-known pretraining strategies:
• Recovering the unmasked parts of the parse is simi-
lar to OCR, a prerequisite skill for understanding lan-
guage. OCR pretraining was proposed in Donut which
uses synthetic renderings or OCR outputs. In Figure 3,
predicting <C++> exemplifies this learning signal.
• Recovering the masked parts of the parse is much like
masked language modeling (Devlin et al., 2019). A
major difference is that the visual context often pro-
vides additional powerful cues. In Figure 3, predict-
ing <Python> exemplifies this signal.
• Recovering the alt-text from images is a common pre-
training strategy for image captioning (Sharma et al.,
2018; Wang et al., 2022a; Chen et al., 2022c). A ma-
jor difference is that the model is permitted to use the
web page as additional context. In Figure 3, predict-
ing img alt=C++ exemplifies this learning signal.
Appendix F contains more details including examples of
screenshots paired with their gold and predicted parses.
2.4. Warming up with a reading curriculum
While we can directly pretrain Pix2Struct on the
screenshot parsing task, we find that doing this naively
can result in instability and slow learning. However, if we
first expose the model to a short “warmup” stage of sim-
ply learning to read, we find a strong curriculum learning
effect where (1) pretraining is more stable and converges
faster, and (2) we observe better finetuning performance,
as discussed in Section 5. We create images of text snip-
pets with random colors and fonts. The model is simply
trained to decode the original text (see Appendix E for ex-
amples). This type of curriculum learning was also used
in Dessurt (Davis et al., 2022) and can also be viewed as a
simplified version of Donut’s pretraining.
2.5. Finetuning
Finetuning Pix2Struct is straightforward and largely a
matter of preprocessing the downstream data to unambigu-
ously reflect the task in the image inputs and text outputs,
analogous to the way T5 (Raffel et al., 2020) is used for
text-based tasks. In this section, we cover the preprocess-
ing strategies for the tasks described in Table 4. Examples
of this preprocessing are shown in Figure 1.
Captioning is the most straightforward, since the input im-
age and the output text can be directly used (as in TextCaps,
Screen2Words). In the case where the focus of the caption
is a specific bounding box (as in Widget Captioning), we
draw the target bounding box on the image itself.
For visual question answering (as in OCR-VQA, ChartQA,
DocVQA, InfographicsVQA), while multimodal models
typically reserve a specialized text channel for the question,
we opt to instead directly render the question as a header
at the top of the original image. Pix2Struct reads both
the question and the image jointly via the visual modality.
This strategy is analogous to the common practice of sim-
ply concatenating all inputs during finetuning of pretrained
text models, first proposed in GPT (Radford et al., 2018)
and has been the default method in NLP since then. Intu-
itively, this strategy is effective because Pix2Struct has
been pretrained to be sensitive to long-range interactions
between various parts of the input image. In the case of
multiple choice answers (as in AI2D), we also render the
choices in the header as part of the question.
The most complex scenario is RefExp, where the task is
choosing between UI components that a natural language
expression could be referring to. For each candidate, we
create a training instance where the input image contains
the bounding box and referring expression, and the decod-
4