
Model
Task Dataset Size Input Architecture Output Task Output
Autonomous Web Navigation MiniWoB Demos (Shi et al., 2017) 12K Page Enc-Dec Text Dictionary
Dec
Semantic Classification Annotated Shopping Webpages (Gur et al., 2021) 28K Snippet All Text Category
Description Generation CommonCrawl (new) 85K Snippet Enc-Dec Text Text
Dec
Table 1: Task, dataset, and model summary. All models receive raw HTML. Autonomous Web Navigation
receives the entire HTML, while the other tasks receive HTML snippets extracted given salient element.
4 CANONICAL TASKS FOR HTML UNDERSTANDING
We devise three canonical tasks to study HTML understanding capabilities of LLM-based web
agents. These tasks require correctly interpreting both structure and content to varying degrees
to make predictions, with autonomous navigation being the most challenging capability of the three.
Autonomous Web Navigation. This task evaluates how well a model navigates multi-page web-
sites as a sequential decision-making problem (Shi et al., 2017; Liu et al., 2018). At the beginning
of an episode, the agent is given a natural language instruction, e.g. Enter the username “lyda”
and the password “N22t” into the text fields and press login. The agent applies actions to a se-
quence of HTML pages, where each action is of the form function(selector, text). The
function is one of click or type,selector is an integer pointer that uniquely identifies an ele-
ment, and text is a text to input if the type functionality is activated. An episode terminates when
either the page reaches a terminal state (e.g., the ‘sign in’ button is clicked) or the maximum number
of steps is reached.
Semantic Classification.Many HTML understanding applications require a model that can classify
HTML elements into standardized categories. For example, in automated form-filling (Diaz et al.,
2013; Gur et al., 2021), it is useful to identify a ‘submit button’ across many websites (e.g., shopping,
flight booking, utility application) with various button representations (e.g., position, color, or text).
Thus, we formulate Semantic Classification as classifying elements into role categories. Take the
example HTML in Figure 1a which includes two input elements and a submit button. Let’s
pick the first input as an element of interest to be classified by the system, also called a salient
element. The system should classify this element as username, since it appears on a login page and
it has a label with Email Address which is typically associated with the username in form-filling
applications. To solve this, the system can aggregate information from multiple sources in the page
– the label that says Enter Email Address, the input attributes (type=“email” and id=“uName”),
or even the ordering of other elements in the page such as ‘password’ and ‘sign in’.
Description Generation.Motivated by applications in accessibility-minded web browser con-
trol (Jorgensen & Binsted, 2005), we formulate description generation as an extractive problem
where the goal is to locate the textual description of an element in the HTML and generate it as
output. For instance, the description of the salient element in Figure 1a is Enter Email Address;
when rendered, this label will appear above the ‘email’ input field. HTML provides a large
amount of flexibility, and so in general a descriptive text that appears alongside a specific element
when rendered can be very far from that element when looking at the HTML plaintext. Thus, this
task evaluates a model’s ability to understand the structure of HTML as it would appear to a user,
despite not having access to the rendered web page directly.
5 DATASETS
Each of our canonical tasks requires a separate dataset, with the description generation task using a
newly contributed, auto-labelled dataset based on CommonCrawl.
Autonomous Web Navigation.We use the 12K demonstrations included in the publicly available
MiniWoB benchmark (Shi et al., 2017), which encompass 62 website applications ranging from
email forwarding to social media interactions. Each demonstration is a sequence of (instruction,
HTML, action) tuples. Every element in a MiniWoB demonstration is accompanied by a reference
number unique within its respective pages. This number can be used as an element selector, making
the action space unified across all tasks and time steps. For instance, the action in Figure 1a would be
4