UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS Izzeddin Gur Ofir Nachum Yingjie Miao Mustafa Safdari Austin Huang

2025-05-06 0 0 1.39MB 20 页 10玖币
侵权投诉
UNDERSTANDING HTML WITH LARGE LANGUAGE
MODELS
Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang
Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust
Google Research
{izzeddin,ofirnachum,yingjiemiao,msafdari,austinvhuang
chowdhery,sharannarang,nfiedel,sandrafaust}@google.com
ABSTRACT
Large language models (LLMs) have shown exceptional performance on a va-
riety of natural language tasks. Yet, their capabilities for HTML understanding
– i.e., parsing the raw HTML of a webpage, with applications to automation of
web-based tasks, crawling, and browser-assisted retrieval – have not been fully
explored. We contribute HTML understanding models (fine-tuned LLMs) and an
in-depth analysis of their capabilities under three tasks: (i) Semantic Classifica-
tion of HTML elements, (ii) Description Generation for HTML inputs, and (iii)
Autonomous Web Navigation of HTML pages. While previous work has devel-
oped dedicated architectures and training procedures for HTML understanding,
we show that LLMs pretrained on standard natural language corpora transfer re-
markably well to HTML understanding tasks. For instance, fine-tuned LLMs are
12% more accurate at semantic classification compared to models trained exclu-
sively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB
benchmark, LLMs successfully complete 50% more tasks using 192x less data
compared to the previous best supervised model. Out of the LLMs we evalu-
ate, we show evidence that T5-based models are ideal due to their bidirectional
encoder-decoder architecture. To promote further research on LLMs for HTML
understanding, we create and open-source a large-scale HTML dataset distilled
and auto-labeled from CommonCrawl.1
1 INTRODUCTION
Web crawling (Olston et al., 2010), form-filling (Diaz et al., 2013; Gur et al., 2021), or information
retrieving web agents (Nogueira & Cho, 2016) are important for both automating and assisting
users in web-based tasks. These and similar applications rely on models that can search for specific
content or controls on a web page as well as navigate a website autonomously. Since a web page in
its raw form is represented as an HTML-based text sequence, the success of models for web-based
tasks relies on their ability to understand HTML semantics, structure, and embedded interactions.
The predominant approach to web automation and HTML understanding is to train specialized mod-
els, i.e., gathering application-specific datasets and designing neural network (NN) architectures to
leverage inductive biases of the HTMLs structure; see, e.g., Liu et al. (2018); Toyama et al. (2021);
Gur et al. (2021); Humphreys et al. (2022). However, both dataset collection and neural architecture
design are expensive, time-consuming, and require highly-specialized, domain-specific knowledge.
Meanwhile, in the natural language processing (NLP) literature, large language models (LLMs) have
emerged as a solution to the difficulties of dataset collection and specialized NN design (Kaplan
et al., 2020; Bommasani et al., 2021). A popular paradigm in NLP is to take an off-the-shelf LLM
– pretrained on a large text corpus via an unsupervised and task-agnostic learning objective – and
either fine-tune or prompt the LLM on a small task-specific dataset. This paradigm has shown
exceptional performance on a variety of NLP tasks (Xue et al., 2020; Brown et al., 2020; Austin
et al., 2021). Whether LLMs can be applied to HTML understanding – especially given the much
larger context and sequence lengths – remains an under-explored question.
1See visualizations of the results at https://sites.google.com/view/llm4html/home.
1
arXiv:2210.03945v2 [cs.LG] 19 May 2023
<html>
<body>
<form class="login-form">
<div>
<label class="form-label" for=”uName”>
Enter Email Address
</label>
<label class="form-label" for=”pass”>
Enter Password:
</label>
</div>
<div>
<input type="email" id="uName”>
<input type="password" id="pass">
<span class="hidden">
Please enter your password.
</span>
</div>
<button type="submit">Sign In</button>
</form>
</body>
</html>
(a)
<div><label class="form-label" for=”uName”>Email Address</label><label
class="form-label" for=”pass”>Enter Password: </label></div><div><input
type="email" id="uName” target><input type="password" id="pass"><span
class="hidden">Please enter your password.</span></div>
(b)
Figure 1: a) HTML example page with a highlighted salient element, an element of interest (dashed box).
All canonical tasks evaluate a distinct interaction with this element, either by classifying it as one of a set of
categories, generating a text description of its purpose, or applying an action as part of a sequential navigation
of a multi-page website. b) LLM architectures overview. Dashed boxes denote sub-modules that are specific to
either encoder-only or encoder-decoder models. For encoder-only models, we add an extra classification layer.
Decoder-only models (not in the diagram) are similar to encoder-decoder models, the main difference is that
the HTML snippet is fed to the decoder and processed from left-to-right.
In this paper, we investigate whether LLMs can be applied to HTML understanding to produce
better-performing, more sample-efficient HTML understanding models and without the need for
custom NN architecture design. To that end, we present a suite of three benchmarking tasks for
HTML understanding that capture the essence of these applications and require understanding both
structure and content. First, we devise Semantic Classification as a task that requires a model to
classify a given HTML element into one of a set of categories, such as address, email, password
etc., with application to automated form-filling. Second, we present Description Generation, a
label-extraction task where a model is given an HTML snippet and is asked to produce a natural
language description. For instance for an email field, the description might be “Please enter your
email address. Note that in the majority of web pages, this connection between input elements and
description content is only implicit in the raw HTML code and inferring such links is a prerequisite
for higher-level navigation objectives. The third task is Autonomous Web Navigation (Shi et al.,
2017). A model is presented with an HTML page paired with a natural language command and
must apply appropriate actions on a sequence of HTML pages to satisfy the command. See Figure
1a for a simplified example of these tasks.
With these benchmark tasks in hand, we evaluate the transfer capabilities of a variety of pretrained
LLMs (Table 1), varying in architecture (encoder-only, encoder-decoder, or decoder-only), model
size (from 24.6M to 62B parameters), and training data corpora (both including and excluding pre-
training NLP and HTML corpus). While prior work universally pre-parses the HTML as input to the
model (Gur et al., 2021; Liu et al., 2018; Nakano et al., 2021), ours – to the best of our knowledge – is
the first work that uses raw, unprocessed HTML. Our results show that LLMs demonstrate a remark-
able level of HTML understanding across all tasks, with up to 192×more sample-efficiency than
models trained from scratch, and achieving a new SoTA for supervised learning on the MiniWoB
benchmark suite (Shi et al., 2017). The encoder-decoder architectures with bi-directional attention
show the best performance across the board even when their pretraining does not include HTML. In
addition, we show that the performance scales sub-linearly with the model size.
The broader objective of this research is to advance the integration of LLMs with autonomous web
agents. It has only been in the last year that researchers have begun to utilize LLMs outside of
NLP and integrate them as core capabilities in autonomy (Lu et al. (2021); Ahn et al. (2022)). In
this context, LLMs are reasoning engines for sequential decision making agents interacting with
environments.
The present work is the first in the research literature to embed an LLM and train it as an agent for
autonomous web navigation. This requires new implementations to adapt LLM training for behavior
2
cloning in addition to designing interfaces for integrating text generation into a perception-compute-
action cycle operating in a stateful web environment. Our implementation allows us to answer new
questions regarding trade-offs among various model characteristics.
We believe these contributions expand the scope of language models and connect their unique capa-
bilities with autonomous agents for the web. We provide a new perspective on machine learning for
HTML understanding and web automation, showing that pretrained LLMs can achieve significant
performance on such tasks, reducing the need for specialized architectures and training protocols.
To encourage further research in this direction, we open sourced 2model weights for agents used in
the WoB environment and our dataset for description generation.
2 RELATED WORK
HTML Understanding Autonomous web navigation has been a popular application for neural net-
work models, and a variety of works propose simulated websites for training web-based agents, with
application to task fulfillment (Yao et al., 2022; Gur et al., 2021; Burns et al., 2022; Mazumder &
Riva, 2020; Shi et al., 2017; Liu et al., 2018) as well as information retrieval or question-answering
(Adolphs et al., 2021; Nogueira & Cho, 2016). Simulated websites provide an easy way to evaluate
models online, and for this reason we use the existing MiniWoB benchmark (Shi et al., 2017) for our
web navigation setting. However, it is still important to have a mechanism for evaluating models on
a wide variety of real-world websites. This was the key motivation for generating our own dataset
for the description generation task, which is distilled and auto-labeled from CommonCrawl and is a
key contribution of our paper.
Alongside these benchmarks, many works have developed models for web navigation and related
subtasks (Pasupat et al., 2018; Bommasani et al., 2021; He et al., 2021; Gur et al., 2021; Humphreys
et al., 2022; Liu et al., 2018; Jia et al., 2019). These works often rely on specialized neural network
architectures that leverage inductive biases of HTML structure, or on preprocessing of HTML to
make it easier to input to a model (Li et al. (2021a;b)). In contrast, our work takes a minimalist
approach, providing HTML in text form with minimal processing and using widely-adopted trans-
former networks.
LLMs and HTML Works that explore the intersection of LLMs and HTML generally fall into two
categories. The first category uses LLMs to assist web navigation (Nakano et al., 2021; Yao et al.,
2022), and typically relies on a custom preprocessing to map the context and structure of a web page
to natural language, thus severely restricting what HTML pages the model can parse. The second
category pretrains LLMs on a large corpora of HTML text (Aghajanyan et al., 2021). However,
these works typically restrict the model evaluation to standard NLP tasks, e.g., summarization and
question/answering as opposed to tasks more relevant to HTML understanding and web automation.
Our work can be thought of as the reverse: We keep the pretraining of LLMs unchanged and focus
on the mechanisms for transferring the pretrained LLMs to HTML-relevant tasks.
3 BRIEF BACKGROUND ON HTML AS SEMI-STRUCTURED TEXT DATA
HTML is a markup language, used to organize web page structure and content. Consider the
example HTML page in Figure 1a. This web page includes two adjacent input elements, one for
e-mail and another for password, with their corresponding labels on a separate branch of the page.
These inputs and labels are one of many possible elements that serve as HTML building blocks.
Each element has a set of attributes – key and value pair – that describe the element’s content, such
as style and human-readable text. When rendered in a browser, these attributes will be responsible
for how the element is shown and where it is positioned. In the example in Figure 1a, the first
input has three attributes, tag="input",type="email", and id="uName", that identify
the element as an email input with an identifier (“uName”) that can be accessed programmatically.
2https://console.cloud.google.com/storage/browser/gresearch/webllm
3
Model
Task Dataset Size Input Architecture Output Task Output
Autonomous Web Navigation MiniWoB Demos (Shi et al., 2017) 12K Page Enc-Dec Text Dictionary
Dec
Semantic Classification Annotated Shopping Webpages (Gur et al., 2021) 28K Snippet All Text Category
Description Generation CommonCrawl (new) 85K Snippet Enc-Dec Text Text
Dec
Table 1: Task, dataset, and model summary. All models receive raw HTML. Autonomous Web Navigation
receives the entire HTML, while the other tasks receive HTML snippets extracted given salient element.
4 CANONICAL TASKS FOR HTML UNDERSTANDING
We devise three canonical tasks to study HTML understanding capabilities of LLM-based web
agents. These tasks require correctly interpreting both structure and content to varying degrees
to make predictions, with autonomous navigation being the most challenging capability of the three.
Autonomous Web Navigation. This task evaluates how well a model navigates multi-page web-
sites as a sequential decision-making problem (Shi et al., 2017; Liu et al., 2018). At the beginning
of an episode, the agent is given a natural language instruction, e.g. Enter the username “lyda”
and the password “N22t” into the text fields and press login. The agent applies actions to a se-
quence of HTML pages, where each action is of the form function(selector, text). The
function is one of click or type,selector is an integer pointer that uniquely identifies an ele-
ment, and text is a text to input if the type functionality is activated. An episode terminates when
either the page reaches a terminal state (e.g., the ‘sign in’ button is clicked) or the maximum number
of steps is reached.
Semantic Classification.Many HTML understanding applications require a model that can classify
HTML elements into standardized categories. For example, in automated form-filling (Diaz et al.,
2013; Gur et al., 2021), it is useful to identify a ‘submit button’ across many websites (e.g., shopping,
flight booking, utility application) with various button representations (e.g., position, color, or text).
Thus, we formulate Semantic Classification as classifying elements into role categories. Take the
example HTML in Figure 1a which includes two input elements and a submit button. Let’s
pick the first input as an element of interest to be classified by the system, also called a salient
element. The system should classify this element as username, since it appears on a login page and
it has a label with Email Address which is typically associated with the username in form-filling
applications. To solve this, the system can aggregate information from multiple sources in the page
– the label that says Enter Email Address, the input attributes (type=“email” and id=“uName”),
or even the ordering of other elements in the page such as ‘password’ and ‘sign in’.
Description Generation.Motivated by applications in accessibility-minded web browser con-
trol (Jorgensen & Binsted, 2005), we formulate description generation as an extractive problem
where the goal is to locate the textual description of an element in the HTML and generate it as
output. For instance, the description of the salient element in Figure 1a is Enter Email Address;
when rendered, this label will appear above the ‘email’ input field. HTML provides a large
amount of flexibility, and so in general a descriptive text that appears alongside a specific element
when rendered can be very far from that element when looking at the HTML plaintext. Thus, this
task evaluates a model’s ability to understand the structure of HTML as it would appear to a user,
despite not having access to the rendered web page directly.
5 DATASETS
Each of our canonical tasks requires a separate dataset, with the description generation task using a
newly contributed, auto-labelled dataset based on CommonCrawl.
Autonomous Web Navigation.We use the 12K demonstrations included in the publicly available
MiniWoB benchmark (Shi et al., 2017), which encompass 62 website applications ranging from
email forwarding to social media interactions. Each demonstration is a sequence of (instruction,
HTML, action) tuples. Every element in a MiniWoB demonstration is accompanied by a reference
number unique within its respective pages. This number can be used as an element selector, making
the action space unified across all tasks and time steps. For instance, the action in Figure 1a would be
4
摘要:

UNDERSTANDINGHTMLWITHLARGELANGUAGEMODELSIzzeddinGur,OrNachum,YingjieMiao,MustafaSafdari,AustinHuangAakankshaChowdhery,SharanNarang,NoahFiedel,AleksandraFaustGoogleResearchfizzeddin,ofirnachum,yingjiemiao,msafdari,austinvhuangchowdhery,sharannarang,nfiedel,sandrafaustg@google.comABSTRACTLargelanguag...

展开>> 收起<<
UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS Izzeddin Gur Ofir Nachum Yingjie Miao Mustafa Safdari Austin Huang.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注