UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS Izzeddin Gur Oﬁr Nachum Yingjie Miao Mustafa Safdari Austin Huang

2025-05-06 1 0 1.39MB 20 页 10玖币

侵权投诉

UNDERSTANDING HTML WITH LARGE LANGUAGE

MODELS

Izzeddin Gur, Oﬁr Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang

Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust

Google Research

{izzeddin,ofirnachum,yingjiemiao,msafdari,austinvhuang

chowdhery,sharannarang,nfiedel,sandrafaust}@google.com

ABSTRACT

Large language models (LLMs) have shown exceptional performance on a va-

riety of natural language tasks. Yet, their capabilities for HTML understanding

– i.e., parsing the raw HTML of a webpage, with applications to automation of

web-based tasks, crawling, and browser-assisted retrieval – have not been fully

explored. We contribute HTML understanding models (ﬁne-tuned LLMs) and an

in-depth analysis of their capabilities under three tasks: (i) Semantic Classiﬁca-

tion of HTML elements, (ii) Description Generation for HTML inputs, and (iii)

Autonomous Web Navigation of HTML pages. While previous work has devel-

oped dedicated architectures and training procedures for HTML understanding,

we show that LLMs pretrained on standard natural language corpora transfer re-

markably well to HTML understanding tasks. For instance, ﬁne-tuned LLMs are

12% more accurate at semantic classiﬁcation compared to models trained exclu-

sively on the task dataset. Moreover, when ﬁne-tuned on data from the MiniWoB

benchmark, LLMs successfully complete 50% more tasks using 192x less data

compared to the previous best supervised model. Out of the LLMs we evalu-

ate, we show evidence that T5-based models are ideal due to their bidirectional

encoder-decoder architecture. To promote further research on LLMs for HTML

understanding, we create and open-source a large-scale HTML dataset distilled

and auto-labeled from CommonCrawl.1

1 INTRODUCTION

Web crawling (Olston et al., 2010), form-ﬁlling (Diaz et al., 2013; Gur et al., 2021), or information

retrieving web agents (Nogueira & Cho, 2016) are important for both automating and assisting

users in web-based tasks. These and similar applications rely on models that can search for speciﬁc

content or controls on a web page as well as navigate a website autonomously. Since a web page in

its raw form is represented as an HTML-based text sequence, the success of models for web-based

tasks relies on their ability to understand HTML semantics, structure, and embedded interactions.

The predominant approach to web automation and HTML understanding is to train specialized mod-

els, i.e., gathering application-speciﬁc datasets and designing neural network (NN) architectures to

leverage inductive biases of the HTML’s structure; see, e.g., Liu et al. (2018); Toyama et al. (2021);

Gur et al. (2021); Humphreys et al. (2022). However, both dataset collection and neural architecture

design are expensive, time-consuming, and require highly-specialized, domain-speciﬁc knowledge.

Meanwhile, in the natural language processing (NLP) literature, large language models (LLMs) have

emerged as a solution to the difﬁculties of dataset collection and specialized NN design (Kaplan

et al., 2020; Bommasani et al., 2021). A popular paradigm in NLP is to take an off-the-shelf LLM

– pretrained on a large text corpus via an unsupervised and task-agnostic learning objective – and

either ﬁne-tune or prompt the LLM on a small task-speciﬁc dataset. This paradigm has shown

exceptional performance on a variety of NLP tasks (Xue et al., 2020; Brown et al., 2020; Austin

et al., 2021). Whether LLMs can be applied to HTML understanding – especially given the much

larger context and sequence lengths – remains an under-explored question.

1See visualizations of the results at https://sites.google.com/view/llm4html/home.

arXiv:2210.03945v2 [cs.LG] 19 May 2023

<html>

<body>

<div>

Enter Email Address

</label>

Enter Password:

</label>

</div>

<div>

Please enter your password.

</span>

</div>

</form>

</body>

</html>

(a)

<div><label class="form-label" for=”uName”>Email Address</label><label

class="form-label" for=”pass”>Enter Password: </label></div><div><input

type="email" id="uName” target><input type="password" id="pass"><span

class="hidden">Please enter your password.</span></div>

(b)

Figure 1: a) HTML example page with a highlighted salient element, an element of interest (dashed box).

All canonical tasks evaluate a distinct interaction with this element, either by classifying it as one of a set of

categories, generating a text description of its purpose, or applying an action as part of a sequential navigation

of a multi-page website. b) LLM architectures overview. Dashed boxes denote sub-modules that are speciﬁc to

either encoder-only or encoder-decoder models. For encoder-only models, we add an extra classiﬁcation layer.

Decoder-only models (not in the diagram) are similar to encoder-decoder models, the main difference is that

the HTML snippet is fed to the decoder and processed from left-to-right.

In this paper, we investigate whether LLMs can be applied to HTML understanding to produce

better-performing, more sample-efﬁcient HTML understanding models and without the need for

custom NN architecture design. To that end, we present a suite of three benchmarking tasks for

HTML understanding that capture the essence of these applications and require understanding both

structure and content. First, we devise Semantic Classiﬁcation as a task that requires a model to

classify a given HTML element into one of a set of categories, such as address, email, password

etc., with application to automated form-ﬁlling. Second, we present Description Generation, a

label-extraction task where a model is given an HTML snippet and is asked to produce a natural

language description. For instance for an email ﬁeld, the description might be “Please enter your

email address.” Note that in the majority of web pages, this connection between input elements and

description content is only implicit in the raw HTML code and inferring such links is a prerequisite

for higher-level navigation objectives. The third task is Autonomous Web Navigation (Shi et al.,

2017). A model is presented with an HTML page paired with a natural language command and

must apply appropriate actions on a sequence of HTML pages to satisfy the command. See Figure

1a for a simpliﬁed example of these tasks.

With these benchmark tasks in hand, we evaluate the transfer capabilities of a variety of pretrained

LLMs (Table 1), varying in architecture (encoder-only, encoder-decoder, or decoder-only), model

size (from 24.6M to 62B parameters), and training data corpora (both including and excluding pre-

training NLP and HTML corpus). While prior work universally pre-parses the HTML as input to the

model (Gur et al., 2021; Liu et al., 2018; Nakano et al., 2021), ours – to the best of our knowledge – is

the ﬁrst work that uses raw, unprocessed HTML. Our results show that LLMs demonstrate a remark-

able level of HTML understanding across all tasks, with up to 192×more sample-efﬁciency than

models trained from scratch, and achieving a new SoTA for supervised learning on the MiniWoB

benchmark suite (Shi et al., 2017). The encoder-decoder architectures with bi-directional attention

show the best performance across the board even when their pretraining does not include HTML. In

addition, we show that the performance scales sub-linearly with the model size.

The broader objective of this research is to advance the integration of LLMs with autonomous web

agents. It has only been in the last year that researchers have begun to utilize LLMs outside of

NLP and integrate them as core capabilities in autonomy (Lu et al. (2021); Ahn et al. (2022)). In

this context, LLMs are reasoning engines for sequential decision making agents interacting with

environments.

The present work is the ﬁrst in the research literature to embed an LLM and train it as an agent for

autonomous web navigation. This requires new implementations to adapt LLM training for behavior

cloning in addition to designing interfaces for integrating text generation into a perception-compute-

action cycle operating in a stateful web environment. Our implementation allows us to answer new

questions regarding trade-offs among various model characteristics.

We believe these contributions expand the scope of language models and connect their unique capa-

bilities with autonomous agents for the web. We provide a new perspective on machine learning for

HTML understanding and web automation, showing that pretrained LLMs can achieve signiﬁcant

performance on such tasks, reducing the need for specialized architectures and training protocols.

To encourage further research in this direction, we open sourced 2model weights for agents used in

the WoB environment and our dataset for description generation.

2 RELATED WORK

HTML Understanding Autonomous web navigation has been a popular application for neural net-

work models, and a variety of works propose simulated websites for training web-based agents, with

application to task fulﬁllment (Yao et al., 2022; Gur et al., 2021; Burns et al., 2022; Mazumder &

Riva, 2020; Shi et al., 2017; Liu et al., 2018) as well as information retrieval or question-answering

(Adolphs et al., 2021; Nogueira & Cho, 2016). Simulated websites provide an easy way to evaluate

models online, and for this reason we use the existing MiniWoB benchmark (Shi et al., 2017) for our

web navigation setting. However, it is still important to have a mechanism for evaluating models on

a wide variety of real-world websites. This was the key motivation for generating our own dataset

for the description generation task, which is distilled and auto-labeled from CommonCrawl and is a

key contribution of our paper.

Alongside these benchmarks, many works have developed models for web navigation and related

subtasks (Pasupat et al., 2018; Bommasani et al., 2021; He et al., 2021; Gur et al., 2021; Humphreys

et al., 2022; Liu et al., 2018; Jia et al., 2019). These works often rely on specialized neural network

architectures that leverage inductive biases of HTML structure, or on preprocessing of HTML to

make it easier to input to a model (Li et al. (2021a;b)). In contrast, our work takes a minimalist

approach, providing HTML in text form with minimal processing and using widely-adopted trans-

former networks.

LLMs and HTML Works that explore the intersection of LLMs and HTML generally fall into two

categories. The ﬁrst category uses LLMs to assist web navigation (Nakano et al., 2021; Yao et al.,

2022), and typically relies on a custom preprocessing to map the context and structure of a web page

to natural language, thus severely restricting what HTML pages the model can parse. The second

category pretrains LLMs on a large corpora of HTML text (Aghajanyan et al., 2021). However,

these works typically restrict the model evaluation to standard NLP tasks, e.g., summarization and

question/answering as opposed to tasks more relevant to HTML understanding and web automation.

Our work can be thought of as the reverse: We keep the pretraining of LLMs unchanged and focus

on the mechanisms for transferring the pretrained LLMs to HTML-relevant tasks.

3 BRIEF BACKGROUND ON HTML AS SEMI-STRUCTURED TEXT DATA

HTML is a markup language, used to organize web page structure and content. Consider the

example HTML page in Figure 1a. This web page includes two adjacent input elements, one for

e-mail and another for password, with their corresponding labels on a separate branch of the page.

These inputs and labels are one of many possible elements that serve as HTML building blocks.

Each element has a set of attributes – key and value pair – that describe the element’s content, such

as style and human-readable text. When rendered in a browser, these attributes will be responsible

for how the element is shown and where it is positioned. In the example in Figure 1a, the ﬁrst

input has three attributes, tag="input",type="email", and id="uName", that identify

the element as an email input with an identiﬁer (“uName”) that can be accessed programmatically.

2https://console.cloud.google.com/storage/browser/gresearch/webllm

Model

Task Dataset Size Input Architecture Output Task Output

Autonomous Web Navigation MiniWoB Demos (Shi et al., 2017) 12K Page Enc-Dec Text Dictionary

Dec

Semantic Classiﬁcation Annotated Shopping Webpages (Gur et al., 2021) 28K Snippet All Text Category

Description Generation CommonCrawl (new) 85K Snippet Enc-Dec Text Text

Dec

Table 1: Task, dataset, and model summary. All models receive raw HTML. Autonomous Web Navigation

receives the entire HTML, while the other tasks receive HTML snippets extracted given salient element.

4 CANONICAL TASKS FOR HTML UNDERSTANDING

We devise three canonical tasks to study HTML understanding capabilities of LLM-based web

agents. These tasks require correctly interpreting both structure and content to varying degrees

to make predictions, with autonomous navigation being the most challenging capability of the three.

Autonomous Web Navigation. This task evaluates how well a model navigates multi-page web-

sites as a sequential decision-making problem (Shi et al., 2017; Liu et al., 2018). At the beginning

of an episode, the agent is given a natural language instruction, e.g. Enter the username “lyda”

and the password “N22t” into the text ﬁelds and press login. The agent applies actions to a se-

quence of HTML pages, where each action is of the form function(selector, text). The

function is one of click or type,selector is an integer pointer that uniquely identiﬁes an ele-

ment, and text is a text to input if the type functionality is activated. An episode terminates when

either the page reaches a terminal state (e.g., the ‘sign in’ button is clicked) or the maximum number

of steps is reached.

Semantic Classiﬁcation.Many HTML understanding applications require a model that can classify

HTML elements into standardized categories. For example, in automated form-ﬁlling (Diaz et al.,

2013; Gur et al., 2021), it is useful to identify a ‘submit button’ across many websites (e.g., shopping,

ﬂight booking, utility application) with various button representations (e.g., position, color, or text).

Thus, we formulate Semantic Classiﬁcation as classifying elements into role categories. Take the

example HTML in Figure 1a which includes two input elements and a submit button. Let’s

pick the ﬁrst input as an element of interest to be classiﬁed by the system, also called a salient

element. The system should classify this element as username, since it appears on a login page and

it has a label with Email Address which is typically associated with the username in form-ﬁlling

applications. To solve this, the system can aggregate information from multiple sources in the page

– the label that says Enter Email Address, the input attributes (type=“email” and id=“uName”),

or even the ordering of other elements in the page such as ‘password’ and ‘sign in’.

Description Generation.Motivated by applications in accessibility-minded web browser con-

trol (Jorgensen & Binsted, 2005), we formulate description generation as an extractive problem

where the goal is to locate the textual description of an element in the HTML and generate it as

output. For instance, the description of the salient element in Figure 1a is Enter Email Address;

when rendered, this label will appear above the ‘email’ input ﬁeld. HTML provides a large

amount of ﬂexibility, and so in general a descriptive text that appears alongside a speciﬁc element

when rendered can be very far from that element when looking at the HTML plaintext. Thus, this

task evaluates a model’s ability to understand the structure of HTML as it would appear to a user,

despite not having access to the rendered web page directly.

5 DATASETS

Each of our canonical tasks requires a separate dataset, with the description generation task using a

newly contributed, auto-labelled dataset based on CommonCrawl.

Autonomous Web Navigation.We use the 12K demonstrations included in the publicly available

MiniWoB benchmark (Shi et al., 2017), which encompass 62 website applications ranging from

email forwarding to social media interactions. Each demonstration is a sequence of (instruction,

HTML, action) tuples. Every element in a MiniWoB demonstration is accompanied by a reference

number unique within its respective pages. This number can be used as an element selector, making

the action space uniﬁed across all tasks and time steps. For instance, the action in Figure 1a would be

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UNDERSTANDINGHTMLWITHLARGELANGUAGEMODELSIzzeddinGur,OrNachum,YingjieMiao,MustafaSafdari,AustinHuangAakankshaChowdhery,SharanNarang,NoahFiedel,AleksandraFaustGoogleResearchfizzeddin,ofirnachum,yingjiemiao,msafdari,austinvhuangchowdhery,sharannarang,nfiedel,sandrafaustg@google.comABSTRACTLargelanguag...

展开>> 收起<<

UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS Izzeddin Gur Oﬁr Nachum Yingjie Miao Mustafa Safdari Austin Huang.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS Izzeddin Gur Oﬁr Nachum Yingjie Miao Mustafa Safdari Austin Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: