Autoregressive Structured Prediction with Language Models Tianyu LiuYuchen Eleanor Jiang Nicholas MonathRyan CotterellMrinmaya Sachan

2025-05-02 0 0 443.02KB 13 页 10玖币

Autoregressive Structured Prediction with Language Models

Tianyu LiuζYuchen Eleanor Jiangζ

Nicholas MonathγRyan CotterellζMrinmaya Sachanζ

ζETH Zürich γGoogle Research

{tianyu.liu,yuchen.jiang}@inf.ethz.ch

nmonath@google.com {ryan.cotterell,mrinmaya.sachan}@inf.ethz.ch

Abstract

In recent years, NLP has moved towards the

application of language models to a more

diverse set of tasks. However, applying

language models to structured prediction, e.g.,

predicting parse trees, taggings, and coref-

erence chains, is not straightforward. Prior

work on language model-based structured

prediction typically ﬂattens the target structure

into a string to easily ﬁt it into the language

modeling framework. Such ﬂattening limits

the accessibility of structural information and

can lead to inferior performance compared to

approaches that overtly model the structure.

In this work, we propose to construct a

conditional language model over sequences

of structure-building actions, rather than over

strings in a way that makes it easier for the

model to pick up on intra-structure dependen-

cies. Our method sets the new state of the

art on named entity recognition, end-to-end

relation extraction, and coreference resolution.

https://github.com/lyutyuh/ASP

1 Introduction

Many common NLP tasks, e.g., named entity

recognition, relation extraction, and coreference

resolution are naturally taxonomized as structured

prediction, the supervised machine-learning task of

predicting a structure from a large

1

set. To general-

ize well to held-out data in a structured prediction

problem, the received wisdom has been that it

is necessary to correctly model complex depen-

dencies between different pieces of the structure.

However, a recent trend in structured prediction

for language has been to forgo explicitly modeling

such dependencies (Ma and Hovy,2016;Lee et al.,

2017;He et al.,2017,inter alia), and, instead, to

apply an expressive black-box model, e.g., a neural

network, with the hope that the model picks up on

the dependencies without explicit instruction.

1

Typically, large means exponential in the size of the input.

Framing structured prediction as conditional

language modeling is an increasingly common

black-box technique for building structured predic-

tors that has led to empirical success (Vinyals et al.,

2015;Raffel et al.,2020;Athiwaratkun et al.,2020;

De Cao et al.,2021;Paolini et al.,2021,inter alia).

The idea behind the framework is to encode the tar-

get structure as a string, ﬂattening out the structure.

Then, one uses a conditional language model to

predict the ﬂattened string encoding the structure.

For instance, Vinyals et al. (2015) ﬂatten parse

trees into strings and predict the strings encoding

the ﬂattened trees from the sentence with a machine

translation architecture. The hope is that the au-

toregressive nature of the language model allows it

to learn to model the intra-structure dependencies

and the necessary hard constraints that ensure the

model even produces well-formed structures. Ad-

ditionally, many modelers make use of pre-trained

language models (Lewis et al.,2020;Raffel et al.,

2020) to further improve the language models.

However, despite their empirical success, simply

hoping that a black-box approach correctly models

intricate intra-structure dependencies is often

insufﬁcient for highly structured tasks (Paolini

et al.,2021, §1). Indeed, the act of ﬂattening a

structured object into a string makes properly mod-

eling the intra-structure dependencies harder for

many tasks, e.g., those that involve nested spans or

long-distance dependencies. For instance, in coref-

erence resolution, a conference link between two

mentions can stretch across thousands of words,

and a coreference chain can also contain over a

hundred mentions (Pradhan et al.,2012). Flatten-

ing such a large amount of structured information

into a string makes the task more difﬁcult to model.

In this paper, we propose a simple framework

that augments a conditional language model with

explicit modeling of structure. Instead of modeling

strings that encode a ﬂattened representation of

the target structure, we model a constrained set

of actions that build the target structure step by

step; see Fig. 1for an example of our proposed

arXiv:2210.14698v2 [cs.CL] 17 Nov 2022

[* US ]

President

Joe

Biden

]

took

office

in 2021 .

Previously

,[* he ]was

the

senator

of [*

Delaware

] .

𝜺 𝜺

#1

𝜺 𝜺 𝜺

#1

𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺

#2

𝜺 𝜺 𝜺 𝜺 𝜺 𝜺

#3

𝜺

𝜺

𝜺

𝜺

𝜺

LOC

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

PER

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

𝜺

LOC

Live_in

𝜺

𝜺

𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺

#2

𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺

a"

ASP:

US President Joe Biden took office in 2021. Previously, he was the senator of Delaware.

INPUT

b"

z"

(ERE)

[* US ]President Joe Biden ]took office in 2021. Previously, [* he ]was the senator of [* Delaware ] .

z"

(COREF)

Figure 1: Illustration of the target outputs of our framework on coreference resolution (COREF) and end-to-end

relation extraction (ERE). The lower part illustrates the decoding process of our model. The actions yiare

color-coded as ],[∗and copy . The structure random variables ziare presented along with coreference links or

relation links. We present words in the copy cells merely as an illustration.

framework. Training a conditional language model

to predict structure-building actions exposes the

structure in a way that allows the model to pick

up on the intra-structure dependencies more easily

while still allowing the modeler to leverage pre-

trained language models. We conduct experiments

on three structured prediction tasks: named entity

recognition, end-to-end relation extraction, and

coreference resolution. On each task, we achieve

state-of-the-art results without relying on data

augmentation or task-speciﬁc feature engineering.

2 Autoregressive Structured Prediction

In this section, we describe our proposed approach,

which we refer to as

autoregressive structured

prediction

(

ASP

). Unlike previous approaches for

structured prediction based on conditional language

modeling, we represent structures as sequences of

actions

, which build pieces of the target structure

step by step. For instance, in the task of coreference

resolution, the actions build spans as well as the

relations between the spans, contiguous sequences

of tokens. We give an example in Fig. 1.

2.1 Representing Structures with Actions

While our approach to structured prediction,

ASP

,

is quite general, our paper narrowly focuses on

modeling structures that are expressible as a set

of dependent spans, and we couch the technical

exposition in terms of modeling spans and relation-

ships among spans. Our goal is to predict an action

sequence

y=y1, . . . , yN

, where each action

yn

is

chosen from an

action space Yn

. In this work, we

take

Yn

to be factored, i.e.,

Yn

def

=A × Bn× Zn

,

where

A

is a set of structure-building actions,

Bn

is the set of bracket-pairing actions, and

Zn

is a

set of span-labeling actions. Thus, each

yn

may be

expressed as a triple, i.e.,

yn=han, bn, zni

. We

discuss each set in a separate paragraph below.

Structure-Building Actions.

We ﬁrst de-

ﬁne a set of structure-building actions

A=n],[∗,copy o

that allow us to encode the

span structure of a text, e.g.,

[∗Delaware ]

in

Fig. 1encodes that Delaware is a span of interest.

More technically, the action

]

refers to a right

bracket that marks the right-most part of a span.

The action

[∗

refers to a left bracket that marks

the left-most part of a span. The superscript

∗

on

[∗

is inspired by the Kleene star and indicates

that it is a placeholder for 0 or more consecutive

left brackets

2

. Finally,

copy

refers to copying a

word from the input document. To see how these

actions come together to form a span, consider the

subsequence in Fig. 1,

[∗Delaware ]

, which is

generated from a sequence of structure-building

actions [∗,copy , and ].

Bracket-Pairing Actions.

Next, we develop the

set of actions that allow the model to match left

and right brackets; we term these bracket-pairing

actions. The set of bracket-pairing actions consists

of all previously constructed left brackets, i.e.,

Bn=nm|m<n∧am=[∗o(1)

Thus, in general,

|Bn|

is

O(n)

. However, it is

often the case that domain-speciﬁc knowledge can

2

In our preliminary experiments, we observe unsatisfactory

performance when the model has to generate consecutive left

brackets. We leverage

[

∗

as an engineering workaround. We

hypothesize that this phenomenon is due to the inability of

transformers to recognize

Dyck

languages (Hahn,2020;Hao

et al.,2022).

be used to prune

Bn

. For instance, coreference

mentions and named entities rarely cross sentence

boundaries, which yields a linguistically motivated

pruning strategy (Liu et al.,2022). Thus, in some

cases, the cardinality of

Bn

can be signiﬁcantly

smaller. When we decode action sequences yinto

a structure, unpaired

[∗

and

]

can be removed

ensuring that the output of the model will not

contain unpaired brackets.

Span-Labeling Actions.

Finally, we add addi-

tional symbols

zn

associated with each

yn

that en-

code a labeling of a single span or a relationship

between two or more spans. For instance, see §2.3

for an example. We denote the set of all znas

Zn=nm|m<n∧am=]o× L (2)

where

nm|m < n ∧am=]o

is the set of

previous spans, which allows the model to capture

intra-span relationships, and

L

denotes the set of

possible labelings of the current span and the re-

lationship between the adjoined spans. In general,

designing

Zn

requires some task-speciﬁc knowl-

edge in order to specify the label space. However,

we contend it requires less effort than designing

a ﬂattened string output where different levels of

structures may be intertwined (Paolini et al.,2021).

2.2 Model Parameterization

Let

D=w1,...,wK

be an input document of

K

sentences where

wk

denotes the

kth

sentence in

D

.

We ﬁrst convert the structure to be built on top of

D

into an action sequence, which we denote as

y

where

yn∈ Yn

. Now, we model the sequence of

actions yas a conditional language model

pθ(y|D) =

N

Y

n=1

pθ(yn|y<n, D)(3)

The log-likelihood of the model is then given

by

log pθ(y|D) = PN

n=1 log pθ(yn|y<n, D)

.

We model the local conditional probabilities

p(yn|y<n, D)

as a softmax over a dynamic set

Yn

that changes as a function of the history y<n, i.e.,

pθ(yn|y<n, D) = exp sθ(yn)

Py0

n∈Ynexp sθ(y0

n)(4)

where

sθ

is a parameterized score function; we

discuss several speciﬁc instantiations of

sθ

in

§2.3. Finally, we note that the use of a dynamic

vocabulary stands in contrast to most conditional

language models where the vocabulary is held

constant across time steps, e.g., Sutskever et al.’s

(2014) approach to machine translation.

Greedy Decoding.

We determine the approxi-

mate best sequence

y∗

using a greedy decoding

strategy. At decoding step n, we compute

y∗

n= argmax

y0

n

pθ(y0

n|y<n, D)(5)

The chosen

y∗

n=ha∗

n, b∗

n, z∗

ni

will then be verbal-

ized as a token as follows: If

a∗

n=copy

, then

we copy the next token from the input that is not

present in the output. Otherwise, if

y∗

n=[∗

or

y∗

n=]

, we insert

[∗

or

]

into the output se-

quence, respectively. The verbalized token is then

fed into the conditional language model at the next

step. The decoding process terminates when the

model copies a distinguished symbol EOS symbol

from the input. The end of the procedure yields an

approximate argmax y∗.

Computational Complexity.

Eq. (4) can be

computed quite efﬁciently using our framework,

as the cardinalities of

A

is

O(1)

, and the size of

Bn

and

Zn

are both

O(n)

. A tighter analysis says

the cardinalities of

Bn

and

Zn

are roughly linear

in the number of spans predicted. In practice, we

have

n |V|

where

|V|

is the size of vocabulary,

which is the step-wise complexity of (Paolini et al.,

2021). A quantitative analysis of the number of

mentions in coreference can be found in App. B.

Generality.

Despite our exposition’s focus on

tasks that involve assigning labels to span or span

pairs, our method is quite general. Indeed, almost

any structured prediction task can be encoded by a

series of structure-building actions. For tasks that

involve labeling tuples of spans, e.g., semantic role

labeling makes use of tree-tuples that consist of the

subject, predicate, and object, Eq. (2) can be easily

extended with a new space of categorical variables

nm|m<n∧am=]o

to model the extra item.

2.3 Task-speciﬁc Parameterizations

We now demonstrate how to apply

ASP

to three

language structured prediction tasks: named entity

recognition, coreference resolution, and end-to-end

relation extraction.

Named Entity Recognition.

Named entity

recognition is the task of labeling all mention spans

E={en}|E|

n=1

in a document

D

that refers to named

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AutoregressiveStructuredPredictionwithLanguageModelsTianyuLiuYuchenEleanorJiangNicholasMonathRyanCotterellMrinmayaSachanETHZürichGoogleResearchftianyu.liu,yuchen.jiangg@inf.ethz.chnmonath@google.comfryan.cotterell,mrinmaya.sachang@inf.ethz.chAbstractInrecentyears,NLPhasmovedtowardstheapplicatio...

展开>> 收起<<

Autoregressive Structured Prediction with Language Models Tianyu LiuYuchen Eleanor Jiang Nicholas MonathRyan CotterellMrinmaya Sachan.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：13 页 大小：443.02KB 格式：PDF 时间：2025-05-02

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

MAOOA..
高级编辑

文档 14218 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 13

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录