Autoregressive Structured Prediction with Language Models Tianyu LiuYuchen Eleanor Jiang Nicholas MonathRyan CotterellMrinmaya Sachan

2025-05-02 0 0 443.02KB 13 页 10玖币
侵权投诉
Autoregressive Structured Prediction with Language Models
Tianyu LiuζYuchen Eleanor Jiangζ
Nicholas MonathγRyan CotterellζMrinmaya Sachanζ
ζETH Zürich γGoogle Research
{tianyu.liu,yuchen.jiang}@inf.ethz.ch
nmonath@google.com {ryan.cotterell,mrinmaya.sachan}@inf.ethz.ch
Abstract
In recent years, NLP has moved towards the
application of language models to a more
diverse set of tasks. However, applying
language models to structured prediction, e.g.,
predicting parse trees, taggings, and coref-
erence chains, is not straightforward. Prior
work on language model-based structured
prediction typically flattens the target structure
into a string to easily fit it into the language
modeling framework. Such flattening limits
the accessibility of structural information and
can lead to inferior performance compared to
approaches that overtly model the structure.
In this work, we propose to construct a
conditional language model over sequences
of structure-building actions, rather than over
strings in a way that makes it easier for the
model to pick up on intra-structure dependen-
cies. Our method sets the new state of the
art on named entity recognition, end-to-end
relation extraction, and coreference resolution.
https://github.com/lyutyuh/ASP
1 Introduction
Many common NLP tasks, e.g., named entity
recognition, relation extraction, and coreference
resolution are naturally taxonomized as structured
prediction, the supervised machine-learning task of
predicting a structure from a large
1
set. To general-
ize well to held-out data in a structured prediction
problem, the received wisdom has been that it
is necessary to correctly model complex depen-
dencies between different pieces of the structure.
However, a recent trend in structured prediction
for language has been to forgo explicitly modeling
such dependencies (Ma and Hovy,2016;Lee et al.,
2017;He et al.,2017,inter alia), and, instead, to
apply an expressive black-box model, e.g., a neural
network, with the hope that the model picks up on
the dependencies without explicit instruction.
1
Typically, large means exponential in the size of the input.
Framing structured prediction as conditional
language modeling is an increasingly common
black-box technique for building structured predic-
tors that has led to empirical success (Vinyals et al.,
2015;Raffel et al.,2020;Athiwaratkun et al.,2020;
De Cao et al.,2021;Paolini et al.,2021,inter alia).
The idea behind the framework is to encode the tar-
get structure as a string, flattening out the structure.
Then, one uses a conditional language model to
predict the flattened string encoding the structure.
For instance, Vinyals et al. (2015) flatten parse
trees into strings and predict the strings encoding
the flattened trees from the sentence with a machine
translation architecture. The hope is that the au-
toregressive nature of the language model allows it
to learn to model the intra-structure dependencies
and the necessary hard constraints that ensure the
model even produces well-formed structures. Ad-
ditionally, many modelers make use of pre-trained
language models (Lewis et al.,2020;Raffel et al.,
2020) to further improve the language models.
However, despite their empirical success, simply
hoping that a black-box approach correctly models
intricate intra-structure dependencies is often
insufficient for highly structured tasks (Paolini
et al.,2021, §1). Indeed, the act of flattening a
structured object into a string makes properly mod-
eling the intra-structure dependencies harder for
many tasks, e.g., those that involve nested spans or
long-distance dependencies. For instance, in coref-
erence resolution, a conference link between two
mentions can stretch across thousands of words,
and a coreference chain can also contain over a
hundred mentions (Pradhan et al.,2012). Flatten-
ing such a large amount of structured information
into a string makes the task more difficult to model.
In this paper, we propose a simple framework
that augments a conditional language model with
explicit modeling of structure. Instead of modeling
strings that encode a flattened representation of
the target structure, we model a constrained set
of actions that build the target structure step by
step; see Fig. 1for an example of our proposed
arXiv:2210.14698v2 [cs.CL] 17 Nov 2022
[* US ]
President
Joe
Biden
]
took
office
in 2021 .
Previously
,[* he ]was
senator
of [*
Delaware
] .
𝜺 𝜺
#1
𝜺 𝜺 𝜺
#1
𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺
#2
𝜺 𝜺 𝜺 𝜺 𝜺 𝜺
#3
𝜺
𝜺
𝜺
𝜺
𝜺
LOC
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
PER
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
𝜺
LOC
Live_in
𝜺
𝜺
𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺
#2
𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺 𝜺
a"
ASP:
US President Joe Biden took office in 2021. Previously, he was the senator of Delaware.
INPUT
b"
z"
(ERE)
[* US ]President Joe Biden ]took office in 2021. Previously, [* he ]was the senator of [* Delaware ] .
z"
(COREF)
Figure 1: Illustration of the target outputs of our framework on coreference resolution (COREF) and end-to-end
relation extraction (ERE). The lower part illustrates the decoding process of our model. The actions yiare
color-coded as ],[and copy . The structure random variables ziare presented along with coreference links or
relation links. We present words in the copy cells merely as an illustration.
framework. Training a conditional language model
to predict structure-building actions exposes the
structure in a way that allows the model to pick
up on the intra-structure dependencies more easily
while still allowing the modeler to leverage pre-
trained language models. We conduct experiments
on three structured prediction tasks: named entity
recognition, end-to-end relation extraction, and
coreference resolution. On each task, we achieve
state-of-the-art results without relying on data
augmentation or task-specific feature engineering.
2 Autoregressive Structured Prediction
In this section, we describe our proposed approach,
which we refer to as
autoregressive structured
prediction
(
ASP
). Unlike previous approaches for
structured prediction based on conditional language
modeling, we represent structures as sequences of
actions
, which build pieces of the target structure
step by step. For instance, in the task of coreference
resolution, the actions build spans as well as the
relations between the spans, contiguous sequences
of tokens. We give an example in Fig. 1.
2.1 Representing Structures with Actions
While our approach to structured prediction,
ASP
,
is quite general, our paper narrowly focuses on
modeling structures that are expressible as a set
of dependent spans, and we couch the technical
exposition in terms of modeling spans and relation-
ships among spans. Our goal is to predict an action
sequence
y=y1, . . . , yN
, where each action
yn
is
chosen from an
action space Yn
. In this work, we
take
Yn
to be factored, i.e.,
Yn
def
=A × Bn× Zn
,
where
A
is a set of structure-building actions,
Bn
is the set of bracket-pairing actions, and
Zn
is a
set of span-labeling actions. Thus, each
yn
may be
expressed as a triple, i.e.,
yn=han, bn, zni
. We
discuss each set in a separate paragraph below.
Structure-Building Actions.
We first de-
fine a set of structure-building actions
A=n],[,copy o
that allow us to encode the
span structure of a text, e.g.,
[Delaware ]
in
Fig. 1encodes that Delaware is a span of interest.
More technically, the action
]
refers to a right
bracket that marks the right-most part of a span.
The action
[
refers to a left bracket that marks
the left-most part of a span. The superscript
on
[
is inspired by the Kleene star and indicates
that it is a placeholder for 0 or more consecutive
left brackets
2
. Finally,
copy
refers to copying a
word from the input document. To see how these
actions come together to form a span, consider the
subsequence in Fig. 1,
[Delaware ]
, which is
generated from a sequence of structure-building
actions [,copy , and ].
Bracket-Pairing Actions.
Next, we develop the
set of actions that allow the model to match left
and right brackets; we term these bracket-pairing
actions. The set of bracket-pairing actions consists
of all previously constructed left brackets, i.e.,
Bn=nm|m<nam=[o(1)
Thus, in general,
|Bn|
is
O(n)
. However, it is
often the case that domain-specific knowledge can
2
In our preliminary experiments, we observe unsatisfactory
performance when the model has to generate consecutive left
brackets. We leverage
[
as an engineering workaround. We
hypothesize that this phenomenon is due to the inability of
transformers to recognize
Dyck
languages (Hahn,2020;Hao
et al.,2022).
be used to prune
Bn
. For instance, coreference
mentions and named entities rarely cross sentence
boundaries, which yields a linguistically motivated
pruning strategy (Liu et al.,2022). Thus, in some
cases, the cardinality of
Bn
can be significantly
smaller. When we decode action sequences yinto
a structure, unpaired
[
and
]
can be removed
ensuring that the output of the model will not
contain unpaired brackets.
Span-Labeling Actions.
Finally, we add addi-
tional symbols
zn
associated with each
yn
that en-
code a labeling of a single span or a relationship
between two or more spans. For instance, see §2.3
for an example. We denote the set of all znas
Zn=nm|m<nam=]o× L (2)
where
nm|m < n am=]o
is the set of
previous spans, which allows the model to capture
intra-span relationships, and
L
denotes the set of
possible labelings of the current span and the re-
lationship between the adjoined spans. In general,
designing
Zn
requires some task-specific knowl-
edge in order to specify the label space. However,
we contend it requires less effort than designing
a flattened string output where different levels of
structures may be intertwined (Paolini et al.,2021).
2.2 Model Parameterization
Let
D=w1,...,wK
be an input document of
K
sentences where
wk
denotes the
kth
sentence in
D
.
We first convert the structure to be built on top of
D
into an action sequence, which we denote as
y
where
yn∈ Yn
. Now, we model the sequence of
actions yas a conditional language model
pθ(y|D) =
N
Y
n=1
pθ(yn|y<n, D)(3)
The log-likelihood of the model is then given
by
log pθ(y|D) = PN
n=1 log pθ(yn|y<n, D)
.
We model the local conditional probabilities
p(yn|y<n, D)
as a softmax over a dynamic set
Yn
that changes as a function of the history y<n, i.e.,
pθ(yn|y<n, D) = exp sθ(yn)
Py0
n∈Ynexp sθ(y0
n)(4)
where
sθ
is a parameterized score function; we
discuss several specific instantiations of
sθ
in
§2.3. Finally, we note that the use of a dynamic
vocabulary stands in contrast to most conditional
language models where the vocabulary is held
constant across time steps, e.g., Sutskever et al.s
(2014) approach to machine translation.
Greedy Decoding.
We determine the approxi-
mate best sequence
y
using a greedy decoding
strategy. At decoding step n, we compute
y
n= argmax
y0
n
pθ(y0
n|y<n, D)(5)
The chosen
y
n=ha
n, b
n, z
ni
will then be verbal-
ized as a token as follows: If
a
n=copy
, then
we copy the next token from the input that is not
present in the output. Otherwise, if
y
n=[
or
y
n=]
, we insert
[
or
]
into the output se-
quence, respectively. The verbalized token is then
fed into the conditional language model at the next
step. The decoding process terminates when the
model copies a distinguished symbol EOS symbol
from the input. The end of the procedure yields an
approximate argmax y.
Computational Complexity.
Eq. (4) can be
computed quite efficiently using our framework,
as the cardinalities of
A
is
O(1)
, and the size of
Bn
and
Zn
are both
O(n)
. A tighter analysis says
the cardinalities of
Bn
and
Zn
are roughly linear
in the number of spans predicted. In practice, we
have
n |V|
where
|V|
is the size of vocabulary,
which is the step-wise complexity of (Paolini et al.,
2021). A quantitative analysis of the number of
mentions in coreference can be found in App. B.
Generality.
Despite our exposition’s focus on
tasks that involve assigning labels to span or span
pairs, our method is quite general. Indeed, almost
any structured prediction task can be encoded by a
series of structure-building actions. For tasks that
involve labeling tuples of spans, e.g., semantic role
labeling makes use of tree-tuples that consist of the
subject, predicate, and object, Eq. (2) can be easily
extended with a new space of categorical variables
nm|m<nam=]o
to model the extra item.
2.3 Task-specific Parameterizations
We now demonstrate how to apply
ASP
to three
language structured prediction tasks: named entity
recognition, coreference resolution, and end-to-end
relation extraction.
Named Entity Recognition.
Named entity
recognition is the task of labeling all mention spans
E={en}|E|
n=1
in a document
D
that refers to named
摘要:

AutoregressiveStructuredPredictionwithLanguageModelsTianyuLiuYuchenEleanorJiangNicholasMonathRyanCotterellMrinmayaSachanETHZürichGoogleResearchftianyu.liu,yuchen.jiangg@inf.ethz.chnmonath@google.comfryan.cotterell,mrinmaya.sachang@inf.ethz.chAbstractInrecentyears,NLPhasmovedtowardstheapplicatio...

展开>> 收起<<
Autoregressive Structured Prediction with Language Models Tianyu LiuYuchen Eleanor Jiang Nicholas MonathRyan CotterellMrinmaya Sachan.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:443.02KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注