Text Editing as Imitation Game Ning ShiBin TangBo YuanLongtao Huang Yewen PuJie Fu yZhouhan LinF

2025-05-02 0 0 682.85KB 12 页 10玖币
侵权投诉
Text Editing as Imitation Game
Ning Shi♠♥ Bin TangBo YuanLongtao Huang
Yewen PuJie FuZhouhan LinF
Alberta Machine Intelligence Institute, Dept. of Computing Science, University of Alberta
Alibaba Group FShanghai Jiao Tong University
Autodesk Research Beijing Academy of Artificial Intelligence
ning.shi@ualberta.ca, {tangbin.tang,qiufu.yb,kaiyang.hlt}@alibaba-inc.com
yewen.pu@autodesk.com, fujie@baai.ac.cn, lin.zhouhan@gmail.com
Abstract
Text editing, such as grammatical error cor-
rection, arises naturally from imperfect textual
data. Recent works frame text editing as a
multi-round sequence tagging task, where op-
erations – such as insertion and substitution –
are represented as a sequence of tags. While
achieving good results, this encoding is lim-
ited in flexibility as all actions are bound to
token-level tags. In this work, we reformulate
text editing as an imitation game using behav-
ioral cloning. Specifically, we convert conven-
tional sequence-to-sequence data into state-to-
action demonstrations, where the action space
can be as flexible as needed. Instead of gen-
erating the actions one at a time, we introduce
a dual decoders structure to parallel the decod-
ing while retaining the dependencies between
action tokens, coupled with trajectory augmen-
tation to alleviate the distribution shift that im-
itation learning often suffers. In experiments
on a suite of Arithmetic Equation benchmarks,
our model consistently outperforms the autore-
gressive baselines in terms of performance, ef-
ficiency, and robustness. We hope our findings
will shed light on future studies in reinforce-
ment learning applying sequence-level action
generation to natural language processing.
1 Introduction
Text editing (Malmi et al.,2022) is an important
domain of processing tasks to edit the text in a
localized fashion, applying to text simplification
(Agrawal et al.,2021), grammatical error correc-
tion (Li et al.,2022), punctuation restoration (Shi
et al.,2021), to name a few. Neural sequence-to-
sequence (seq2seq) framework (Sutskever et al.,
2014) establishes itself as the primary approach
to text editing tasks, by framing the problem as
machine translation (Wu et al.,2016). Applying
a seq2seq modeling has the advantage of simplic-
Work was done at Alibaba Group.
Zhouhan Lin is the corresponding author.
1 1 2
Sequence
Tagging
Sequence
Generation
End-to-end
[INSERT, POS_1, +][INSERT_+, INSERT_=, KEEP]
Realization Environment
1 + 1 = 21 + 1 = 2 1 + 1 = 2
Token-level
Actions
Sequence-level
Actions
Figure 1: Three approaches – sequence tagging (left),
end-to-end (middle), sequence generation (right) – to
turn an invalid arithmetic expression “1 1 2” into a valid
one “1 + 1 = 2”. In end-to-end, the entire string “1 1 2”
is encoded into a latent state, which the string “1 + 1 =
2” is generated directly. In sequence tagging, a local-
ized action (such as “INSERT_+”, meaning insert a “+”
symbol after this token) is applied/tagged to each token;
these token-level actions are then executed, modifying
the input string. In contrast, sequence generation out-
put an entire action sequence, generating the location
(rather than tagging it), and the action sequence is ex-
ecuted, modifying the input string. Both token-level
actions and sequence-level actions can be applied mul-
tiple times to polish the text further (up to a fixed point).
ity, where the system can simply be built by giv-
ing input-output pairs consisting of pathological
sequences to be edited, and the desired sequence
output, without much manual processing efforts
(Junczys-Dowmunt et al.,2018).
However, even with a copy mechanism (See
et al.,2017;Zhao et al.,2019;Panthaplackel et al.,
2021), an end-to-end model can struggle in car-
rying out localized, specific fixes while keeping
the rest of the sequence intact. Thus, sequence
tagging is often found more appropriate when out-
puts highly overlap with inputs (Dong et al.,2019;
Mallinson et al.,2020;Stahlberg and Kumar,2020).
In such cases, a neural model predicts a tag se-
quence – representing localized fixes such as in-
sertion and substitution – and a programmatic in-
terpreter implements these edit operations through.
Here, each tag represents a token-level action and
determines the operation on its attached token (Ko-
hita et al.,2020). A model can avoid modifying the
overlap by assigning no-op (e.g.,
KEEP
), while the
action space is limited to token-level modifications,
arXiv:2210.12276v1 [cs.CL] 21 Oct 2022
such as deletion or insertion after a token (Awasthi
et al.,2019;Malmi et al.,2019).
In contrast, alternative approaches (Gupta et al.,
2019) train the agent to explicitly generate free-
form edit actions and iteratively reconstructs the
text during the interaction with an environment
capable of altering the text based on these actions.
This sequence-level action generation (Branavan
et al.,2009;Guu et al.,2017;Elgohary et al.,2021)
allows higher flexibility of action design not limited
to token-level actions, and is more advantageous
given the narrowed problem space and dynamic
context in the edit (Shi et al.,2020).
The mechanisms of sequence tagging and se-
quence generation against end-to-end are exem-
plified in Figure 1. Both methods allow multiple
rounds of sequence refinement (Ge et al.,2018;Liu
et al.,2021) and imitation learning (IL) (Pomer-
leau,1991). Essentially an agent learns from the
demonstrations of an expert policy and later imi-
tates the memorized behavior to act independently
(Schaal,1996). On the one hand, IL in sequence
tagging functions as a standard supervised learning
in its nature and thus has attracted significant inter-
est and been widely used recently (Agrawal et al.,
2021;Yao et al.,2021;Agrawal and Carpuat,2022),
achieving good results in the token-level action gen-
eration setting (Gu et al.,2019;Reid and Zhong,
2021). On the other hand, IL in sequence-level
action generation is less well defined even though
its principle has been followed in text editing (Shi
et al.,2020) and many others (Chen et al.,2021).
As a major obstacle, the training is on state-action
demonstrations, where the encoding of the states
and actions can be very different (Gu et al.,2018).
For instance, the mismatch of the lengths dimen-
sion between the state and action makes it tricky
to implement for an auto-regressive modeling that
benefits from a single, uniform representation.
To tackle the issues above, we reformulate
text editing as an imitation game controlled by a
Markov Decision Process (MDP). To begin with,
we define the input sequence as the initial state, the
required operations as action sequences, and the
output target sequence as the goal state. A learning
agent needs to imitate an expert policy, respond to
seen states with actions, and interact with the envi-
ronment until the success of the eventual editing.
To convert existing input-output data into state-
action pairs, we utilize trajectory generation (TG),
a skill to leverage dynamic programming (DP) for
an efficient search of the minimum operations given
a predefined edit metric. We backtrace explored
editing paths and automatically express operations
as action sequences. Regarding the length misalign-
ment, we first take advantage of the flexibility at
the sequence-level to fix actions to be of the same
length. Secondly, we employ a linear layer after the
encoder to transform the length dimension of the
context matrix into the action length. By that, we
introduce a dual decoders (D2) structure that not
only parallels the decoding but also retains captur-
ing interdependencies among action tokens. Taking
a further step, we propose trajectory augmentation
(TA) as a solution to the distribution shift problem
most IL suffers (Ross et al.,2011). Through a suite
of three Arithmetic Equation (AE) benchmarks
(Shi et al.,2020), namely Arithmetic Operators
Restoration (AOR), Arithmetic Equation Simplifi-
cation (AES), and Arithmetic Equation Correction
(AEC), we confirm the superiority of our learning
paradigm. In particular, D2 consistently exceeds
standard autoregressive models from performance,
efficiency, and robustness perspectives.
In theory, our methods also apply to other imi-
tation learning scenarios where a reward function
exists to further promote the agent. In this work,
we primarily focus on a proof-of-concept of our
learning paradigm landing at supervised behavior
cloning (BC) in the context of text editing. To this
end, our contributions1are as follows:
1.
We frame text editing into an imitation game
formally defined as an MDP, allowing the high-
est degrees of flexibility to design actions at the
sequence-level.
2.
We involve TG to translate input-output data to
state-action demonstrations for IL.
3.
We introduce D2, a novel non-autoregressive
decoder, boosting the learning in terms of accu-
racy, efficiency, and robustness.
4.
We propose a corresponding TA technique to
mitigate distribution shift IL often suffers.
2 Imitation Game
We aim to cast text editing into an imitation game
by defining the task as a recurrent sequence gener-
ation, as presented in Figure 2(a). In this section,
we describe the major components of our proposal,
including (1) the problem definition, (2) the data
translation, (3) the model structure, and (4) a solu-
tion to the distribution shift.
1Code and data are publicly available at GitHub.
[INSERT, POS_1, +] 1 + 1 2 [INSERT, POS_3, =] 1 + 1 = 2 [DONE, DONE, DONE]
s3
a1s2a2s3a3
a
Action Environment
s
State
x
a s
y
1 1 2 1 + 1 = 2 1 + 1 = 2
s1
1 1 2 Agent
Imitation Game
s'a*
s1
s2
s*
Expert Action
s1
1 1 2
1 + 1 2
1 1 2
a1
[INSERT, POS_1, +]
a2
s1
s2
1 1 = 2
1 1 2
a2
s2
s3
1 + 1 = 2
1 + 1 2
[INSERT, POS_3, =]
[INSERT, POS_2, =]
Expert State
Execute Skip Update
(a)
Shifted State
* *
*
*
*
*
*
'
*
(b)
'
Figure 2: (a) shows the imitation game of AOR. Considering input text xas initial state s1, the agent interacts with
the environment to edit “112” into “1 + 1 = 2” via action a1to insert “+” at the first position and a2to insert “=
at the thrid position. After a3, the agent stops editing and calls the environment to return s3as the output text y.
Using the same example, (b) explains how to achieve shifted state s0
2by skipping action a
1and doing a0
2. Here we
update a
2to a0
2accordingly due to the previous skipping. The new state s0
2was not in the expert demonstrations.
2.1 Behavior cloning
We tear a text editing task
X 7→ Y
into recurrent
subtasks of sequence generation
S 7→ A
defined
by an MDP tuple M= (S,A,P,E,R).
State S
is a set of text sequences
s=sjm
, where
s∈ VS
. We think of a source sequence
x∈ X
as
the initial state
s1
, its target sequence
y∈ Y
as the
goal state
sT
, and every edited sequence in between
as an intermediate state
st
. The path
x7→ y
can be
represented as a set of sequential states stT.
Action A
is a set of action sequences
a=ain
,
where
a∈ VA
. In Figure 3, “INSERT”, “POS_3”,
and “=” are three action tokens belonging to the
vocabulary space of action
VA
. In contrast to token-
level actions in sequence tagging, sentence-level
ones set free the editing by varying edit metrics
E
(e.g., Levenshtein distance) as long as
XAE
7−→ Y
.
It serves as an expert policy
π
to demonstrate the
path to the goal state. A better expert usually means
better demonstrations and imitation results. Hence,
depending on the task, a suitable Eis essential.
Transition matrix P
models the probability
p
that
an action
at
leads a state
st
to the state
st+1
. We
know
s,a. p(st+1|st,at)=1
due to the nature of
text editing. So we can omit P.
Environment E
responds to an action and updates
the game state accordingly by
st+1 =E(st,at)
with process control. For example, the environ-
ment can refuse to execute actions that fail to pass
the verification and terminate the game if a maxi-
mum number of iterations has been consumed.
Reward function R
calculates a reward for each
action. It is a major factor contributing to the suc-
cess of reinforcement learning. In the scope of this
paper, we focus on BC, the simplest form of IL. So
we can also omit Rand leave it for future work.
Algorithm 1 Trajectory Generation (TG)
Input:
Initial state
x
, goal state
y
, environment
E
, and edit
metric E.
Output: Trajectories τ.
1: τ← ∅
2: sx
3: ops DP(x,y, E)
4: for op ops do
5: aAction(op).Translate operation to action
6: ττ[(s,a)]
7: s← E(s,a)
8: end for
9: ττ[(s,aT)] .
Append goal state and output action
10: return τ
The formulation turns out to be a simplified
MBC = (S,A,E)
. Interacting with the environ-
ment
E
, we hope a trained agent is able to follow
its learned policy
π:S 7→ A
, and iteratively edit
the initial state s0=xinto the goal state sT=y.
2.2 Trajectory generation
A data set to learn
X 7→ Y
consists of input-output
pairs. It is necessary to convert it into state-action
ones so that an agent can mimic the expert policy
π:S 7→ A
via supervised learning. A detailed
TG is described in Algorithm 1.
Treating a pre-defined edit metric
E
as the expert
policy
π
, we can leverage DP to efficiently find
the minimum operations required to convert
x
into
y
in a left-to-right manner and backtrace this path
to get specific operations.
Operations are later expressed as a set of se-
quential actions
a
tT
. Here we utilize a special
symbol
DONE
to mark the last action
a
T
where
aa
T. a =DONE
. Once an agent performs
a
T
, the current state is returned by the environment
as the final output.
Given
s
1=x
, we attain the next state
s
2=
摘要:

TextEditingasImitationGameNingShi~BinTang~BoYuan~LongtaoHuang~YewenPu|JieFu}yZhouhanLinFAlbertaMachineIntelligenceInstitute,Dept.ofComputingScience,UniversityofAlberta~AlibabaGroupFShanghaiJiaoTongUniversity|AutodeskResearch}BeijingAcademyofArticialIntelligencening.shi@ualberta.ca,{tangbin.tang,...

展开>> 收起<<
Text Editing as Imitation Game Ning ShiBin TangBo YuanLongtao Huang Yewen PuJie Fu yZhouhan LinF.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:682.85KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注