
such as deletion or insertion after a token (Awasthi
et al.,2019;Malmi et al.,2019).
In contrast, alternative approaches (Gupta et al.,
2019) train the agent to explicitly generate free-
form edit actions and iteratively reconstructs the
text during the interaction with an environment
capable of altering the text based on these actions.
This sequence-level action generation (Branavan
et al.,2009;Guu et al.,2017;Elgohary et al.,2021)
allows higher flexibility of action design not limited
to token-level actions, and is more advantageous
given the narrowed problem space and dynamic
context in the edit (Shi et al.,2020).
The mechanisms of sequence tagging and se-
quence generation against end-to-end are exem-
plified in Figure 1. Both methods allow multiple
rounds of sequence refinement (Ge et al.,2018;Liu
et al.,2021) and imitation learning (IL) (Pomer-
leau,1991). Essentially an agent learns from the
demonstrations of an expert policy and later imi-
tates the memorized behavior to act independently
(Schaal,1996). On the one hand, IL in sequence
tagging functions as a standard supervised learning
in its nature and thus has attracted significant inter-
est and been widely used recently (Agrawal et al.,
2021;Yao et al.,2021;Agrawal and Carpuat,2022),
achieving good results in the token-level action gen-
eration setting (Gu et al.,2019;Reid and Zhong,
2021). On the other hand, IL in sequence-level
action generation is less well defined even though
its principle has been followed in text editing (Shi
et al.,2020) and many others (Chen et al.,2021).
As a major obstacle, the training is on state-action
demonstrations, where the encoding of the states
and actions can be very different (Gu et al.,2018).
For instance, the mismatch of the lengths dimen-
sion between the state and action makes it tricky
to implement for an auto-regressive modeling that
benefits from a single, uniform representation.
To tackle the issues above, we reformulate
text editing as an imitation game controlled by a
Markov Decision Process (MDP). To begin with,
we define the input sequence as the initial state, the
required operations as action sequences, and the
output target sequence as the goal state. A learning
agent needs to imitate an expert policy, respond to
seen states with actions, and interact with the envi-
ronment until the success of the eventual editing.
To convert existing input-output data into state-
action pairs, we utilize trajectory generation (TG),
a skill to leverage dynamic programming (DP) for
an efficient search of the minimum operations given
a predefined edit metric. We backtrace explored
editing paths and automatically express operations
as action sequences. Regarding the length misalign-
ment, we first take advantage of the flexibility at
the sequence-level to fix actions to be of the same
length. Secondly, we employ a linear layer after the
encoder to transform the length dimension of the
context matrix into the action length. By that, we
introduce a dual decoders (D2) structure that not
only parallels the decoding but also retains captur-
ing interdependencies among action tokens. Taking
a further step, we propose trajectory augmentation
(TA) as a solution to the distribution shift problem
most IL suffers (Ross et al.,2011). Through a suite
of three Arithmetic Equation (AE) benchmarks
(Shi et al.,2020), namely Arithmetic Operators
Restoration (AOR), Arithmetic Equation Simplifi-
cation (AES), and Arithmetic Equation Correction
(AEC), we confirm the superiority of our learning
paradigm. In particular, D2 consistently exceeds
standard autoregressive models from performance,
efficiency, and robustness perspectives.
In theory, our methods also apply to other imi-
tation learning scenarios where a reward function
exists to further promote the agent. In this work,
we primarily focus on a proof-of-concept of our
learning paradigm landing at supervised behavior
cloning (BC) in the context of text editing. To this
end, our contributions1are as follows:
1.
We frame text editing into an imitation game
formally defined as an MDP, allowing the high-
est degrees of flexibility to design actions at the
sequence-level.
2.
We involve TG to translate input-output data to
state-action demonstrations for IL.
3.
We introduce D2, a novel non-autoregressive
decoder, boosting the learning in terms of accu-
racy, efficiency, and robustness.
4.
We propose a corresponding TA technique to
mitigate distribution shift IL often suffers.
2 Imitation Game
We aim to cast text editing into an imitation game
by defining the task as a recurrent sequence gener-
ation, as presented in Figure 2(a). In this section,
we describe the major components of our proposal,
including (1) the problem definition, (2) the data
translation, (3) the model structure, and (4) a solu-
tion to the distribution shift.
1Code and data are publicly available at GitHub.