2007) or data driven (Gür et al.,2018;Kreyssig
et al.,2018;Tseng et al.,2021). There also have
been some efforts to jointly optimize end-to-end
dialog system (DS) and user simulator (US), but
most are based on traditional architectures of using
LSTM seq2seq networks (Liu and Lane,2017b;
Papangelis et al.,2019;Tseng et al.,2021).
Inspired by the recent progress of funetuning
pretrained LMs such as GPT-2 to develop the end-
to-end trainable DS, in this paper we are firstly
interested in building a GPT-2 based end-to-end
trainable US for online RL of DS, which has not
ever been explored. Further, note that how to de-
velop jointly optimized GPT-2 based DS and US in
the RL framework is unclear, which requires new
design of model architectures. Regarding this, we
aim to develop
J
ointly
R
einforced
U
ser simulator
and task-oriented
D
ialog system (JRUD), leverag-
ing the recent progress of using pretrained LMs
such as GPT-2 as the backbone.
To be clear, GPT-2 (Radford et al.,2019) in this
paper refers to the particular class of causal LM,
which computes conditional probabilities for next-
token generation via self-attention based Trans-
former neural network (Vaswani et al.,2017).
The basic idea in finetuning pretrained GPT-2 to
build the dialog agent is to utilize the genera-
tion ability empowered by the finetuned causal
LM. Given a particular form of conditional model,
p(output|input)
, where
input
and
output
are to-
ken sequences, the GPT-2 LM can be finetuned
over training samples
(input, output)
(often re-
ferred to as training sequences (Hosseini-Asl et al.,
2020)), and after finetuning, the model can be used
for generation, i.e., generating
output
after receiv-
ing input.
A limitation of previous methods in GPT-2 based
DS, e.g., SimpleTOD (Hosseini-Asl et al.,2020),
SOLOIST (Li et al.,2020), AuGPT (Kulhánek
et al.,2021) and UBAR (Yang et al.,2021), is that
the whole history is used as the input at each turn.
This significantly increases the memory and com-
putation cost in both training and generation. More-
over, using the whole history may burden the model
with redundant information and hurts the training
efficiency. To address the aforementioned limita-
tion and to facilitate the development of JRUD,
we propose
S
implified
G
enerative
A
rchitectures
(SGA) for DS and US respectively, both based on
GPT-2 but using shortened history.
The main contributions of this work can be sum-
marised as follows:
•
Our DS with the proposed SGA, called SGA-
DS, when only supervised trained, achieves
state-of-the-art performance on MultiWOZ2.1
(Eric et al.,2020) and is more compute-
efficient in both training and generation.
•
To the best of our knowledge, our US with the
proposed SGA, called SGA-US, represents
the first GPT-2 based end-to-end trainable US,
which could be trained via SL or RL.
•
Based on the proposed DS and US, we suc-
cessfully develop a RL framework, called
SGA-JRUD, for building jointly reinforced
user simulator and dialog systems, which can
be interacted and trained via online RL to
significantly improve the performance of the
TOD system, as shown in extensive experi-
ments on MultiWOZ2.1.
2 Related Work
End-to-end TOD systems
The methodology for
building TOD systems is gradually advancing from
separate training of individual modules (Mrkši´
c
et al.,2017;Wen et al.,2017a) to the end-to-end
(E2E) trainable approach (Wen et al.,2017b;Liu
and Lane,2017a;Lei et al.,2018). Recent studies
have exploited the large-scale pre-trained language
model such as GPT-2 for building end-to-end TOD
systems, e.g., SimpleTOD (Hosseini-Asl et al.,
2020), SOLOIST (Li et al.,2020), AuGPT (Kul-
hánek et al.,2021) and UBAR (Yang et al.,2021).
While existing GPT-2 based TOD systems achieve
improved performance, these models mostly em-
ploy the whole dialog history as input during train-
ing and generation, which brings inefficiencies in
computation, memory and learning. It is shown in
Sec. 4.7 that earlier history beyond the previous
turn are in fact weakly attended to in next-token
generation. In contrast, the simplified architecture
proposed in our SGA-DS only uses the belief state
and system response of the previous turn for gener-
ating the response in current turn.
RL in TOD systems and user simulators
Rein-
forcement learning, which aims to train an agent
towards maximizing long-term cumulative rewards
from interactions between the agent and its environ-
ment, could be divided in two classes, offline and
online (Sutton and Barto,2018). Both classes have