Jointly Reinforced User Simulator and Task-oriented Dialog System with Simplified Generative Architecture Hong Liu and Zhijian Ou

2025-04-24 1 0 607.22KB 13 页 10玖币
侵权投诉
Jointly Reinforced User Simulator and Task-oriented Dialog System with
Simplified Generative Architecture
Hong Liu and Zhijian Ou
Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, Beijing, China
Yi Huang and Junlan Feng
China Mobile Research Institute, Beijing, China
Abstract
Recently, there has been progress in super-
vised funetuning pretrained GPT-2 to build
end-to-end task-oriented dialog (TOD) sys-
tems. However, online reinforcement learn-
ing of a GPT-2 based dialog system (DS), to-
gether with a end-to-end user simulator (US),
has not ever been explored. Moreover, a draw-
back with existing GPT-2 based TOD systems
is that they mostly employ the whole dialog
history as input, which brings inefficiencies
in memory and compute. In this paper, we
first propose Simplified Generative Architec-
tures (SGA) for DS and US respectively, both
based on GPT-2 but using shortened history.
Then, we successfully develop Jointly Rein-
forced US and DS, called SGA-JRUD. Our
DS with the proposed SGA, when only super-
vised trained, achieves state-of-the-art perfor-
mance on MultiWOZ2.1 and is more compute-
efficient in both training and generation. Ex-
tensive experiments on MultiWOZ2.1 further
show the superiority of SGA-JRUD in both of-
fline and online evaluations.
1 Introduction
Task-oriented dialog (TOD) systems, which are
mainly designed to assist users to accomplish their
goals, often consist of several modules including di-
alog state tracking (DST), database querying (DB),
dialog policy (DP) and natural language generation
(NLG). The information flow in a task-oriented
dialog is illustrated in Figure 1. Recent studies
recast these modules all as conditional generation
of tokens and integrate them into a single language
model (LM), which usually uses some pretrained
language model (LM) such as GPT-2 (Radford
et al.,2019) as the backbone. Fine-tuning GPT-2
over annotated dialog datasets such as MultiWOZ
(Budzianowski et al.,2018) via supervised learning
(SL) has shown state-of-the-art results (Hosseini-
Asl et al.,2020;Li et al.,2020;Kulhánek et al.,
2021;Yang et al.,2021), thanks to the powerful
Figure 1: The information flow in a task-oriented dia-
log. Square brackets denote special tokens in GPT-2.
generation ability of GPT-2.
However, it has long been recognized that su-
pervised learning over annotated dialog datasets
alone may not be sufficient to learn a task-oriented
dialog agent (Young et al.,2013). Conversations
often do not have only a single correct response,
multiple responses can be appropriate for the same
dialog context (Zhang et al.,2020). Supervised
trained agents can become biased by the annota-
tions. Reinforcement learning (RL) for an agent
aims to goal-directed learning from interaction be-
tween the decision-making agent and its environ-
ment (Sutton and Barto,2018) and is a natural
choice for learning task-oriented dialog policies,
where the user is modeled as the interactive envi-
ronment. Offline RL optimizes the policy from the
fixed annotated dataset without online environment
interaction (Zhou et al.,2017;Jeon and Lee,2022)
but only partially exploits the power of RL. Online
RL requires interaction with real humans or user
simulators during training. However, building a
good user simulator is as challenging as designing
a dialog agent, either rule based (Schatzmann et al.,
arXiv:2210.06706v1 [cs.CL] 13 Oct 2022
2007) or data driven (Gür et al.,2018;Kreyssig
et al.,2018;Tseng et al.,2021). There also have
been some efforts to jointly optimize end-to-end
dialog system (DS) and user simulator (US), but
most are based on traditional architectures of using
LSTM seq2seq networks (Liu and Lane,2017b;
Papangelis et al.,2019;Tseng et al.,2021).
Inspired by the recent progress of funetuning
pretrained LMs such as GPT-2 to develop the end-
to-end trainable DS, in this paper we are firstly
interested in building a GPT-2 based end-to-end
trainable US for online RL of DS, which has not
ever been explored. Further, note that how to de-
velop jointly optimized GPT-2 based DS and US in
the RL framework is unclear, which requires new
design of model architectures. Regarding this, we
aim to develop
J
ointly
R
einforced
U
ser simulator
and task-oriented
D
ialog system (JRUD), leverag-
ing the recent progress of using pretrained LMs
such as GPT-2 as the backbone.
To be clear, GPT-2 (Radford et al.,2019) in this
paper refers to the particular class of causal LM,
which computes conditional probabilities for next-
token generation via self-attention based Trans-
former neural network (Vaswani et al.,2017).
The basic idea in finetuning pretrained GPT-2 to
build the dialog agent is to utilize the genera-
tion ability empowered by the finetuned causal
LM. Given a particular form of conditional model,
p(output|input)
, where
input
and
output
are to-
ken sequences, the GPT-2 LM can be finetuned
over training samples
(input, output)
(often re-
ferred to as training sequences (Hosseini-Asl et al.,
2020)), and after finetuning, the model can be used
for generation, i.e., generating
output
after receiv-
ing input.
A limitation of previous methods in GPT-2 based
DS, e.g., SimpleTOD (Hosseini-Asl et al.,2020),
SOLOIST (Li et al.,2020), AuGPT (Kulhánek
et al.,2021) and UBAR (Yang et al.,2021), is that
the whole history is used as the input at each turn.
This significantly increases the memory and com-
putation cost in both training and generation. More-
over, using the whole history may burden the model
with redundant information and hurts the training
efficiency. To address the aforementioned limita-
tion and to facilitate the development of JRUD,
we propose
S
implified
G
enerative
A
rchitectures
(SGA) for DS and US respectively, both based on
GPT-2 but using shortened history.
The main contributions of this work can be sum-
marised as follows:
Our DS with the proposed SGA, called SGA-
DS, when only supervised trained, achieves
state-of-the-art performance on MultiWOZ2.1
(Eric et al.,2020) and is more compute-
efficient in both training and generation.
To the best of our knowledge, our US with the
proposed SGA, called SGA-US, represents
the first GPT-2 based end-to-end trainable US,
which could be trained via SL or RL.
Based on the proposed DS and US, we suc-
cessfully develop a RL framework, called
SGA-JRUD, for building jointly reinforced
user simulator and dialog systems, which can
be interacted and trained via online RL to
significantly improve the performance of the
TOD system, as shown in extensive experi-
ments on MultiWOZ2.1.
2 Related Work
End-to-end TOD systems
The methodology for
building TOD systems is gradually advancing from
separate training of individual modules (Mrkši´
c
et al.,2017;Wen et al.,2017a) to the end-to-end
(E2E) trainable approach (Wen et al.,2017b;Liu
and Lane,2017a;Lei et al.,2018). Recent studies
have exploited the large-scale pre-trained language
model such as GPT-2 for building end-to-end TOD
systems, e.g., SimpleTOD (Hosseini-Asl et al.,
2020), SOLOIST (Li et al.,2020), AuGPT (Kul-
hánek et al.,2021) and UBAR (Yang et al.,2021).
While existing GPT-2 based TOD systems achieve
improved performance, these models mostly em-
ploy the whole dialog history as input during train-
ing and generation, which brings inefficiencies in
computation, memory and learning. It is shown in
Sec. 4.7 that earlier history beyond the previous
turn are in fact weakly attended to in next-token
generation. In contrast, the simplified architecture
proposed in our SGA-DS only uses the belief state
and system response of the previous turn for gener-
ating the response in current turn.
RL in TOD systems and user simulators
Rein-
forcement learning, which aims to train an agent
towards maximizing long-term cumulative rewards
from interactions between the agent and its environ-
ment, could be divided in two classes, offline and
online (Sutton and Barto,2018). Both classes have
been applied in TOD systems. Offline RL only opti-
mizes the dialog agent over fixed collected data and
thus avoids building user simulators (Zhou et al.,
2017;Zhao et al.,2019;Jeon and Lee,2022). On-
line RL, instead, needs to design a user simulator
(US) and let the dialog agent interact with the user
simulator (acting as the environment) to generate
new dialogs, over which the dialog agent can be
further optimized. A variety of user simulators
have been studied, either rule based or data driven.
A typical example of rule based US is the agenda-
based user simulator (ABUS) (Schatzmann et al.,
2007). In the data driven US approach, different
models are proposed to train USs from data us-
ing different architectures, e.g. GRU seq2seq (Gür
et al.,2018) LSTM seq2seq (Kreyssig et al.,2018).
In this paper, motivated by the recent success of
GPT-2 based DS, we propose a new GPT-2 based
US and further design its simplified generative ar-
chitecture.
Joint training of DS and US
There have been
some studies to jointly optimize end-to-end DS and
US, but most are based on traditional architectures
of using LSTM seq2seq networks (Liu and Lane,
2017b;Papangelis et al.,2019;Tseng et al.,2021).
Earlier studies use template-based NLG module for
both DS and US (Liu and Lane,2017b) and work in
single domain such DSTC2 (Liu and Lane,2017b;
Papangelis et al.,2019). Progress has been made
to use neural network based generation and work
in multi-domain (Tseng et al.,2021). Different
RL algorithms have been attempted such as policy
gradient and actor-critic in (Liu and Lane,2017b),
Q-learning in (Papangelis et al.,2019). This paper
represents a further advance in designing GPT-2
based DS and US with new simplified architectures.
3 Method
In the following, we first introduce the background,
then the simplified generative architectures (SGA)
proposed for dialog system (DS) and user simula-
tor (US), finally we describe the jointly reinforced
method.
3.1 Background
Notations
According to the information flow in
a task-oriented dialog as illustrated in Figure 1, we
let
gt
denote the user goal state,
uat
the user act,
ut
the user utterance,
bt
the belief state,
dbt
the
database result,
at
the system act and
rt
be the
delexicalized response, respectively, at turn
t=
Figure 2: The proposed Simplified Generative Archi-
tectures (SGAs) for DS and US, shown in (a) and (b)
respectively, as compared to SimpleTOD-DS (c) and
UBAR-DS (d). Yellow boxes represent the condition-
ing input of the model during generation, and green
boxes the targeting output. The figure also reveals dif-
ferences between our SGA models and the other two
models. During supervised training, our SGA models
are trained by maximizing the conditional likelihood
of output given input, while the other two models
in fact maximizes the joint likelihood over both input
and output. Further, our SGA models can be naturally
fit into the RL framework for DS and US respectively,
while the other two models not (See Sec. 3.3 for de-
tails).
1,· · · , T
, for a dialog of
T
turns. In this work, all
these variables are tokenized into token sequences,
following recent studies in (Zhang et al.,2020;
Yang et al.,2021).
denotes the concatenation of
sequences such as in
utrt
.
|ut|
denotes the length
of
ut
in tokens.
{u, r}t
is a shorthand for
ut, rt
, and
{u, r}1:trepresents {u, r}1,· · · ,{u, r}t.
Dialog system (DS)
The main task for DS is,
for each dialog turn
t
, to generate (or say, pre-
dict)
1bt
,
at
and
rt
, given
ut
and dialog history
u1, r1,· · · , ut1, rt1
. A recent progress in build-
ing DS is that all variables are represented by
token sequences, and the workflow of a dialog
system (belief state tracking, action and response
generation) can be unified into a single sequence
generation problem, which can be accomplished
by a causal language model (Hosseini-Asl et al.,
2020;Yang et al.,2021). Particularly, pretrained
1
Note that database result
dbt
is deterministically obtained
by querying database using the predicted
bt
. We omit
dbt
in
the discussion for simplicity.
摘要:

JointlyReinforcedUserSimulatorandTask-orientedDialogSystemwithSimpliedGenerativeArchitectureHongLiuandZhijianOuSpeechProcessingandMachineIntelligence(SPMI)Lab,TsinghuaUniversity,Beijing,ChinaYiHuangandJunlanFengChinaMobileResearchInstitute,Beijing,ChinaAbstractRecently,therehasbeenprogressinsuper-v...

展开>> 收起<<
Jointly Reinforced User Simulator and Task-oriented Dialog System with Simplified Generative Architecture Hong Liu and Zhijian Ou.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:607.22KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注