Jointly Reinforced User Simulator and Task-oriented Dialog System with Simpliﬁed Generative Architecture Hong Liu and Zhijian Ou

2025-04-24 1 0 607.22KB 13 页 10玖币

侵权投诉

Jointly Reinforced User Simulator and Task-oriented Dialog System with

Simpliﬁed Generative Architecture

Hong Liu and Zhijian Ou

Speech Processing and Machine Intelligence (SPMI) Lab, Tsinghua University, Beijing, China

Yi Huang and Junlan Feng

China Mobile Research Institute, Beijing, China

Abstract

Recently, there has been progress in super-

vised funetuning pretrained GPT-2 to build

end-to-end task-oriented dialog (TOD) sys-

tems. However, online reinforcement learn-

ing of a GPT-2 based dialog system (DS), to-

gether with a end-to-end user simulator (US),

has not ever been explored. Moreover, a draw-

back with existing GPT-2 based TOD systems

is that they mostly employ the whole dialog

history as input, which brings inefﬁciencies

in memory and compute. In this paper, we

ﬁrst propose Simpliﬁed Generative Architec-

tures (SGA) for DS and US respectively, both

based on GPT-2 but using shortened history.

Then, we successfully develop Jointly Rein-

forced US and DS, called SGA-JRUD. Our

DS with the proposed SGA, when only super-

vised trained, achieves state-of-the-art perfor-

mance on MultiWOZ2.1 and is more compute-

efﬁcient in both training and generation. Ex-

tensive experiments on MultiWOZ2.1 further

show the superiority of SGA-JRUD in both of-

ﬂine and online evaluations.

1 Introduction

Task-oriented dialog (TOD) systems, which are

mainly designed to assist users to accomplish their

goals, often consist of several modules including di-

alog state tracking (DST), database querying (DB),

dialog policy (DP) and natural language generation

(NLG). The information ﬂow in a task-oriented

dialog is illustrated in Figure 1. Recent studies

recast these modules all as conditional generation

of tokens and integrate them into a single language

model (LM), which usually uses some pretrained

language model (LM) such as GPT-2 (Radford

et al.,2019) as the backbone. Fine-tuning GPT-2

over annotated dialog datasets such as MultiWOZ

(Budzianowski et al.,2018) via supervised learning

(SL) has shown state-of-the-art results (Hosseini-

Asl et al.,2020;Li et al.,2020;Kulhánek et al.,

2021;Yang et al.,2021), thanks to the powerful

Figure 1: The information ﬂow in a task-oriented dia-

log. Square brackets denote special tokens in GPT-2.

generation ability of GPT-2.

However, it has long been recognized that su-

pervised learning over annotated dialog datasets

alone may not be sufﬁcient to learn a task-oriented

dialog agent (Young et al.,2013). Conversations

often do not have only a single correct response,

multiple responses can be appropriate for the same

dialog context (Zhang et al.,2020). Supervised

trained agents can become biased by the annota-

tions. Reinforcement learning (RL) for an agent

aims to goal-directed learning from interaction be-

tween the decision-making agent and its environ-

ment (Sutton and Barto,2018) and is a natural

choice for learning task-oriented dialog policies,

where the user is modeled as the interactive envi-

ronment. Ofﬂine RL optimizes the policy from the

ﬁxed annotated dataset without online environment

interaction (Zhou et al.,2017;Jeon and Lee,2022)

but only partially exploits the power of RL. Online

RL requires interaction with real humans or user

simulators during training. However, building a

good user simulator is as challenging as designing

a dialog agent, either rule based (Schatzmann et al.,

arXiv:2210.06706v1 [cs.CL] 13 Oct 2022

2007) or data driven (Gür et al.,2018;Kreyssig

et al.,2018;Tseng et al.,2021). There also have

been some efforts to jointly optimize end-to-end

dialog system (DS) and user simulator (US), but

most are based on traditional architectures of using

LSTM seq2seq networks (Liu and Lane,2017b;

Papangelis et al.,2019;Tseng et al.,2021).

Inspired by the recent progress of funetuning

pretrained LMs such as GPT-2 to develop the end-

to-end trainable DS, in this paper we are ﬁrstly

interested in building a GPT-2 based end-to-end

trainable US for online RL of DS, which has not

ever been explored. Further, note that how to de-

velop jointly optimized GPT-2 based DS and US in

the RL framework is unclear, which requires new

design of model architectures. Regarding this, we

aim to develop

ointly

einforced

ser simulator

and task-oriented

ialog system (JRUD), leverag-

ing the recent progress of using pretrained LMs

such as GPT-2 as the backbone.

To be clear, GPT-2 (Radford et al.,2019) in this

paper refers to the particular class of causal LM,

which computes conditional probabilities for next-

token generation via self-attention based Trans-

former neural network (Vaswani et al.,2017).

The basic idea in ﬁnetuning pretrained GPT-2 to

build the dialog agent is to utilize the genera-

tion ability empowered by the ﬁnetuned causal

LM. Given a particular form of conditional model,

p(output|input)

, where

input

and

output

are to-

ken sequences, the GPT-2 LM can be ﬁnetuned

over training samples

(input, output)

(often re-

ferred to as training sequences (Hosseini-Asl et al.,

2020)), and after ﬁnetuning, the model can be used

for generation, i.e., generating

output

after receiv-

ing input.

A limitation of previous methods in GPT-2 based

DS, e.g., SimpleTOD (Hosseini-Asl et al.,2020),

SOLOIST (Li et al.,2020), AuGPT (Kulhánek

et al.,2021) and UBAR (Yang et al.,2021), is that

the whole history is used as the input at each turn.

This signiﬁcantly increases the memory and com-

putation cost in both training and generation. More-

over, using the whole history may burden the model

with redundant information and hurts the training

efﬁciency. To address the aforementioned limita-

tion and to facilitate the development of JRUD,

we propose

impliﬁed

enerative

rchitectures

(SGA) for DS and US respectively, both based on

GPT-2 but using shortened history.

The main contributions of this work can be sum-

marised as follows:

•

Our DS with the proposed SGA, called SGA-

DS, when only supervised trained, achieves

state-of-the-art performance on MultiWOZ2.1

(Eric et al.,2020) and is more compute-

efﬁcient in both training and generation.

•

To the best of our knowledge, our US with the

proposed SGA, called SGA-US, represents

the ﬁrst GPT-2 based end-to-end trainable US,

which could be trained via SL or RL.

•

Based on the proposed DS and US, we suc-

cessfully develop a RL framework, called

SGA-JRUD, for building jointly reinforced

user simulator and dialog systems, which can

be interacted and trained via online RL to

signiﬁcantly improve the performance of the

TOD system, as shown in extensive experi-

ments on MultiWOZ2.1.

2 Related Work

End-to-end TOD systems

The methodology for

building TOD systems is gradually advancing from

separate training of individual modules (Mrkši´

et al.,2017;Wen et al.,2017a) to the end-to-end

(E2E) trainable approach (Wen et al.,2017b;Liu

and Lane,2017a;Lei et al.,2018). Recent studies

have exploited the large-scale pre-trained language

model such as GPT-2 for building end-to-end TOD

systems, e.g., SimpleTOD (Hosseini-Asl et al.,

2020), SOLOIST (Li et al.,2020), AuGPT (Kul-

hánek et al.,2021) and UBAR (Yang et al.,2021).

While existing GPT-2 based TOD systems achieve

improved performance, these models mostly em-

ploy the whole dialog history as input during train-

ing and generation, which brings inefﬁciencies in

computation, memory and learning. It is shown in

Sec. 4.7 that earlier history beyond the previous

turn are in fact weakly attended to in next-token

generation. In contrast, the simpliﬁed architecture

proposed in our SGA-DS only uses the belief state

and system response of the previous turn for gener-

ating the response in current turn.

RL in TOD systems and user simulators

Rein-

forcement learning, which aims to train an agent

towards maximizing long-term cumulative rewards

from interactions between the agent and its environ-

ment, could be divided in two classes, ofﬂine and

online (Sutton and Barto,2018). Both classes have

been applied in TOD systems. Ofﬂine RL only opti-

mizes the dialog agent over ﬁxed collected data and

thus avoids building user simulators (Zhou et al.,

2017;Zhao et al.,2019;Jeon and Lee,2022). On-

line RL, instead, needs to design a user simulator

(US) and let the dialog agent interact with the user

simulator (acting as the environment) to generate

new dialogs, over which the dialog agent can be

further optimized. A variety of user simulators

have been studied, either rule based or data driven.

A typical example of rule based US is the agenda-

based user simulator (ABUS) (Schatzmann et al.,

2007). In the data driven US approach, different

models are proposed to train USs from data us-

ing different architectures, e.g. GRU seq2seq (Gür

et al.,2018) LSTM seq2seq (Kreyssig et al.,2018).

In this paper, motivated by the recent success of

GPT-2 based DS, we propose a new GPT-2 based

US and further design its simpliﬁed generative ar-

chitecture.

Joint training of DS and US

There have been

some studies to jointly optimize end-to-end DS and

US, but most are based on traditional architectures

of using LSTM seq2seq networks (Liu and Lane,

2017b;Papangelis et al.,2019;Tseng et al.,2021).

Earlier studies use template-based NLG module for

both DS and US (Liu and Lane,2017b) and work in

single domain such DSTC2 (Liu and Lane,2017b;

Papangelis et al.,2019). Progress has been made

to use neural network based generation and work

in multi-domain (Tseng et al.,2021). Different

RL algorithms have been attempted such as policy

gradient and actor-critic in (Liu and Lane,2017b),

Q-learning in (Papangelis et al.,2019). This paper

represents a further advance in designing GPT-2

based DS and US with new simpliﬁed architectures.

3 Method

In the following, we ﬁrst introduce the background,

then the simpliﬁed generative architectures (SGA)

proposed for dialog system (DS) and user simula-

tor (US), ﬁnally we describe the jointly reinforced

method.

3.1 Background

Notations

According to the information ﬂow in

a task-oriented dialog as illustrated in Figure 1, we

let

denote the user goal state,

uat

the user act,

the user utterance,

the belief state,

dbt

the

database result,

the system act and

be the

delexicalized response, respectively, at turn

Figure 2: The proposed Simpliﬁed Generative Archi-

tectures (SGAs) for DS and US, shown in (a) and (b)

respectively, as compared to SimpleTOD-DS (c) and

UBAR-DS (d). Yellow boxes represent the condition-

ing input of the model during generation, and green

boxes the targeting output. The ﬁgure also reveals dif-

ferences between our SGA models and the other two

models. During supervised training, our SGA models

are trained by maximizing the conditional likelihood

of output given input, while the other two models

in fact maximizes the joint likelihood over both input

and output. Further, our SGA models can be naturally

ﬁt into the RL framework for DS and US respectively,

while the other two models not (See Sec. 3.3 for de-

tails).

1,· · · , T

, for a dialog of

turns. In this work, all

these variables are tokenized into token sequences,

following recent studies in (Zhang et al.,2020;

Yang et al.,2021).

⊕

denotes the concatenation of

sequences such as in

ut⊕rt

|ut|

denotes the length

in tokens.

{u, r}t

is a shorthand for

ut, rt

, and

{u, r}1:trepresents {u, r}1,· · · ,{u, r}t.

Dialog system (DS)

The main task for DS is,

for each dialog turn

, to generate (or say, pre-

dict)

1bt

and

, given

and dialog history

u1, r1,· · · , ut−1, rt−1

. A recent progress in build-

ing DS is that all variables are represented by

token sequences, and the workﬂow of a dialog

system (belief state tracking, action and response

generation) can be uniﬁed into a single sequence

generation problem, which can be accomplished

by a causal language model (Hosseini-Asl et al.,

2020;Yang et al.,2021). Particularly, pretrained

Note that database result

dbt

is deterministically obtained

by querying database using the predicted

. We omit

dbt

the discussion for simplicity.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JointlyReinforcedUserSimulatorandTask-orientedDialogSystemwithSimpliedGenerativeArchitectureHongLiuandZhijianOuSpeechProcessingandMachineIntelligence(SPMI)Lab,TsinghuaUniversity,Beijing,ChinaYiHuangandJunlanFengChinaMobileResearchInstitute,Beijing,ChinaAbstractRecently,therehasbeenprogressinsuper-v...

展开>> 收起<<

Jointly Reinforced User Simulator and Task-oriented Dialog System with Simpliﬁed Generative Architecture Hong Liu and Zhijian Ou.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Jointly Reinforced User Simulator and Task-oriented Dialog System with Simpliﬁed Generative Architecture Hong Liu and Zhijian Ou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: