
Team Flow at DRC2022: Pipeline System for Travel Destination
Recommendation Task in Spoken Dialogue
Ryu Hirai1∗, Atsumoto Ohashi1∗, Ao Guo1, Hideki Shiroma1, Xulin Zhou1,
Yukihiko Tone1, Shinya Iizuka2and Ryuichiro Higashinaka1
Abstract— To improve the interactive capabilities of a dia-
logue system, e.g., to adapt to different customers, the Dialogue
Robot Competition (DRC2022) was held. As one of the teams,
we built a dialogue system with a pipeline structure containing
four modules. The natural language understanding (NLU) and
natural language generation (NLG) modules were GPT-2 based
models, and the dialogue state tracking (DST) and policy
modules were designed on the basis of hand-crafted rules. After
the preliminary round of the competition, we found that the
low variation in training examples for the NLU and failed
recommendation due to the policy used were probably the main
reasons for the limited performance of the system.
I. INTRODUCTION
With the popularization of human-machine dialogue, a
dialogue system is expected to achieve objectives in various
situations, e.g., respond to different customers appropriately
in a customer service task. To improve the interactive ca-
pabilities of a dialogue system, Dialogue Robot Competition
2022 (DRC2022) [1] was held following the past competition
[2]. Each team was required to develop a dialogue system
embedded within a humanoid robot to handle the “travel
destination recommendation task.” In the task, the robot plays
the role of a counter-sales person to satisfy the customer by
helping him/her choose one of two tourist attractions.
This paper reports the work of the team “Flow” in
DRC2022. Our dialogue system was built with a pipeline
composed of four modules: natural language understanding
(NLU), dialogue state tracking (DST), policy, and natural
language generation (NLG). By configuring the system in
a pipeline fashion, (1) it is easy to tune the functionality
of each module, and (2) in the future, we can expect to
introduce a method, such as [3], to integrate all modules and
optimize the dialogue performance of the entire system. We
built the NLU and NLG modules by fine-tuning GPT-2 [4], a
popular large-scale language model, with our crowdsourced
data for the travel destination recommendation task. We
further designed the DST and policy modules with hand-
crafted rules.
After the preliminary round of the competition, we exam-
ined the evaluation results and dialogue histories. We found
two main reasons for the limited performance: (1) the NLU
was not able to track the customer dialogue acts properly due
to the low variation in training examples for GPT-2, and (2)
1Graduate School of Informatics, Nagoya University, Japan
hirai.ryu.k6@s.mail.nagoya-u.ac.jp
2School of Informatics, Nagoya University, Japan
∗Equal contribution.
NLU
NLG
DST
Policy
ASR
TTS
Customer utterance (text) Customer DA
System utterance (text)
Dialogue state
Customer utterance
(speech)
System utterance
(speech) System DA
Customer
Fig. 1. Diagram of pipeline structure of our spoken dialogue system.
At each turn, customer’s speech recognition results obtained by automatic
speech recognition (ASR) are processed by NLU, DST, policy, and NLG
to generate system’s response text, which is finally converted to speech by
text-to-speech (TTS) to respond to customer.
the rules of the policy resulted in a recommendation strategy
that ignored customers’ preferences.
II. IMPLEMENTATION
Fig. 1 shows the pipeline structure of the spoken dialogue
system our team implemented. At each turn, the customer’s
speech input to the robot is converted into text by the
automatic speech recognition (ASR) module, and the utter-
ance text is input to the NLU module. The NLU predicts
the customer’s dialogue act (DA), which is a semantic
representation of the customer’s utterance. The DST module
then updates the dialogue state on the basis of the customer’s
DA. The dialogue state consists of information such as the
history of the DAs, the customer profile, and the belief state,
which is a set of customers’ preferences toward travel. The
policy module decides the next action to be taken by the
system as the system’s DA by using the dialogue state and
information on tourist attractions from the database. The
NLG module then converts the system’s DA into a system
utterance. Finally, the text-to-speech (TTS) module responds
to the customer by converting the text response to speech.
ASR and TTS were implemented using the Google Speech
Recognition system and Amazon Polly API, respectively,
which were prepared by the competition organizers. The
robot’s expression control and motion control were based
on the expression and motion rules defined for each system
dialogue act.
In the following sections, first, we describe the ontology
of DA for the travel destination recommendation task, NLU,
DST, policy, and NLG modules. Then, we describe the
robot’s facial expression control and motion control. Finally,
we describe and discuss the evaluation results.
arXiv:2210.09518v1 [cs.CL] 18 Oct 2022