DialogConv A Lightweight Fully Convolutional Network for Multi-view Response Selection Yongkang Liu1 Shi Feng1 Wei Gao2 Daling Wang1and Yifei Zhang1

2025-05-06 0 0 1.26MB 13 页 10玖币
侵权投诉
DialogConv: A Lightweight Fully Convolutional Network for Multi-view
Response Selection
Yongkang Liu1, Shi Feng1, Wei Gao2, Daling Wang1and Yifei Zhang1
1Northeastern University, China
2Singapore Management University, Singapore
misonsky@163.com, fengshi@cse.neu.edu.cn
weigao@smu.edu.sg, {wangdaling,zhangyifei}@cse.neu.edu.cn
Abstract
Current end-to-end retrieval-based dialogue
systems are mainly based on Recurrent Neu-
ral Networks or Transformers with attention
mechanisms. Although promising results have
been achieved, these models often suffer from
slow inference or huge number of parameters.
In this paper, we propose a novel lightweight
fully convolutional architecture, called Dialog-
Conv, for response selection. DialogConv
is exclusively built on top of convolution to
extract matching features of context and re-
sponse. Dialogues are modeled in 3D views,
where DialogConv performs convolution oper-
ations on embedding view, word view and ut-
terance view to capture richer semantic infor-
mation from multiple contextual views. On
the four benchmark datasets, compared with
state-of-the-art baselines, DialogConv is on av-
erage about 8.5×smaller in size, and 79.39×
and 10.64×faster on CPU and GPU devices,
respectively. At the same time, DialogConv
achieves the competitive effectiveness of re-
sponse selection.
1 Introduction
An important challenge in building intelligent di-
alogue systems is the response selection problem,
which aims to select an appropriate response from
a set of candidates given a dialogue context. Such
retrieval-based dialogue systems have attracted
great attention from academia and industry due to
the advantages of informative and fluent responses
produced (Tao et al.,2021).
The existing retrieval-based dialogue systems
can be divided into three patterns according to the
way of input handling (Zhang and Zhao,2021):
(i) Separate Pattern (Wu et al.,2017;Zhang et al.,
2018;Zhou et al.,2018;Gu et al.,2019); (ii) Con-
catenated Pattern (Tan et al.,2015;Zhou et al.,
2016); (iii) PrLM (Pretrained Language Model)
Pattern (Cui et al.,2020;Gu et al.,2020;Liu et al.,
2021). Separate Pattern (i.e., Figure 1(a)) encodes
Figure 1: Flat modeling. (a) is separate pattern, (b) is
concatenated pattern and (c) is PrLM pattern. Grey bars
in (c) are embedded representations of special symbols.
utterances individually, while Concatenated Pat-
tern (i.e., Figure 1(b)) concatenates all utterances
into a continuous word sequence. Methods based
on these two patterns usually have Recurrent Neu-
ral Networks (RNNs) (Hochreiter and Schmidhu-
ber,1997;Cho et al.,2014) and attention mech-
anism (Bahdanau et al.,2015) as the backbone.
Although promising results have been achieved,
these methods are generally slow in training and
inference due to their recurrent nature.
The PrLM Pattern (i.e., Figure 1(c)) uses special
symbols to connect all utterances into a continuous
sequence, similar to Concatenated Pattern. While
PrLM Pattern has obtained state-of-the-art perfor-
mance in response selection (Cui et al.,2020;Gu
et al.,2020;Liu et al.,2021), this method having
Transformer (Vaswani et al.,2017) as the de facto
standard architecture suffer from a large number
of parameters and heavy computational cost. Very
large models not only lead to increased training
costs, but also prevent researchers from iterating
quickly. At the same time, slow inference hinders
development and deployment of dialogue systems
in real-world scenarios.
Furthermore, these three patterns treat dialogue
contexts as flat structures (Li et al.,2021). Meth-
ods based on such flat structures usually capture
the sequential features of text by considering each
word as a unit. However, previous work (Lu et al.,
2019) revealed that given a multi-turn dialogue
(e.g., Figure 2(a)), the context of the dialogue can
1
arXiv:2210.13845v1 [cs.CL] 25 Oct 2022
No, I just got here half an hour
𝑢1:
𝑢2:
𝑢4:
𝑢5:
4:30
𝑢1𝑢2𝑢3
𝑢1𝑢2𝑢𝑚
emb
word
utter
(1)
emb
utter
(a) (b)
𝑤1
𝑤1
𝑤1
𝑤2𝑤2
𝑤2
𝑤𝑛
𝑤𝑛
𝑤𝑛
emb
(d)
𝑢1𝑢2𝑢𝑘+1 𝑢2𝑢3𝑢𝑘+2
4:30 … computer for 2 hours.
𝑢3:who use the computer before ?
lucy.
you start using computer at 4:00.
hours 2
computer
half start
(c)
(2)
utter
emb
(3)
emb
utter
Figure 2: Stereo view modeling. (a) An example of multi-turn dialogue; (b) Features from different views; (c) A
schematic diagram of stereo view; (d) Convolution on different views ((1) is convolution in embedding view; (2)
is convolution in word view; and (3) is convolution in utterance view)
exhibit a composition of 3D stereo structures as
we view utterances in each dimension (shown as
Figure 2(b) and (c))). As shown in Figure 2(b),
the embedding view can represent features of each
individual word, the word view can represent the
features from the whole conversation and a sin-
gle utterance, and the utterance view can capture
the dependencies between different localities com-
posed of adjacent utterances. Existing methods (Gu
et al.,2019;Zhou et al.,2016;Gu et al.,2020) only
extract features based on the flat structures, but can-
not simultaneously capture complex features from
such stereoscopic views.
In this paper, we propose a lightweight fully
1
convolutional network model, called DialogConv,
without any RNN and attention module for multi-
view response selection. Different from previous
studies (Zhou et al.,2016;Gu et al.,2019,2020;
Li et al.,2021) which model the dialogue in a flat
view, DialogConv models the dialogue context and
response together in the 3D space of the stereo
views, i.e., embedding view, word view, and ut-
terance view (as shown in Figure 2(d)). In the
embedding view, the word-level features will be re-
fined through convolution operations on the plane
formed by the word sequence dimension and the
utterance dimension. In the word view, the global
conversation features will be captured by concate-
nating all words into a continuous sequence, and
the features of each utterance will be refined by
performing convolution on each utterance. In the
utterance view, the dependency features between
different local contexts will be distilled by perform-
ing convolution across different utterances. In gen-
eral, DialogConv can simultaneously extract fea-
tures with different granularities from the stereo
structure.
DialogConv is completely based on CNN, which
uses much fewer parameters and computing re-
1
Here ‘fully’ means DialogConv is built exclusively on
CNNs.
sources. DialogueConv has an average number
of parameters of 12.4 million, which is on average
about
8.5×
smaller than other models. The infer-
ence speed of DialogConv can be on average about
79.39×
faster on CPU device and
10.64×
faster
on GPU device than existing models. Moreover,
DialogConv achieves competitive results on four
benchmarks and performs even better when pre-
trained with contrastive learning. In summary, we
make the following contributions:
We propose an efficient convolutional response
selection model, DialogConv, which, to our best
knowledge, is the first response selection model
built entirely on multiple convolutional layers
without any RNN or attention module.
We model dialogue from stereo views, where
2D and 1D convolution operations are performed
on embedding, word and utterance views, and
thus DialogConv can capture features from stereo
views simultaneously.
Extensive experiments on four benchmark
datasets show that DialogConv with fewer pa-
rameters can achieve competitive performance
with faster speed and less computing resources.
The code is available o Github2.
2 Related Work
2.1 Retrieval-based Dialogue System
Most existing retrieval-based dialogue systems (Wu
et al.,2017;Gu et al.,2019;Liu et al.,2021) fo-
cus on matching between dialogue context and re-
sponse. These methods attempt to mine deep se-
mantic features through sequence modeling, e.g.,
using attention-based pairwise matching mecha-
nisms to capture interaction features between di-
alogue context and candidate response. However,
previous research (Sankar et al.,2019;Li et al.,
2021) shows that these methods fail to fully exploit
the conversation history information. In addition,
2https://github.com/misonsky/DialogConv
2
methods based on recurrent neural network suffer
from slow inference speed due to the nature of re-
current structures. Although Transformer-based
methods (Vaswani et al.,2017) get rid of the weak-
ness of recurrent structure, they are usually plagued
by a large number of parameters (Wu et al.,2019),
making the training and inference of Transformer-
based models require a lot of computational cost.
In this paper, we propose a multi-view approach
to model dialogue context based on a fully convo-
lutional structure and a lightweight model that is
smaller and faster than most existing methods.
2.2 Convolutional Neural Networks (CNN)
For the past few years, CNNs have been the go-
to model in computer vision. The main reason
is that CNN enjoys the advantage of parameter
sharing and is better at modeling local structures.
A large number of excellent architectures based
on CNN have been proposed (Krizhevsky et al.,
2012;He et al.,2016;Dai et al.,2021). For text
processing, convolutional structures are good at
capturing local dependencies of text and are faster
than RNNs (Hochreiter and Schmidhuber,1997).
Therefore, some studies (Wu et al.,2016;Lu et al.,
2019;Yuan et al.,2019) employ convolutional
structures to aggregate the matching features be-
tween dialogue contexts and responses. However,
these works usually require combining attention
mechanisms or the skeleton structure of RNN with
CNNs. Furthermore, these studies treat dialogue
context as a flat structure. In this paper, we propose
a novel fully convolutional architecture to extract
matching features from stereo views, which can
simultaneously extract the features with different
granularities from different views.
3 Methodology
3.1 Problem Formulation
In this paper, an instance in the dialogue
dataset can be represented as
(C, y)
, where
C
=
(u1,u2,...,ut1,r)
represents the set of di-
alogue contexts
(u1,u2,...,ut1)
and the re-
sponse
r
,
ui
is the
i
-th utterance, and
y∈ {0,1}
is
the class label of
C
. As the core of retrieval-based
dialogue system, the purpose of response selection
is to build a discriminator
g(C)
on
(C, y)
to mea-
sure the matching between the dialogue context
and response.
3.2 Fully Convolutional Matching
We propose a fully convolutional encoder for multi-
view response selection. Multiple views include
embedding view, word view, and utterance view.
In the embedding view, the convolution operations
are performed on the plane formed by the word se-
quence dimension and the utterance dimension, and
word-level features can be extracted through nonlin-
ear transformations between different embeddings.
In the word view, global dialogue context features
will be captured by convolution of a contiguous se-
quence connecting all words, and features of each
utterance will be obtained by performing convo-
lution on each utterance. In the utterance view,
DialogConv is responsible for capturing the de-
pendency features between different local contexts
composed of adjacent utterances. Figure 3shows
an overview of our proposed DialogConv, which
consists of six layers: (i) embedding layer; (ii) local
matching layer; (iii) context matching layer; (iv)
discourse matching layer; (v) aggregation layer;
(vi) prediction Layer.
Symbol Definition: The embedding layer uses
a pretrained word embedding model to map each
word in
C
to a vector space. We stack
C
chrono-
logically into a 3D tensor
GRt×`×d
, where
d
represents the dimension of word embedding,
`
represents the length of the utterance and
t
is the
number of utterances including the response.
G
is the input to DialogConv. We use Conv2D
v
k×s
and Conv1D
v
w
to denote the convolution operations,
where Conv2D
v
k×s
denotes a two-dimensional con-
volution with a convolution kernel size of
k×s
,
Conv1D
v
w
represents a one-dimensional convolu-
tion with a convolution kernel size of
w
, and
v
represents a specific view. We will describe the
details of the remaining layers in the following
subsections.
3.2.1 Local Matching Layer
The local matching layer is responsible for extract-
ing features of each utterance. The local matching
stage contains features from the embedding and
word views. Firstly, we employ
1×1
convolutions
in the embedding view and the word view, respec-
tively. The process can be formally described as:
G1=Conv2Dembedding
1×1(σ(G)) (1)
G2=Conv2Dword
1×1(G1) + G(2)
where
σ(·)
stands for GELUs activation func-
tion (Hendrycks and Gimpel,2016). The
1×1
3
摘要:

DialogConv:ALightweightFullyConvolutionalNetworkforMulti-viewResponseSelectionYongkangLiu1,ShiFeng1,WeiGao2,DalingWang1andYifeiZhang11NortheasternUniversity,China2SingaporeManagementUniversity,Singaporemisonsky@163.com,fengshi@cse.neu.edu.cnweigao@smu.edu.sg,{wangdaling,zhangyifei}@cse.neu.edu.cnAbs...

展开>> 收起<<
DialogConv A Lightweight Fully Convolutional Network for Multi-view Response Selection Yongkang Liu1 Shi Feng1 Wei Gao2 Daling Wang1and Yifei Zhang1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:1.26MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注