DialogConv A Lightweight Fully Convolutional Network for Multi-view Response Selection Yongkang Liu1 Shi Feng1 Wei Gao2 Daling Wang1and Yifei Zhang1

2025-05-06 0 0 1.26MB 13 页 10玖币

侵权投诉

DialogConv: A Lightweight Fully Convolutional Network for Multi-view

Response Selection

Yongkang Liu1, Shi Feng1, Wei Gao2, Daling Wang1and Yifei Zhang1

1Northeastern University, China

2Singapore Management University, Singapore

misonsky@163.com, fengshi@cse.neu.edu.cn

weigao@smu.edu.sg, {wangdaling,zhangyifei}@cse.neu.edu.cn

Abstract

Current end-to-end retrieval-based dialogue

systems are mainly based on Recurrent Neu-

ral Networks or Transformers with attention

mechanisms. Although promising results have

been achieved, these models often suffer from

slow inference or huge number of parameters.

In this paper, we propose a novel lightweight

fully convolutional architecture, called Dialog-

Conv, for response selection. DialogConv

is exclusively built on top of convolution to

extract matching features of context and re-

sponse. Dialogues are modeled in 3D views,

where DialogConv performs convolution oper-

ations on embedding view, word view and ut-

terance view to capture richer semantic infor-

mation from multiple contextual views. On

the four benchmark datasets, compared with

state-of-the-art baselines, DialogConv is on av-

erage about 8.5×smaller in size, and 79.39×

and 10.64×faster on CPU and GPU devices,

respectively. At the same time, DialogConv

achieves the competitive effectiveness of re-

sponse selection.

1 Introduction

An important challenge in building intelligent di-

alogue systems is the response selection problem,

which aims to select an appropriate response from

a set of candidates given a dialogue context. Such

retrieval-based dialogue systems have attracted

great attention from academia and industry due to

the advantages of informative and ﬂuent responses

produced (Tao et al.,2021).

The existing retrieval-based dialogue systems

can be divided into three patterns according to the

way of input handling (Zhang and Zhao,2021):

(i) Separate Pattern (Wu et al.,2017;Zhang et al.,

2018;Zhou et al.,2018;Gu et al.,2019); (ii) Con-

catenated Pattern (Tan et al.,2015;Zhou et al.,

2016); (iii) PrLM (Pretrained Language Model)

Pattern (Cui et al.,2020;Gu et al.,2020;Liu et al.,

2021). Separate Pattern (i.e., Figure 1(a)) encodes

Figure 1: Flat modeling. (a) is separate pattern, (b) is

concatenated pattern and (c) is PrLM pattern. Grey bars

in (c) are embedded representations of special symbols.

utterances individually, while Concatenated Pat-

tern (i.e., Figure 1(b)) concatenates all utterances

into a continuous word sequence. Methods based

on these two patterns usually have Recurrent Neu-

ral Networks (RNNs) (Hochreiter and Schmidhu-

ber,1997;Cho et al.,2014) and attention mech-

anism (Bahdanau et al.,2015) as the backbone.

Although promising results have been achieved,

these methods are generally slow in training and

inference due to their recurrent nature.

The PrLM Pattern (i.e., Figure 1(c)) uses special

symbols to connect all utterances into a continuous

sequence, similar to Concatenated Pattern. While

PrLM Pattern has obtained state-of-the-art perfor-

mance in response selection (Cui et al.,2020;Gu

et al.,2020;Liu et al.,2021), this method having

Transformer (Vaswani et al.,2017) as the de facto

standard architecture suffer from a large number

of parameters and heavy computational cost. Very

large models not only lead to increased training

costs, but also prevent researchers from iterating

quickly. At the same time, slow inference hinders

development and deployment of dialogue systems

in real-world scenarios.

Furthermore, these three patterns treat dialogue

contexts as ﬂat structures (Li et al.,2021). Meth-

ods based on such ﬂat structures usually capture

the sequential features of text by considering each

word as a unit. However, previous work (Lu et al.,

2019) revealed that given a multi-turn dialogue

(e.g., Figure 2(a)), the context of the dialogue can

arXiv:2210.13845v1 [cs.CL] 25 Oct 2022

No, I just got here half an hour

𝑢1:

𝑢2:

𝑢4:

𝑢5:

4:30

…

𝑢1𝑢2𝑢3…

…

𝑢1𝑢2𝑢𝑚

…

emb

word

utter

(1)

emb

utter

(a) (b)

𝑤1

𝑤2𝑤2

𝑤2

𝑤𝑛

emb

(d)

𝑢1𝑢2𝑢𝑘+1 𝑢2𝑢3𝑢𝑘+2

4:30 … computer for 2 hours.

𝑢3:who use the computer before ?

lucy.

you start using computer at 4:00.

…

hours 2

computer

half start

(c)

(2)

utter

emb

(3)

emb

utter

Figure 2: Stereo view modeling. (a) An example of multi-turn dialogue; (b) Features from different views; (c) A

schematic diagram of stereo view; (d) Convolution on different views ((1) is convolution in embedding view; (2)

is convolution in word view; and (3) is convolution in utterance view)

exhibit a composition of 3D stereo structures as

we view utterances in each dimension (shown as

Figure 2(b) and (c))). As shown in Figure 2(b),

the embedding view can represent features of each

individual word, the word view can represent the

features from the whole conversation and a sin-

gle utterance, and the utterance view can capture

the dependencies between different localities com-

posed of adjacent utterances. Existing methods (Gu

et al.,2019;Zhou et al.,2016;Gu et al.,2020) only

extract features based on the ﬂat structures, but can-

not simultaneously capture complex features from

such stereoscopic views.

In this paper, we propose a lightweight fully

convolutional network model, called DialogConv,

without any RNN and attention module for multi-

view response selection. Different from previous

studies (Zhou et al.,2016;Gu et al.,2019,2020;

Li et al.,2021) which model the dialogue in a ﬂat

view, DialogConv models the dialogue context and

response together in the 3D space of the stereo

views, i.e., embedding view, word view, and ut-

terance view (as shown in Figure 2(d)). In the

embedding view, the word-level features will be re-

ﬁned through convolution operations on the plane

formed by the word sequence dimension and the

utterance dimension. In the word view, the global

conversation features will be captured by concate-

nating all words into a continuous sequence, and

the features of each utterance will be reﬁned by

performing convolution on each utterance. In the

utterance view, the dependency features between

different local contexts will be distilled by perform-

ing convolution across different utterances. In gen-

eral, DialogConv can simultaneously extract fea-

tures with different granularities from the stereo

structure.

DialogConv is completely based on CNN, which

uses much fewer parameters and computing re-

Here ‘fully’ means DialogConv is built exclusively on

CNNs.

sources. DialogueConv has an average number

of parameters of 12.4 million, which is on average

about

8.5×

smaller than other models. The infer-

ence speed of DialogConv can be on average about

79.39×

faster on CPU device and

10.64×

faster

on GPU device than existing models. Moreover,

DialogConv achieves competitive results on four

benchmarks and performs even better when pre-

trained with contrastive learning. In summary, we

make the following contributions:

•

We propose an efﬁcient convolutional response

selection model, DialogConv, which, to our best

knowledge, is the ﬁrst response selection model

built entirely on multiple convolutional layers

without any RNN or attention module.

•

We model dialogue from stereo views, where

2D and 1D convolution operations are performed

on embedding, word and utterance views, and

thus DialogConv can capture features from stereo

views simultaneously.

•

Extensive experiments on four benchmark

datasets show that DialogConv with fewer pa-

rameters can achieve competitive performance

with faster speed and less computing resources.

The code is available o Github2.

2 Related Work

2.1 Retrieval-based Dialogue System

Most existing retrieval-based dialogue systems (Wu

et al.,2017;Gu et al.,2019;Liu et al.,2021) fo-

cus on matching between dialogue context and re-

sponse. These methods attempt to mine deep se-

mantic features through sequence modeling, e.g.,

using attention-based pairwise matching mecha-

nisms to capture interaction features between di-

alogue context and candidate response. However,

previous research (Sankar et al.,2019;Li et al.,

2021) shows that these methods fail to fully exploit

the conversation history information. In addition,

2https://github.com/misonsky/DialogConv

methods based on recurrent neural network suffer

from slow inference speed due to the nature of re-

current structures. Although Transformer-based

methods (Vaswani et al.,2017) get rid of the weak-

ness of recurrent structure, they are usually plagued

by a large number of parameters (Wu et al.,2019),

making the training and inference of Transformer-

based models require a lot of computational cost.

In this paper, we propose a multi-view approach

to model dialogue context based on a fully convo-

lutional structure and a lightweight model that is

smaller and faster than most existing methods.

2.2 Convolutional Neural Networks (CNN)

For the past few years, CNNs have been the go-

to model in computer vision. The main reason

is that CNN enjoys the advantage of parameter

sharing and is better at modeling local structures.

A large number of excellent architectures based

on CNN have been proposed (Krizhevsky et al.,

2012;He et al.,2016;Dai et al.,2021). For text

processing, convolutional structures are good at

capturing local dependencies of text and are faster

than RNNs (Hochreiter and Schmidhuber,1997).

Therefore, some studies (Wu et al.,2016;Lu et al.,

2019;Yuan et al.,2019) employ convolutional

structures to aggregate the matching features be-

tween dialogue contexts and responses. However,

these works usually require combining attention

mechanisms or the skeleton structure of RNN with

CNNs. Furthermore, these studies treat dialogue

context as a ﬂat structure. In this paper, we propose

a novel fully convolutional architecture to extract

matching features from stereo views, which can

simultaneously extract the features with different

granularities from different views.

3 Methodology

3.1 Problem Formulation

In this paper, an instance in the dialogue

dataset can be represented as

(C, y)

, where

(u1,u2,...,ut−1,r)

represents the set of di-

alogue contexts

(u1,u2,...,ut−1)

and the re-

sponse

is the

-th utterance, and

y∈ {0,1}

the class label of

. As the core of retrieval-based

dialogue system, the purpose of response selection

is to build a discriminator

g(C)

(C, y)

to mea-

sure the matching between the dialogue context

and response.

3.2 Fully Convolutional Matching

We propose a fully convolutional encoder for multi-

view response selection. Multiple views include

embedding view, word view, and utterance view.

In the embedding view, the convolution operations

are performed on the plane formed by the word se-

quence dimension and the utterance dimension, and

word-level features can be extracted through nonlin-

ear transformations between different embeddings.

In the word view, global dialogue context features

will be captured by convolution of a contiguous se-

quence connecting all words, and features of each

utterance will be obtained by performing convo-

lution on each utterance. In the utterance view,

DialogConv is responsible for capturing the de-

pendency features between different local contexts

composed of adjacent utterances. Figure 3shows

an overview of our proposed DialogConv, which

consists of six layers: (i) embedding layer; (ii) local

matching layer; (iii) context matching layer; (iv)

discourse matching layer; (v) aggregation layer;

(vi) prediction Layer.

Symbol Deﬁnition: The embedding layer uses

a pretrained word embedding model to map each

word in

to a vector space. We stack

chrono-

logically into a 3D tensor

G∈Rt×`×d

, where

represents the dimension of word embedding,

represents the length of the utterance and

is the

number of utterances including the response.

is the input to DialogConv. We use Conv2D

k×s

and Conv1D

to denote the convolution operations,

where Conv2D

k×s

denotes a two-dimensional con-

volution with a convolution kernel size of

k×s

Conv1D

represents a one-dimensional convolu-

tion with a convolution kernel size of

, and

represents a speciﬁc view. We will describe the

details of the remaining layers in the following

subsections.

3.2.1 Local Matching Layer

The local matching layer is responsible for extract-

ing features of each utterance. The local matching

stage contains features from the embedding and

word views. Firstly, we employ

1×1

convolutions

in the embedding view and the word view, respec-

tively. The process can be formally described as:

G1=Conv2Dembedding

1×1(σ(G)) (1)

G2=Conv2Dword

1×1(G1) + G(2)

where

σ(·)

stands for GELUs activation func-

tion (Hendrycks and Gimpel,2016). The

1×1

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DialogConv:ALightweightFullyConvolutionalNetworkforMulti-viewResponseSelectionYongkangLiu1,ShiFeng1,WeiGao2,DalingWang1andYifeiZhang11NortheasternUniversity,China2SingaporeManagementUniversity,Singaporemisonsky@163.com,fengshi@cse.neu.edu.cnweigao@smu.edu.sg,{wangdaling,zhangyifei}@cse.neu.edu.cnAbs...

展开>> 收起<<

DialogConv A Lightweight Fully Convolutional Network for Multi-view Response Selection Yongkang Liu1 Shi Feng1 Wei Gao2 Daling Wang1and Yifei Zhang1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DialogConv A Lightweight Fully Convolutional Network for Multi-view Response Selection Yongkang Liu1 Shi Feng1 Wei Gao2 Daling Wang1and Yifei Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: