CDC ONV A Benchmark for Contradiction Detection in Chinese Conversations Chujie Zheng1Jinfeng Zhou12Yinhe Zheng3Libiao Peng3Zhen Guo4

2025-04-30 0 0 1.88MB 12 页 10玖币

侵权投诉

CDCONV: A Benchmark for Contradiction Detection in

Chinese Conversations

Chujie Zheng1∗Jinfeng Zhou1,2∗Yinhe Zheng3Libiao Peng3Zhen Guo4

Wenquan Wu4Zheng-Yu Niu4Hua Wu4Minlie Huang1,3†

1The CoAI Group, Institute for Artiﬁcial Intelligence, State Key Lab of Intelligent Technology and Systems,

1Beijing National Research Center for Information Science and Technology, DCST, Tsinghua University, Beijing 100084, China

2College of Intelligence and Computing, Tianjin University, Tianjin, China

3Lingxin AI, Beijing 100084, China 4Baidu Inc., China

chujiezhengchn@gmail.com jfzhou.mail@gmail.com aihuang@tsinghua.edu.cn

{guozhenguozhen, wuwenquan01, niuzhengyu, wu_hua}@baidu.com

Abstract

Dialogue contradiction is a critical issue in

open-domain dialogue systems. The contex-

tualization nature of conversations makes dia-

logue contradiction detection rather challeng-

ing. In this work, we propose a bench-

mark for Contradiction Detection in Chinese

Conversations, namely CDCONV. It contains

12K multi-turn conversations annotated with

three typical contradiction categories: Intra-

sentence Contradiction, Role Confusion, and

History Contradiction. To efﬁciently construct

the CDCONV conversations, we devise a se-

ries of methods for automatic conversation

generation, which simulate common user be-

haviors that trigger chatbots to make contra-

dictions. We conduct careful manual quality

screening of the constructed conversations and

show that state-of-the-art Chinese chatbots can

be easily goaded into making contradictions.

Experiments on CDCONV show that properly

modeling contextual information is critical for

dialogue contradiction detection, but there are

still unresolved challenges that require future

research.1

1 Introduction

Large-scale pre-training for dialogue generation

(Zhang et al.,2020;Freitas et al.,2020) has ad-

vanced the development of engaging and human-

like dialogue systems. Unfortunately, state-of-

the-art open-domain chatbots, such as BlenderBot

(Roller et al.,2021), EVA (Zhou et al.,2021;Gu

et al.,2022) and PLATO (Bao et al.,2021b), still of-

ten behave inconsistently with their role or identity

and produce utterances that are self-contradictory

∗Equal contribution.

†Corresponding author.

Our data and codes are available at

https:

//www.github.com/thu-coai/CDConv

and

https://github.com/PaddlePaddle/Knover/

tree/dygraph/projects/cdconv

Non-contradiction

𝑢1: 你喜欢吃面条吗？(Do you like noodles?)

𝑏1: 我喜欢吃米饭! (I love to eat rice!)

𝑢2: 你不喜欢面条吗? (Don’t you like noodles?)

𝑏2: 不喜欢! (I dislike!)

Contradiction

𝑢1: 那你喜欢什么小动物呢? (What animals do you like?)

𝑏1: 狗,猫 (Dogs, cats)

𝑢2: 你不喜欢什么小动物? (What animals do you dislike?)

𝑏2: 不喜欢猫,其他的都喜欢 (I dislike cats. I like all the

other animals)

{"model": "plato", "method": "反义-否定词", "u1": "那你喜欢看什么类型的电影？", "b1": "科幻,动作,恐怖,喜剧片都可以", "u2": "你不喜欢什么类型的电影？", "b2": "恐怖,血腥,血腥暴力的,还有恶心的", "label": 3, "persona": 2}

{"model": "plato", "method": "反义-否定词", "u1": "那你喜欢看什么类型的电影？", "b1": "科幻,动作,恐怖,喜剧片都可以", "u2": "你不喜欢什么类型的电影？", "b2": "不喜欢韩国电影", "label": 0}

{"model": "plato", "method": "反义-否定词", "u1": "那你喜欢看什么类型的电影？", "b1": "科幻,动作,恐怖,喜剧片都可以", "u2": "你不喜欢什么类型的电影？", "b2": "我不喜欢剧情片", "label": 0}

{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗？", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗？", "b2": "不喜欢我还这么纠结啊?", "label": 3, "persona": 2}

{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗？", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗？", "b2": "我也不知道,我感觉我不喜欢他了", "label": 1}

{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗？", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗？", "b2": "不喜欢,不讨厌", "label": 0}

{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗？", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗？", "b2": "不喜欢,只是有好感", "label": 0}

{"model": "plato", "method": "反义-否定词", "u1": "你喜欢干什么平时？", "b1": "玩游戏,逛街,看电影", "u2": "你不喜欢做什么？", "b2": "不喜欢逛街,太累了", "label": 3, "persona": 2}

{"model": "eva", "method": "反义-否定词", "u1": "那你下班之后最喜欢干的事是什么？", "b1": "吃饭睡觉打豆豆", "u2": "下班后你最不喜欢做什么？", "b2": "不喜欢睡觉，也不喜欢吃东西", "label": 3, "persona": 2}

Figure 1: Dialogue contradiction detection requires the

full contextual information (including u1and u2) rather

than only the bot’s utterances (i.e., b1and b2).

or contradict the dialogue history (Shuster et al.,

2022;Gu et al.,2022;Xu et al.,2022a). Such in-

consistency or contradiction phenomena violate

Grice’s cooperative principle (Grice,1975) and

greatly impair the users’ long-term trust (Huang

et al.,2020;Lee et al.,2022).

Dialogue contradiction detection has shown to

be an effective means to improve the consistency

of chatbots (Welleck et al.,2019;Nie et al.,2021),

which, however, is always a challenging task.

Speciﬁcally, the contextualization nature of con-

versations indicates the necessity of considering

and modeling contextual information. For instance,

in the “Contradiction” example in Figure 1,

does

not explicitly contradict

. However, given

, the

actual meaning of

should be “

I like

dogs, cats”

and

are thus contradictory. In contrast, in

the “Non-contradiction” example, while

and

seem inconsistent (“love” vs. “dislike”),

actually

means “I dislike

noodles

” considering the dialogue

context. Hence,

is compatible with

and does

not make a contradiction.

Despite the above challenge, existing datasets for

contradiction detection (Dziri et al.,2019;Welleck

arXiv:2210.08511v1 [cs.CL] 16 Oct 2022

Lang Task Input Task Type Contradiction Categories

MNLI (2018) En Sentence Pair - -

CMNLI (2020), OCNLI (2020) Zh Sentence Pair - -

DNLI (2019), InferConvAI (2019) En Sentence Pair - -

KvPI (2020) Zh Conversation & Proﬁle Extrinsic Proﬁle

DIALFACT (2022) En Conversation Extrinsic Fact

CI-ToD (2021) En Conversation & KB Int & Ext Query, History & KB

DECODE (2021) En Conversation Intrinsic History

CDCONV (Ours) Zh Conversation Intrinsic Intra-sentence, Role, History

Table 1: Comparison of CDCONV with related benchmarks / datasets for (dialogue) contradiction detection. The

Extrinsic type targets the contradiction between a conversation and external information (e.g., proﬁles or facts),

while Intrinsic targets the contradiction inside a conversation. See §2for detailed discussion.

et al.,2019) usually only consider the textual entail-

ment relationship between two isolated sentences

(Dagan et al.,2005), which is largely insufﬁcient

for dialogue contradiction detection due to the ne-

glect of contextual information. A recent work (Nie

et al.,2021) crowd-sourced a dataset named DE-

CODE that contains conversations where the last

utterances contradict the dialogue histories. How-

ever, DECODE lacks a wide coverage of typical

contradiction categories, and most of its contradic-

tion cases are written by human, which have gap

with the real scenario where users trigger chatbots

to make contradictions.

In this work, we propose a benchmark for

ontradiction

etection in Chinese

Conv

ersations,

namely

CDCONV

. It contains 12K multi-turn con-

versations with human-annotated contradiction la-

bels (§3). Different from previous work (e.g., Nie

et al. 2021) that only considered the contradiction

to dialogue history (i.e., History Contradiction),

CDCONV covers another two typical categories:

Intra-sentence Contradiction and Role Confusion,

which refer to that a reply contradicts itself and that

a reply confuses the speaker’s role, respectively.

Since the cases of non-contradiction and con-

tradiction in natural human-bot conversations are

extremely unbalanced (§3,Nie et al. 2021), we

automatically construct the CDCONV conversa-

tions combined with elaborate manual inspection

(§4.1). Speciﬁcally, we ﬁrst devise a series of au-

tomatic methods to generate conversations (§4.2),

which simulate the common user behaviors that

trigger chatbots to make contradictions. We then

conduct careful human screening and annotation

for the constructed conversations to ensure the data

quality (§4.3). We validate the effectiveness of the

trigger methods and show that state-of-the-art Chi-

nese open-domain chatbots (EVA and PLATO) can

be easily goaded into making contradictions (§4.4).

Finally, we evaluate popular Chinese pre-trained

models on CDCONV (§5). Results show that prop-

erly modeling contextual information is critical

for dialogue contradiction detection. However,

there is still much room for future research in di-

alogue modeling, integrating commonsense and

world knowledge, and reasoning.

Our contributions are summarized as follows:

•

We propose CDCONV, a benchmark for contra-

diction detection in Chinese conversations. It

contains 12K conversations annotated with three

typical contradiction categories: Intra-sentence

Contradiction, Role Confusion, and History Con-

tradiction.

•

We present a series of methods by simulating

common user behaviors to automatically trig-

ger chatbots to make contradictions. We demon-

strate the effectiveness of these trigger methods

through detailed human annotation.

•

We evaluate popular Chinese pre-trained models

on CDCONV. Results show the importance of

properly modeling contextual information in dia-

logue contradiction detection, while this task is

still far from solved and requires further study.

2 Related Work

Table 1summarizes the comparison of CDCONV

with related benchmarks / datasets for (dialogue)

contradiction detection.

Contradiction Detection for Sentence Pair

The early contradiction detection usually adopted

the natural language inference (NLI) framework

(Dagan et al.,2005), such as the English MNLI

(Williams et al.,2018) dataset and the Chinese

Intra-sentence Contradiction / Paraphrasing

Role Confusion / Short Utterance

𝑢1: 你都有孩子了吗? (Do you have children?)

𝑏1: 没有啊,我是说你来找我玩! (No, I mean you can come

and see me!)

𝑢2: 你有孩子吗? (Do you have children?)

𝑏2: 没有,我只有一个女儿 (No, I only have a daughter.)

𝑢1: 那你今年上几年级了呢? (What grade are you this year?)

𝑏1: 开学初三,你呢? (The third year of junior high, and you?)

𝑢2: 额 (Emm)

𝑏2: 我比你大呢,我开学高三 (Older than you, I’m in the third

year of senior high)

History Contradiction (Persona Attributes) /

Inquiring History (Bot)

History Contradiction (Persona Opinions) /

Perturbation (Antonym)

𝑢1: 那你们是异地恋吗? (So, are you in a long-distance

relationship?)

𝑏1: 不是,我也在北京 (No, I am also in Beijing)

𝑢2: 你在哪里? (Where are you?)

𝑏2: 我在上海 (I am in Shanghai)

𝑢1: 哦哦,那你喜欢什么颜色? (Oh, what color do you like?)

𝑏1: 黑,白,灰 (Black, white, gray)

𝑢2: 哦哦,那你讨厌什么颜色? (Oh, what color do you hate?)

𝑏2: 蓝色,灰色 (Blue, gray)

History Contradiction (Persona Experiences) /

Perturbation (Synonym)

History Contradiction (Persona-unrelated) /

Inquiring History (User-M)

𝑢1: 你吃早饭了吗? (Have you had breakfast yet?)

𝑏1: 吃了呀,都快到公司了 (Yes, I’m about to arrive at the

company)

𝑢2: 你吃早餐了吗? (Have you had breakfast yet?)

𝑏2: 还没,估计到公司都九点了 (Not yet. I’ll be at the

company at 9 o’clock)

𝑢1: 我喜欢菊花,它在秋天开放太美了 (I like chrysanthemum.

It blooms in autumn so beautifully.)

𝑏1: 菊花也美,秋天是我的幸运季节 (Chrysanthemum is

beautiful. Autumn is my lucky season)

𝑢2: 你知道菊花在什么季节开放吗? (Do you know which

season chrysanthemum blooms in?)

𝑏2: 不知道,你要告诉我么? (I don’t know, would you tell me?)

把“我也是”删除

{"file": "5nd", "model": "eva", "method": "同义-回译", "u1": "你都有孩子了吗？", "b1": "没有啊，我是说你来找我玩!", "u2": "你有孩子吗？", "b2": "没有，我只有一个女儿!", "label": 1}

{"file": "5nd", "model": "eva", "method": "同义-回译", "u1": "你好，在忙么？", "b1": "还好，你呢？", "u2": "你好，你忙吗？", "b2": "不忙，只是很忙!", "label": 1}

{"file": "2nd4", "model": "eva", "method": "设问-bot", "u1": "我也是，你现在是工作还是在上学？", "b1": "工作了，在上海，你呢？", "u2": "工作地点在哪里？", "b2": "上海啊，还没定下来呢。", "label": 1}

{"file": "1st", "model": "eva", "method": "设问-bot", "u1": "那你们是异地恋吗？", "b1": "不是，我在北京，他也在北京。", "u2": "他在哪里？", "b2": "他在上海，我也在上海。", "label": 3, "persona": 1}

Figure 2: Data examples of contradiction categories (§3)/trigger methods (§4.2).Red texts denote the parts that

make contradiction. Blue texts illustrate the trigger methods (i.e., how u2are constructed). Perturbation (Negative)

and Inquiring History are separately illustrated in Figure 1and Figure 5respectively.

CMNLI (Xu et al.,2020) and OCNLI (Hu et al.,

2020) datasets. The task input consists of two

isolated sentences, which are labeled as one of

the textual entailment relationships: “entailment”,

“neutral” and “contradiction”. To extend the NLI

framework to the dialogue domain, Welleck et al.

(2019) constructed the DNLI dataset where the

dialogue utterances and the persona descriptions

from PersonaChat (Zhang et al.,2018) are used

to form sentence pairs. Dziri et al. (2019) simi-

larly synthesized the InferConvAI dataset through

automatic manipulation with dialogue utterances.

However, the NLI framework does not consider the

contextualization nature of conversations, making

it deﬁcient for dialogue contradiction detection.

Contradiction Detection for Conversation

The contradictions in dialogue systems can be split

into two major types: Extrinsic and Intrinsic (Dziri

et al.,2021;Ji et al.,2022). The

Extrinsic

type

refers to the contradiction between a conversation

and external information. For instance, the

KvPI dataset (Song et al.,2020) focuses on the

contradiction to structured attribute proﬁles. The

DIALFACT benchmark (Gupta et al.,2022) aims at

detecting contradictory statements to world facts

and improving factual correctness. The CI-ToD

dataset (Qin et al.,2021) involves the inconsistency

with knowledge bases in task-oriented dialogue.

One potential limitation of Extrinsic dialogue

contradiction detection is that it may rely on

static and manually curated external information

(e.g., proﬁles), which could be insufﬁcient in

open-domain dialogue.

Our work focuses on the

Intrinsic

type, which

refers to the contradiction inside a conversation

and is more widespread and fundamental in open-

domain dialogue. The DECODE dataset (Nie et al.,

2021) is a relevant work to ours, whose contradic-

tion cases are mostly collected by manually writ-

ing subsequent utterances to contradict the given

dialogue histories. Besides the language differ-

ence, CDCONV is distinguished from DECODE in

two aspects: (1) Apart from History Contradiction,

CDCONV additionally covers two contradiction

categories: Intra-sentence Contradiction and Role

Confusion, which are also typical and common

in human-bot conversations (§3). (2) Instead of

being human-written, the contradiction cases in

CDCONV are constructed by simulating the user

behaviors that trigger chatbots to make contradic-

tions (§4.2), which are closer to the real scenario

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CDCONV:ABenchmarkforContradictionDetectioninChineseConversationsChujieZheng1JinfengZhou1;2YinheZheng3LibiaoPeng3ZhenGuo4WenquanWu4Zheng-YuNiu4HuaWu4MinlieHuang1;3y1TheCoAIGroup,InstituteforArticialIntelligence,StateKeyLabofIntelligentTechnologyandSystems,1BeijingNationalResearchCenterforInformati...

展开>> 收起<<

CDC ONV A Benchmark for Contradiction Detection in Chinese Conversations Chujie Zheng1Jinfeng Zhou12Yinhe Zheng3Libiao Peng3Zhen Guo4.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CDC ONV A Benchmark for Contradiction Detection in Chinese Conversations Chujie Zheng1Jinfeng Zhou12Yinhe Zheng3Libiao Peng3Zhen Guo4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: