CDC ONV A Benchmark for Contradiction Detection in Chinese Conversations Chujie Zheng1Jinfeng Zhou12Yinhe Zheng3Libiao Peng3Zhen Guo4

2025-04-30 0 0 1.88MB 12 页 10玖币
侵权投诉
CDCONV: A Benchmark for Contradiction Detection in
Chinese Conversations
Chujie Zheng1Jinfeng Zhou1,2Yinhe Zheng3Libiao Peng3Zhen Guo4
Wenquan Wu4Zheng-Yu Niu4Hua Wu4Minlie Huang1,3
1The CoAI Group, Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems,
1Beijing National Research Center for Information Science and Technology, DCST, Tsinghua University, Beijing 100084, China
2College of Intelligence and Computing, Tianjin University, Tianjin, China
3Lingxin AI, Beijing 100084, China 4Baidu Inc., China
chujiezhengchn@gmail.com jfzhou.mail@gmail.com aihuang@tsinghua.edu.cn
{guozhenguozhen, wuwenquan01, niuzhengyu, wu_hua}@baidu.com
Abstract
Dialogue contradiction is a critical issue in
open-domain dialogue systems. The contex-
tualization nature of conversations makes dia-
logue contradiction detection rather challeng-
ing. In this work, we propose a bench-
mark for Contradiction Detection in Chinese
Conversations, namely CDCONV. It contains
12K multi-turn conversations annotated with
three typical contradiction categories: Intra-
sentence Contradiction, Role Confusion, and
History Contradiction. To efficiently construct
the CDCONV conversations, we devise a se-
ries of methods for automatic conversation
generation, which simulate common user be-
haviors that trigger chatbots to make contra-
dictions. We conduct careful manual quality
screening of the constructed conversations and
show that state-of-the-art Chinese chatbots can
be easily goaded into making contradictions.
Experiments on CDCONV show that properly
modeling contextual information is critical for
dialogue contradiction detection, but there are
still unresolved challenges that require future
research.1
1 Introduction
Large-scale pre-training for dialogue generation
(Zhang et al.,2020;Freitas et al.,2020) has ad-
vanced the development of engaging and human-
like dialogue systems. Unfortunately, state-of-
the-art open-domain chatbots, such as BlenderBot
(Roller et al.,2021), EVA (Zhou et al.,2021;Gu
et al.,2022) and PLATO (Bao et al.,2021b), still of-
ten behave inconsistently with their role or identity
and produce utterances that are self-contradictory
Equal contribution.
Corresponding author.
1
Our data and codes are available at
https:
//www.github.com/thu-coai/CDConv
and
https://github.com/PaddlePaddle/Knover/
tree/dygraph/projects/cdconv
Non-contradiction
𝑢1: 喜欢吃面条吗?(Do you like noodles?)
𝑏1: 我喜欢吃米饭! (I love to eat rice!)
𝑢2: 喜欢面条吗? (Don’t you like noodles?)
𝑏2: 不喜欢! (I dislike!)
Contradiction
𝑢1: 那你喜欢什么小动物呢? (What animals do you like?)
𝑏1: , (Dogs, cats)
𝑢2: 喜欢什么小动物? (What animals do you dislike?)
𝑏2: 不喜欢,其他的都喜欢 (I dislike cats. I like all the
other animals)
{"model": "plato", "method": "反义-否定词", "u1": "那你喜欢看什么类型的电影?", "b1": "科幻,动作,恐怖,喜剧片都可以", "u2": "你不喜欢什么类型的电影?", "b2": "恐怖,血腥,血腥暴力的,还有恶心的", "label": 3, "persona": 2}
{"model": "plato", "method": "反义-否定词", "u1": "那你喜欢看什么类型的电影?", "b1": "科幻,动作,恐怖,喜剧片都可以", "u2": "你不喜欢什么类型的电影?", "b2": "不喜欢韩国电影", "label": 0}
{"model": "plato", "method": "反义-否定词", "u1": "那你喜欢看什么类型的电影?", "b1": "科幻,动作,恐怖,喜剧片都可以", "u2": "你不喜欢什么类型的电影?", "b2": "我不喜欢剧情片", "label": 0}
{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗?", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗?", "b2": "不喜欢我还这么纠结啊?", "label": 3, "persona": 2}
{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗?", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗?", "b2": "我也不知道,我感觉我不喜欢他了", "label": 1}
{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗?", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗?", "b2": "不喜欢,不讨厌", "label": 0}
{"model": "plato", "method": "反义-否定词", "u1": "你难道不喜欢她吗?", "b1": "不喜欢,我有喜欢的人", "u2": "你喜欢她吗?", "b2": "不喜欢,只是有好感", "label": 0}
{"model": "plato", "method": "反义-否定词", "u1": "你喜欢干什么平时?", "b1": "玩游戏,逛街,看电影", "u2": "你不喜欢做什么?", "b2": "不喜欢逛街,太累了", "label": 3, "persona": 2}
{"model": "eva", "method": "反义-否定词", "u1": "那你下班之后最喜欢干的事是什么?", "b1": "吃饭睡觉打豆豆", "u2": "下班后你最不喜欢做什么?", "b2": "不喜欢睡觉,也不喜欢吃东西", "label": 3, "persona": 2}
Figure 1: Dialogue contradiction detection requires the
full contextual information (including u1and u2) rather
than only the bot’s utterances (i.e., b1and b2).
or contradict the dialogue history (Shuster et al.,
2022;Gu et al.,2022;Xu et al.,2022a). Such in-
consistency or contradiction phenomena violate
Grice’s cooperative principle (Grice,1975) and
greatly impair the users’ long-term trust (Huang
et al.,2020;Lee et al.,2022).
Dialogue contradiction detection has shown to
be an effective means to improve the consistency
of chatbots (Welleck et al.,2019;Nie et al.,2021),
which, however, is always a challenging task.
Specifically, the contextualization nature of con-
versations indicates the necessity of considering
and modeling contextual information. For instance,
in the “Contradiction” example in Figure 1,
b2
does
not explicitly contradict
b1
. However, given
u1
, the
actual meaning of
b1
should be “
I like
dogs, cats”
and
b1
and
b2
are thus contradictory. In contrast, in
the “Non-contradiction” example, while
b1
and
b2
seem inconsistent (“love” vs. “dislike”),
b2
actually
means “I dislike
noodles
” considering the dialogue
context. Hence,
b2
is compatible with
b1
and does
not make a contradiction.
Despite the above challenge, existing datasets for
contradiction detection (Dziri et al.,2019;Welleck
arXiv:2210.08511v1 [cs.CL] 16 Oct 2022
Lang Task Input Task Type Contradiction Categories
MNLI (2018) En Sentence Pair - -
CMNLI (2020), OCNLI (2020) Zh Sentence Pair - -
DNLI (2019), InferConvAI (2019) En Sentence Pair - -
KvPI (2020) Zh Conversation & Profile Extrinsic Profile
DIALFACT (2022) En Conversation Extrinsic Fact
CI-ToD (2021) En Conversation & KB Int & Ext Query, History & KB
DECODE (2021) En Conversation Intrinsic History
CDCONV (Ours) Zh Conversation Intrinsic Intra-sentence, Role, History
Table 1: Comparison of CDCONV with related benchmarks / datasets for (dialogue) contradiction detection. The
Extrinsic type targets the contradiction between a conversation and external information (e.g., profiles or facts),
while Intrinsic targets the contradiction inside a conversation. See §2for detailed discussion.
et al.,2019) usually only consider the textual entail-
ment relationship between two isolated sentences
(Dagan et al.,2005), which is largely insufficient
for dialogue contradiction detection due to the ne-
glect of contextual information. A recent work (Nie
et al.,2021) crowd-sourced a dataset named DE-
CODE that contains conversations where the last
utterances contradict the dialogue histories. How-
ever, DECODE lacks a wide coverage of typical
contradiction categories, and most of its contradic-
tion cases are written by human, which have gap
with the real scenario where users trigger chatbots
to make contradictions.
In this work, we propose a benchmark for
C
ontradiction
D
etection in Chinese
Conv
ersations,
namely
CDCONV
. It contains 12K multi-turn con-
versations with human-annotated contradiction la-
bels (§3). Different from previous work (e.g., Nie
et al. 2021) that only considered the contradiction
to dialogue history (i.e., History Contradiction),
CDCONV covers another two typical categories:
Intra-sentence Contradiction and Role Confusion,
which refer to that a reply contradicts itself and that
a reply confuses the speaker’s role, respectively.
Since the cases of non-contradiction and con-
tradiction in natural human-bot conversations are
extremely unbalanced (§3,Nie et al. 2021), we
automatically construct the CDCONV conversa-
tions combined with elaborate manual inspection
4.1). Specifically, we first devise a series of au-
tomatic methods to generate conversations (§4.2),
which simulate the common user behaviors that
trigger chatbots to make contradictions. We then
conduct careful human screening and annotation
for the constructed conversations to ensure the data
quality (§4.3). We validate the effectiveness of the
trigger methods and show that state-of-the-art Chi-
nese open-domain chatbots (EVA and PLATO) can
be easily goaded into making contradictions (§4.4).
Finally, we evaluate popular Chinese pre-trained
models on CDCONV 5). Results show that prop-
erly modeling contextual information is critical
for dialogue contradiction detection. However,
there is still much room for future research in di-
alogue modeling, integrating commonsense and
world knowledge, and reasoning.
Our contributions are summarized as follows:
We propose CDCONV, a benchmark for contra-
diction detection in Chinese conversations. It
contains 12K conversations annotated with three
typical contradiction categories: Intra-sentence
Contradiction, Role Confusion, and History Con-
tradiction.
We present a series of methods by simulating
common user behaviors to automatically trig-
ger chatbots to make contradictions. We demon-
strate the effectiveness of these trigger methods
through detailed human annotation.
We evaluate popular Chinese pre-trained models
on CDCONV. Results show the importance of
properly modeling contextual information in dia-
logue contradiction detection, while this task is
still far from solved and requires further study.
2 Related Work
Table 1summarizes the comparison of CDCONV
with related benchmarks / datasets for (dialogue)
contradiction detection.
Contradiction Detection for Sentence Pair
The early contradiction detection usually adopted
the natural language inference (NLI) framework
(Dagan et al.,2005), such as the English MNLI
(Williams et al.,2018) dataset and the Chinese
Intra-sentence Contradiction / Paraphrasing
Role Confusion / Short Utterance
𝑢1: 你都有孩子了吗? (Do you have children?)
𝑏1: 没有啊,我是说你来找我玩! (No, I mean you can come
and see me!)
𝑢2: 你有孩子吗? (Do you have children?)
𝑏2: 没有,我只有一个女儿 (No, I only have a daughter.)
𝑢1: 那你今年上几年级了呢? (What grade are you this year?)
𝑏1: 开学初三,你呢? (The third year of junior high, and you?)
𝑢2: (Emm)
𝑏2: 我比你大呢,我开学高三 (Older than you, I’m in the third
year of senior high)
History Contradiction (Persona Attributes) /
Inquiring History (Bot)
History Contradiction (Persona Opinions) /
Perturbation (Antonym)
𝑢1: 那你们是异地恋吗? (So, are you in a long-distance
relationship?)
𝑏1: 不是,我也在北京 (No, I am also in Beijing)
𝑢2: 在哪里? (Where are you?)
𝑏2: 我在上海 (I am in Shanghai)
𝑢1: 哦哦,那你喜欢什么颜色? (Oh, what color do you like?)
𝑏1: ,, (Black, white, gray)
𝑢2: 哦哦,那你讨厌什么颜色? (Oh, what color do you hate?)
𝑏2: 蓝色,灰色 (Blue, gray)
History Contradiction (Persona Experiences) /
Perturbation (Synonym)
History Contradiction (Persona-unrelated) /
Inquiring History (User-M)
𝑢1: 你吃早饭了吗? (Have you had breakfast yet?)
𝑏1: 吃了呀,都快到公司了 (Yes, I’m about to arrive at the
company)
𝑢2: 你吃早餐了吗? (Have you had breakfast yet?)
𝑏2: 还没,估计到公司都九点了 (Not yet. I’ll be at the
company at 9 o’clock)
𝑢1: 我喜欢菊花,它在秋天开放太美了 (I like chrysanthemum.
It blooms in autumn so beautifully.)
𝑏1: 菊花也美,秋天是我的幸运季节 (Chrysanthemum is
beautiful. Autumn is my lucky season)
𝑢2: 你知道菊花在什么季节开放吗? (Do you know which
season chrysanthemum blooms in?)
𝑏2: 不知道,你要告诉我么? (I don’t know, would you tell me?)
把“我也是”删除
{"file": "5nd", "model": "eva", "method": "同义-回译", "u1": "你都有孩子了吗?", "b1": "没有啊,我是说你来找我玩!", "u2": "你有孩子吗?", "b2": "没有,我只有一个女儿!", "label": 1}
{"file": "5nd", "model": "eva", "method": "同义-回译", "u1": "你好,在忙么?", "b1": "还好,你呢?", "u2": "你好,你忙吗?", "b2": "不忙,只是很忙!", "label": 1}
{"file": "2nd4", "model": "eva", "method": "设问-bot", "u1": "我也是,你现在是工作还是在上学?", "b1": "工作了,在上海,你呢?", "u2": "工作地点在哪里?", "b2": "上海啊,还没定下来呢。", "label": 1}
{"file": "1st", "model": "eva", "method": "设问-bot", "u1": "那你们是异地恋吗?", "b1": "不是,我在北京,他在北京。", "u2": "他在哪里?", "b2": "他在上海,我在上海。", "label": 3, "persona": 1}
Figure 2: Data examples of contradiction categories (§3)/trigger methods (§4.2).Red texts denote the parts that
make contradiction. Blue texts illustrate the trigger methods (i.e., how u2are constructed). Perturbation (Negative)
and Inquiring History are separately illustrated in Figure 1and Figure 5respectively.
CMNLI (Xu et al.,2020) and OCNLI (Hu et al.,
2020) datasets. The task input consists of two
isolated sentences, which are labeled as one of
the textual entailment relationships: “entailment”,
“neutral” and “contradiction”. To extend the NLI
framework to the dialogue domain, Welleck et al.
(2019) constructed the DNLI dataset where the
dialogue utterances and the persona descriptions
from PersonaChat (Zhang et al.,2018) are used
to form sentence pairs. Dziri et al. (2019) simi-
larly synthesized the InferConvAI dataset through
automatic manipulation with dialogue utterances.
However, the NLI framework does not consider the
contextualization nature of conversations, making
it deficient for dialogue contradiction detection.
Contradiction Detection for Conversation
The contradictions in dialogue systems can be split
into two major types: Extrinsic and Intrinsic (Dziri
et al.,2021;Ji et al.,2022). The
Extrinsic
type
refers to the contradiction between a conversation
and external information. For instance, the
KvPI dataset (Song et al.,2020) focuses on the
contradiction to structured attribute profiles. The
DIALFACT benchmark (Gupta et al.,2022) aims at
detecting contradictory statements to world facts
and improving factual correctness. The CI-ToD
dataset (Qin et al.,2021) involves the inconsistency
with knowledge bases in task-oriented dialogue.
One potential limitation of Extrinsic dialogue
contradiction detection is that it may rely on
static and manually curated external information
(e.g., profiles), which could be insufficient in
open-domain dialogue.
Our work focuses on the
Intrinsic
type, which
refers to the contradiction inside a conversation
and is more widespread and fundamental in open-
domain dialogue. The DECODE dataset (Nie et al.,
2021) is a relevant work to ours, whose contradic-
tion cases are mostly collected by manually writ-
ing subsequent utterances to contradict the given
dialogue histories. Besides the language differ-
ence, CDCONV is distinguished from DECODE in
two aspects: (1) Apart from History Contradiction,
CDCONV additionally covers two contradiction
categories: Intra-sentence Contradiction and Role
Confusion, which are also typical and common
in human-bot conversations (§3). (2) Instead of
being human-written, the contradiction cases in
CDCONV are constructed by simulating the user
behaviors that trigger chatbots to make contradic-
tions (§4.2), which are closer to the real scenario
摘要:

CDCONV:ABenchmarkforContradictionDetectioninChineseConversationsChujieZheng1JinfengZhou1;2YinheZheng3LibiaoPeng3ZhenGuo4WenquanWu4Zheng-YuNiu4HuaWu4MinlieHuang1;3y1TheCoAIGroup,InstituteforArticialIntelligence,StateKeyLabofIntelligentTechnologyandSystems,1BeijingNationalResearchCenterforInformati...

展开>> 收起<<
CDC ONV A Benchmark for Contradiction Detection in Chinese Conversations Chujie Zheng1Jinfeng Zhou12Yinhe Zheng3Libiao Peng3Zhen Guo4.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.88MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注