ExPUNations Augmenting Puns with Keywords and Explanations Jiao Sun1yAnjali Narayan-Chen2yShereen Oraby2Alessandra Cervone2 Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng23

2025-05-06 0 0 3.91MB 16 页 10玖币

侵权投诉

ExPUNations: Augmenting Puns with Keywords and Explanations

Jiao Sun1∗† Anjali Narayan-Chen2†Shereen Oraby2Alessandra Cervone2

Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng2,3

1University of Southern California

2Amazon Alexa AI

3University of California, Los Angeles

jiaosun@usc.edu

{naraanja,orabys,cervon,tagyoung,jhuangz,yangliud}@amazon.com

violetpeng@cs.ucla.edu

Abstract

The tasks of humor understanding and gen-

eration are challenging and subjective even

for humans, requiring commonsense and real-

world knowledge to master. Puns, in particu-

lar, add the challenge of fusing that knowledge

with the ability to interpret lexical-semantic

ambiguity. In this paper, we present the Ex-

PUNations (ExPUN) dataset, in which we aug-

ment an existing dataset of puns with detailed

crowdsourced annotations of keywords denot-

ing the most distinctive words that make the

text funny, pun explanations describing why

the text is funny, and ﬁne-grained funniness

ratings. This is the ﬁrst humor dataset with

such extensive and ﬁne-grained annotations

speciﬁcally for puns. Based on these anno-

tations, we propose two tasks: explanation

generation to aid with pun classiﬁcation and

keyword-conditioned pun generation, to chal-

lenge the current state-of-the-art natural lan-

guage understanding and generation models’

ability to understand and generate humor. We

showcase that the annotated keywords we col-

lect are helpful for generating better novel hu-

morous texts in human evaluation, and that

our natural language explanations can be lever-

aged to improve both the accuracy and robust-

ness of humor classiﬁers.

1 Introduction

Humor serves multiple purposes and provides nu-

merous beneﬁts, such as relieving anxiety, avoiding

painful feelings and facilitating learning (Buxman,

2008). As a speciﬁc example of humor, the creative

uses of puns, wordplay and ambiguity are impor-

tant ways to come up with jokes (Chiaro,2006).

Pun understanding and generation are particularly

challenging tasks because they require extensive

commonsense and world knowledge to compose

and understand, even for humans. Despite growing

∗Work done during Jiao’s internship at Amazon.

†Both authors equally contributed to the paper.

Text When artists dream in color it’s a pigment of their

imagination.

KWD artists , dream , color , pigment , imagination .

NLEx

Pigments are non-soluble materials often used in

painting, and pigment sounds like ﬁgment, which is

something that is not real but someone believes it is.

Text The man found something to catch ﬁsh, which was

a net gain.

KWD catch ﬁsh , net gain .

NLEx

This is a play on words. A “net gain” means an

increase in revenue but here “net” refers to how a net

is used to catch ﬁsh.

Table 1: Two examples of annotated Keywords (KWD)

and Natural Language Explanations (NLEx) for puns

in our dataset. The highlighted texts are annotated

keywords that contribute to making the text funny.

interest in the area, there are limited amounts of

data available in the domain of humor understand-

ing and generation.

Existing humor datasets are usually only anno-

tated with binary labels indicating whether each

sentence is a joke, pun, or punchline (Hasan et al.,

2019;Weller and Seppi,2019;Castro et al.,2018;

Mittal et al.,2021). This is insufﬁcient to bench-

mark models’ ability to understand and generate

novel humorous text, since hardly anything mean-

ingful can be learned from such a sparse supervi-

sion signal and coarse-grained annotation.

To facilitate research on humor understanding

and generation, we present the ExPUNations (Ex-

PUN) dataset, in which we augment an existing

dataset of puns from SemEval 2017 Task 7 (Miller

et al.,2017) with detailed crowdsourced annota-

tions of ﬁne-grained funniness ratings on a Likert

scale of one to ﬁve, along with keywords denot-

ing the most distinctive words that make the text

funny and natural language explanations describ-

ing why the text is funny (Table 1). In addition, we

collect annotations indicating whether a person un-

derstands the sentence, thinks it is a pun, and ﬁnds

arXiv:2210.13513v1 [cs.CL] 24 Oct 2022

Text Be True to your teeth, or they will be false to you. Drinking too much of a certain potent potable

may require a leave of absinthe.

Understandable [1, 1, 1, 1, 0] [1, 1, 1, 1, 1]

Offensive/Inappropriate [0, 1, 0, 0, 0] [0, 0, 0, 0, 0]

Is a joke? [1, 0, 1, 0, 0] [1, 1, 1, 1, 1]

Funniness (1-5) [2, 0, 1, 0, 0] [3, 4, 2, 1, 2]

Natural

Language

Explanation

(NLEx)

NLEx1: Talking about being true as in being real or

they will be fake/false teeth.

NLEx2: False teeth are something people who lose

their teeth may have, and being true to your teeth

may be a way of saying take care of them otherwise

you’ll lose them.

NLEx1: It’s a pun that replaces the word absence

with absinthe, which is notoriously strong alcohol.

NLEx2: This is a play on words. Absinthe here

represents the liquor by the same name but is meant

to replace the similar-sounding “absence”. Too much

absinthe will make you ill.

Joke keywords

(KWD)

KWD1: [“true”, “teeth”, “false”]

KWD2: [“be true”, “teeth”, “false to you”]

KWD1: [“drinking”, “leave of absinthe”]

KWD2: [“drinking too much”, “leave of absinthe”]

Table 2: Two examples with annotation ﬁelds that we collect. We use underline to mark the commonsense knowl-

edge that people need in order to understand the joke.

the joke offensive or inappropriate. Since these

tasks are all highly subjective, we collect multi-

ple annotations per sample, and present a detailed

agreement analysis. We believe our annotations

can be used in many other applications beyond

pun understanding and generation, such as toxicity

detection.

The contributions of our work are threefold:

•

We contribute extensive high-quality annota-

tions for an existing humor dataset along mul-

tiple dimensions.1

•

Based on the annotations, we propose two

tasks, explanation generation for pun classi-

ﬁcation and keyword-conditioned pun gener-

ation, to advance research on humor under-

standing and generation.

•

We benchmark state-of-the-art NLP models

on explanation generation for pun classiﬁca-

tion and keyword-conditioned pun generation.

Our experiments demonstrate the beneﬁts of

utilizing natural language keywords and expla-

nations for humor understanding and genera-

tion while highlighting several potential areas

of improvement for the existing models.

2 ExPUN Dataset

In this section, we describe our data annotation

procedure, including details of the annotation ﬁelds

and our assessment of the annotation quality.

Resources will be available at:

https://github.

com/amazon-research/expunations

2.1 Data Preparation

The original SemEval 2017 Task 7 dataset (Miller

et al.,2017)

contains puns that are either homo-

graphic (exploiting polysemy) or heterographic (ex-

ploiting phonological similarity to another word).

The dataset also contains examples of non-pun text.

We sample 1,999 text samples from SemEval 2017

Task 7 as the basis for our humor annotation. 3

2.2 Dataset Annotation

The annotated ﬁelds (AF ) come in the order of:

AF1

[understandability]: whether the annotator

understands the text or not, regardless of

whether they perceive it as funny.

AF2

[offensiveness]: whether the annotator ﬁnds

the text offensive or inappropriate.

AF3

[joke]: whether the annotator thinks the text

is intended to be a joke.

AF4

[funniness]: rate the funniness on a Likert

scale of 1-5, where 1 means very not funny

and 5 means very funny.

AF5

[explanation]: explain in concise natural lan-

guage about why this joke is funny. More

speciﬁcally, if external or commonsense

knowledge is required to understand the joke

and/or its humor, the annotator should in-

clude the relevant knowledge in the explana-

tion. If the joke is a pun or play on words,

they must provide an explanation of how the

play on words works.

2https://alt.qcri.org/semeval2017/

task7/

. The data is released under CC BY-NC 4.0 license

(

https://creativecommons.org/licenses/

by-nc/4.0/legalcode).

We sample 834 heterographic puns, 1,074 homographic

puns and 91 non-puns.

AF6

[joke keywords]: pick out (as few as possi-

ble) keyword phrases from the joke that are

related to the punchline/the reason the joke is

funny. We emphasize that phrases should be

sparse and mainly limited to content words,

can be multiple words long, and the keywords

should be copied verbatim from the joke.

If an annotator rates the instance as not under-

standable, they will skip the rest of the annotation

for that instance (

AF2

AF6

). In addition, if an an-

notator rates an example as not a joke, they can

skip the rest of the annotation (

AF4

AF6

). Table 2

shows two examples in our dataset. The ﬁrst ex-

ample has two annotators who think the text is a

joke, and therefore it has two explanations. In the

second instance, all annotators unanimously agree

it is a joke. Here, we sample two explanations

from the original ﬁve. For both instances, we use

underline to highlight the external commonsense

knowledge in the explanation. If the joke is a play

on words, the explanation also shows how the play

on words works (e.g., the second joke). We show

the full annotation guidelines, including calibrating

examples, in Appendix A.

We crowdsourced 5 annotations per sample us-

ing a professional team of 10 dedicated full-time

annotators within our organization. Before starting

the task, we held a kick-off meeting with the team

to explain the annotation guidelines in detail. We

then conducted 3 pilot rounds for calibration and

iteratively met with annotators, including more de-

tails and examples to address annotator questions.

Finally, we conducted 7 rounds of annotation, each

with between 100-300 puns per round grouped into

minibatches of 50 examples. Each sample in a

minibatch was annotated by consistent subteams

of 5 annotators. After receiving a completed batch

of annotations, we manually examined their qual-

ity and provided feedback on any quality issues,

redoing batches as necessary.

2.3 Dataset Statistics and Quality Control

We report overall dataset statistics in Table 3. For

AF1−AF3

, we count the number of samples la-

beled positive by majority vote. For

AF4

, we com-

pute the average of all funniness scores, excluding

blank annotations, and ﬁnd that while annotators

recognized most samples as jokes, they did not ﬁnd

them to be particularly funny. For

AF5

and

AF6

See Appendix A.2 for more details on pilot round feed-

back.

total AF1AF2AF3

# samples 1,999 1,795 65 1,449

AF4: Avg. funniness 1.68

AF5: Explanations

total # explanations 6,650

avg. # explanations/sample 3.33

avg. # tokens/expl. 31.67

avg. # sentences/expl. 2.01

AF6: Keyword phrases

avg. # tokens/keyword phrase 1.33

avg. # keyword phrases/sample 2.09

Table 3: Overall stats for annotation ﬁelds in ExPUN.

we compute lexical statistics of our explanations

and keyword annotations and provide deeper anal-

ysis of these key annotation ﬁelds in Section 2.4.

We report inter-annotator agreement for all an-

notation ﬁelds in Table 4.

For ﬁelds

AF1

AF4

we compute agreement using (1) the average of

Cohen’s kappa scores of each annotator against the

majority vote, and (2) the average Spearman cor-

relation between each pair of annotators. We ﬁnd

that annotators show moderate agreement when de-

ciding if the given text is a joke (

AF3

), but lower

agreement on the task of understanding the text

(

AF1

) as well as the much more subjective task of

rating how funny a joke is (

AF4

). We also ﬁnd

weak average Spearman correlation between each

pair of annotations for the subjective categories of

offensiveness (

AF2

whether the text is a joke

(AF3) and joke funniness (AF4).

For the free text ﬁelds in

AF5

and

AF6

, we com-

pute averaged BLEU-4 (Papineni et al.,2002) and

METEOR (Banerjee and Lavie,2005) scores in a

pairwise fashion. We treat each annotator’s expla-

nation (for AF5) or list of keyword phrases joined

into a string (for

AF6

) as candidate text, with the

remaining annotators’ annotations as a set of refer-

ences. We ﬁnd high similarity between joke key-

word annotations, suggesting that annotators iden-

tify similar spans of keyword phrases, and a lower

degree of similarity between pun explanations.

2.4 Dataset Analysis

Explanations.

As seen in Figures 1a and 1b, on

average, samples are annotated with multiple expla-

nations, and the explanations are lengthy, spanning

multiple sentences, and lexically diverse (14,748

When computing agreement, we exclude the ﬁrst 100

annotated samples, as these were used as a calibrating pilot.

6See Appendix A.3 for more details.

Annotation Field κ ρ BLEU MET.

AF1: Understand (0/1) 0.40 0.16 - -

AF2: Offensive (0/1) 0.16 0.34 - -

AF3: Joke (0/1) 0.58 0.32 - -

AF4: Funny (1-5) 0.41 0.30 - -

AF5: Explain (Text) - - 0.18 0.30

AF6: Keywords (Text) - - 0.58 0.74

Table 4: Agreement stats for annotated ﬁelds in the Ex-

PUN dataset. We report averaged Cohen’s κand Spear-

man’s ρfor numeric ratings (AF1−AF4), and averaged

BLEU-4 and METEOR for text ﬁelds (AF5−AF6).

token vocabulary size, with 210,580 tokens over-

all). Figure 3in Appendix Bshows the distribu-

tion of the top 50 most frequent content-words

in our explanations. The frequent use of usually

and often indicate the explanation of commonsense

knowledge, e.g., thunder and lightning are usu-

ally present in a weather storm or “pain” means

physical discomfort often felt by a hospital patient.

The most frequent words, means and word, indicate

that annotators frequently provide word sense infor-

mation as part of their explanations, while sounds

frequently appears in explanations of heterographic

puns. Each of these most frequent words comprise

less than 2.8% of all tokens in the explanations,

illustrating the rich diversity of our corpus. 7

Keywords.

As seen in Figures 1c and 1d, on aver-

age, keyword phrases in ExPUN, which are derived

from the original puns, are short and sparse (5,497

token vocabulary size, with 27,820 tokens overall).

This follows from our guidelines to annotate key-

words concisely, focusing mainly on content words

that are essential to understanding the joke. Table 5

shows two examples of pun keyword annotations

in our dataset that showcase different annotation

styles among annotators. For instance, one anno-

tator may tend to select wordy keyword phrases

that introduce unnecessary tokens, while another

may omit salient keywords that other annotators

mention. Aggregating these annotations among

annotators to construct a single ground truth set

of keyword phrases is therefore challenging be-

cause of differing annotation styles. The problem

of merging keywords is further complicated be-

cause the keywords from different annotators are

often not aligned well, as different annotators may

annotate varying numbers of keyword phrases and

different spans. Taking these considerations into

We show an analysis of highly-frequent explanation tem-

plates, as well as unique and highly-informative templates, in

Appendix B.

(a) Tokens/explanation (b) Sentences/explanation

Figure 1: Distributions of (a) number of tokens and (b)

number of sentences in explanations (AF5), (c) tokens

in keyword phrases (AF6), and (d) keyword phrases

per sample. Horizontal lines are used to show the min,

mean, and max values for each distribution.

account, we propose a keyword aggregation algo-

rithm to address these issues and construct a single

set of aggregated keywords per sample.

Keywords Aggregation.

Algorithm 1in Ap-

pendix Cdescribes our keyword aggregation

method. The algorithm aims to generate a com-

prehensive list of concise keywords for each sam-

ple. First, we compute a reliability score for each

annotation, deﬁned as the average of (# keyword

phrases

−

# average tokens in each keyword phrase).

The higher the score, the more comprehensive and

concise the keywords from an annotator should be.

We choose the annotator with the highest score to

be the anchor. We note, however, that keyword

annotations are not always error-free; e.g., in the

ﬁrst example of Table 5,

has an incorrect word

(fancy chairs instead of royal chairs). Therefore,

for each keyword phrase, we compute the fuzzy

matching score between the anchor’s annotation

with the rest of annotators’ annotations. For each

annotator, we keep the keyword phrase that has

the highest fuzzy matching score with the anchor

annotator’s, with a minimum threshold score of

60.

This process produces a ﬁltered keyword list

where each of the remaining keyword phrases look

similar to the anchor’s. Then, we compute the av-

erage fuzzy matching score between the anchor’s

keyword phrase and each element in the ﬁltered

keyword list. We then choose the annotator with

the second-highest reliability score to be the an-

8This is empirically determined.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExPUNations:AugmentingPunswithKeywordsandExplanationsJiaoSun1yAnjaliNarayan-Chen2yShereenOraby2AlessandraCervone2TagyoungChung2JingHuang2YangLiu2NanyunPeng2;31UniversityofSouthernCalifornia2AmazonAlexaAI3UniversityofCalifornia,LosAngelesjiaosun@usc.edu{naraanja,orabys,cervon,tagyoung,jhuangz,yangli...

展开>> 收起<<

ExPUNations Augmenting Puns with Keywords and Explanations Jiao Sun1yAnjali Narayan-Chen2yShereen Oraby2Alessandra Cervone2 Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng23.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ExPUNations Augmenting Puns with Keywords and Explanations Jiao Sun1yAnjali Narayan-Chen2yShereen Oraby2Alessandra Cervone2 Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng23

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: