ExPUNations Augmenting Puns with Keywords and Explanations Jiao Sun1yAnjali Narayan-Chen2yShereen Oraby2Alessandra Cervone2 Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng23

2025-05-06 0 0 3.91MB 16 页 10玖币
侵权投诉
ExPUNations: Augmenting Puns with Keywords and Explanations
Jiao Sun1Anjali Narayan-Chen2Shereen Oraby2Alessandra Cervone2
Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng2,3
1University of Southern California
2Amazon Alexa AI
3University of California, Los Angeles
jiaosun@usc.edu
{naraanja,orabys,cervon,tagyoung,jhuangz,yangliud}@amazon.com
violetpeng@cs.ucla.edu
Abstract
The tasks of humor understanding and gen-
eration are challenging and subjective even
for humans, requiring commonsense and real-
world knowledge to master. Puns, in particu-
lar, add the challenge of fusing that knowledge
with the ability to interpret lexical-semantic
ambiguity. In this paper, we present the Ex-
PUNations (ExPUN) dataset, in which we aug-
ment an existing dataset of puns with detailed
crowdsourced annotations of keywords denot-
ing the most distinctive words that make the
text funny, pun explanations describing why
the text is funny, and fine-grained funniness
ratings. This is the first humor dataset with
such extensive and fine-grained annotations
specifically for puns. Based on these anno-
tations, we propose two tasks: explanation
generation to aid with pun classification and
keyword-conditioned pun generation, to chal-
lenge the current state-of-the-art natural lan-
guage understanding and generation models’
ability to understand and generate humor. We
showcase that the annotated keywords we col-
lect are helpful for generating better novel hu-
morous texts in human evaluation, and that
our natural language explanations can be lever-
aged to improve both the accuracy and robust-
ness of humor classifiers.
1 Introduction
Humor serves multiple purposes and provides nu-
merous benefits, such as relieving anxiety, avoiding
painful feelings and facilitating learning (Buxman,
2008). As a specific example of humor, the creative
uses of puns, wordplay and ambiguity are impor-
tant ways to come up with jokes (Chiaro,2006).
Pun understanding and generation are particularly
challenging tasks because they require extensive
commonsense and world knowledge to compose
and understand, even for humans. Despite growing
Work done during Jiao’s internship at Amazon.
Both authors equally contributed to the paper.
Text When artists dream in color it’s a pigment of their
imagination.
KWD artists , dream , color , pigment , imagination .
NLEx
Pigments are non-soluble materials often used in
painting, and pigment sounds like figment, which is
something that is not real but someone believes it is.
Text The man found something to catch fish, which was
a net gain.
KWD catch fish , net gain .
NLEx
This is a play on words. A “net gain” means an
increase in revenue but here “net” refers to how a net
is used to catch fish.
Table 1: Two examples of annotated Keywords (KWD)
and Natural Language Explanations (NLEx) for puns
in our dataset. The highlighted texts are annotated
keywords that contribute to making the text funny.
interest in the area, there are limited amounts of
data available in the domain of humor understand-
ing and generation.
Existing humor datasets are usually only anno-
tated with binary labels indicating whether each
sentence is a joke, pun, or punchline (Hasan et al.,
2019;Weller and Seppi,2019;Castro et al.,2018;
Mittal et al.,2021). This is insufficient to bench-
mark models’ ability to understand and generate
novel humorous text, since hardly anything mean-
ingful can be learned from such a sparse supervi-
sion signal and coarse-grained annotation.
To facilitate research on humor understanding
and generation, we present the ExPUNations (Ex-
PUN) dataset, in which we augment an existing
dataset of puns from SemEval 2017 Task 7 (Miller
et al.,2017) with detailed crowdsourced annota-
tions of fine-grained funniness ratings on a Likert
scale of one to five, along with keywords denot-
ing the most distinctive words that make the text
funny and natural language explanations describ-
ing why the text is funny (Table 1). In addition, we
collect annotations indicating whether a person un-
derstands the sentence, thinks it is a pun, and finds
arXiv:2210.13513v1 [cs.CL] 24 Oct 2022
Text Be True to your teeth, or they will be false to you. Drinking too much of a certain potent potable
may require a leave of absinthe.
Understandable [1, 1, 1, 1, 0] [1, 1, 1, 1, 1]
Offensive/Inappropriate [0, 1, 0, 0, 0] [0, 0, 0, 0, 0]
Is a joke? [1, 0, 1, 0, 0] [1, 1, 1, 1, 1]
Funniness (1-5) [2, 0, 1, 0, 0] [3, 4, 2, 1, 2]
Natural
Language
Explanation
(NLEx)
NLEx1: Talking about being true as in being real or
they will be fake/false teeth.
NLEx2: False teeth are something people who lose
their teeth may have, and being true to your teeth
may be a way of saying take care of them otherwise
you’ll lose them.
NLEx1: It’s a pun that replaces the word absence
with absinthe, which is notoriously strong alcohol.
NLEx2: This is a play on words. Absinthe here
represents the liquor by the same name but is meant
to replace the similar-sounding “absence”. Too much
absinthe will make you ill.
Joke keywords
(KWD)
KWD1: [“true”, “teeth”, “false”]
KWD2: [“be true”, “teeth”, “false to you”]
KWD1: [“drinking”, “leave of absinthe”]
KWD2: [“drinking too much”, “leave of absinthe”]
Table 2: Two examples with annotation fields that we collect. We use underline to mark the commonsense knowl-
edge that people need in order to understand the joke.
the joke offensive or inappropriate. Since these
tasks are all highly subjective, we collect multi-
ple annotations per sample, and present a detailed
agreement analysis. We believe our annotations
can be used in many other applications beyond
pun understanding and generation, such as toxicity
detection.
The contributions of our work are threefold:
We contribute extensive high-quality annota-
tions for an existing humor dataset along mul-
tiple dimensions.1
Based on the annotations, we propose two
tasks, explanation generation for pun classi-
fication and keyword-conditioned pun gener-
ation, to advance research on humor under-
standing and generation.
We benchmark state-of-the-art NLP models
on explanation generation for pun classifica-
tion and keyword-conditioned pun generation.
Our experiments demonstrate the benefits of
utilizing natural language keywords and expla-
nations for humor understanding and genera-
tion while highlighting several potential areas
of improvement for the existing models.
2 ExPUN Dataset
In this section, we describe our data annotation
procedure, including details of the annotation fields
and our assessment of the annotation quality.
1
Resources will be available at:
https://github.
com/amazon-research/expunations
2.1 Data Preparation
The original SemEval 2017 Task 7 dataset (Miller
et al.,2017)
2
contains puns that are either homo-
graphic (exploiting polysemy) or heterographic (ex-
ploiting phonological similarity to another word).
The dataset also contains examples of non-pun text.
We sample 1,999 text samples from SemEval 2017
Task 7 as the basis for our humor annotation. 3
2.2 Dataset Annotation
The annotated fields (AF ) come in the order of:
AF1
[understandability]: whether the annotator
understands the text or not, regardless of
whether they perceive it as funny.
AF2
[offensiveness]: whether the annotator finds
the text offensive or inappropriate.
AF3
[joke]: whether the annotator thinks the text
is intended to be a joke.
AF4
[funniness]: rate the funniness on a Likert
scale of 1-5, where 1 means very not funny
and 5 means very funny.
AF5
[explanation]: explain in concise natural lan-
guage about why this joke is funny. More
specifically, if external or commonsense
knowledge is required to understand the joke
and/or its humor, the annotator should in-
clude the relevant knowledge in the explana-
tion. If the joke is a pun or play on words,
they must provide an explanation of how the
play on words works.
2https://alt.qcri.org/semeval2017/
task7/
. The data is released under CC BY-NC 4.0 license
(
https://creativecommons.org/licenses/
by-nc/4.0/legalcode).
3
We sample 834 heterographic puns, 1,074 homographic
puns and 91 non-puns.
AF6
[joke keywords]: pick out (as few as possi-
ble) keyword phrases from the joke that are
related to the punchline/the reason the joke is
funny. We emphasize that phrases should be
sparse and mainly limited to content words,
can be multiple words long, and the keywords
should be copied verbatim from the joke.
If an annotator rates the instance as not under-
standable, they will skip the rest of the annotation
for that instance (
AF2
-
AF6
). In addition, if an an-
notator rates an example as not a joke, they can
skip the rest of the annotation (
AF4
-
AF6
). Table 2
shows two examples in our dataset. The first ex-
ample has two annotators who think the text is a
joke, and therefore it has two explanations. In the
second instance, all annotators unanimously agree
it is a joke. Here, we sample two explanations
from the original five. For both instances, we use
underline to highlight the external commonsense
knowledge in the explanation. If the joke is a play
on words, the explanation also shows how the play
on words works (e.g., the second joke). We show
the full annotation guidelines, including calibrating
examples, in Appendix A.
We crowdsourced 5 annotations per sample us-
ing a professional team of 10 dedicated full-time
annotators within our organization. Before starting
the task, we held a kick-off meeting with the team
to explain the annotation guidelines in detail. We
then conducted 3 pilot rounds for calibration and
iteratively met with annotators, including more de-
tails and examples to address annotator questions.
4
Finally, we conducted 7 rounds of annotation, each
with between 100-300 puns per round grouped into
minibatches of 50 examples. Each sample in a
minibatch was annotated by consistent subteams
of 5 annotators. After receiving a completed batch
of annotations, we manually examined their qual-
ity and provided feedback on any quality issues,
redoing batches as necessary.
2.3 Dataset Statistics and Quality Control
We report overall dataset statistics in Table 3. For
AF1AF3
, we count the number of samples la-
beled positive by majority vote. For
AF4
, we com-
pute the average of all funniness scores, excluding
blank annotations, and find that while annotators
recognized most samples as jokes, they did not find
them to be particularly funny. For
AF5
and
AF6
,
4
See Appendix A.2 for more details on pilot round feed-
back.
total AF1AF2AF3
# samples 1,999 1,795 65 1,449
AF4: Avg. funniness 1.68
AF5: Explanations
total # explanations 6,650
avg. # explanations/sample 3.33
avg. # tokens/expl. 31.67
avg. # sentences/expl. 2.01
AF6: Keyword phrases
avg. # tokens/keyword phrase 1.33
avg. # keyword phrases/sample 2.09
Table 3: Overall stats for annotation fields in ExPUN.
we compute lexical statistics of our explanations
and keyword annotations and provide deeper anal-
ysis of these key annotation fields in Section 2.4.
We report inter-annotator agreement for all an-
notation fields in Table 4.
5
For fields
AF1
-
AF4
,
we compute agreement using (1) the average of
Cohen’s kappa scores of each annotator against the
majority vote, and (2) the average Spearman cor-
relation between each pair of annotators. We find
that annotators show moderate agreement when de-
ciding if the given text is a joke (
AF3
), but lower
agreement on the task of understanding the text
(
AF1
) as well as the much more subjective task of
rating how funny a joke is (
AF4
). We also find
weak average Spearman correlation between each
pair of annotations for the subjective categories of
offensiveness (
AF2
),
6
whether the text is a joke
(AF3) and joke funniness (AF4).
For the free text fields in
AF5
and
AF6
, we com-
pute averaged BLEU-4 (Papineni et al.,2002) and
METEOR (Banerjee and Lavie,2005) scores in a
pairwise fashion. We treat each annotator’s expla-
nation (for AF5) or list of keyword phrases joined
into a string (for
AF6
) as candidate text, with the
remaining annotators’ annotations as a set of refer-
ences. We find high similarity between joke key-
word annotations, suggesting that annotators iden-
tify similar spans of keyword phrases, and a lower
degree of similarity between pun explanations.
2.4 Dataset Analysis
Explanations.
As seen in Figures 1a and 1b, on
average, samples are annotated with multiple expla-
nations, and the explanations are lengthy, spanning
multiple sentences, and lexically diverse (14,748
5
When computing agreement, we exclude the first 100
annotated samples, as these were used as a calibrating pilot.
6See Appendix A.3 for more details.
Annotation Field κ ρ BLEU MET.
AF1: Understand (0/1) 0.40 0.16 - -
AF2: Offensive (0/1) 0.16 0.34 - -
AF3: Joke (0/1) 0.58 0.32 - -
AF4: Funny (1-5) 0.41 0.30 - -
AF5: Explain (Text) - - 0.18 0.30
AF6: Keywords (Text) - - 0.58 0.74
Table 4: Agreement stats for annotated fields in the Ex-
PUN dataset. We report averaged Cohen’s κand Spear-
man’s ρfor numeric ratings (AF1AF4), and averaged
BLEU-4 and METEOR for text fields (AF5AF6).
token vocabulary size, with 210,580 tokens over-
all). Figure 3in Appendix Bshows the distribu-
tion of the top 50 most frequent content-words
in our explanations. The frequent use of usually
and often indicate the explanation of commonsense
knowledge, e.g., thunder and lightning are usu-
ally present in a weather storm or “pain” means
physical discomfort often felt by a hospital patient.
The most frequent words, means and word, indicate
that annotators frequently provide word sense infor-
mation as part of their explanations, while sounds
frequently appears in explanations of heterographic
puns. Each of these most frequent words comprise
less than 2.8% of all tokens in the explanations,
illustrating the rich diversity of our corpus. 7
Keywords.
As seen in Figures 1c and 1d, on aver-
age, keyword phrases in ExPUN, which are derived
from the original puns, are short and sparse (5,497
token vocabulary size, with 27,820 tokens overall).
This follows from our guidelines to annotate key-
words concisely, focusing mainly on content words
that are essential to understanding the joke. Table 5
shows two examples of pun keyword annotations
in our dataset that showcase different annotation
styles among annotators. For instance, one anno-
tator may tend to select wordy keyword phrases
that introduce unnecessary tokens, while another
may omit salient keywords that other annotators
mention. Aggregating these annotations among
annotators to construct a single ground truth set
of keyword phrases is therefore challenging be-
cause of differing annotation styles. The problem
of merging keywords is further complicated be-
cause the keywords from different annotators are
often not aligned well, as different annotators may
annotate varying numbers of keyword phrases and
different spans. Taking these considerations into
7
We show an analysis of highly-frequent explanation tem-
plates, as well as unique and highly-informative templates, in
Appendix B.
(a) Tokens/explanation (b) Sentences/explanation
(c) Tokens/keyword phrase (d) Keyword phrases/sample
Figure 1: Distributions of (a) number of tokens and (b)
number of sentences in explanations (AF5), (c) tokens
in keyword phrases (AF6), and (d) keyword phrases
per sample. Horizontal lines are used to show the min,
mean, and max values for each distribution.
account, we propose a keyword aggregation algo-
rithm to address these issues and construct a single
set of aggregated keywords per sample.
Keywords Aggregation.
Algorithm 1in Ap-
pendix Cdescribes our keyword aggregation
method. The algorithm aims to generate a com-
prehensive list of concise keywords for each sam-
ple. First, we compute a reliability score for each
annotation, defined as the average of (# keyword
phrases
# average tokens in each keyword phrase).
The higher the score, the more comprehensive and
concise the keywords from an annotator should be.
We choose the annotator with the highest score to
be the anchor. We note, however, that keyword
annotations are not always error-free; e.g., in the
first example of Table 5,
w4
has an incorrect word
(fancy chairs instead of royal chairs). Therefore,
for each keyword phrase, we compute the fuzzy
matching score between the anchor’s annotation
with the rest of annotators’ annotations. For each
annotator, we keep the keyword phrase that has
the highest fuzzy matching score with the anchor
annotator’s, with a minimum threshold score of
60.
8
This process produces a filtered keyword list
where each of the remaining keyword phrases look
similar to the anchor’s. Then, we compute the av-
erage fuzzy matching score between the anchor’s
keyword phrase and each element in the filtered
keyword list. We then choose the annotator with
the second-highest reliability score to be the an-
8This is empirically determined.
摘要:

ExPUNations:AugmentingPunswithKeywordsandExplanationsJiaoSun1yAnjaliNarayan-Chen2yShereenOraby2AlessandraCervone2TagyoungChung2JingHuang2YangLiu2NanyunPeng2;31UniversityofSouthernCalifornia2AmazonAlexaAI3UniversityofCalifornia,LosAngelesjiaosun@usc.edu{naraanja,orabys,cervon,tagyoung,jhuangz,yangli...

展开>> 收起<<
ExPUNations Augmenting Puns with Keywords and Explanations Jiao Sun1yAnjali Narayan-Chen2yShereen Oraby2Alessandra Cervone2 Tagyoung Chung2Jing Huang2Yang Liu2Nanyun Peng23.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:3.91MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注