2 Related Work
LLMs have demonstrated remarkable performance on tasks for which they were not explicitly trained
(Brown et al., 2020). Building on the hypothesis that these abilities arise due to implicit multitask
learning (Radford et al., 2019), the recent works of Sanh et al. (2022) and Wei et al. (2022) explicitly
train LLMs in a supervised multitask fashion, leading to models that are better zero-shot learners with
fewer parameters. Besides rapidly saturating language understanding benchmarks (Kiela et al., 2021),
these advancements make LLMs beneficial foundations for agents performing a plethora of tasks
(Adolphs et al., 2022; Reed et al., 2022). The trend towards using these models as agents brings along
with it increased urgency for alignment with human values (Kenton et al., 2021). However, larger
models trained with next-word prediction are generally more toxic and unhelpful (Gehman et al.,
2020; Bender et al., 2021; Lin et al., 2022). Recent work mitigates this with methods like prompting
and finetuning on human-annotated outputs (Askell et al., 2021; Ouyang et al., 2022; Thoppilan et al.,
2022). The produced models are more aligned on desiderata such as informativeness when evaluated
by dedicated benchmarks and humans. We argue, however, that there is still something missing in
these benchmarks. What is helpful and informative, as Kasirzadeh and Gabriel (2022) also point out,
depends on the context in which a conversation is held. Consequently, any application that requires
communicating with humans will rely on pragmatic communication skills—something that is not
explicitly captured by the benchmarks used to evaluate the alignment of LLMs.
There is a large body of work that investigates the interplay between pragmatics and computational
modeling (Cianflone et al., 2018; Schuster et al., 2020; Louis et al., 2020; Kim et al., 2021; Li et al.,
2021; Jeretic et al., 2020; Parrish et al., 2021; Hosseini et al., 2023). Cianflone et al. (2018) introduce
the task of predicting adverbial presupposition triggers, which are words like ‘again’ that trigger
the unspoken presupposition that an event has happened before. Schuster et al. (2020) study the
ability of computational models to do scalar inferences, finding that models use linguistic features
to make pragmatic inferences. Kim et al. (2021) find that a substantial part of question-answering
datasets contain questions that are unanswerable due to false presuppositions (i.e. “which linguist
invented the lightbulb”). Hosseini et al. (2023) present a dataset for selecting entities with indirect
answers, and find that language models adapted for this task get reasonable accuracy, but that there
is room for improvement. The difference with this body of work and ours is that we look at the
emergence of pragmatic understanding from large-scale language modeling. Jeretic et al. (2020);
Parrish et al. (2021) are early works investigating the emergence of pragmatic understanding in
pretrained language models, but they only look at scalar implicatures and presuppositions. Zheng
et al. (2021) are the first to evaluate pretrained language models on conversational implicatures. This
is important pioneering work highlighting the difficulty of implicature for language models, but their
evaluations require task-specific training and the models they evaluate are relatively small. In contrast,
our evaluation protocol is applicable out-of-the-box and is much more comprehensive, evaluating
models up to 176 billion parameters and using in-context prompting. Additionally, Zheng et al.
(2021) benchmark synthetic data whereas this work evaluates performance on naturally occurring
implicatures (George and Mamidi, 2020). We believe this to be a better representation of the true
distribution of implicatures in natural dialogue.
The standard set of benchmarks LLMs are evaluated on covers many tasks, but even though im-
plicature is one of the most important aspects of language pragmatics (Levinson, 1983), it is only
evaluated as part of BIG-bench (Srivastava et al., 2022). Unfortunately, the methodology used by the
BIG-bench implicature task contributors has limitations, which call into question the validity of their
claims. Firstly, the task contributors discard a subset of the data that is ambiguous according to them.
In our view this defeats the point of the benchmark. Implicatures are a type of non-literal, ambiguous
language the intended meaning of which humans often easily interpret; comparing the way humans
and models do this is precisely what we are interested in. In turn, we expect performance on the
BIG-bench task to overestimate the ability of LLMs to resolve naturally occurring implicatures. We
keep this challenging subset of the data and instead use human evaluation to deal with examples that
are too ambiguous to understand. Secondly, the difference in performance between their average
and best rater is 18%, whereas for our evaluations this difference is 6%. This indicates their human
evaluation is of low quality, but it is impossible to verify because there are no details available on how
the annotation is done. Finally, BIG-bench uses only base LLMs and no SotA fine-tuning methods.
In summary, we use a more challenging dataset, and in turn at least six times more evaluations per
model, we provide higher-quality human annotations, and evaluate four different categories of LLMs
to investigate which aspects of LLMs contribute to their performance on implicature understanding.
3