The Goldilocks of Pragmatic Understanding Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

2025-05-06 0 0 8.2MB 79 页 10玖币

侵权投诉

The Goldilocks of Pragmatic Understanding:

Fine-Tuning Strategy Matters for Implicature

Resolution by LLMs

Laura Ruis∗

University College London

Akbir Khan

University College London

Stella Biderman

EleutherAI, Booz Allen Hamilton

Sara Hooker

Cohere for AI

Tim Rocktäschel

University College London

Edward Grefenstette

University College London

Abstract

Despite widespread use of LLMs as conversational agents, evaluations of perfor-

mance fail to capture a crucial aspect of communication: interpreting language in

context—incorporating its pragmatics. Humans interpret language using beliefs

and prior knowledge about the world. For example, we intuitively understand

the response “I wore gloves” to the question “Did you leave ﬁngerprints?” as

meaning “No”. To investigate whether LLMs have the ability to make this type

of inference, known as an implicature, we design a simple task and evaluate four

categories of widely used state-of-the-art models. We ﬁnd that, despite only evalu-

ating on utterances that require a binary inference (yes or no), models in three of

these categories perform close to random. However, LLMs instruction-tuned at

the example-level perform signiﬁcantly better. These results suggest that certain

ﬁne-tuning strategies are far better at inducing pragmatic understanding in models.

We present our ﬁndings as the starting point for further research into evaluating

how LLMs interpret language in context and to drive the development of more

pragmatic and useful models of human discourse.

1 Introduction

User: “Have you seen my phone?”

GPT-3: “Yes, I have seen your phone.”

GPT-3’s response

is a perfectly ﬁne answer to the question, but a human might answer differently.

They might respond “it’s in your bag," bypassing the obvious follow-up question (“where is it?”).

Giving such a helpful and efﬁcient answer is an example of pragmatic language use that goes

beyond the mere production of semantically plausible and consistent utterances. Meaning is not only

determined by a combination of words, but also context, beliefs, and social institutions (Wittgenstein,

1953; Grice, 1975; Huang, 2017). Consider another exchange where Esther asks her friend Juan “Can

you come to my party on Friday?” and Juan responds “I have to work”. We resolve Juan’s response

as him declining the invitation by using the contextual commonsense knowledge that having to work

on a Friday night precludes attendance. Both these exchanges contain an implicature—utterances that

convey something other than their literal meaning.

Implicatures illustrate how context contributes to

meaning; distinguishing writing and speaking from communicating (Green, 1996). We cannot fully

∗Correspondence to laura.ruis.21@ucl.ac.uk

2Appendix D contains details on how this completion was obtained from text-davinci-002

3In Appendix E we present an introduction to implicature.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.14986v2 [cs.CL] 3 Dec 2023

Figure 1: A schematic depiction of the protocol we propose to evaluate whether language models can

resolve implicatures. Each example in the test set gets wrapped in templates and transformed into an

incoherent example by swapping “yes” and “no”. The model is said to resolve the implicature if it

assigns a higher likelihood to the coherent text than the incoherent text.

understand utterances without understanding their implications. Indeed, the term “communication”

presupposes the speaker’s implications are understood by the addressee. Being able to resolve

completely novel implicatures and, more broadly, engage in pragmatic understanding constitutes an

essential and ubiquitous aspect of our every day use of language.

Large language models (LLMs) have demonstrated remarkable ability on a variety of downstream

tasks such as planning (Huang et al., 2022), commonsense reasoning (Kojima et al., 2022), information

retrieval (Lewis et al., 2020; Kim et al., 2022) and code completion (Austin et al., 2021; Biderman and

Raff, 2022), to name a few. When ﬁne-tuned with human feedback, LLMs obtain higher ratings on

desiderata like helpfulness (Ouyang et al., 2022; Bai et al., 2022), and are proposed as conversational

agents (Thoppilan et al., 2022). Despite the widespread use of LLMs as conversational agents, there

has been limited evaluation of their ability to navigate contextual commonsense knowledge.

This raises an important question: to what extent can large language models resolve conversational

implicature? To answer this question we use a public dataset of conversational implicatures and

propose an evaluation protocol on top of it (Figure 1). We evaluate a range of state-of-the-art models

that can be categorised into four groups; large-scale pre-trained models, like OPT (Zhang et al.,

2022), LLMs ﬁne-tuned on conversational data, like BlenderBot (Ng et al., 2019), LLMs ﬁne-tuned

on common NLP benchmarks with natural instructions for each benchmark, like Flan-T5 (Chung

et al., 2022), and LLMs ﬁne-tuned on tasks with natural instructions for each example, e.g. versions

of OpenAI’s InstructGPT-3 series

. Our results show that implicature resolution is a challenging task

for LLMs. All pre-trained models obtain close to random zero-shot accuracy (around 60%), whereas

humans obtain 86%. However, our results suggest that instruction-tuning at the example level is

important for pragmatic understanding. Models ﬁne-tuned with this method perform much better

than others, and analysis of different model sizes shows that they have the best scaling properties. We

further push performance for these models with chain-of-thought prompting, and ﬁnd that one model

in the group (GPT-4) reaches human-level performance. In summary, we conclude that pragmatic

understanding has not yet arisen from large-scale pre-training on its own, but scaling analysis shows

that it might for much larger scale. Fine-tuning on conversational data or benchmark-level instructions

does not produce models with pragmatic understanding. However, ﬁne-tuning on instructions at the

example-level is a fruitful path towards more useful models of human discourse.

The main contributions of this work are: i) we motivate implicature understanding as a crucial

aspect of communication that is currently mostly missing from evaluations of LLMs, ii) we design an

implicature resolution task and propose a comprehensive evaluation protocol on which we evaluate

both humans and LLMs to ﬁnd that it poses a signiﬁcant challenge for SotA LLMs, and iii) we

provide a thorough analysis of the results and identify one ﬁne-tuning strategy (instruction-tuning at

the example-level) as a promising method that produces models with more pragmatic understanding.

4The precise method is unpublished and differs from the original instructGPT (Ouyang et al., 2022).

2 Related Work

LLMs have demonstrated remarkable performance on tasks for which they were not explicitly trained

(Brown et al., 2020). Building on the hypothesis that these abilities arise due to implicit multitask

learning (Radford et al., 2019), the recent works of Sanh et al. (2022) and Wei et al. (2022) explicitly

train LLMs in a supervised multitask fashion, leading to models that are better zero-shot learners with

fewer parameters. Besides rapidly saturating language understanding benchmarks (Kiela et al., 2021),

these advancements make LLMs beneﬁcial foundations for agents performing a plethora of tasks

(Adolphs et al., 2022; Reed et al., 2022). The trend towards using these models as agents brings along

with it increased urgency for alignment with human values (Kenton et al., 2021). However, larger

models trained with next-word prediction are generally more toxic and unhelpful (Gehman et al.,

2020; Bender et al., 2021; Lin et al., 2022). Recent work mitigates this with methods like prompting

and ﬁnetuning on human-annotated outputs (Askell et al., 2021; Ouyang et al., 2022; Thoppilan et al.,

2022). The produced models are more aligned on desiderata such as informativeness when evaluated

by dedicated benchmarks and humans. We argue, however, that there is still something missing in

these benchmarks. What is helpful and informative, as Kasirzadeh and Gabriel (2022) also point out,

depends on the context in which a conversation is held. Consequently, any application that requires

communicating with humans will rely on pragmatic communication skills—something that is not

explicitly captured by the benchmarks used to evaluate the alignment of LLMs.

There is a large body of work that investigates the interplay between pragmatics and computational

modeling (Cianﬂone et al., 2018; Schuster et al., 2020; Louis et al., 2020; Kim et al., 2021; Li et al.,

2021; Jeretic et al., 2020; Parrish et al., 2021; Hosseini et al., 2023). Cianﬂone et al. (2018) introduce

the task of predicting adverbial presupposition triggers, which are words like ‘again’ that trigger

the unspoken presupposition that an event has happened before. Schuster et al. (2020) study the

ability of computational models to do scalar inferences, ﬁnding that models use linguistic features

to make pragmatic inferences. Kim et al. (2021) ﬁnd that a substantial part of question-answering

datasets contain questions that are unanswerable due to false presuppositions (i.e. “which linguist

invented the lightbulb”). Hosseini et al. (2023) present a dataset for selecting entities with indirect

answers, and ﬁnd that language models adapted for this task get reasonable accuracy, but that there

is room for improvement. The difference with this body of work and ours is that we look at the

emergence of pragmatic understanding from large-scale language modeling. Jeretic et al. (2020);

Parrish et al. (2021) are early works investigating the emergence of pragmatic understanding in

pretrained language models, but they only look at scalar implicatures and presuppositions. Zheng

et al. (2021) are the ﬁrst to evaluate pretrained language models on conversational implicatures. This

is important pioneering work highlighting the difﬁculty of implicature for language models, but their

evaluations require task-speciﬁc training and the models they evaluate are relatively small. In contrast,

our evaluation protocol is applicable out-of-the-box and is much more comprehensive, evaluating

models up to 176 billion parameters and using in-context prompting. Additionally, Zheng et al.

(2021) benchmark synthetic data whereas this work evaluates performance on naturally occurring

implicatures (George and Mamidi, 2020). We believe this to be a better representation of the true

distribution of implicatures in natural dialogue.

The standard set of benchmarks LLMs are evaluated on covers many tasks, but even though im-

plicature is one of the most important aspects of language pragmatics (Levinson, 1983), it is only

evaluated as part of BIG-bench (Srivastava et al., 2022). Unfortunately, the methodology used by the

BIG-bench implicature task contributors has limitations, which call into question the validity of their

claims. Firstly, the task contributors discard a subset of the data that is ambiguous according to them.

In our view this defeats the point of the benchmark. Implicatures are a type of non-literal, ambiguous

language the intended meaning of which humans often easily interpret; comparing the way humans

and models do this is precisely what we are interested in. In turn, we expect performance on the

BIG-bench task to overestimate the ability of LLMs to resolve naturally occurring implicatures. We

keep this challenging subset of the data and instead use human evaluation to deal with examples that

are too ambiguous to understand. Secondly, the difference in performance between their average

and best rater is 18%, whereas for our evaluations this difference is 6%. This indicates their human

evaluation is of low quality, but it is impossible to verify because there are no details available on how

the annotation is done. Finally, BIG-bench uses only base LLMs and no SotA ﬁne-tuning methods.

In summary, we use a more challenging dataset, and in turn at least six times more evaluations per

model, we provide higher-quality human annotations, and evaluate four different categories of LLMs

to investigate which aspects of LLMs contribute to their performance on implicature understanding.

3 Evaluation Protocol

Here we outline the evaluation protocol we use to answer the research question “To what extent can

LLMs resolve conversational implicature?”. We focus on binary implicatures that imply “yes” or “no”

(see Figure 1). We say a model resolves an implicature correctly if it assigns higher likelihood to a

coherent utterance than a similar but incoherent one, detailed below.

Zero-shot evaluation. Consider the example from the introduction packed into a single utterance:

Esther asked “Can you come to my party on Friday?” and Juan responded “I have

to work”, which means no.

We can transform this example to be pragmatically incoherent (in the sense that it will become

pragmatically inconsistent with expected use) by replacing the word “no” with “yes”:

Esther asked “Can you come to my party on Friday?” and Juan responded “I have

to work”, which means yes.

To resolve the implicature, the model should assign higher likelihood to the ﬁrst of the two sentences

above, namely the most coherent one. Importantly, both sentences have exactly the same words

except for the binary implicature “yes” or “no”, making the assigned likelihood scores directly

comparable. Formally, let the coherent prompt be

and the augmented, incoherent prompt be

ˆx

A model outputs a likelihood

parameterized by weights

. We say a model correctly resolves an

example

when it assigns

pθ(x)> pθ(ˆx)

. This is equivalent to evaluating whether the model

assigns a higher likelihood to the correct continuation of the two options. Note that this is a more

lenient evaluation protocol than sometimes used for language models, where models are evaluated on

on their ability to generate the correct continuation, in this case “no”. The greedy decoding approach

(evaluating whether “yes” or “no” is generated) is also captured by our approach, but we additionally

label an example correct if “no” is not the highest assigned likelihood, but still higher than “yes”.

We did not opt for greedy decoding because “no” is not the only coherent continuation here, and

marginalising over all possible correct continuations is intractable. The more lenient evaluation does

capture implicature resolution, because the choice of “no” versus “yes” is only determined by the

resolution of the implicature. We guide the models to output “yes” or “no” explicitly in three of

the six prompt templates with instructions, such that we can estimate the effect of this guidance

on performance. For two model classes (i.e. GPT-3.5-turbo and GPT-4) we do not have access to

likelihoods, and for these models we take the greedy decoding approach, guiding the model to output

“yes” or “no” explicitly in all prompts (see Table 6 in Appendix F).

We use a dataset of conversational implicatures curated by George and Mamidi (2020)

. It contains

implicatures that, like in Figure 1, are presented in utterance-response-implicature tuples. Of these,

718 are binary implicatures that we can convert into an incoherent sentence. We randomly sample

600 examples for the test set and keep the remaining 118 as a development set to improve implicature

resolution after pre-training through in-context prompting or ﬁne-tuning.

Few-shot in-context evaluation. We add kexamples of the task to the prompt, e.g. with k= 2:

Esther asked “Have you found him yet?” and Juan responded “They’re still

looking”, which means no.

Esther asked “Are you having fun?” and Juan responded “Is the pope Catholic?”,

which means yes.

Finish the following sentence:

Esther asked “Can you come to my party on Friday?” and Juan responded “I have

to work”, which means no.

We evaluate the models’

-shot capabilities for

k∈ {1,5,10,15,30}

by randomly sampling

examples from the development set for each test example. We opt for a random sampling approach to

control for two sources of randomness. Firstly, examples have different levels of informativeness.

Secondly, recent work found that the order in which examples are presented matters (Lu et al., 2022).

Ideally, to marginalise over these random factors, we would evaluate each test example with all

5Published under a CC BY 4.0 license.

permutations of

examples from the development set. This requires

118!

(118−k)!

evaluations for each

test example, which is intractable. Instead, we estimate performance per test example by randomly

sampling from the development set. In this way we control for some of the variance in performance,

but avoid extra evaluations. We ensure each model sees the same few-shot examples per test example.

Controlling for prompt sensitivity. It has been shown language models are sensitive to prompt

wording (Efrat and Levy, 2020; Tan et al., 2021; Reynolds and McDonell, 2021a; Webson and Pavlick,

2021). To control for this factor of randomness we manually curate six different template prompts and

measure performance across these. One of the templates has been presented above, namely “Esther

asked <utterance> and Juan responded <response>, which means <implicature>”. Another template

is: “Question: <utterance>, response: <response>, meaning: <implicature>”. The former we call

natural prompts and the latter structured prompts. Each group has three templates that only differ

slightly in wording. This grouping allows us to look at the variance due to slight changes in wording

as well as performance difference due to a completely different way of presenting the example. The

full list of prompts can be found in Appendix F.

4 Experiments

The set of large language model classes we evaluate can be grouped into four distinct categories:

Base models: large-scale pre-trained models; RoBERTa (Liu et al., 2019), BERT (Devlin

et al., 2018), GPT-2 (Radford et al., 2019), EleutherAI (Wang and Komatsuzaki, 2021; Black

et al., 2022), BLOOM (BigScience, 2022), OPT (Zhang et al., 2022), Cohere’s base models,

and GPT-3 (Brown et al., 2020)

2. Dialogue FT: LLMs ﬁne-tuned on dialogue, BlenderBot (Ng et al., 2019).

Benchmark IT: LLMs ﬁne-tuned on tasks with natural instructions for each benchmark or

“benchmark-level instruction-tuned models”; T0 (Sanh et al., 2022) and Flan-T5 (Chung

et al., 2022).

Example IT: LLMs ﬁne-tuned on tasks with natural instructions for each example or

“example-level instruction-tuned models”; a subset of OpenAI’s API models and Cohere’s

API models).

For Benchmark IT models, annotators write a single instruction for an entire dataset. The models

are then ﬁne-tuned on each example from the dataset with the same instruction. We distinguish this

from example-level IT; for that type of ﬁne-tuning each example in a dataset gets a new instruction,

resulting in a more diverse dataset. Each group contains model classes for which we evaluate a range

of sizes. A detailed categorization of the models and their attributes can be found in appendix G.

make use of the OpenAI and Cohere APIs as well as the pretrained models in the transformers library

(Wolf et al., 2020) and EleutherAI’s framework to evaluate them (Gao et al., 2021). All code used

for this paper can be found on GitHub

and the dataset is made publicly available on HuggingFace

Below, we present zero-shot and few-shot results, discussing patterns of performance of the models

in the four different groups. We further look at the results for different model sizes of each model

class and the variance over the prompt templates. We contrast the models’ performance with human

performance. To this end, each test example gets annotated by ﬁve humans. We split the test set

in four and assign each annotator a subset, leaving us with twenty annotators in total. The average

human performance is 86.2%, and the best performance is 92%. Some of the errors humans make

uncover examples that have multiple interpretations, and others uncover annotation errors. The nature

of the task of implicature resolution means we do not expect models to perform better than human

best performance. Details on the human experiment can be found in the Appendix H (also containing

an analysis of human errors), and detailed results per model and prompt template in Appendix K.10.

We also test for spurious correlations present in the benchmark (like lexical cues the model can rely

on), and ﬁnd no indication (Appendix K.8).

Insight 1: Models instruction-tuned at the example level outperform all others. Table 1 shows the

best 0-, 1-, and 5-shot accuracy each model class achieved on the implicature task. The best overall

Note that there are several important aspects unknown for models behind APIs, like OpenAI’s model sizes.

7https://github.com/LauraRuis/do-pigs-fly

8https://huggingface.co/datasets/UCL-DARK/ludwig

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheGoldilocksofPragmaticUnderstanding:Fine-TuningStrategyMattersforImplicatureResolutionbyLLMsLauraRuis∗UniversityCollegeLondonAkbirKhanUniversityCollegeLondonStellaBidermanEleutherAI,BoozAllenHamiltonSaraHookerCohereforAITimRocktäschelUniversityCollegeLondonEdwardGrefenstetteUniversityCollegeLondon...

展开>> 收起<<

The Goldilocks of Pragmatic Understanding Fine-Tuning Strategy Matters for Implicature Resolution by LLMs.pdf

共79页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Goldilocks of Pragmatic Understanding Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: