The Goldilocks of Pragmatic Understanding Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

2025-05-06 0 0 8.2MB 79 页 10玖币
侵权投诉
The Goldilocks of Pragmatic Understanding:
Fine-Tuning Strategy Matters for Implicature
Resolution by LLMs
Laura Ruis
University College London
Akbir Khan
University College London
Stella Biderman
EleutherAI, Booz Allen Hamilton
Sara Hooker
Cohere for AI
Tim Rocktäschel
University College London
Edward Grefenstette
University College London
Abstract
Despite widespread use of LLMs as conversational agents, evaluations of perfor-
mance fail to capture a crucial aspect of communication: interpreting language in
context—incorporating its pragmatics. Humans interpret language using beliefs
and prior knowledge about the world. For example, we intuitively understand
the response “I wore gloves” to the question “Did you leave fingerprints?” as
meaning “No”. To investigate whether LLMs have the ability to make this type
of inference, known as an implicature, we design a simple task and evaluate four
categories of widely used state-of-the-art models. We find that, despite only evalu-
ating on utterances that require a binary inference (yes or no), models in three of
these categories perform close to random. However, LLMs instruction-tuned at
the example-level perform significantly better. These results suggest that certain
fine-tuning strategies are far better at inducing pragmatic understanding in models.
We present our findings as the starting point for further research into evaluating
how LLMs interpret language in context and to drive the development of more
pragmatic and useful models of human discourse.
1 Introduction
User: “Have you seen my phone?”
GPT-3: “Yes, I have seen your phone.
GPT-3’s response
2
is a perfectly fine answer to the question, but a human might answer differently.
They might respond “it’s in your bag," bypassing the obvious follow-up question (“where is it?”).
Giving such a helpful and efficient answer is an example of pragmatic language use that goes
beyond the mere production of semantically plausible and consistent utterances. Meaning is not only
determined by a combination of words, but also context, beliefs, and social institutions (Wittgenstein,
1953; Grice, 1975; Huang, 2017). Consider another exchange where Esther asks her friend Juan “Can
you come to my party on Friday?” and Juan responds “I have to work”. We resolve Juan’s response
as him declining the invitation by using the contextual commonsense knowledge that having to work
on a Friday night precludes attendance. Both these exchanges contain an implicature—utterances that
convey something other than their literal meaning.
3
Implicatures illustrate how context contributes to
meaning; distinguishing writing and speaking from communicating (Green, 1996). We cannot fully
Correspondence to laura.ruis.21@ucl.ac.uk
2Appendix D contains details on how this completion was obtained from text-davinci-002
3In Appendix E we present an introduction to implicature.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.14986v2 [cs.CL] 3 Dec 2023
Figure 1: A schematic depiction of the protocol we propose to evaluate whether language models can
resolve implicatures. Each example in the test set gets wrapped in templates and transformed into an
incoherent example by swapping “yes” and “no”. The model is said to resolve the implicature if it
assigns a higher likelihood to the coherent text than the incoherent text.
understand utterances without understanding their implications. Indeed, the term “communication”
presupposes the speaker’s implications are understood by the addressee. Being able to resolve
completely novel implicatures and, more broadly, engage in pragmatic understanding constitutes an
essential and ubiquitous aspect of our every day use of language.
Large language models (LLMs) have demonstrated remarkable ability on a variety of downstream
tasks such as planning (Huang et al., 2022), commonsense reasoning (Kojima et al., 2022), information
retrieval (Lewis et al., 2020; Kim et al., 2022) and code completion (Austin et al., 2021; Biderman and
Raff, 2022), to name a few. When fine-tuned with human feedback, LLMs obtain higher ratings on
desiderata like helpfulness (Ouyang et al., 2022; Bai et al., 2022), and are proposed as conversational
agents (Thoppilan et al., 2022). Despite the widespread use of LLMs as conversational agents, there
has been limited evaluation of their ability to navigate contextual commonsense knowledge.
This raises an important question: to what extent can large language models resolve conversational
implicature? To answer this question we use a public dataset of conversational implicatures and
propose an evaluation protocol on top of it (Figure 1). We evaluate a range of state-of-the-art models
that can be categorised into four groups; large-scale pre-trained models, like OPT (Zhang et al.,
2022), LLMs fine-tuned on conversational data, like BlenderBot (Ng et al., 2019), LLMs fine-tuned
on common NLP benchmarks with natural instructions for each benchmark, like Flan-T5 (Chung
et al., 2022), and LLMs fine-tuned on tasks with natural instructions for each example, e.g. versions
of OpenAI’s InstructGPT-3 series
4
. Our results show that implicature resolution is a challenging task
for LLMs. All pre-trained models obtain close to random zero-shot accuracy (around 60%), whereas
humans obtain 86%. However, our results suggest that instruction-tuning at the example level is
important for pragmatic understanding. Models fine-tuned with this method perform much better
than others, and analysis of different model sizes shows that they have the best scaling properties. We
further push performance for these models with chain-of-thought prompting, and find that one model
in the group (GPT-4) reaches human-level performance. In summary, we conclude that pragmatic
understanding has not yet arisen from large-scale pre-training on its own, but scaling analysis shows
that it might for much larger scale. Fine-tuning on conversational data or benchmark-level instructions
does not produce models with pragmatic understanding. However, fine-tuning on instructions at the
example-level is a fruitful path towards more useful models of human discourse.
The main contributions of this work are: i) we motivate implicature understanding as a crucial
aspect of communication that is currently mostly missing from evaluations of LLMs, ii) we design an
implicature resolution task and propose a comprehensive evaluation protocol on which we evaluate
both humans and LLMs to find that it poses a significant challenge for SotA LLMs, and iii) we
provide a thorough analysis of the results and identify one fine-tuning strategy (instruction-tuning at
the example-level) as a promising method that produces models with more pragmatic understanding.
4The precise method is unpublished and differs from the original instructGPT (Ouyang et al., 2022).
2
2 Related Work
LLMs have demonstrated remarkable performance on tasks for which they were not explicitly trained
(Brown et al., 2020). Building on the hypothesis that these abilities arise due to implicit multitask
learning (Radford et al., 2019), the recent works of Sanh et al. (2022) and Wei et al. (2022) explicitly
train LLMs in a supervised multitask fashion, leading to models that are better zero-shot learners with
fewer parameters. Besides rapidly saturating language understanding benchmarks (Kiela et al., 2021),
these advancements make LLMs beneficial foundations for agents performing a plethora of tasks
(Adolphs et al., 2022; Reed et al., 2022). The trend towards using these models as agents brings along
with it increased urgency for alignment with human values (Kenton et al., 2021). However, larger
models trained with next-word prediction are generally more toxic and unhelpful (Gehman et al.,
2020; Bender et al., 2021; Lin et al., 2022). Recent work mitigates this with methods like prompting
and finetuning on human-annotated outputs (Askell et al., 2021; Ouyang et al., 2022; Thoppilan et al.,
2022). The produced models are more aligned on desiderata such as informativeness when evaluated
by dedicated benchmarks and humans. We argue, however, that there is still something missing in
these benchmarks. What is helpful and informative, as Kasirzadeh and Gabriel (2022) also point out,
depends on the context in which a conversation is held. Consequently, any application that requires
communicating with humans will rely on pragmatic communication skills—something that is not
explicitly captured by the benchmarks used to evaluate the alignment of LLMs.
There is a large body of work that investigates the interplay between pragmatics and computational
modeling (Cianflone et al., 2018; Schuster et al., 2020; Louis et al., 2020; Kim et al., 2021; Li et al.,
2021; Jeretic et al., 2020; Parrish et al., 2021; Hosseini et al., 2023). Cianflone et al. (2018) introduce
the task of predicting adverbial presupposition triggers, which are words like ‘again’ that trigger
the unspoken presupposition that an event has happened before. Schuster et al. (2020) study the
ability of computational models to do scalar inferences, finding that models use linguistic features
to make pragmatic inferences. Kim et al. (2021) find that a substantial part of question-answering
datasets contain questions that are unanswerable due to false presuppositions (i.e. “which linguist
invented the lightbulb”). Hosseini et al. (2023) present a dataset for selecting entities with indirect
answers, and find that language models adapted for this task get reasonable accuracy, but that there
is room for improvement. The difference with this body of work and ours is that we look at the
emergence of pragmatic understanding from large-scale language modeling. Jeretic et al. (2020);
Parrish et al. (2021) are early works investigating the emergence of pragmatic understanding in
pretrained language models, but they only look at scalar implicatures and presuppositions. Zheng
et al. (2021) are the first to evaluate pretrained language models on conversational implicatures. This
is important pioneering work highlighting the difficulty of implicature for language models, but their
evaluations require task-specific training and the models they evaluate are relatively small. In contrast,
our evaluation protocol is applicable out-of-the-box and is much more comprehensive, evaluating
models up to 176 billion parameters and using in-context prompting. Additionally, Zheng et al.
(2021) benchmark synthetic data whereas this work evaluates performance on naturally occurring
implicatures (George and Mamidi, 2020). We believe this to be a better representation of the true
distribution of implicatures in natural dialogue.
The standard set of benchmarks LLMs are evaluated on covers many tasks, but even though im-
plicature is one of the most important aspects of language pragmatics (Levinson, 1983), it is only
evaluated as part of BIG-bench (Srivastava et al., 2022). Unfortunately, the methodology used by the
BIG-bench implicature task contributors has limitations, which call into question the validity of their
claims. Firstly, the task contributors discard a subset of the data that is ambiguous according to them.
In our view this defeats the point of the benchmark. Implicatures are a type of non-literal, ambiguous
language the intended meaning of which humans often easily interpret; comparing the way humans
and models do this is precisely what we are interested in. In turn, we expect performance on the
BIG-bench task to overestimate the ability of LLMs to resolve naturally occurring implicatures. We
keep this challenging subset of the data and instead use human evaluation to deal with examples that
are too ambiguous to understand. Secondly, the difference in performance between their average
and best rater is 18%, whereas for our evaluations this difference is 6%. This indicates their human
evaluation is of low quality, but it is impossible to verify because there are no details available on how
the annotation is done. Finally, BIG-bench uses only base LLMs and no SotA fine-tuning methods.
In summary, we use a more challenging dataset, and in turn at least six times more evaluations per
model, we provide higher-quality human annotations, and evaluate four different categories of LLMs
to investigate which aspects of LLMs contribute to their performance on implicature understanding.
3
3 Evaluation Protocol
Here we outline the evaluation protocol we use to answer the research question “To what extent can
LLMs resolve conversational implicature?”. We focus on binary implicatures that imply “yes” or “no”
(see Figure 1). We say a model resolves an implicature correctly if it assigns higher likelihood to a
coherent utterance than a similar but incoherent one, detailed below.
Zero-shot evaluation. Consider the example from the introduction packed into a single utterance:
Esther asked “Can you come to my party on Friday?” and Juan responded “I have
to work”, which means no.
We can transform this example to be pragmatically incoherent (in the sense that it will become
pragmatically inconsistent with expected use) by replacing the word “no” with “yes”:
Esther asked “Can you come to my party on Friday?” and Juan responded “I have
to work”, which means yes.
To resolve the implicature, the model should assign higher likelihood to the first of the two sentences
above, namely the most coherent one. Importantly, both sentences have exactly the same words
except for the binary implicature “yes” or “no”, making the assigned likelihood scores directly
comparable. Formally, let the coherent prompt be
x
and the augmented, incoherent prompt be
ˆx
.
A model outputs a likelihood
p
parameterized by weights
θ
. We say a model correctly resolves an
example
x
when it assigns
pθ(x)> pθ(ˆx)
. This is equivalent to evaluating whether the model
assigns a higher likelihood to the correct continuation of the two options. Note that this is a more
lenient evaluation protocol than sometimes used for language models, where models are evaluated on
on their ability to generate the correct continuation, in this case “no”. The greedy decoding approach
(evaluating whether “yes” or “no” is generated) is also captured by our approach, but we additionally
label an example correct if “no” is not the highest assigned likelihood, but still higher than “yes”.
We did not opt for greedy decoding because “no” is not the only coherent continuation here, and
marginalising over all possible correct continuations is intractable. The more lenient evaluation does
capture implicature resolution, because the choice of “no” versus “yes” is only determined by the
resolution of the implicature. We guide the models to output “yes” or “no” explicitly in three of
the six prompt templates with instructions, such that we can estimate the effect of this guidance
on performance. For two model classes (i.e. GPT-3.5-turbo and GPT-4) we do not have access to
likelihoods, and for these models we take the greedy decoding approach, guiding the model to output
“yes” or “no” explicitly in all prompts (see Table 6 in Appendix F).
We use a dataset of conversational implicatures curated by George and Mamidi (2020)
5
. It contains
implicatures that, like in Figure 1, are presented in utterance-response-implicature tuples. Of these,
718 are binary implicatures that we can convert into an incoherent sentence. We randomly sample
600 examples for the test set and keep the remaining 118 as a development set to improve implicature
resolution after pre-training through in-context prompting or fine-tuning.
Few-shot in-context evaluation. We add kexamples of the task to the prompt, e.g. with k= 2:
Esther asked “Have you found him yet?” and Juan responded “They’re still
looking”, which means no.
Esther asked “Are you having fun?” and Juan responded “Is the pope Catholic?”,
which means yes.
Finish the following sentence:
Esther asked “Can you come to my party on Friday?” and Juan responded “I have
to work”, which means no.
We evaluate the models’
k
-shot capabilities for
k∈ {1,5,10,15,30}
by randomly sampling
k
examples from the development set for each test example. We opt for a random sampling approach to
control for two sources of randomness. Firstly, examples have different levels of informativeness.
Secondly, recent work found that the order in which examples are presented matters (Lu et al., 2022).
Ideally, to marginalise over these random factors, we would evaluate each test example with all
5Published under a CC BY 4.0 license.
4
permutations of
k
examples from the development set. This requires
118!
(118k)!
evaluations for each
test example, which is intractable. Instead, we estimate performance per test example by randomly
sampling from the development set. In this way we control for some of the variance in performance,
but avoid extra evaluations. We ensure each model sees the same few-shot examples per test example.
Controlling for prompt sensitivity. It has been shown language models are sensitive to prompt
wording (Efrat and Levy, 2020; Tan et al., 2021; Reynolds and McDonell, 2021a; Webson and Pavlick,
2021). To control for this factor of randomness we manually curate six different template prompts and
measure performance across these. One of the templates has been presented above, namely “Esther
asked <utterance> and Juan responded <response>, which means <implicature>”. Another template
is: “Question: <utterance>, response: <response>, meaning: <implicature>”. The former we call
natural prompts and the latter structured prompts. Each group has three templates that only differ
slightly in wording. This grouping allows us to look at the variance due to slight changes in wording
as well as performance difference due to a completely different way of presenting the example. The
full list of prompts can be found in Appendix F.
4 Experiments
The set of large language model classes we evaluate can be grouped into four distinct categories:
1.
Base models: large-scale pre-trained models; RoBERTa (Liu et al., 2019), BERT (Devlin
et al., 2018), GPT-2 (Radford et al., 2019), EleutherAI (Wang and Komatsuzaki, 2021; Black
et al., 2022), BLOOM (BigScience, 2022), OPT (Zhang et al., 2022), Cohere’s base models,
and GPT-3 (Brown et al., 2020)
2. Dialogue FT: LLMs fine-tuned on dialogue, BlenderBot (Ng et al., 2019).
3.
Benchmark IT: LLMs fine-tuned on tasks with natural instructions for each benchmark or
“benchmark-level instruction-tuned models”; T0 (Sanh et al., 2022) and Flan-T5 (Chung
et al., 2022).
4.
Example IT: LLMs fine-tuned on tasks with natural instructions for each example or
“example-level instruction-tuned models”; a subset of OpenAI’s API models and Cohere’s
API models).
For Benchmark IT models, annotators write a single instruction for an entire dataset. The models
are then fine-tuned on each example from the dataset with the same instruction. We distinguish this
from example-level IT; for that type of fine-tuning each example in a dataset gets a new instruction,
resulting in a more diverse dataset. Each group contains model classes for which we evaluate a range
of sizes. A detailed categorization of the models and their attributes can be found in appendix G.
6
We
make use of the OpenAI and Cohere APIs as well as the pretrained models in the transformers library
(Wolf et al., 2020) and EleutherAI’s framework to evaluate them (Gao et al., 2021). All code used
for this paper can be found on GitHub
7
and the dataset is made publicly available on HuggingFace
8
.
Below, we present zero-shot and few-shot results, discussing patterns of performance of the models
in the four different groups. We further look at the results for different model sizes of each model
class and the variance over the prompt templates. We contrast the models’ performance with human
performance. To this end, each test example gets annotated by five humans. We split the test set
in four and assign each annotator a subset, leaving us with twenty annotators in total. The average
human performance is 86.2%, and the best performance is 92%. Some of the errors humans make
uncover examples that have multiple interpretations, and others uncover annotation errors. The nature
of the task of implicature resolution means we do not expect models to perform better than human
best performance. Details on the human experiment can be found in the Appendix H (also containing
an analysis of human errors), and detailed results per model and prompt template in Appendix K.10.
We also test for spurious correlations present in the benchmark (like lexical cues the model can rely
on), and find no indication (Appendix K.8).
Insight 1: Models instruction-tuned at the example level outperform all others. Table 1 shows the
best 0-, 1-, and 5-shot accuracy each model class achieved on the implicature task. The best overall
6
Note that there are several important aspects unknown for models behind APIs, like OpenAI’s model sizes.
7https://github.com/LauraRuis/do-pigs-fly
8https://huggingface.co/datasets/UCL-DARK/ludwig
5
摘要:

TheGoldilocksofPragmaticUnderstanding:Fine-TuningStrategyMattersforImplicatureResolutionbyLLMsLauraRuis∗UniversityCollegeLondonAkbirKhanUniversityCollegeLondonStellaBidermanEleutherAI,BoozAllenHamiltonSaraHookerCohereforAITimRocktäschelUniversityCollegeLondonEdwardGrefenstetteUniversityCollegeLondon...

展开>> 收起<<
The Goldilocks of Pragmatic Understanding Fine-Tuning Strategy Matters for Implicature Resolution by LLMs.pdf

共79页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:79 页 大小:8.2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 79
客服
关注