Help me write a poem Instruction Tuning as a Vehicle for Collaborative Poetry Writing Tuhin Chakrabarty1Vishakh Padmakumar2He He23

2025-05-06 0 0 1.17MB 16 页 10玖币
侵权投诉
Help me write a poem: Instruction Tuning as a Vehicle for Collaborative
Poetry Writing
Tuhin Chakrabarty1Vishakh Padmakumar2He He 2,3
1Department of Computer Science, Columbia University
2Center for Data Science, New York University
3Department of Computer Science, New York University
tuhin.chakr@cs.columbia.edu, vishakh@nyu.edu, hhe@nyu.edu
Abstract
Recent work in training large language mod-
els (LLMs) to follow natural language instruc-
tions has opened up exciting opportunities for
natural language interface design. Building
on the prior success of LLMs in the realm of
computer-assisted creativity, we aim to study
if LLMs can improve the quality of user-
generated content through collaboration. We
present CoPoet, a collaborative poetry writing
system. In contrast to auto-completing a user’s
text, CoPoet is controlled by user instructions
that specify the attributes of the desired text,
such as Write a sentence about ‘love’ or Write
a sentence ending in ‘fly’. The core compo-
nent of our system is a language model fine-
tuned on a diverse collection of instructions for
poetry writing. Our model is not only com-
petitive with publicly available LLMs trained
on instructions (InstructGPT), but is also capa-
ble of satisfying unseen compositional instruc-
tions. A study with 15 qualified crowdwork-
ers shows that users successfully write poems
with CoPoet on diverse topics ranging from
Monarchy to Climate change. Further, the
collaboratively written poems are preferred by
third-party evaluators over those written with-
out the system.1
1 Introduction
Advancements in large language models (LLMs)
have made remarkable progress towards generating
coherent text in a wide variety of domains. This
has spurred increasing interest in computer-assisted
creativity (See et al.,2019;Elkins and Chun,2020;
Ramesh et al.,2022;Branwen,2020) such as build-
ing co-creative assistants for writing stories, poems,
and argumentative essays (Lee et al.,2022;Swan-
son et al.,2021;Uthus et al.,2019;Donahue et al.,
2020;Padmakumar and He,2022;Du et al.,2022).
Both Authors Contributed Equally
1
Our code, preprocessed data, models, and the interaction
logs from our user study are available at
https://github.
com/vishakhpk/creative-instructions
Write a simile about 'cake'
A cake is like a cloud of joy.
A cake is like a piece of heaven.
Suggest a verse to follow the
sentence 'The cake is like a
cloud of joy'
The frosting is like a rainbow.
So soft so fluffy, I could cry.
The cake is like a cloud of joy.
The frosting is like a rainbow.
One slice is heaven on Earth,
A flowing chocolate volcano.
Write a poetic sentence that
contains the word 'chocolate'
and ends in 'volcano'
A flowing chocolate volcano.
I ate so much chocolate and
erupted like a volcano.
The cake is like a cloud of joy.
The frosting is like a rainbow.
One slice is heaven on Earth,
The cake is like a cloud of joy.
The frosting is like a rainbow.
The cake is like a cloud of joy.
Poem Draft Interactions
Figure 1: A collaborative poem entitled ’Decadence’,
written with CoPoet assistance. Green text was writ-
ten directly by the human, who interacts with CoPoet
using instructions. CoPoet offers multiple suggestions
which the user can accept or reject . The user wrote
a four line poem before indicating completion of the
task.
The adoption of these technologies hinges on their
ability to provide appropriate suggestions while be-
ing easy to interact with. However, there has been
limited research on the effectiveness of such col-
laboration, e.g., whether the assistant understands
user intents and whether collaboration improves
the final outcome.
In this paper, we aim to understand the collabo-
ration capabilities of LLMs through a case study of
collaborative poetry writing. Writing a poem is of-
ten a challenging task because it is both open-ended
and highly constrained. Unlike stories or other ar-
gumentative texts, in order to write a poem we need
creative content that satisfies various long- and
short-range form constraints such as rhyme, meter,
arXiv:2210.13669v1 [cs.CL] 25 Oct 2022
and sound, which poses a significant challenge for
end-to-end poem generation systems (Ghazvinine-
jad et al.,2016;Tian and Peng,2022;Van de Cruys,
2020;Ormazabal et al.,2022). While LLMs some-
times struggle with long-range coherence, they are
good at providing variations of text that satisfy lo-
cal constraints. This makes them great partners
to humans in poem writing, where humans focus
on the long-range writing plan and the machine
implements the ideas locally.
Effective collaboration in co-creative writing is
challenging as it requires the model to understand
user intention. For example, as shown in Figure 1,
a user may have a rough plan around two related
concepts such as chocolate and volcano, and want
the model to suggest a verse that contains chocolate
and ends with volcano; or they may be looking for
a verse that rhymes with a specific word (rainbow)
to satisfy the constraints. An auto-completion in-
terface is not able to anticipate such user needs and
provide targeted suggestions. To enable richer in-
teraction, we rely on instructional prompts (Wang
et al.,2022;Sanh et al.,2021;Mishra et al.,2022;
Mishra and Nouri,2022) that act as a natural lan-
guage interface between the user and the assistant.
Specifically, we present CoPoet, a collaborative
poem writing system with a natural language inter-
face. During a writing session, the user can itera-
tively request suggestions through natural language
instructions such as Write a simile about ‘cake’,
and edit their draft based on the suggestions (Fig-
ure 1). To build CoPoet, we finetune a pretrained
sequence-to-sequence model on a parallel corpus
of instruction-output pairs. We obtain the outputs
from publicly available datasets of creative text and
synthesize the corresponding instructions by rules,
including both lexical and rhyming constraints as
well as requests on rhetorical devices.
To understand how well the model follows in-
structions, we test it on instructions with varying
levels of difficulty, from those seen during training
to unseen compositional instructions that contain
multiple constraints. Both automatic and human
evaluation show that our finetuned model satisfies
the constraints 86% of the time, 10% better than a
much larger 175B version of InstructGPT (Brown
et al.,2020). On unseen compositional instructions,
our best model satisfies them 77.6% of the time,
outperforming InstructGPT by a margin of 28%.
To understand its collaboration capabilities, we
run a user study on Amazon Mechanical Turk
Subject Write a poetic sentence about ‘sun’
O crimson sun, your warming draft’s pulsa-
tion.
End Write a poetic sentence ending in ‘glory’
Am I exalted here unto that glory.
Rhyme
Write a poetic sentence that ends in a word
which rhymes with ‘replace’
Diminishing for me, with delicate grace.
Next
Sentence
Write a next sentence in a poem given the
previous sentence ‘‘The only thing I know’
‘for sure’
Is that love is hard and can be obscure.
Metaphor Write a metaphor about ‘brain’
My brain is a tangled mess of circuits.
Simile
Write a simile about ‘making someone feel
desired’
I want to make you feel like a flower near
a hummingbird
Onoma-
topoeia
Write a poetic sentence about ‘bottles’
showcasing onomatopoeia
The stamping of feet and the ring of bottles.
Subject
+ End
Write a poetic sentence about ‘tears’ and
ending in ‘wives’
Awash in the tears of soldier’s wives.
Table 1: Natural language instructions for poem writ-
ing paired with example outputs. Each instruction con-
sists of a template and an argument.
(AMT) where CoPoet assists expert crowd workers
(recruited through a qualification test) in writing
poems (Section 4). We observe that the recruited
users are able to write coherent and creative poems
on diverse topics ranging from Glass Ceiling to Cli-
mate Change. About 70% of model suggested text
is retained in the final poem and users give CoPoet
a rating of 4.3 out of 5 on both the suggestion qual-
ity and the overall helpfulness. Further, a separate
group of annotators on AMT prefers the collabora-
tively written poems more often than those written
without CoPoet assistance. In particular, we find
model assistance improves rhyming and vocabulary
diversity of the poems.
2 Data
To train a model to follow instructions, we need
<instruction, poem_line>
pairs where the text
satisfies the instruction. The key challenge to build-
ing such a model is the lack of parallel data, so
we collect our own dataset of creative writing in-
structions from publicly available poem corpora or
relevant subreddits from Reddit (Table 7).
Based on some initial feedback from profes-
sional poets, we decided to include 3 major
types of instructions: 1) Continuation based in-
structions that suggest content when writers are
blocked/clueless on how to proceed; 2) Instructions
on Lexical Constraints to enable greater control
of poetic form such as rhyme, sound, and meter.
These are instructions that force language models
to obey specific choices such as generating a line
that contains a specific topic,start word,end word
or a sentence with a particular rhyme; 3) Instruc-
tions on Rhetorical devices that are mostly used for
introducing embellishments and imagery in a poem
such as metaphor, similes, and onomatopoeia.
Table 1shows the primary instructions used to
train our models. These instructions are crafted by
the authors of the paper, who convert every poem
line to an
<instruction, poem_line>
pair using
rules.
Each instruction consists of a template (unique
to the instruction type) and one or more arguments,
as can be seen in Table 1. Given a poem line in
the corpus, we reverse-engineer the instruction by
picking a template and extracting the arguments
from the poem line. For continuation instructions,
we use the previous context as the argument. For
instructions on lexical constraints, we extract noun
phrases and start/end words as arguments using
NLTK for tokenization. To construct instructions
on rhymes, we use the CMU dictionary to find
rhyming words.
2
We describe more details in Ap-
pendix Aon how we create instructions for each
particular type.
To allow models to adapt to linguistic variations
of the instruction templates, we also include para-
phrases of the instruction templates, e.g., instead
of “Write" we also use“Generate”, or instead of
“Write a sentence about” we use “Write a sentence
that contains the word” or “Write a sentence that
includes the word”. In total, our dataset consists of
873,574
<instruction, poem_line>
pairs which
we randomly split into 808,180 train and 65,394
held-out validation examples.
3
We evaluate perfor-
mance on three test sets of hand-crafted instructions
of varying difficulty (Section 3.2).
2https://pypi.org/project/pronouncing/
3
Our dataset is publicly available at
https://github.
com/vishakhpk/creative-instructions.
3 How Well Do LLMs Follow
Instructions?
In this section, we first describe our models and
baselines, followed by the evaluation results using
both automatic metrics (Section 3.3) and human
evaluation (Section 3.4).
3.1 Experiment Setup
Model Details
We finetune the pretrained T5
(Raffel et al.,2020) and T0 (Sanh et al.,2021)
models from HuggingFace (Wolf et al.,2019) on
the collected data (Section 2) to produce the out-
put given the instruction using cross-entropy loss.
We report results on finetuned T5-3B, T5-11B and
T0-3B models, which are henceforth referred to as
T5-3B-poem, T5-11B-poem, and T0-3B-poem. We
select the hyperparameters by the validation loss:
for T5-11B-poem, we use the Adam optimizer with
a learning rate of
1e4
; for T5-3B-poem and T0-
3B-poem, we use the Adafactor optimizer with a
learning rate of
1e3
. Each model is trained for 3
epochs with early stopping based on validation loss.
We finetune all models on an A100 GPU and use
Deepspeed (Rasley et al.,2020) integration for the
11B model. During finetuning, we restrict the max-
imum sequence length of both the source and the
target to 64 tokens (via truncation).
4
At inference
time, we generate output sequences using top-k
sampling with
k= 5
and a temperature of
0.7
per
recommendations from earlier work in open-ended
creative text generation (Fan et al.,2018;Holtzman
et al.,2020;Padmakumar and He,2022).
Baselines
We compare our finetuned models
with two other models: (i) the T0pp model (Sanh
et al.,2021), trained on instruction-based prompts
from 49 datasets;
5
and (ii) the 175B davinci variant
of InstructGPT (Ouyang et al.,2022) that is trained
on human-written instructions on diverse tasks in a
human-in-the-loop fashion. Given an instruction,
we generate text directly (i.e. zero-shot) from T0pp
using top-k sampling (Fan et al.,2018).
For InstructGPT, we evaluate on both zero-
shot and few-shot settings. For zero-shot, the
prompt consists of only the instruction. For few-
shot, the prompt consists of 26
<instruction,
4
The length limit is chosen to avoid memory explosion. It
has minimal impact on model performance since most verses
are shorter.
5
These include question-answering, summarization,
structure-to-text generation, sentiment and topic classification
tasks but no explicit creative writing tasks.
poem_line>
pairs from our training data (selected
to cover all the instruction templates), followed by
the test instruction.
6
We use the OpenAI API with
a temperature of
0.7
, no frequency penalty, and
a maximum sequence length of 64 to match our
setting.
3.2 Test Sets
While our training instructions cover many tem-
plates and topics, user instructions may deviate
from the training distribution during interaction.
To evaluate the generalization capabilities of the
models, we identify three settings with increasing
difficulty based on whether the instruction tem-
plates or arguments are seen during training.
Known Instruction Templates with Known Ar-
guments (KIKA)
The simplest setting requires
the model to generalize to novel combinations of
the templates and arguments. Specifically, we cre-
ate instructions where both the templates and the
arguments are seen in the training set, although
each specific combination is unseen (i.e. the train-
ing and test sets have no overlapping instructions).
Known Instruction Templates with Unknown
Arguments (KIUA)
To handle novel concepts
from users, the model must generalize to unseen
arguments, which may include new entities or
phrases. For example, it might be easier for a model
to write a poetic sentence about a known argument
such as beauty, but difficult to write about an un-
known argument beauty without virtue. For this
set, we include instructions where the instruction
templates are seen during training but the corre-
sponding arguments are unseen.
Unknown Compositional Instruction Tem-
plates
One of the main benefits of natural
language instructions is that they can be easily
composed in new ways to cover various user
intentions. This is particularly useful in creative
writing because it enables users to request text
from the model with multiple constraints. There-
fore, we also test whether the model understands
compositional instructions using two templates,
as seen in Table 2. Our model is exposed to a
single compositional template during training:
Subject+End. For this test set, we create a variety
of unseen compositions.
In total, we create 242 test examples (82 KIKA,
82 KIUA, 78 compositional) by selecting instruc-
6The exact prompt can be found in our code repository.
Start
+End
Write a poetic sentence that starts with the
word ‘Maybe’ and ending in ‘void’
Maybe one day, you will find me in the void
Subject
+Rhyme
Write a poetic sentence that contains the
word ‘breaks’ and ending in a word which
rhymes with ‘bound’
She cracks and breaks and hits the ground.
Next
Sentence
+End
Write a next sentence in a poetry given
the previous sentence ‘Every once a while
I lower the blinds’ and ending in ‘play’
Waiting for someone to call me out to play
Metaphor
+End
Write a metaphor that includes the word
‘film’ and ending in ‘thought’
A film is a petrified fountain of thought.
Table 2: Examples of compositional natural language
instructions for creative tasks paired with their respec-
tive outputs from our test sets.
tions according to the above criteria, followed by
manual verification.
3.3 Automatic Evaluation
We evaluate how well the models satisfy constraints
specified in the instructions on each of the test sets
(Section 3.2). We report the success rate of satis-
fying the instructions where the success condition
for each instruction type is listed in Table 3.7
Instruction
Type
Success Condition
Rhyme Last word of the model generation
rhymes with the desired subject using the
CMU Pronouncing Dictionary
Haiku Model generation contains 15–19
syllables and contains the desired subject
Simile /
Metaphor
Model generation contains the desired
subject as well as a comparator
Start / End
First/last word of the model generation
matches the desired subject
Subject Model generation contains the desired
subject in the instruction
Table 3: Success conditions for different instruction
templates.
Finetuned Models Have Strong In-Domain Per-
formance but Drop on Out-of-Domain Data
Figure 2shows the average success rate and stan-
dard deviations of each model on the three test
7
Prior work on instruction tuning reports metrics such
as BLEU score for generation tasks (Sanh et al.,2021;Wei
et al.,2021) and these are unsuitable for our poetry writing
instructions, thus we define custom success conditions.
摘要:

Helpmewriteapoem:InstructionTuningasaVehicleforCollaborativePoetryWritingTuhinChakrabarty1VishakhPadmakumar2HeHe2;31DepartmentofComputerScience,ColumbiaUniversity2CenterforDataScience,NewYorkUniversity3DepartmentofComputerScience,NewYorkUniversitytuhin.chakr@cs.columbia.edu,vishakh@nyu.edu,hhe@nyu...

展开>> 收起<<
Help me write a poem Instruction Tuning as a Vehicle for Collaborative Poetry Writing Tuhin Chakrabarty1Vishakh Padmakumar2He He23.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.17MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注