
sional poets, we decided to include 3 major
types of instructions: 1) Continuation based in-
structions that suggest content when writers are
blocked/clueless on how to proceed; 2) Instructions
on Lexical Constraints to enable greater control
of poetic form such as rhyme, sound, and meter.
These are instructions that force language models
to obey specific choices such as generating a line
that contains a specific topic,start word,end word
or a sentence with a particular rhyme; 3) Instruc-
tions on Rhetorical devices that are mostly used for
introducing embellishments and imagery in a poem
such as metaphor, similes, and onomatopoeia.
Table 1shows the primary instructions used to
train our models. These instructions are crafted by
the authors of the paper, who convert every poem
line to an
<instruction, poem_line>
pair using
rules.
Each instruction consists of a template (unique
to the instruction type) and one or more arguments,
as can be seen in Table 1. Given a poem line in
the corpus, we reverse-engineer the instruction by
picking a template and extracting the arguments
from the poem line. For continuation instructions,
we use the previous context as the argument. For
instructions on lexical constraints, we extract noun
phrases and start/end words as arguments using
NLTK for tokenization. To construct instructions
on rhymes, we use the CMU dictionary to find
rhyming words.
2
We describe more details in Ap-
pendix Aon how we create instructions for each
particular type.
To allow models to adapt to linguistic variations
of the instruction templates, we also include para-
phrases of the instruction templates, e.g., instead
of “Write" we also use“Generate”, or instead of
“Write a sentence about” we use “Write a sentence
that contains the word” or “Write a sentence that
includes the word”. In total, our dataset consists of
873,574
<instruction, poem_line>
pairs which
we randomly split into 808,180 train and 65,394
held-out validation examples.
3
We evaluate perfor-
mance on three test sets of hand-crafted instructions
of varying difficulty (Section 3.2).
2https://pypi.org/project/pronouncing/
3
Our dataset is publicly available at
https://github.
com/vishakhpk/creative-instructions.
3 How Well Do LLMs Follow
Instructions?
In this section, we first describe our models and
baselines, followed by the evaluation results using
both automatic metrics (Section 3.3) and human
evaluation (Section 3.4).
3.1 Experiment Setup
Model Details
We finetune the pretrained T5
(Raffel et al.,2020) and T0 (Sanh et al.,2021)
models from HuggingFace (Wolf et al.,2019) on
the collected data (Section 2) to produce the out-
put given the instruction using cross-entropy loss.
We report results on finetuned T5-3B, T5-11B and
T0-3B models, which are henceforth referred to as
T5-3B-poem, T5-11B-poem, and T0-3B-poem. We
select the hyperparameters by the validation loss:
for T5-11B-poem, we use the Adam optimizer with
a learning rate of
1e−4
; for T5-3B-poem and T0-
3B-poem, we use the Adafactor optimizer with a
learning rate of
1e−3
. Each model is trained for 3
epochs with early stopping based on validation loss.
We finetune all models on an A100 GPU and use
Deepspeed (Rasley et al.,2020) integration for the
11B model. During finetuning, we restrict the max-
imum sequence length of both the source and the
target to 64 tokens (via truncation).
4
At inference
time, we generate output sequences using top-k
sampling with
k= 5
and a temperature of
0.7
per
recommendations from earlier work in open-ended
creative text generation (Fan et al.,2018;Holtzman
et al.,2020;Padmakumar and He,2022).
Baselines
We compare our finetuned models
with two other models: (i) the T0pp model (Sanh
et al.,2021), trained on instruction-based prompts
from 49 datasets;
5
and (ii) the 175B davinci variant
of InstructGPT (Ouyang et al.,2022) that is trained
on human-written instructions on diverse tasks in a
human-in-the-loop fashion. Given an instruction,
we generate text directly (i.e. zero-shot) from T0pp
using top-k sampling (Fan et al.,2018).
For InstructGPT, we evaluate on both zero-
shot and few-shot settings. For zero-shot, the
prompt consists of only the instruction. For few-
shot, the prompt consists of 26
<instruction,
4
The length limit is chosen to avoid memory explosion. It
has minimal impact on model performance since most verses
are shorter.
5
These include question-answering, summarization,
structure-to-text generation, sentiment and topic classification
tasks but no explicit creative writing tasks.