
She [Venus] is often described as looking at herself on the mirror, although this is
physically impossible since viewers can see her [Venus] face reflected in their
direction. This phenomenon [Venus gazing at herself on the mirror] is known as the
Venus effect.
,→
,→
,→
...
Nudes were extremely rare in seventeenth-century Spanish art, which was policed actively
by members of the Spanish Inquisition. Despite this [the fact that nudes were
extremely rare in seventeenth- century Spanish art, which was policed actively by
members of the Spanish Inquisition], nudes by foreign artists were keenly collected
by the court circle, and this painting [The Rokeby Venus] was hung in the houses of
Spanish courtiers until 1813, when it was brought to England to hang in Rokeby Park,
Yorkshire.
,→
,→
,→
,→
,→
,→
...
The painting [The Rokeby Venus] is believed to have been executed during one of
Velázquez's [the artist] visits to Rome, and Prater has observed that in Rome the
artist [Velázquez] "did indeed lead a life of considerable personal liberty..."
,→
,→
Figure 3: Example of output from the decontextualization prompt, applied to the Wikipedia page
https://en.
wikipedia.org/wiki/Rokeby_Venus
we did not consider many alternative prompts. De-
contextualization was performed autoregressively,
rewriting each sentence using the previous
k
decon-
textualized sentences as context.
The capabilities and limitations of this approach
are highlighted in Figure 3, which shows some
typical outputs. The markup resolves pronominal
references she and her and the nominal references
this painting and this phenomenon. Perhaps most
impressively, the elliptical expression despite this
is decontextualized with the markup [the fact that
nudes were extremely rare. . . ]. However, by the
end of the document, we have lost track of the first
name of the artist, so that the artist is decontex-
tualized as only [Velázquez], rather than with the
full name. Future work may address this issue by
exploring more sophisticated strategies than simple
autoregressive decontextualization.
Chain-of-thought question answering. In
chain-of-thought prompting, the language model is
asked to first generate a rationale before producing
an answer (Wei et al.,2022). For open-book
question answering, we take the rationale to be a
sentence that is extracted from the passage and
which contains the answer, as shown in Figure 5.
We construct question-specific few-shot prompts
by concatenating several exemplars in which a
question, passage, rationale, and answer are shown,
before providing the question and passage for
the instance to be predicted. The exemplars are
drawn from the training set, selecting questions
with the highest BM25 similarity to the target
question (Robertson et al.,2009). Exemplars are
added until we reach a limit of 1024 sentencepiece
tokens in the prompt (Kudo and Richardson,2018);
for the QuoRef dataset, this amounts to two or
three exemplars in most cases.
To generate the rationales in the exemplars, we
enumerate all sentences in the passage that contains
an exact match to the answer and select the one
with the highest BM25 similarity to the exemplar’s
question. Each sentence is considered in both its
original surface form and with decontextualizing
markup. If no sentence contains an exact match to
the answer, then the question is not included as an
exemplar. However, prompts are constructed for
all training set examples, even when no rationale
can be extracted using this heuristic.
Rationale validation. Finally, to validate the ra-
tionales that were generated in the chain-of-thought
stage, we perform a final validation stage in which
the teacher model must answer questions based
only on the generated rationales. As in the previ-
ous stage, we include each training set example and
construct in-prompt exemplars by BM25 similar-
ity to other questions in the training set. Because
this stage does not include full passages, we can
fit many more exemplars while remaining under
the budget of 1024 tokens, on the order of 20 per
prompt. The resulting “faithful answers” are then
used to filter the fine-tuning data that is exposed to
the student model.
3 Training the Student Model
The prompt chain described in Section 2produces
markup-and-mask rationales and uses them to an-