of global and local schemas. The global schema
was built using filtered key phrases extracted from
contexts that were part of the same class of the
data, e.g. a class of similar products in the Amazon
data (McAuley and Yang,2016) and similar dia-
logues in the Ubuntu dataset (Lowe et al.,2015),
whereas the local schema was built using one given
context, and they defined the missing information
as the difference between the global and the local
schema. The extraction of comparable schemas
across different contexts was possible due to the
repetitive nature of the datasets considered, e.g. the
descriptions of products of the same type such as
laptops allow the prediction of potentially missing
properties which need clarification, in contrast to
fact-checking claims which are less repetitive.
The standard sequence-to-sequence architecture
(Sutskever et al.,2014) is typically used in ques-
tion generation approaches (Du et al.,2017;Zhou
et al.,2017). Although answer-aware approaches
allow for the generation of multiple questions con-
ditioned on the same passage (Sun et al.,2018),
providing the answer during inference is not pos-
sible in fact-checking since one would typically
ask questions about what is missing from the claim.
Other work includes question generation for ques-
tion answering (Duan et al.,2017), question gener-
ation for educational purposes (Heilman and Smith,
2010), and poll question generation from social me-
dia posts (Lu et al.,2021). Furthermore, Hosking
and Riedel (2019) evaluated rewards in question
generation, showed that they did not correlate with
human judgments, and explained why rewards did
not help when using reinforcement learning.
Commonly used evaluation metrics such as
BLEU (Papineni et al.,2002) and ROUGE (Lin,
2004) fall short at correlating with human judg-
ments when evaluating the quality of automatically
generated questions (Liu et al.,2016;Sultan et al.,
2020;Nema and Khapra,2018). Majumder et al.
(2021) carried a human evaluation based on fluency,
relevance, whether the question dealt with missing
information, and usefulness. In addition, Cheng
et al. (2021) proposed to assess the quality of au-
tomatically generated questions based on whether
they were well-formed, concise, answerable, and
answer-matching. Similarly, we conduct a human
evaluation of the generated questions adapted to
fact-checking.
3 Varifocal Question Generation
In this section, we describe Varifocal, an approach
that generates multiple questions per claim based
on its different aspects, which correspond to textual
spans that we call focal points.
Varifocal consists of three components: (1) a
focal point extractor, (2) a question generator that
generates a question for each focal point, and (3)
a re-ranker that ranks the generated questions, re-
moves duplicates and promotes questions that are
more likely to match the gold standard ones.
3.1 Focal Point Extraction
We consider two types of focal points: contigu-
ous spans from the claim and metadata elements.
For the former, we consider all the subtrees of
its syntactic parse tree, thus obtaining more co-
herent phrases than if we extracted randomly se-
lected n-grams. In addition, the metadata, which
includes (1) the source of the claim or the name
of the speaker, and (2) the date when the claim
was made, can be useful in question generation for
fact-checking. As shown in Figure 1, having access
to the date of the claim helped the model generate
a precise question, i.e. Where was Miss Universe
Guyana arrested in 2017?. As the metadata is not
part of the claim, we incorporate it using a template.
For instance, we combined the claim and metadata
of the example shown in Figure 1as follows: state-
news.com reported on 11/15/17 that Miss Universe
Guyana 2017 was arrested at London Heathrow
airport with 2 kilograms of cocaine.
3.2 Question Generation
This component takes a claim and its focal points
as input and generates a set of questions. Given
a claim
c
, the set of all focal points is denoted as
F
, where each focal point
fi∈F
is a span in the
claim
c
and its metadata, such as
fi= [ws, ..., we]
where
s
and
e
mark the start and the end of the
span, respectively. Then, for each focal point
fi
,
the model generates autoregressively a question
ˆqi
of nwords, as follows:
p( ˆqi|c, fi) =
n
Y
k=1
p(ˆqi[k]|ˆqi[0:k−1] ,[˜c;˜
fi]) (1)
[˜c;˜
fi]
is the transformer-based encoding of
c
con-
catenated to
fi
. The question generation compo-
nent in Varifocal is similar to the answer-aware
sequence-to-sequence model (Sun et al.,2018).