sequence MT model, the approximate best trans-
lation under Prism-src can be found by running
beam search with the MT model conditioned on
the source text.
4.2 Greedy Optimization for Extractive
Summarization
Summarization models are generally categorized
as being either extractive or abstractive. Extractive
systems create a summary by selecting
k
salient
document sentences, whereas abstractive systems
typically autoregressively generate a summary with
a sequence-to-sequence model.
The best possible extractive summary according
to a reference-free metric can be found by enu-
merating all possible summaries of
k
sentences,
scoring them with the metric, and selecting the
summary with the highest score. Since the number
of
k
sentence summaries may be large, this may be
computationally expensive. However, an approxi-
mate inference procedure can be used instead.
Rather than enumerate all possible extractive
summaries, the approximate inference algorithm
constructs a summary by greedily selecting one
sentence that increases the score of the metric the
most (Lin and Bilmes,2011). This is repeated until
a target summary length of
k
sentences is reached,
resulting in an approximation of the best possible
summary under the reference-free metric.
A near-identical procedure is commonly used
for creating sentence-level labels for training ex-
tractive summarization models, except a reference-
based evaluation metric, such as ROUGE, is typ-
ically used for scoring the sentences instead of a
reference-free metric (Nallapati et al.,2017). The
key difference is that the output summary from the
reference-based procedure is used to train a model
which later predicts
k
salient sentences during infer-
ence, whereas the reference-free procedure can be
directly used during inference (i.e., without train-
ing) to pick the approximately best summary under
the reference-free metric.
4.3 Reranking
Exact inference for any reference-free metric can
be performed by enumerating all possible outputs,
calculating the score of each one, and selecting
the output with the highest score. However, it is
almost certainly true that this is computationally
intractable for any practical application of text gen-
eration due to the size of the output space.
To that end, we propose to use reranking (Shen
et al.,2004;Och et al.,2004) as an approximate
inference procedure in which a pre-trained model
for the task at hand is used to restrict the search
space to a small set of high-quality candidate out-
puts. These outputs are then scored and reranked
using the reference-free metric to identify an ap-
proximately best output under the metric.
In practice, we identify a set of
k
high-quality
outputs using standard beam search with pre-
trained sequence-to-sequence summarization and
MT models and a beam size of
k
. The top-
k
partial
outputs sorted by their log-likelihood under the pre-
trained models are kept at each step of beam search.
The final outputs are then reranked by a reference-
free metric. For summarization, we use BART
(Lewis et al.,2020) trained on the CNN/DailyMail
dataset. For MT, we use Facebook’s submission
to the WMT’19 translation shared task (Ng et al.,
2019). The model is available for en
→
de, de
→
en,
en→ru, and ru→en.
5 Analysis
5.1 Approximate Inference Effectiveness
Although inference methods for the reference-free
metrics can be defined, it is possible that they fail
to find high-scoring outputs due to the complexity
of the search problem. However in this analysis,
we show that the simple approximate inference
procedures defined in §4are effective at optimizing
the metrics’ scores.
We compared the outputs obtained by the infer-
ence algorithms to those from systems included
in the WMT’19, SummEval, and REALSumm
datasets. Fig. 2evaluates using the direct optimiza-
tion procedure (§4.1) to select the best Prism-src
output, Fig. 3shows the results of using rerank-
ing (§4.3) to pick the best outputs according to
COMET-QE, and Fig. 4contains the results of us-
ing the greedy extractive procedure (§4.2) to op-
timize QuestEval. The Figures also include the
systems’ scores under the reference-based metrics
BLEURT for MT and ROUGE for summarization.
Other combinations of reference-based metrics and
inference algorithms can be found in Appendix B.
In all MT language pairs and both summariza-
tion datasets, the inference algorithms produce the
highest scoring outputs under the reference-free
metrics, often by a large margin. For example,
reranking translations according to their COMET-
QE scores on de
→
en results in a relative 38% im-