Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue Amit Kumar Chaudhary

2025-04-30 0 0 1.84MB 10 页 10玖币
侵权投诉
Are Current Decoding Strategies Capable of Facing
the Challenges of Visual Dialogue?
Amit Kumar Chaudhary
CIMeC, University of Trento
amitkumar.chaudhar@unitn.it
Alex J. Lucassen
CIMeC, University of Trento
alex.lucassen@unitn.it
Ioanna Tsani
CIMeC, University of Trento
ioanna.tsani@unitn.it
Alberto Testoni
DISI, University of Trento
alberto.testoni@unitn.it
Abstract
Decoding strategies play a crucial role in nat-
ural language generation systems. They are
usually designed and evaluated in open-ended
text-only tasks, and it is not clear how different
strategies handle the numerous challenges that
goal-oriented multimodal systems face (such
as grounding and informativeness). To an-
swer this question, we compare a wide vari-
ety of different decoding strategies and hyper-
parameter configurations in a Visual Dialogue
referential game. Although none of them suc-
cessfully balance lexical richness, accuracy in
the task, and visual grounding, our in-depth
analysis allows us to highlight the strengths
and weaknesses of each decoding strategy. We
believe our findings and suggestions may serve
as a starting point for designing more effec-
tive decoding algorithms that handle the chal-
lenges of Visual Dialogue tasks.
1 Introduction
The last few years have witnessed remarkable
progress in developing efficient generative lan-
guage models. The choice of the decoding strategy
plays a crucial role in the quality of the output (see
Zarrieß et al. (2021) for an exhaustive overview). It
should be noted that decoding strategies are usually
designed for and evaluated in text-only settings.
The most-used decoding strategies can be grouped
into two main classes. On the one hand, decoding
strategies that aim to generate text that maximizes
likelihood (like greedy and beam search) are shown
to generate generic, repetitive, and degenerate out-
put. Zhang et al. (2021) refer to this phenomenon as
the likelihood trap, and provide evidence that these
strategies lead to sub-optimal sequences. On the
other hand, stochastic strategies like pure sampling,
top-k sampling, and nucleus sampling (Holtzman
et al.,2020) increase the variability of generated
texts by taking random samples from the model.
However, this comes at the cost of generating words
that are not semantically appropriate for the con-
text in which they appear. Recently, Meister et al.
(2022) used an information-theoretic framework
to propose a new decoding algorithm (typical de-
coding), which samples tokens with an information
content close to their conditional entropy. Typical
decoding shows promising results in human evalu-
ation experiments but, given its recent release, it is
not clear yet how general this approach is.
Multimodal vision & language systems have re-
cently received a lot of attention from the research
community, but a thorough analysis of different
decoding strategies in these systems has not been
carried out. Thus, the question arises of whether the
above-mentioned decoding strategies can handle
the challenges of multimodal systems. i.e., gen-
erate text that not only takes into account lexical
variability, but also grounding in the visual modal-
ity. Moreover, in goal-oriented tasks, the informa-
tiveness of the generated text plays a crucial role
as well. To address these research questions, in
this paper we take a referential visual dialogue task,
GuessWhat?! (De Vries et al.,2017), where two
players (a Questioner and an Oracle) interact so
that the Questioner identifies the secret object as-
signed to the Oracle among the ones appearing in
an image (see Figure 1for an example). Apart from
well-known issues, such as repetitions in the output,
this task poses specific challenges for evaluating de-
coding techniques compared to previous work. On
the one hand, the generated output has to be coher-
ent with the visual input upon which the conversa-
tion takes place. As highlighted by Rohrbach et al.
(2018); Testoni and Bernardi (2021b), multimodal
generative models often generate hallucinated en-
tities, i.e., tokens that refer to entities that do not
appear in the image upon which the conversation
takes place. On the other hand, the questions must
be informative, i.e., they must help the Questioner
to incrementally identify the target object.
We show that the choice of the decoding strat-
arXiv:2210.12997v1 [cs.CL] 24 Oct 2022
Figure 1: Example of a GuessWhat game from
De Vries et al. (2017)
egy and its hyper-parameter configuration heavily
affects the quality of the generated output. Our
results highlight the specific strengths and weak-
nesses of decoding strategies that aim at generating
sequences with the highest probability vs. strate-
gies that randomly sample words. We find that
none of the decoding strategies currently available
is able to balance task accuracy and linguistic qual-
ity of the output. However, we also show which
strategies perform better at important challenges,
such as incremental dialogue history, human evalu-
ation, hallucination rate, and lexical diversity. We
believe our work may serve as a starting point
for designing decoding strategies that take into ac-
count all the challenges involved in Visual Dia-
logue tasks.
2 Task & Dataset
GuessWhat?! (De Vries et al.,2017) is a simple
object identification game in English where two
participants see a real-world image from MSCOCO
(Lin et al.,2014) containing multiple objects. One
player (the Oracle) is secretly assigned one object
in the image (the target) and the other player (the
Questioner) has to guess it by asking a series of
binary yes-no questions to the Oracle. The task
is considered to be successful if the Questioner
identifies the target. The dataset for this task was
collected from human players via Amazon Mechan-
ical Turk. The authors collected 150K dialogues
with an average of 5.3 binary questions per dia-
logue. Figure 1shows an example of a GuessWhat
game from the dataset.
3 Model and Decoding Strategies
We use the model and pre-trained checkpoints of
the Questioner agent made available by Testoni and
Bernardi (2021c) for the GuessWhat?! task. This
model is based on the GDSE architecture (Shekhar
et al.,2019). It uses a ResNet-152 network (He
et al.,2016) to encode the images and an LSTM
network to encode the dialogue history. A multi-
modal shared representation is generated and then
used to train both the question generator (which
generates a follow-up question given the dialogue
history) and the Guesser module (which selects
the target object among a list of candidates at the
end of the dialogue) in a joint multi-task learning
fashion. Testoni and Bernardi (2021c) added an
internal Oracle module to the GDSE architecture,
which guides a cognitively-inspired beam search re-
ranking strategy (Confirm-it) at inference time: this
strategy promotes the generation of questions that
aim at confirming the model’s intermediate conjec-
tures about the target. In our work, at inference
time the Questioner agent always interacts with the
baseline Oracle agent proposed in De Vries et al.
(2017).
We analyse the effect of a large number of de-
coding strategies as well as hyper-parameter config-
uration for each strategy: as highlighted by Zhang
et al. (2021), it is crucial to evaluate different hyper-
parameter configurations when comparing multiple
decoding strategies. Among the ones that maxi-
mize the likelihood of the sequence, we consider
plain
beam search
(with a beam size of 3) and
greedy search
. We also consider
Confirm-it
, the
cognitively-inspired beam search re-ranking strat-
egy proposed in Testoni and Bernardi (2021c) for
promoting the generation of questions that aim at
confirming the model’s intermediate conjectures
about the target. This strategy re-ranks the set
of candidate questions from beam search and se-
lects the one that helps the most in confirming
the model’s hypothesis about the target. As for
stochastic strategies, we analyse
pure sampling
,
top-k sampling
(with different
k
values), and
nu-
cleus sampling
(with different
p
values), a strat-
egy proposed in Holtzman et al. (2020) which se-
lects the highest probability tokens whose cumu-
lative probability mass exceeds a given threshold
p
. We also consider
typical decoding
(with differ-
ent
τ
values), a recently proposed strategy (Meis-
ter et al.,2022) based on an information-theoretic
framework. We refer to the respective papers for
摘要:

AreCurrentDecodingStrategiesCapableofFacingtheChallengesofVisualDialogue?AmitKumarChaudharyCIMeC,UniversityofTrentoamitkumar.chaudhar@unitn.itAlexJ.LucassenCIMeC,UniversityofTrentoalex.lucassen@unitn.itIoannaTsaniCIMeC,UniversityofTrentoioanna.tsani@unitn.itAlbertoTestoniDISI,UniversityofTrentoalber...

展开>> 收起<<
Are Current Decoding Strategies Capable of Facing the Challenges of Visual Dialogue Amit Kumar Chaudhary.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.84MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注