Extending Compositional Attention Networks for Social Reasoning in Videos Christina Sartzetaki1 Georgios Paraskevopoulos12 Alexandros Potamianos1 1School of ECE National Technical University of Athens Greece

2025-04-27 1 0 2.47MB 5 页 10玖币

侵权投诉

Extending Compositional Attention Networks for Social Reasoning in Videos

Christina Sartzetaki1, Georgios Paraskevopoulos1,2, Alexandros Potamianos1

1School of ECE, National Technical University of Athens, Greece

2Institute for Speech and Language Processing, Athens, Greece

christina.sartzetaki@gmail.com, {geopar,potam}@central.ntua.gr

Abstract

We propose a novel deep architecture for the task of reason-

ing about social interactions in videos. We leverage the multi-

step reasoning capabilities of Compositional Attention Net-

works (MAC) [1], and propose a multimodal extension (MAC-

X). MAC-X is based on a recurrent cell that performs iterative

mid-level fusion of input modalities (visual, auditory, text) over

multiple reasoning steps, by use of a temporal attention mecha-

nism. We then combine MAC-X with LSTMs for temporal in-

put processing in an end-to-end architecture. Our ablation stud-

ies show that the proposed MAC-X architecture can effectively

leverage multimodal input cues using mid-level fusion mecha-

nisms. We apply MAC-X to the task of Social Video Question

Answering in the Social IQ dataset and obtain a 2.5% absolute

improvement in terms of binary accuracy over the current state-

of-the-art.

Index Terms: Video Question Answering, Social Reasoning,

Compositional Attention Networks, MAC

1. Introduction

Humans are social creatures; our survival and well-being de-

pends on our effective communication with others. This is

achieved through perceiving and understanding information

from multiple sensory modalities as well as reasoning and ar-

riving to conclusions, in order to respond accordingly. Artiﬁ-

cial intelligence systems need to be able to process interactions

between the different sensory modalities to gain an in-depth

understanding of their environment, and for that reason mul-

timodal machine learning has developed into a vibrant multi-

disciplinary ﬁeld of increasing importance and extraordinary

potential [2] with a wide range of benchmark tasks.

In Visual Question Answering (VQA), a task sometimes de-

scribed as a visual Turing test [3, 4], an AI agent is required to

answer a natural language question based on an input image,

from answers either in multiple-choice or open-ended format.

The VQA task was introduced in [5] and it inspired the cre-

ation of several datasets focusing on different aspects of the task

[6, 7, 8, 9]. The VQA task can also be formulated with video

content (Video QA) [10, 11, 12], where the input has a tempo-

ral dimension and may include audio and dialogue transcript.

Video QA is a more complex multimodal task that may require

action recognition, conversation and story line understanding,

as well as using speech characteristics such as prosody, tim-

bre and pitch. Social-IQ [13] is an unconstrained benchmark

that introduces the task of Social Video Question Answering.

It consists of human-centered videos in the wild along with so-

cial and theory-of-mind-related questions, and answering can

demand sophisticated combinations of language understanding,

cultural knowledge, logical and causal reasoning, on top of non-

social layers of comprehension about physical events [14].

A direction that has proven successful in the VQA literature

Figure 1: Example from the Social-IQ dataset: The man looks

lovingly at the little leopard while exclaiming “So sweet!”

is combining modules of memory and attention. In [15], the Dy-

namic Memory Network (DMN) [16] proposed for Text QA is

extended for application in VQA, while in [17], it is enhanced

with new mechanisms for Video QA. Notably, [18] proposes

a bottom-up and top-down attention mechanism for salient im-

age regions, and in [19] images and questions are processed

through self and cross attention. Lastly, in [20] the commonly

used RNNs are replaced with positional self-attention. Another

approach in recent research is neurosymbolic models, which at-

tempt to get the best of both worlds from deep neural networks

and older symbolic-AI techniques. In [21], strong supervision is

used to translate questions to functional programs followed by a

question-speciﬁc neural network, as opposed to [22] where this

translation requires no explicit supervision. Moving towards a

more neural approach, the method proposed in [23] predicts a

probabilistic graph for the image and performs sequential rea-

soning over the abstract latent space of that graph.

The Memory Attention Composition (MAC) Network [1]

was proposed in an attempt to capture the “logic of thought”

in addition to constructing neural representations from the data.

The MAC Network exploits the core ideas of attention that un-

derlie neural models, but also provides an architecture suited

for soft symbolic reasoning. In [24], the authors introduce a

dual process neural architecture for Video QA where MAC is

employed as “System 2”, taking as input a temporal attention

space-time representation from “System 1”.

For the task of Social Video Question Answering, the meth-

ods previously explored on Social-IQ typically make use of at-

tention and fusion mechanisms, and can be summarized as fol-

lows. First, Tensor Memory Fusion Network (TMFN) [13] is a

baseline created by performing architecture and hyperparame-

arXiv:2210.01191v1 [cs.CV] 3 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExtendingCompositionalAttentionNetworksforSocialReasoninginVideosChristinaSartzetaki1,GeorgiosParaskevopoulos1;2,AlexandrosPotamianos11SchoolofECE,NationalTechnicalUniversityofAthens,Greece2InstituteforSpeechandLanguageProcessing,Athens,Greecechristina.sartzetaki@gmail.com,fgeopar,potamg@central.ntu...

展开>> 收起<<

Extending Compositional Attention Networks for Social Reasoning in Videos Christina Sartzetaki1 Georgios Paraskevopoulos12 Alexandros Potamianos1 1School of ECE National Technical University of Athens Greece.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Extending Compositional Attention Networks for Social Reasoning in Videos Christina Sartzetaki1 Georgios Paraskevopoulos12 Alexandros Potamianos1 1School of ECE National Technical University of Athens Greece

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: