
Extending Compositional Attention Networks for Social Reasoning in Videos
Christina Sartzetaki1, Georgios Paraskevopoulos1,2, Alexandros Potamianos1
1School of ECE, National Technical University of Athens, Greece
2Institute for Speech and Language Processing, Athens, Greece
christina.sartzetaki@gmail.com, {geopar,potam}@central.ntua.gr
Abstract
We propose a novel deep architecture for the task of reason-
ing about social interactions in videos. We leverage the multi-
step reasoning capabilities of Compositional Attention Net-
works (MAC) [1], and propose a multimodal extension (MAC-
X). MAC-X is based on a recurrent cell that performs iterative
mid-level fusion of input modalities (visual, auditory, text) over
multiple reasoning steps, by use of a temporal attention mecha-
nism. We then combine MAC-X with LSTMs for temporal in-
put processing in an end-to-end architecture. Our ablation stud-
ies show that the proposed MAC-X architecture can effectively
leverage multimodal input cues using mid-level fusion mecha-
nisms. We apply MAC-X to the task of Social Video Question
Answering in the Social IQ dataset and obtain a 2.5% absolute
improvement in terms of binary accuracy over the current state-
of-the-art.
Index Terms: Video Question Answering, Social Reasoning,
Compositional Attention Networks, MAC
1. Introduction
Humans are social creatures; our survival and well-being de-
pends on our effective communication with others. This is
achieved through perceiving and understanding information
from multiple sensory modalities as well as reasoning and ar-
riving to conclusions, in order to respond accordingly. Artifi-
cial intelligence systems need to be able to process interactions
between the different sensory modalities to gain an in-depth
understanding of their environment, and for that reason mul-
timodal machine learning has developed into a vibrant multi-
disciplinary field of increasing importance and extraordinary
potential [2] with a wide range of benchmark tasks.
In Visual Question Answering (VQA), a task sometimes de-
scribed as a visual Turing test [3, 4], an AI agent is required to
answer a natural language question based on an input image,
from answers either in multiple-choice or open-ended format.
The VQA task was introduced in [5] and it inspired the cre-
ation of several datasets focusing on different aspects of the task
[6, 7, 8, 9]. The VQA task can also be formulated with video
content (Video QA) [10, 11, 12], where the input has a tempo-
ral dimension and may include audio and dialogue transcript.
Video QA is a more complex multimodal task that may require
action recognition, conversation and story line understanding,
as well as using speech characteristics such as prosody, tim-
bre and pitch. Social-IQ [13] is an unconstrained benchmark
that introduces the task of Social Video Question Answering.
It consists of human-centered videos in the wild along with so-
cial and theory-of-mind-related questions, and answering can
demand sophisticated combinations of language understanding,
cultural knowledge, logical and causal reasoning, on top of non-
social layers of comprehension about physical events [14].
A direction that has proven successful in the VQA literature
Figure 1: Example from the Social-IQ dataset: The man looks
lovingly at the little leopard while exclaiming “So sweet!”
is combining modules of memory and attention. In [15], the Dy-
namic Memory Network (DMN) [16] proposed for Text QA is
extended for application in VQA, while in [17], it is enhanced
with new mechanisms for Video QA. Notably, [18] proposes
a bottom-up and top-down attention mechanism for salient im-
age regions, and in [19] images and questions are processed
through self and cross attention. Lastly, in [20] the commonly
used RNNs are replaced with positional self-attention. Another
approach in recent research is neurosymbolic models, which at-
tempt to get the best of both worlds from deep neural networks
and older symbolic-AI techniques. In [21], strong supervision is
used to translate questions to functional programs followed by a
question-specific neural network, as opposed to [22] where this
translation requires no explicit supervision. Moving towards a
more neural approach, the method proposed in [23] predicts a
probabilistic graph for the image and performs sequential rea-
soning over the abstract latent space of that graph.
The Memory Attention Composition (MAC) Network [1]
was proposed in an attempt to capture the “logic of thought”
in addition to constructing neural representations from the data.
The MAC Network exploits the core ideas of attention that un-
derlie neural models, but also provides an architecture suited
for soft symbolic reasoning. In [24], the authors introduce a
dual process neural architecture for Video QA where MAC is
employed as “System 2”, taking as input a temporal attention
space-time representation from “System 1”.
For the task of Social Video Question Answering, the meth-
ods previously explored on Social-IQ typically make use of at-
tention and fusion mechanisms, and can be summarized as fol-
lows. First, Tensor Memory Fusion Network (TMFN) [13] is a
baseline created by performing architecture and hyperparame-
arXiv:2210.01191v1 [cs.CV] 3 Oct 2022