Predictive Event Segmentation and Representation with Neural Networks A Self-Supervised Model Assessed by Psychological Experiments

2025-04-26 0 0 5.98MB 27 页 10玖币
侵权投诉
Predictive Event Segmentation and Representation with Neural
Networks: A Self-Supervised Model Assessed by Psychological
Experiments
Hamit Basgol1, Inci Ayhan2, Emre Ugur3
Department of Computer Science1at the University of Tübingen, Germany
Department of Psychology2,
Department of Computer Engineering3at Bo˘
gaziçi University, Turkey
People segment complex, ever-changing and continuous experience into basic, stable and dis-
crete spatio-temporal experience units, called events. Event segmentation literature investi-
gates the mechanisms that allow people to extract these units from the continuous experience.
Aiming to shed light on event segmentation ability, event segmentation theory points out that
people predict ongoing activities and observe prediction error signals in order to find event
boundaries that keep events apart. In this study, we investigated the mechanism giving rise to
this ability by a computational model and accompanying psychological experiments. Inspired
from the principles of event segmentation theory and predictive processing, we introduced a
semi-mechanistic model of event segmentation, learning, and representation. This model con-
sists of feed-forward neural networks that predict the sensory signal in the next time-step in
order to represent dierent events, and a cognitive model that regulates these neural networks
on the basis of their prediction errors. In order to verify the ability of our model in segmenting
experience into spatio-temporal units, learning them during passive observation, and represent-
ing them in its internal representational space, we prepared a video that depicts natural human
behaviors represented by point-light displays. We compared event segmentation behaviors of
human participants and our model with this video in two hierarchical event segmentation levels.
By using point-biserial correlation technique, we demonstrated that event segmentation deci-
sions of our model correlated with the responses of participants. Moreover, by approximating
internal representation space of participants by a similarity-based technique, we showed that
our model formed a similar internal representation space with those of participants. Our results
suggests that our model that tracks the prediction error signals can produce human-like event
segmentation decisions and event representations. Finally, we discussed our contribution to the
literature of event cognition and our understanding of how event segmentation is implemented
in the brain.
Keywords: event segmentation, point-light displays, predictive processing, self-supervision
1 Introduction
Humans segment continuous information stream into
event units to show robust, adaptive, and intelligent behav-
ior, which is called event segmentation (Richmond & Za-
cks, 2017; Zacks, 2020; Zacks, Speer, Swallow, Braver, &
Reynolds, 2007; Zacks & Swallow, 2007). In the recent
years, a growing number of computational models were pro-
posed to capture how humans segment events in order to
1Hamit Basgol is a PhD Student in the Department of Com-
puter Science at the University of Tübingen, Tübingen, Germany
(e-mail: hamitbasgol@gmail.com). The study is conducted when
Hamit Basgol is a master’s student in Cognitive Science Department
at Bogazici University.
2Inci Ayhan is with Department of Psychology in Bogazici Uni-
versity, Istanbul, Turkey (e-mail: inci.ayhan@boun.edu.tr)
3Emre Ugur is with Department of Computer Engi-
arXiv:2210.05710v1 [q-bio.NC] 4 Oct 2022
2HAMIT BASGOL1, INCI AYHAN2, EMRE UGUR3
utilize their continuous experience (Franklin, Norman, Ran-
ganath, Zacks, & Gershman, 2020; Gumbsch, Kneissler, &
Butz, 2016; Gumbsch, Otte, & Butz, 2017; Metcalf & Leake,
2017; Reynolds, Zacks, & Braver, 2007). Despite their es-
sential contribution to the understanding of the event seg-
mentation ability, these models have demonstrated certain
limitations. Namely they were not capable of segmenting
events in varying lengths (Metcalf & Leake, 2017; Reynolds
et al., 2007), they used datasets that involved abrupt tran-
sitions between naturalistic action sequences (Metcalf &
Leake, 2017; Reynolds et al., 2007) and they included robotic
models that did not aim for capturing human event segmen-
tation decisions (Gumbsch et al., 2016, 2017). In fact, to the
best of our knowledge, there has been only one study, where
authors compared the performance of their model to the hu-
man event segmentation decisions (Franklin et al., 2020).
In this study, we aim at addressing these limitations by a
novel computational model, which we built upon three main
elements: (1) the event segmentation theory (Zacks et al.,
2007), (2) the predictive processing of Clark (2013); Wiese
and Metzinger (2017), and (3) the robotic model of Gumbsch
et al. (2017), the contributions of which will be highlighted
throughout the paper. In this study, we showed that a self-
supervised and semi-mechanistic model monitoring predic-
tion error signals could produce multimodal event segments
in varying lengths and store the knowledge of events in acti-
vations of neural networks. Moreover, we compared the seg-
mentation and representation results of our model with those
of humans to reveal their similarities and dierences. We be-
lieve that our model presents a fruitful approach to modeling
event segmentation and integrating event knowledge into a
wide range of perceptual and cognitive processes.
The introduction of the paper is organized as follows:
Firstly, we introduce the event segmentation theory and the
importance of the prediction error signals for the event seg-
mentation. Secondly, we review the computational models
of event segmentation and identify their limitations. Finally,
we explain the methodology of our current study, its results
and conclusions.
1.1 Event segmentation theory and prediction error
Early studies of event segmentation were conducted by
Newtson (1973) using a unitization paradigm, where partici-
pants were asked to watch a movie and segment it into mean-
ingful units. The results of Newston’s study demonstrated
that a substantial agreement across participants on the seg-
mentation locations, which happened to be persistent in time.
Subsequent research verified these findings and opened up
the possibilities of investigating the role of events in human
cognition (Zacks, 2020; Zacks & Swallow, 2007). The lo-
cations at which participants segment a continuous informa-
tion stream (e.g., a movie) are termed as event boundaries,
which are the positions in time that show perceptual changes
in spatial locations, movements, relative distances between
agents, or goals (Cutting, 2014; Cutting, Brunick, & Can-
dan, 2012; Hard, Recchia, & Tversky, 2011; Hard, Tversky,
& Lang, 2006; Hu, Meitz, & Papenmeier, 2014; Kurby &
Zacks, 2008; Newtson, Engquist, & Bois, 1977; Zacks, 2020;
Zacks et al., 2007; Zacks, Speer, Swallow, & Maley, 2010).
Events are known to be hierarchically structured (Zacks,
2020). People can detect smallest (fine-grained) and largest
events (coarse-grained) (Hard et al., 2011, 2006; Newtson,
1973; Zacks, 2020; Zacks et al., 2001; Zacks & Swallow,
2007), when they are instructed to do so. Research with
functional magnetic resonance neuroimaging (fMRI) sug-
gests that hierarchical segmentation is an automatic process
(Speer, Zacks, & Reynolds, 2007; Zacks et al., 2001) such
that while observing a movie or reading a story, the brain
selectively responds to the fine- and coarse-grained event
boundaries. Hard et al. (2011), for example, demonstrated
that changes at event boundaries are more numerous than
other parts of an activity; moreover, they particularly peak at
coarse-grained boundaries. The strong relationship between
both types of change, namely the sensory (fine-grained) and
the conceptual (coarse-grained) change, suggests that events
are segmented based on the perceptual cycle formed by the
bottom-up processing of sensory features and the top-down
processing of conceptual knowledge (Neisser, 1976; Zacks,
2020; Zacks et al., 2007).
A computational model or a theory of event segmentation
should explain at least two basic properties of event segmen-
tation. The first one is how locations of event boundaries are
detected and the second one is how event segmentation oper-
ation is conducted in dierent hierarchies. Event segmenta-
neering in Bogazici University, Istanbul, Turkey (e-mail:
emre.ugur@boun.edu.tr)
EVENT SEGMENTATION 3
tion theory (EST) proposes an account for both of these prop-
erties. According to the EST, people constantly make per-
ceptual predictions by event models in the working memory
(Reynolds et al., 2007; Zacks et al., 2007). The event bound-
ary is formed when the current event model cannot capture
the current situation, in other words, when the corresponding
prediction error signal follows a transient increase. In such
situations, the system triggers another event model to predict
the following sensory input. Thus, a strategy that is based on
monitoring the prediction error signals, might correspond to
the basic mechanism behind the event segmentation ability.
Indeed, the EST and the role of prediction error signals in the
event segmentation were supported by many studies (Eisen-
berg, Zacks, & Flores, 2018; Franklin et al., 2020; Gumbsch
et al., 2016, 2017; Hard et al., 2011; Reynolds et al., 2007;
Stawarczyk, Bezdek, & Zacks, 2021; Zacks, Kurby, Eisen-
berg, & Haroutunian, 2011), despite exceptions (O’Reilly,
2013; Shin & DuBrow, 2021). Along with its focus on the
prediction error signals for event boundary detection, EST
also suggests that people might make predictions by events
in multiple timescales simultaneously and sensitivity dier-
ences between events to incoming prediction error signals
might determine their lengths or positions in the hierarchy.
For example, an event model might be sensitive to minor pre-
diction errors compared to another (Zacks & Swallow, 2007),
and this sensitivity dierence might make the former shorter
than the latter.
Due to this mechanism, computational models in the lit-
erature have been mostly inspired from the EST (Franklin
et al., 2020; Gumbsch et al., 2016, 2017; Metcalf & Leake,
2017; Reynolds et al., 2007). All these models, on the other
hand, come up with certain limitations, which will be the
topic of the next sub-section.
1.2 Computational models of event segmentation
Several important computational models have been pro-
posed in the literature with dierent limitations (Franklin et
al., 2020; Gumbsch et al., 2016, 2017; Reynolds et al., 2007).
For example, Reynolds et al. (2007) utilized a set of sequence
models for the segmentation of human behaviors. Despite
the success of the model in detecting event boundaries, hier-
archical segmentation of events in varying granularities was
not addressed. At the same time, behavioral sequences that
were used for training the model involved abrupt and unnat-
ural transitions. Metcalf and Leake (2017) enhanced this
model by a reinforcement learning agent. Although these
two models suggest that monitoring prediction error signals
is an eective strategy for the event segmentation, they did
not address hierarchical segmentation of events.
Gumbsch et al. (2016, 2017) developed a robotic model
that chunks sensory-motor information flow into parts. The
model represents events by linear models, which encode dif-
ferent sensory dimensions and predict sensory signal in the
next time-step. The linear models are regulated by a cog-
nitive model at a higher level based on the prediction er-
rors of the lower-level linear models. From this perspec-
tive, whereas the cognitive model resembles to the mecha-
nism proposed by the EST, linear models correspond to the
working memory representations. As has been addressed in
Gumbsch, Butz, and Martius (2019), however, since linear
models encoding sensory dimensions are disconnected from
one another, Gumbsch et al.s model is not capable of discov-
ering multi-modal associations between sensory modalities
in a particular event structure. Besides this limitation, these
models are robotic models that assume the involvement of
an active agent. However, the event segmentation literature
is based largely on the unitization paradigm in which partici-
pants observe events passively and press a button to separate
them from one another (Hard & Tversky, 2003; Hard et al.,
2006; Newtson, 1973; Newtson & Engquist, 1976; Newtson
et al., 1977; Zacks, 2020; Zacks et al., 2001, 2007).
Lastly, Franklin et al. (2020) developed an inclusive
model of event cognition, which considers various domains
such as event memorization, segmentation, retrieval, and in-
ference. To the best of our knowledge, this is the only
study that used naturalistic videos and considered the human-
level event segmentation performance for the model valida-
tion, even though the received correlation between the per-
formances of the model and ground-truth data is open to im-
provement.
Overall, the computational models of event segmentation
suggest that the EST presents a plausible mechanism for the
event segmentation task. For addressing the missing points
in the literature and testing the suggested mechanism by the
EST for more than one granularity level, we developed a
novel computational model for event segmentation. In addi-
4HAMIT BASGOL1, INCI AYHAN2, EMRE UGUR3
tion to event segmentation capability, our model could form
event representations.
1.3 Event representations
Representations are mental objects with semantic proper-
ties (Pitt, 2020). To express the strength of the relationships,
a representational space can be formed by taking the pairwise
distance between all representations (Shepard, 1980, 1987;
Shepard & Arabie, 1979). This two-way relationship makes
the similarity a valuable metric to reveal how the system or-
ganizes knowledge as representations form the basis of cat-
egorization and generalization. One aim of the artificial in-
telligence is to learn valuable and representative information
from the data (Bengio, Courville, & Vincent, 2014). Multi-
layer perceptrons (i.e., deep neural networks) can learn dis-
tributed and semantically meaningful representations (Ben-
gio et al., 2014; Urban & Gates, 2021). The similarity be-
tween representations (i.e., semantic relationships between
represented entities) of a deep learning model can be found
by the Euclidean distance or cosine similarity. For exam-
ple, the semantic relationship between words and sentences
(Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Rogers
& McClelland, 2005), objects (Deselaers & Ferrari, 2011),
scenes (Eslami et al., 2018), and episodes (Rothfuss, Fer-
reira, Aksoy, Zhou, & Asfour, 2018) can be captured with
the help of representations learned by a deep learning system.
Since representations give researchers a gist about how hu-
mans organize knowledge, generalize between instances, and
make analogical transfers (Blough, 2001; Nosofsky, 1992;
Shepard, 1980, 1987; Tversky, 1977), they have a fundamen-
tal place in cognitive science. As could be expected, re-
searchers exploited human similarity judgments to achieve
human mental representations (Shepard, 1980, 1987; Shep-
ard & Arabie, 1979). The role of representations and similar-
ity judgments in artificial intelligence and cognitive science
suggest that they might provide a basis for comparing people
and machines. In fact, recent research provides excellent ex-
amples of this comparison (Hebart, Zheng, Pereira, & Baker,
2020; Peterson, Abbott, & Griths, 2018).
Event representation literature is very rich and represents
a diverse set of studies (Blom, Feuerriegel, Johnson, Bode,
& Hogendoorn, 2020; Day & Bartels, 2008; Fivush, Kuebli,
& Clubb, 1992; Kominsky, Baker, Keil, & Strickland, 2021;
Schütz-Bosbach & Prinz, 2007; Sheldon & El-Asmar, 2018;
Wang, Cherkassky, & Just, 2017). In the context of computa-
tional modeling, recent studies use (Shen, Fu, Deng, & Ino,
2020) and learn (Dias & Dimiccoli, 2018) event representa-
tions. In contrast, despite the interest received by event rep-
resentations, event similarity judgments is a concealed area
under the action similarity judgments (Tarhan, de Freitas,
Alvarez, & Konkle, 2020; Tarhan & Konkle, 2018). In our
work, utilizing this possibility, we compare the event repre-
sentations of our computational model and participants by
exploiting event similarity judgments.
1.4 Our contribution
In this study, inspired from the EST (Zacks et al., 2007),
the predictive processing (Clark, 2013; Wiese & Metzinger,
2017), and the Gumbsch’s robotic model (Gumbsch et al.,
2016, 2017), we developed a novel computational model for
event segmentation. Our model consists of multi-layer per-
ceptrons (i.e., event models) that are managed by a cognitive
mechanism and consequently, determining the event bound-
aries. As our contribution to the literature, (1) our model
is capable of learning to represent and predict multi-modal
event segments with sensory associations in passive observa-
tion unlike the models developed by (Gumbsch et al., 2016,
2017) which segment unimodal events based on actions in
a simulation environment. (2) With the help of a parame-
ter, changing sensitivies of event models to prediction error
signals, our model can also segment events in varying gran-
ularities, which was not addressed by Reynolds et al. (2007)
and Metcalf and Leake (2017). (3) Moreover, segmentation
and representation capabilities of our model were tested by
ground-truth data received from psychological experiments.
A multi-layer perceptron is a plain deep neural network
which consists of an input layer, an intermediate (hidden)
layer or layers, and an output layer. The network learns
the relationship between inputs and outputs by updating
weights in each iteration. Thanks to the hidden units, multi-
layer perceptrons can classify complex patterns (Lippmann,
1989), approximating non-linear functions (Hornik, Stinch-
combe, & White, 1989). Moreover, the knowledge devel-
oped throughout the training is stored in weights and what is
learned by the model can be explored by analyzing the rep-
resentations of the network (Fleming & Storrs, 2019; Hebart
EVENT SEGMENTATION 5
et al., 2020; Peterson et al., 2018). The use of deep neu-
ral networks in cognitive science and artificial intelligence
has a long history and had an important role in the emer-
gence of the connectionist framework (Rumelhart, Hinton, &
Williams, 1986). The eect of connectionist framework, in
other words deep learning models, on cognitive sciences still
persists and leads to revolutionary results in a wide variety
of domains such as perception (Fleming & Storrs, 2019; He,
Zhang, Ren, & Sun, 2016; Krizhevsky, Sutskever, & Hin-
ton, 2012; Russakovsky et al., 2015; Spoerer, McClure, &
Kriegeskorte, 2017), linguistics (Floridi & Chiriatti, 2020;
Radford, Jozefowicz, & Sutskever, 2017; Wu et al., 2016),
developmental psychology (Orhan, Gupta, & Lake, 2020),
and cognitive neuroscience (Khaligh-Razavi & Kriegeskorte,
2014; Tripp, 2017; Yamins & DiCarlo, 2016).
We used multi-layer perceptrons as a member of deep
neural networks to represent events. A deep neural network
model can be trained in several ways: supervised, unsuper-
vised and self-supervised way. In the supervised learning,
models receive outputs (e.g., categories for object identifi-
cation) of inputs (e.g., images) from huge labelled datasets
(Krizhevsky et al., 2012; Russakovsky et al., 2015). Despite
its success in a range of domains such as object identifica-
tion (Krizhevsky et al., 2012; Russakovsky et al., 2015), su-
pervised learning is criticized being inconsistent with how
humans actually learn. Humans learn new concepts and abil-
ities with little supervision without the requirement of hand-
crafted labels (Vinyals, Blundell, Lillicrap, Kavukcuoglu, &
Wierstra, 2016). The dependency of supervised learning on
labels leads to researchers investigate other learning possibil-
ities such as unsupervised and self-supervised learning that
do not require explicit labels. In the unsupervised learning,
the network learns how to represent data eciently by not re-
lying on labels, but rather by capturing the high-order statis-
tics of the dataset (Fleming & Storrs, 2019). On the other
hand, in the self-supervised learning, labels are substituted
by the information in the input data so much so that rather
than mapping inputs to the hand-crafted labels, the network
learns to predict selected parts of the data (e.g., predicting
the next sequence of a video or a certain part of an image)
and generates representations by this way (Liu et al., 2021).
In the context of event segmentation, using a supervised
model receiving event boundary locations from a supervi-
sor or instructor would not be natural because infants mostly
learn new abilities without supervision. Therefore, our
model is self-supervised as it learns to make prediction in a
segment and detects event boundaries from the data without
the need of human-crafted labels (Liu et al., 2021).
Despite the contribution of deep neural networks to the
progress in a wide range of areas from linguistics to neu-
roscience (Fleming & Storrs, 2019; Floridi & Chiriatti,
2020; He et al., 2016; Khaligh-Razavi & Kriegeskorte, 2014;
Krizhevsky et al., 2012; Orhan et al., 2020; Radford et al.,
2017; Russakovsky et al., 2015; Spoerer et al., 2017; Tripp,
2017; Wu et al., 2016; Yamins & DiCarlo, 2016), they have
an important limitation, namely explainability. Deep neu-
ral networks are black-box models whose functioning is not
explicit, which is a crucial property especially for the do-
mains where understanding the decisions of networks is crit-
ical (e.g., medical decision making) or in scientific practice
(e.g., why the network decides this way and what this deci-
sion says for the problem in question). Even though it uses
multi-layer perceptrons for representing the event segments,
our model is not an end-to-end black box model (Rudin &
Radin, 2019); rather, it involves an easily understandable
and trackable white-box model that regulates multi-layer
perceptrons. From this perspective, our model is a semi-
mechanistic model, incorporating the capabilities of white-
box and black-box models, giving researchers a chance to
benefit from the power of neural networks without fully sac-
rificing the explainability.
Our proposed model is both self-supervised and semi-
mechanistic. By tracking the prediction error signals, (1)
the model produces multimodal event segments in varying
hierarchies via passive observation with the help of multi-
layer perceptrons unlike the models developed by (Gumbsch
et al., 2016, 2017), (2) With the help of a parameter, chang-
ing sensitivities of event models to prediction error signals,
our model can produce event segments in varying granular-
ities, which was not addressed by (Metcalf & Leake, 2017;
Reynolds et al., 2007). (3) Moreover, not only did we study
the activations in the layers to have an insight with respect
to the functioning, we also compared the performance of our
model to that of the human observers in order to assess its
capabilities. We received a higher point-biserial correlation
score than the existing score in the literature (Franklin et al.,
摘要:

PredictiveEventSegmentationandRepresentationwithNeuralNetworks:ASelf-SupervisedModelAssessedbyPsychologicalExperimentsHamitBasgol1,InciAyhan2,EmreUgur3DepartmentofComputerScience1attheUniversityofTübingen,GermanyDepartmentofPsychology2,DepartmentofComputerEngineering3atBo gaziçiUniversity,TurkeyPeop...

展开>> 收起<<
Predictive Event Segmentation and Representation with Neural Networks A Self-Supervised Model Assessed by Psychological Experiments.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:5.98MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注