Predictive Event Segmentation and Representation with Neural Networks A Self-Supervised Model Assessed by Psychological Experiments

2025-04-26 0 0 5.98MB 27 页 10玖币

侵权投诉

Predictive Event Segmentation and Representation with Neural

Networks: A Self-Supervised Model Assessed by Psychological

Experiments

Hamit Basgol1, Inci Ayhan2, Emre Ugur3

Department of Computer Science1at the University of Tübingen, Germany

Department of Psychology2,

Department of Computer Engineering3at Bo˘

gaziçi University, Turkey

People segment complex, ever-changing and continuous experience into basic, stable and dis-

crete spatio-temporal experience units, called events. Event segmentation literature investi-

gates the mechanisms that allow people to extract these units from the continuous experience.

Aiming to shed light on event segmentation ability, event segmentation theory points out that

people predict ongoing activities and observe prediction error signals in order to ﬁnd event

boundaries that keep events apart. In this study, we investigated the mechanism giving rise to

this ability by a computational model and accompanying psychological experiments. Inspired

from the principles of event segmentation theory and predictive processing, we introduced a

semi-mechanistic model of event segmentation, learning, and representation. This model con-

sists of feed-forward neural networks that predict the sensory signal in the next time-step in

order to represent diﬀerent events, and a cognitive model that regulates these neural networks

on the basis of their prediction errors. In order to verify the ability of our model in segmenting

experience into spatio-temporal units, learning them during passive observation, and represent-

ing them in its internal representational space, we prepared a video that depicts natural human

behaviors represented by point-light displays. We compared event segmentation behaviors of

human participants and our model with this video in two hierarchical event segmentation levels.

By using point-biserial correlation technique, we demonstrated that event segmentation deci-

sions of our model correlated with the responses of participants. Moreover, by approximating

internal representation space of participants by a similarity-based technique, we showed that

our model formed a similar internal representation space with those of participants. Our results

suggests that our model that tracks the prediction error signals can produce human-like event

segmentation decisions and event representations. Finally, we discussed our contribution to the

literature of event cognition and our understanding of how event segmentation is implemented

in the brain.

Keywords: event segmentation, point-light displays, predictive processing, self-supervision

1 Introduction

Humans segment continuous information stream into

event units to show robust, adaptive, and intelligent behav-

ior, which is called event segmentation (Richmond & Za-

cks, 2017; Zacks, 2020; Zacks, Speer, Swallow, Braver, &

Reynolds, 2007; Zacks & Swallow, 2007). In the recent

years, a growing number of computational models were pro-

posed to capture how humans segment events in order to

1Hamit Basgol is a PhD Student in the Department of Com-

puter Science at the University of Tübingen, Tübingen, Germany

(e-mail: hamitbasgol@gmail.com). The study is conducted when

Hamit Basgol is a master’s student in Cognitive Science Department

at Bogazici University.

2Inci Ayhan is with Department of Psychology in Bogazici Uni-

versity, Istanbul, Turkey (e-mail: inci.ayhan@boun.edu.tr)

3Emre Ugur is with Department of Computer Engi-

arXiv:2210.05710v1 [q-bio.NC] 4 Oct 2022

2HAMIT BASGOL1, INCI AYHAN2, EMRE UGUR3

utilize their continuous experience (Franklin, Norman, Ran-

ganath, Zacks, & Gershman, 2020; Gumbsch, Kneissler, &

Butz, 2016; Gumbsch, Otte, & Butz, 2017; Metcalf & Leake,

2017; Reynolds, Zacks, & Braver, 2007). Despite their es-

sential contribution to the understanding of the event seg-

mentation ability, these models have demonstrated certain

limitations. Namely they were not capable of segmenting

events in varying lengths (Metcalf & Leake, 2017; Reynolds

et al., 2007), they used datasets that involved abrupt tran-

sitions between naturalistic action sequences (Metcalf &

Leake, 2017; Reynolds et al., 2007) and they included robotic

models that did not aim for capturing human event segmen-

tation decisions (Gumbsch et al., 2016, 2017). In fact, to the

best of our knowledge, there has been only one study, where

authors compared the performance of their model to the hu-

man event segmentation decisions (Franklin et al., 2020).

In this study, we aim at addressing these limitations by a

novel computational model, which we built upon three main

elements: (1) the event segmentation theory (Zacks et al.,

2007), (2) the predictive processing of Clark (2013); Wiese

and Metzinger (2017), and (3) the robotic model of Gumbsch

et al. (2017), the contributions of which will be highlighted

throughout the paper. In this study, we showed that a self-

supervised and semi-mechanistic model monitoring predic-

tion error signals could produce multimodal event segments

in varying lengths and store the knowledge of events in acti-

vations of neural networks. Moreover, we compared the seg-

mentation and representation results of our model with those

of humans to reveal their similarities and diﬀerences. We be-

lieve that our model presents a fruitful approach to modeling

event segmentation and integrating event knowledge into a

wide range of perceptual and cognitive processes.

The introduction of the paper is organized as follows:

Firstly, we introduce the event segmentation theory and the

importance of the prediction error signals for the event seg-

mentation. Secondly, we review the computational models

of event segmentation and identify their limitations. Finally,

we explain the methodology of our current study, its results

and conclusions.

1.1 Event segmentation theory and prediction error

Early studies of event segmentation were conducted by

Newtson (1973) using a unitization paradigm, where partici-

pants were asked to watch a movie and segment it into mean-

ingful units. The results of Newston’s study demonstrated

that a substantial agreement across participants on the seg-

mentation locations, which happened to be persistent in time.

Subsequent research veriﬁed these ﬁndings and opened up

the possibilities of investigating the role of events in human

cognition (Zacks, 2020; Zacks & Swallow, 2007). The lo-

cations at which participants segment a continuous informa-

tion stream (e.g., a movie) are termed as event boundaries,

which are the positions in time that show perceptual changes

in spatial locations, movements, relative distances between

agents, or goals (Cutting, 2014; Cutting, Brunick, & Can-

dan, 2012; Hard, Recchia, & Tversky, 2011; Hard, Tversky,

& Lang, 2006; Huﬀ, Meitz, & Papenmeier, 2014; Kurby &

Zacks, 2008; Newtson, Engquist, & Bois, 1977; Zacks, 2020;

Zacks et al., 2007; Zacks, Speer, Swallow, & Maley, 2010).

Events are known to be hierarchically structured (Zacks,

2020). People can detect smallest (ﬁne-grained) and largest

events (coarse-grained) (Hard et al., 2011, 2006; Newtson,

1973; Zacks, 2020; Zacks et al., 2001; Zacks & Swallow,

2007), when they are instructed to do so. Research with

functional magnetic resonance neuroimaging (fMRI) sug-

gests that hierarchical segmentation is an automatic process

(Speer, Zacks, & Reynolds, 2007; Zacks et al., 2001) such

that while observing a movie or reading a story, the brain

selectively responds to the ﬁne- and coarse-grained event

boundaries. Hard et al. (2011), for example, demonstrated

that changes at event boundaries are more numerous than

other parts of an activity; moreover, they particularly peak at

coarse-grained boundaries. The strong relationship between

both types of change, namely the sensory (ﬁne-grained) and

the conceptual (coarse-grained) change, suggests that events

are segmented based on the perceptual cycle formed by the

bottom-up processing of sensory features and the top-down

processing of conceptual knowledge (Neisser, 1976; Zacks,

2020; Zacks et al., 2007).

A computational model or a theory of event segmentation

should explain at least two basic properties of event segmen-

tation. The ﬁrst one is how locations of event boundaries are

detected and the second one is how event segmentation oper-

ation is conducted in diﬀerent hierarchies. Event segmenta-

neering in Bogazici University, Istanbul, Turkey (e-mail:

emre.ugur@boun.edu.tr)

EVENT SEGMENTATION 3

tion theory (EST) proposes an account for both of these prop-

erties. According to the EST, people constantly make per-

ceptual predictions by event models in the working memory

(Reynolds et al., 2007; Zacks et al., 2007). The event bound-

ary is formed when the current event model cannot capture

the current situation, in other words, when the corresponding

prediction error signal follows a transient increase. In such

situations, the system triggers another event model to predict

the following sensory input. Thus, a strategy that is based on

monitoring the prediction error signals, might correspond to

the basic mechanism behind the event segmentation ability.

Indeed, the EST and the role of prediction error signals in the

event segmentation were supported by many studies (Eisen-

berg, Zacks, & Flores, 2018; Franklin et al., 2020; Gumbsch

et al., 2016, 2017; Hard et al., 2011; Reynolds et al., 2007;

Stawarczyk, Bezdek, & Zacks, 2021; Zacks, Kurby, Eisen-

berg, & Haroutunian, 2011), despite exceptions (O’Reilly,

2013; Shin & DuBrow, 2021). Along with its focus on the

prediction error signals for event boundary detection, EST

also suggests that people might make predictions by events

in multiple timescales simultaneously and sensitivity diﬀer-

ences between events to incoming prediction error signals

might determine their lengths or positions in the hierarchy.

For example, an event model might be sensitive to minor pre-

diction errors compared to another (Zacks & Swallow, 2007),

and this sensitivity diﬀerence might make the former shorter

than the latter.

Due to this mechanism, computational models in the lit-

erature have been mostly inspired from the EST (Franklin

et al., 2020; Gumbsch et al., 2016, 2017; Metcalf & Leake,

2017; Reynolds et al., 2007). All these models, on the other

hand, come up with certain limitations, which will be the

topic of the next sub-section.

1.2 Computational models of event segmentation

Several important computational models have been pro-

posed in the literature with diﬀerent limitations (Franklin et

al., 2020; Gumbsch et al., 2016, 2017; Reynolds et al., 2007).

For example, Reynolds et al. (2007) utilized a set of sequence

models for the segmentation of human behaviors. Despite

the success of the model in detecting event boundaries, hier-

archical segmentation of events in varying granularities was

not addressed. At the same time, behavioral sequences that

were used for training the model involved abrupt and unnat-

ural transitions. Metcalf and Leake (2017) enhanced this

model by a reinforcement learning agent. Although these

two models suggest that monitoring prediction error signals

is an eﬀective strategy for the event segmentation, they did

not address hierarchical segmentation of events.

Gumbsch et al. (2016, 2017) developed a robotic model

that chunks sensory-motor information ﬂow into parts. The

model represents events by linear models, which encode dif-

ferent sensory dimensions and predict sensory signal in the

next time-step. The linear models are regulated by a cog-

nitive model at a higher level based on the prediction er-

rors of the lower-level linear models. From this perspec-

tive, whereas the cognitive model resembles to the mecha-

nism proposed by the EST, linear models correspond to the

working memory representations. As has been addressed in

Gumbsch, Butz, and Martius (2019), however, since linear

models encoding sensory dimensions are disconnected from

one another, Gumbsch et al.’s model is not capable of discov-

ering multi-modal associations between sensory modalities

in a particular event structure. Besides this limitation, these

models are robotic models that assume the involvement of

an active agent. However, the event segmentation literature

is based largely on the unitization paradigm in which partici-

pants observe events passively and press a button to separate

them from one another (Hard & Tversky, 2003; Hard et al.,

2006; Newtson, 1973; Newtson & Engquist, 1976; Newtson

et al., 1977; Zacks, 2020; Zacks et al., 2001, 2007).

Lastly, Franklin et al. (2020) developed an inclusive

model of event cognition, which considers various domains

such as event memorization, segmentation, retrieval, and in-

ference. To the best of our knowledge, this is the only

study that used naturalistic videos and considered the human-

level event segmentation performance for the model valida-

tion, even though the received correlation between the per-

formances of the model and ground-truth data is open to im-

provement.

Overall, the computational models of event segmentation

suggest that the EST presents a plausible mechanism for the

event segmentation task. For addressing the missing points

in the literature and testing the suggested mechanism by the

EST for more than one granularity level, we developed a

novel computational model for event segmentation. In addi-

4HAMIT BASGOL1, INCI AYHAN2, EMRE UGUR3

tion to event segmentation capability, our model could form

event representations.

1.3 Event representations

Representations are mental objects with semantic proper-

ties (Pitt, 2020). To express the strength of the relationships,

a representational space can be formed by taking the pairwise

distance between all representations (Shepard, 1980, 1987;

Shepard & Arabie, 1979). This two-way relationship makes

the similarity a valuable metric to reveal how the system or-

ganizes knowledge as representations form the basis of cat-

egorization and generalization. One aim of the artiﬁcial in-

telligence is to learn valuable and representative information

from the data (Bengio, Courville, & Vincent, 2014). Multi-

layer perceptrons (i.e., deep neural networks) can learn dis-

tributed and semantically meaningful representations (Ben-

gio et al., 2014; Urban & Gates, 2021). The similarity be-

tween representations (i.e., semantic relationships between

represented entities) of a deep learning model can be found

by the Euclidean distance or cosine similarity. For exam-

ple, the semantic relationship between words and sentences

(Mikolov, Sutskever, Chen, Corrado, & Dean, 2013; Rogers

& McClelland, 2005), objects (Deselaers & Ferrari, 2011),

scenes (Eslami et al., 2018), and episodes (Rothfuss, Fer-

reira, Aksoy, Zhou, & Asfour, 2018) can be captured with

the help of representations learned by a deep learning system.

Since representations give researchers a gist about how hu-

mans organize knowledge, generalize between instances, and

make analogical transfers (Blough, 2001; Nosofsky, 1992;

Shepard, 1980, 1987; Tversky, 1977), they have a fundamen-

tal place in cognitive science. As could be expected, re-

searchers exploited human similarity judgments to achieve

human mental representations (Shepard, 1980, 1987; Shep-

ard & Arabie, 1979). The role of representations and similar-

ity judgments in artiﬁcial intelligence and cognitive science

suggest that they might provide a basis for comparing people

and machines. In fact, recent research provides excellent ex-

amples of this comparison (Hebart, Zheng, Pereira, & Baker,

2020; Peterson, Abbott, & Griﬃths, 2018).

Event representation literature is very rich and represents

a diverse set of studies (Blom, Feuerriegel, Johnson, Bode,

& Hogendoorn, 2020; Day & Bartels, 2008; Fivush, Kuebli,

& Clubb, 1992; Kominsky, Baker, Keil, & Strickland, 2021;

Schütz-Bosbach & Prinz, 2007; Sheldon & El-Asmar, 2018;

Wang, Cherkassky, & Just, 2017). In the context of computa-

tional modeling, recent studies use (Shen, Fu, Deng, & Ino,

2020) and learn (Dias & Dimiccoli, 2018) event representa-

tions. In contrast, despite the interest received by event rep-

resentations, event similarity judgments is a concealed area

under the action similarity judgments (Tarhan, de Freitas,

Alvarez, & Konkle, 2020; Tarhan & Konkle, 2018). In our

work, utilizing this possibility, we compare the event repre-

sentations of our computational model and participants by

exploiting event similarity judgments.

1.4 Our contribution

In this study, inspired from the EST (Zacks et al., 2007),

the predictive processing (Clark, 2013; Wiese & Metzinger,

2017), and the Gumbsch’s robotic model (Gumbsch et al.,

2016, 2017), we developed a novel computational model for

event segmentation. Our model consists of multi-layer per-

ceptrons (i.e., event models) that are managed by a cognitive

mechanism and consequently, determining the event bound-

aries. As our contribution to the literature, (1) our model

is capable of learning to represent and predict multi-modal

event segments with sensory associations in passive observa-

tion unlike the models developed by (Gumbsch et al., 2016,

2017) which segment unimodal events based on actions in

a simulation environment. (2) With the help of a parame-

ter, changing sensitivies of event models to prediction error

signals, our model can also segment events in varying gran-

ularities, which was not addressed by Reynolds et al. (2007)

and Metcalf and Leake (2017). (3) Moreover, segmentation

and representation capabilities of our model were tested by

ground-truth data received from psychological experiments.

A multi-layer perceptron is a plain deep neural network

which consists of an input layer, an intermediate (hidden)

layer or layers, and an output layer. The network learns

the relationship between inputs and outputs by updating

weights in each iteration. Thanks to the hidden units, multi-

layer perceptrons can classify complex patterns (Lippmann,

1989), approximating non-linear functions (Hornik, Stinch-

combe, & White, 1989). Moreover, the knowledge devel-

oped throughout the training is stored in weights and what is

learned by the model can be explored by analyzing the rep-

resentations of the network (Fleming & Storrs, 2019; Hebart

EVENT SEGMENTATION 5

et al., 2020; Peterson et al., 2018). The use of deep neu-

ral networks in cognitive science and artiﬁcial intelligence

has a long history and had an important role in the emer-

gence of the connectionist framework (Rumelhart, Hinton, &

Williams, 1986). The eﬀect of connectionist framework, in

other words deep learning models, on cognitive sciences still

persists and leads to revolutionary results in a wide variety

of domains such as perception (Fleming & Storrs, 2019; He,

Zhang, Ren, & Sun, 2016; Krizhevsky, Sutskever, & Hin-

ton, 2012; Russakovsky et al., 2015; Spoerer, McClure, &

Kriegeskorte, 2017), linguistics (Floridi & Chiriatti, 2020;

Radford, Jozefowicz, & Sutskever, 2017; Wu et al., 2016),

developmental psychology (Orhan, Gupta, & Lake, 2020),

and cognitive neuroscience (Khaligh-Razavi & Kriegeskorte,

2014; Tripp, 2017; Yamins & DiCarlo, 2016).

We used multi-layer perceptrons as a member of deep

neural networks to represent events. A deep neural network

model can be trained in several ways: supervised, unsuper-

vised and self-supervised way. In the supervised learning,

models receive outputs (e.g., categories for object identiﬁ-

cation) of inputs (e.g., images) from huge labelled datasets

(Krizhevsky et al., 2012; Russakovsky et al., 2015). Despite

its success in a range of domains such as object identiﬁca-

tion (Krizhevsky et al., 2012; Russakovsky et al., 2015), su-

pervised learning is criticized being inconsistent with how

humans actually learn. Humans learn new concepts and abil-

ities with little supervision without the requirement of hand-

crafted labels (Vinyals, Blundell, Lillicrap, Kavukcuoglu, &

Wierstra, 2016). The dependency of supervised learning on

labels leads to researchers investigate other learning possibil-

ities such as unsupervised and self-supervised learning that

do not require explicit labels. In the unsupervised learning,

the network learns how to represent data eﬃciently by not re-

lying on labels, but rather by capturing the high-order statis-

tics of the dataset (Fleming & Storrs, 2019). On the other

hand, in the self-supervised learning, labels are substituted

by the information in the input data so much so that rather

than mapping inputs to the hand-crafted labels, the network

learns to predict selected parts of the data (e.g., predicting

the next sequence of a video or a certain part of an image)

and generates representations by this way (Liu et al., 2021).

In the context of event segmentation, using a supervised

model receiving event boundary locations from a supervi-

sor or instructor would not be natural because infants mostly

learn new abilities without supervision. Therefore, our

model is self-supervised as it learns to make prediction in a

segment and detects event boundaries from the data without

the need of human-crafted labels (Liu et al., 2021).

Despite the contribution of deep neural networks to the

progress in a wide range of areas from linguistics to neu-

roscience (Fleming & Storrs, 2019; Floridi & Chiriatti,

2020; He et al., 2016; Khaligh-Razavi & Kriegeskorte, 2014;

Krizhevsky et al., 2012; Orhan et al., 2020; Radford et al.,

2017; Russakovsky et al., 2015; Spoerer et al., 2017; Tripp,

2017; Wu et al., 2016; Yamins & DiCarlo, 2016), they have

an important limitation, namely explainability. Deep neu-

ral networks are black-box models whose functioning is not

explicit, which is a crucial property especially for the do-

mains where understanding the decisions of networks is crit-

ical (e.g., medical decision making) or in scientiﬁc practice

(e.g., why the network decides this way and what this deci-

sion says for the problem in question). Even though it uses

multi-layer perceptrons for representing the event segments,

our model is not an end-to-end black box model (Rudin &

Radin, 2019); rather, it involves an easily understandable

and trackable white-box model that regulates multi-layer

perceptrons. From this perspective, our model is a semi-

mechanistic model, incorporating the capabilities of white-

box and black-box models, giving researchers a chance to

beneﬁt from the power of neural networks without fully sac-

riﬁcing the explainability.

Our proposed model is both self-supervised and semi-

mechanistic. By tracking the prediction error signals, (1)

the model produces multimodal event segments in varying

hierarchies via passive observation with the help of multi-

layer perceptrons unlike the models developed by (Gumbsch

et al., 2016, 2017), (2) With the help of a parameter, chang-

ing sensitivities of event models to prediction error signals,

our model can produce event segments in varying granular-

ities, which was not addressed by (Metcalf & Leake, 2017;

Reynolds et al., 2007). (3) Moreover, not only did we study

the activations in the layers to have an insight with respect

to the functioning, we also compared the performance of our

model to that of the human observers in order to assess its

capabilities. We received a higher point-biserial correlation

score than the existing score in the literature (Franklin et al.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PredictiveEventSegmentationandRepresentationwithNeuralNetworks:ASelf-SupervisedModelAssessedbyPsychologicalExperimentsHamitBasgol1,InciAyhan2,EmreUgur3DepartmentofComputerScience1attheUniversityofTübingen,GermanyDepartmentofPsychology2,DepartmentofComputerEngineering3atBogaziçiUniversity,TurkeyPeop...

展开>> 收起<<

Predictive Event Segmentation and Representation with Neural Networks A Self-Supervised Model Assessed by Psychological Experiments.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Predictive Event Segmentation and Representation with Neural Networks A Self-Supervised Model Assessed by Psychological Experiments

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: