Topological Continual Learning with Wasserstein Distance and Barycenter Tananun Songdechakraiwut1 Xiaoshuang Yin2 and Barry D. Van Veen1

2025-05-06 0 0 806.18KB 25 页 10玖币
侵权投诉
Topological Continual Learning
with Wasserstein Distance and Barycenter
Tananun Songdechakraiwut1, Xiaoshuang Yin2, and Barry D. Van Veen1
1University of Wisconsin–Madison
2Google
Abstract
Continual learning in neural networks suffers from a phenomenon called
catastrophic forgetting, in which a network quickly forgets what was learned in
a previous task. The human brain, however, is able to continually learn new
tasks and accumulate knowledge throughout life. Neuroscience findings suggest
that continual learning success in the human brain is potentially associated with
its modular structure and memory consolidation mechanisms. In this paper
we propose a novel topological regularization that penalizes cycle structure
in a neural network during training using principled theory from persistent
homology and optimal transport. The penalty encourages the network to learn
modular structure during training. The penalization is based on the closed-
form expressions of the Wasserstein distance and barycenter for the topological
features of a 1-skeleton representation for the network. Our topological continual
learning method combines the proposed regularization with a tiny episodic
memory to mitigate forgetting. We demonstrate that our method is effective in
both shallow and deep network architectures for multiple image classification
datasets.
1
arXiv:2210.02661v1 [cs.LG] 6 Oct 2022
1 Introduction
Neural networks can be trained to achieve impressive performance on a variety of
learning tasks. However, when an already trained network is further trained on a
new task, a phenomenon called catastrophic forgetting [McCloskey and Cohen,1989]
occurs, in which previously learned tasks are quickly forgotten with additional training.
The human brain, however, is able to continually learn new tasks and accumulate
knowledge throughout life without significant loss of previously learned skills. The
biological mechanisms behind this trait are not yet fully understood. Neuroscience
findings suggest that the principle of modularity [Hart and Giszter,2010] may play
an important role. Modular structures [Simon,1962] are aggregates of modules
(components) that perform specific functions without perturbing one another. Human
brains have been characterized by modular structures during learning [Finc et al.,
2020]. Such structures are hypothesized to reduce the interdependence of components,
enhance robustness, and facilitate learning [Bassett et al.,2011]. Another important
learning mechanism is the hippocampal and neocortical interaction responsible for
memory consolidation [McGaugh,2000]. In particular, the hippocampal system
encodes recent experiences that later are replayed multiple times before consolidation
as episodic memory in the neocortex [Klinzing et al.,2019]. The interplay between
the modularity principle and memory consolidation, among other mechanisms, is
potentially associated with continual learning success.
Persistent homology [Barannikov,1994,Edelsbrunner et al.,2000,Wasserman,
2018] has emerged as a tool for understanding, characterizing and quantifying the
topology of brain networks. For example, it has been used to evaluate biomarkers
of the neural basis of consciousness [Songdechakraiwut et al.,2022] and the impact
of twin genetics [Songdechakraiwut et al.,2021] by interpreting brain networks as
1-skeletons [Munkres,1996] of a simplicial complex. The topology of a 1-skeleton
is completely characterized by connected components and cycles [Songdechakraiwut
et al.,2022]. Brain networks naturally organize into modules or connected components
[Bullmore and Sporns,2009,Honey et al.,2007], while cycle structure is ubiquitous
and is often interpreted in terms of information propagation, redundancy and feedback
loops [Kwon and Cho,2007,Ozbudak et al.,2005,Venkatesh et al.,2004,Weiner
et al.,2002].
In this paper we show that persistent homology can be used to improve performance
of neural networks in continual learning tasks. In particular, we interpret a network
as a 1-skeleton and propose a novel topological regularization of the 1-skeleton’s cycle
structure to avoid catastrophic forgetting of previously learned tasks. Regularizing the
cycle structure allows the network to explicitly learn its complement, i.e., the modular
2
Figure 1: Schematic illustrating our topological continual learning approach. A tiny
episodic memory replays past examples from previously learned tasks. The cycle
topology of a subset of layers (shaded) is regularized based on the Wasserstein distance
between the network and barycenter of previously learned networks to improve the
generalization of the learning and memory consolidation processes.
structure, through gradient optimization. Our approach is made computationally
efficient by use of the closed form expressions for the Wasserstein barycenter and
the gradient of Wasserstein distance between network cycle structures presented in
[Songdechakraiwut et al.,2021,2022]. Figure 1illustrates that our proposed approach
employs topological regularization with a tiny episodic memory to mitigate forgetting
and facilitate learning. We evaluate our approach using image classification across
multiple data sets and show that it generally improves classification performance
compared to competing approaches in the challenging case of both shallow and deep
networks of limited width when faced with many learning tasks.
The paper is organized as follows. Efficient computation of persistent homology for
neural networks is given in Section 2, and Section 3presents our topology-regularized
continual learning strategy. In Section 4, we compare our methods to multiple
baseline algorithms using four image classification datasets. Section 5provides a brief
discussion of the relationship between the proposed approach and previously reported
methods for continual learning.
2 Efficient Computation of Topology for Network
Graphs
2.1 Graph Filtration
Represent a neural network as an undirected weighted graph
G
= (
V, W
)with a set
of nodes
V
, and a set of edge weights
W
=
{wi,j }
. The number of nodes and weights
are denoted by
|V|
and
|W|
, respectively. Create a binary graph
G
with the identical
3
Figure 2: Illustration of graph filtration. A toy neural network
G
showing its
maximum spanning tree (MST) as edges denoted by dark lines and a subnetwork
with non-MST edges shown as dashed lines. As the filtration value increases, the
number of connected components
β0
monotonically increases while the number of
cycles
β1
monotonically decreases. Connected components are born at the MST edge
weights e2, e4, e5, e6while cycles die at the non-MST edge weights e1, e3.
node set Vby thresholding the edge weights so that an edge between nodes iand j
exists if
wi,j > 
. The binary graph is a simplicial complex consisting of only nodes
and edges known as a 1-skeleton [Munkres,1996]. As
increases, more and more
edges are removed from the network
G
, resulting in a nested sequence of 1-skeletons:
G0G1⊇ · · · ⊇ Gk,(1)
where
01≤ · · · ≤ k
are called filtration values. This sequence of 1-skeletons is
called a graph filtration [Lee et al.,2012]. Figure 2illustrates the graph filtration of a
toy neural network.
2.2 Birth and Death Decomposition
The only non-trivial topological features in a 1-skeleton are connected components
(0-dimensional topological features) and cycles (1-dimensional topological features).
There are no higher-dimensional topological features in the 1-skeleton, in contrast to
clique complexes [Otter et al.,2017] and Rips complexes [Ghrist,2008,Zomorodian,
2010]. Persistent homology keeps track of the birth and death of topological features
over filtration values
. If a topological feature is born at a filtration value
bl
and
persists up to a filtration value
dl
, then this feature is represented as a two-dimensional
persistence point (
bl, dl
)in a plane. The set of all points
{
(
bl, dl
)
}l
is called persistence
diagram [Edelsbrunner and Harer,2008] or, equivalently, persistence barcode [Ghrist,
2008]. The use of the 1-skeleton simplifies the persistence barcodes to one-dimensional
4
descriptors [Songdechakraiwut et al.,2021,2022]. Specifically, the representation of
the connected components can be simplified to a collection of birth values
B
(
G
) =
{bl}
and that of cycles to a collection of death values
D
(
G
) =
{dl}
. In addition, neural
networks of the same architecture have a birth set
B
and a death set
D
of the same
cardinality as
|V| −
1and
|W| −
(
|V| −
1), respectively. This result completely resolves
the problem of point mismatch in persistence barcodes for same-architecture neural
networks. The example network of Fig. 2has
W
=
{ei}6
i=1,B
(
G
) =
{e2, e4, e5, e6}
and
D
(
G
) =
{e1, e3}
.
B
(
G
)and
D
(
G
)can be identified very efficiently by computing
the maximum spanning tree (MST) of the network [Lee et al.,2012] in
O
(
nlog n
)
operations, where
n
is the number of edges in the network. Supplementary description
of the decomposition is provided in Appendix A.
2.3 Closed Form Wasserstein Distance and Gradient
The birth set of connected components and the death set of cycles are theoretically
shown to completely characterize the topology of a 1-skeleton network representation
as described in Section 2.2. The Wasserstein distance between 1-skeleton network
representations has a closed-form expression [Songdechakraiwut et al.,2022]. Here
we only consider the Wasserstein distance for cycle structure, which depends solely
on the death sets. Let
G, H
be two given networks based on the same architecture.
Their (squared) 2-Wasserstein distance for cycles is defined as the optimal matching
cost between D(G)and D(H):
W2
cycle(G, H) = min
φX
dlD(G)dlφ(dl)2,(2)
where
φ
is a bijection from
D
(
G
)to
D
(
H
). The Wasserstein distance form given in
(2) has a closed-form expression that allows for very efficient computation as follows
[Songdechakraiwut et al.,2022]
W2
cycle(G, H) = X
dlD(G)dlφ(dl)2,(3)
where
φ
maps the
l
-th smallest death value in
D
(
G
)to the
l
-th smallest death
value in D(H)for all l.
In addition, the gradient of the Wasserstein distance for cycles
GW2
cycle
(
G, H
)
with respect to edge weights
wi,j W
is given as a gradient matrix whose
i, j
-th entry
is [Songdechakraiwut et al.,2022]
W 2
cycle(G, H)
wi,j
=(0if wi,j B(G);
2wi,j φ(wi,j )if wi,j D(G).(4)
5
摘要:

TopologicalContinualLearningwithWassersteinDistanceandBarycenterTananunSongdechakraiwut1,XiaoshuangYin2,andBarryD.VanVeen11UniversityofWisconsinMadison2GoogleAbstractContinuallearninginneuralnetworkssuersfromaphenomenoncalledcatastrophicforgetting,inwhichanetworkquicklyforgetswhatwaslearnedinaprev...

展开>> 收起<<
Topological Continual Learning with Wasserstein Distance and Barycenter Tananun Songdechakraiwut1 Xiaoshuang Yin2 and Barry D. Van Veen1.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:806.18KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注