DEEPREINFORCEMENT LEARNING FOR STABILIZATION OFLARGE -SCALE PROBABILISTIC BOOLEAN NETWORKS A P REPRINT

2025-05-06 1 0 1.19MB 20 页 10玖币

侵权投诉

DEEP REINFORCEMENT LEARNING FOR STABILIZATION

OF LARGE-SCALE PROBABILISTIC BOOLEAN NETWORKS

A PREPRINT

Sotiris Moschoyiannis, Evangelos Chatzaroulas, Vytenis Sliogeris, Yuhu Wu ∗

October 26, 2022

ABSTRACT

The ability to direct a Probabilistic Boolean Network (PBN) to a desired state is important to ap-

plications such as targeted therapeutics in cancer biology. Reinforcement Learning (RL) has been

proposed as a framework that solves a discrete-time optimal control problem cast as a Markov Deci-

sion Process. We focus on an integrative framework powered by a model-free deep RL method that

can address different ﬂavours of the control problem (e.g., with or without control inputs; attractor

state or a subset of the state space as the target domain). The method is agnostic to the distribution

of probabilities for the next state, hence it does not use the probability transition matrix. The time

complexity is only linear on the time steps, or interactions between the agent (deep RL) and the envi-

ronment (PBN), during training. Indeed, we explore the scalability of the deep RL approach to (set)

stabilization of large-scale PBNs and demonstrate successful control on large networks, including a

metastatic melanoma PBN with 200 nodes.

1 Introduction

Recent efforts to produce tools to effectively control the dynamics of complex networked systems draw from control

theory, numerical methods and, more recently, network science and machine learning. A dynamical system is control-

lable if by intervening on the state of individual nodes the system as a whole can be driven from any initial state to a

desirable state, within ﬁnite time [1]. This notion of control ﬁnds application in engineered systems but also biological

networks, and is often referred to as stabilization.

Probabilistic Boolean Networks (PBNs) were introduced in [2] for modelling Gene Regulatory Networks (GRNs) as

complex dynamical systems. Nodes represent genes in one of two possible states of activity; 0 (not expressed), 1

(expressed). Edges indicate that genes act on each other, by means of rules represented by Boolean functions. The

state of the network at each time step comprises the states of the individual nodes. PBNs extend Kauffman’s Boolean

Networks (BNs) [3] by associating more than one function with each node, one of which executes at each time step

with a certain probability. This stochasticity in the network model accounts for uncertainty in gene interaction, which

is inherent not only in data collection but also in cell function. Both PBNs and BNs have been extensively used to

model well-known regulatory networks, see seminal work in [4]. The dynamics of PBNs adhere to Markov chain

theory [5] and dictate that the network, from any initial state, will eventually settle down to one of a limited set of

steady states, the so-called attractors (ﬁxed points or cyclic), that it cannot leave without intervention [6].

Genes in biological systems experience sudden emergence of ordered collective behaviour, which corresponds to

steady state behaviour in PBNs and can be modelled via attractor theory [7], e.g., the emergence of heterogeneous small

cell lung cancer phenotypes [8]. Therapeutic strategies may involve switching between attractors, e.g., proliferation,

apoptosis in cancerous cells [9], or directing the network to a (set of) steady states that are more desirable than others,

e.g., exhibit lower levels of resistance to antibiotics [10].

∗Sotiris Moschoyiannis, Evangelos Chatzaroulas, and Vytenis Sliogeris are with the School of Computer Science & Electronic

Engineering, University of Surrey, GU2 7XH, UK (e-mail: s.moschoyiannis@surrey.ac.uk). Sotiris Moschoyiannis and Vytenis

Sliogeris have been partly funded by UKRI Innovate UK, grant 77032. Yuhu Wu is with the School of Control Science and

Engineering, Dalian University of Technology, 116024, Dalian, China (e-mail: wuyuhu@dlut.edu.cn).

arXiv:2210.12229v2 [cs.LG] 25 Oct 2022

Deep Reinforcement Learning for Stabilization

of Large-scale Probabilistic Boolean Networks A PREPRINT

While steady states are understood in a small set of experimental GRNs in the literature, they are not generally available

and it is computationally intractable to compute them in larger networks. However, attractors with larger basins of

attraction tend to be more stable. Therefore, the long-term dynamical behavior can be described by a steady-state

distribution (SSD) [11,12] which effectively captures the time spent at each network state. Hence, the control problem

in large-scale PBNs takes the form of set stabilization of the network to a pre-assigned subset of network states. It

transpires that the ability to direct the network to a speciﬁc attractor, or a subset of network states, by means of

intervention on individual nodes (genes), is central to GRNs and targeted therapeutics.

In this context, intervention at a certain time takes the form of effecting a perturbation on the state of a node, which

in GRNs translates to knocking-out a gene (switch to 0) or activating it (to 1). A lot of changes in cancer are of a

regulatory epigenetic nature [13] and thus cells are expected to be re-conﬁgured to achieve the same outcome, e.g.,

inputs to gene regulatory elements by one signalling pathway can be substituted by another signalling pathway [14].

Hence, cancer biology suggests that perturbations may be transient, e.g., see single-step perturbations in [15]. More

concretely, this means that the network dynamics may change the state of a perturbed node in subsequent time steps.

Existing work in control systems engineering typically restricts perturbations to a subset of nodes, typically the control

nodes of a network [1, 16] or nodes that translate biologically, e.g., pirin and WNT5A driven induction of an invasive

phenotype in melanoma cells [11,17, 18]. The general case where perturbations are considered on the full set of nodes

is less studied, with the exception of [19], even though it is relevant in contexts where control nodes are not available,

or computationally intractable to obtain. Further, motivated by the biological properties found in various target states,

different approaches perturb individual nodes’ states in a PBN in order to either drive it to some attractor within a

ﬁnite number of steps (horizon), or change the network’s long-run behaviour by affecting its steady-state distribution

(by increasing the mass probability of the target states).

Seminal work by Cheng et al. on an algebraic state space representation (ASSR) of Boolean networks, based on semi-

tensor product (STP) [20], stimulated an avalanche of research on such networks, including controllability [21–24].

However, ASSR linearises logical functions by enumerating their state spaces, hence it is model-based, and such

methods require estimating 2N×2Nprobabilities, which quickly becomes intractable for a large number of nodes N.

Attempts to overcome this barrier include work on a polynomial time algorithm to identify pinned nodes [25, 26],

which led to subsequent developments in pinning control to rely on local neighbourhoods rather than global state

information [27], with perturbations taking the form of deleting edges, to generate an acyclic subgraph of the network.

Control of larger BNs (not PBNs which are stochastic, rather than deterministic) has been attempted in [28, 29].

However, no guarantees are provided that the original network and its acyclic version, which is eventually controlled,

have the same dynamics. In addition, there are concerns over how such changes in the network topology translate in

biology since cycles are inherent in most biological pathways, e.g., see [30].

The pinning control strategy has been applied to PBNs by Lin et al. [31] but the approach is only demonstrated on a

real PBN with N=9 nodes, which is hardly large-scale. Moreover, the target domain is an attractor (cyclic attractor

comprising 2 states) and not a subset of the state space, validated by a favourable shift in the steady state distribution

(SSD) of the PBN, which is amenable to larger networks. Further, the complexity of this most recent approach is

O(N2+N2d), hence exponential on the largest in-degree dof the pinned nodes in the acyclic version of the original

PBN.

It transpires that the main challenge in dealing with the Boolean paradigm, as the computational counterpart of gene

regulatory networks, has to do with the sheer scale of the network state space, i.e., the state transition graph or

probability transition matrix, which provides a model of the system’s dynamics, grows exponentially on the number

of nodes. The primary objective is to develop optimal control methods that can systematically derive the series of

perturbations required to direct the PBN from its current state to a target state or subset of states, thus rendering

the system stabilizable at state(s) where the cell exhibits desired biological properties. In this paper we demonstrate

stabilization of a Melanoma PBN within a space of 2200 states (Section 4).

Reinforcement Learning (RL) [32], by inception, addresses sequential decision-making by means of maximising a

cumulative reward signal (indicator of how effective a decision was at the previous step), where the choice of action

at each step inﬂuences not only the immediate reward but also subsequent states, and by virtue of that, also future

rewards. Most importantly, RL provides a model-free framework and solves a discrete-time optimal control problem

cast as an MDP.

In the context of stabilization of PBNs model-free means that the distribution of probabilities of the next state, and

that from each state, is not known. The RL agent learns to optimise its reward through continuous interaction with

the environment - the PBN, here - from an initial state towards a terminal condition (e.g., reaching the target domain

or exceeding the horizon), by following some policy on selecting actions at each state along the way (which of the

m≤Nnodes’ state to ﬂip). Such an episodic framework can feature in Q-Learning, which combines learning from

Deep Reinforcement Learning for Stabilization

of Large-scale Probabilistic Boolean Networks A PREPRINT

experience (like Monte Carlo methods) with bootstrapping (like Dynamic Programming). This means there is no need

to wait until the end of the episode to update the estimates of the action-values at each state (Q function). Estimates of

the current step are based on estimates of the next time step, until the policy converges to the optimal policy. In this

way, the PBN dynamics are learned, by means of learning the reward function.

Previous work on control that utilises Q-Learning includes the work of Karlsen, et al. [33] on rule-based RL, in the

form of an eXtended Classiﬁer System (XCS) which was also applied to the yeast cell cycle BN (N=11) in [34]. The

stabilization of PBNs with Q-Learning is studied in [35] but that work also only address a small apoptosis PBN (N=9).

It transpires that Q-Learning RL struggles to converge to an optimal solution in complex and high-dimensional MDPs.

Naturally, the interest shifts towards combining Q-Learning (and its model-free promise) with deep learning for scal-

ability. Papagiannis & Moschoyiannis in 2019 [36] ﬁrst proposed a control method based on Deep Q-Learning with

experience replay, namely DDQN with Prioritized Experience Replay (PER). This was applied to control of BNs

in [36] and then to PBNs (synthetic N=20 and a real Melanoma N=7) in [19]. Subsequently, this deep RL method

was applied to solve the output tracking problem in a reduced version of the T-cell receptor kinetics model (PBCN with

N=28) in [37]. Batch-mode RL has been used in [38] to control the Melanoma PBN (N=28) also studied here. Nev-

ertheless, the advantages of combining Q-Learning with neural network function approximation to provide efﬁciently

scalable RL control methods applicable to large-scale PBNs remain largely unexplored.

In this article, we take this stream of research a step further by addressing large-scale PBNs through the application of

model-free Deep Reinforcement Learning. In comparison to previous work [19, 37], we present an integrated control

framework for set stabilization of large PBNs, based on model-free deep RL (DDQN with PER) which (i) can address

different ﬂavours of the control problem, with regard to the control inputs as well as the target domain, (ii) can validate

successful control in large PBNs, where computing attractors is not feasible, and (iii) has time complexity which is

linearly dependent on the number of time steps and not the number of nodes in the PBN, hence a clear advancement

on polynomial [26] and exponential (on largest in-degree) [28] complexity.

As such, the main contributions of this paper are as follows:

1. We show that a model-free Deep RL control method for directing a PBN to a target state is scalable, with

time complexity only linearly dependent on the number of time steps during training

2. We show the method to be versatile in that it can address:

(a) control input nodes (when known) but also consider the full set of nodes (when not known)

(b) the target domain for control to be a speciﬁc attractor (stabilization) but also a pre-assigned subset of the

network state space (set stabilization)

3. We demonstrate the approach in successfully determining a control policy for PBNs and PBCNs, including

stabilization of a Melanoma PBN with 200 nodes.

The rest of this paper is structured as follows: Section 2 sets out key concepts behind PBNs, formulates the control

problem, and outlines Deep Reinforcement Learning, focusing on DDQN with PER. The method for deriving series

of perturbations (control policies) for stabilization of PBNs is developed in Section 3. The main results of applying

the control method to large PBNs are presented in Section 4, including comparison and discussion. Finally, Section 5

presents some concluding remarks and possible extensions.

2 Preliminaries

2.1 Probabilistic Boolean Networks (PBNs)

PBNs are a class of discrete dynamical systems characterised by interactions over a set of Nnodes, each taking a

Boolean value xiin D={0,1}. Hence, xi(t), i ∈[1, N], denotes the state of the i-th node at time instance t, and

represents the expression level of the i-th gene in the GRN being modelled. The update rule of xi(t)is determined

by the Boolean function fi([xj(t)]j∈Ni), and the value of fiis assigned to next state of node xi, the set Ni⊂[1, N ]

contains the subscript indices of in-neighbours, and fi:DNi→ D is the logical function chosen for node iat time

step t. The state of a PBN at time t, is denoted by Xt= [x1(t), x2(t), ..., xN(t)]>. Then the evolution (dynamics) of

the BN is represented by the following vector form:

Xt+1 =







x1(t+ 1) = f1([xj(t)]j∈N1)

x2(t+ 1) = f2([xj(t)]j∈N2)

xN(t+ 1) = fN([xj(t)]j∈NN)







∈ DN(1)

Deep Reinforcement Learning for Stabilization

of Large-scale Probabilistic Boolean Networks A PREPRINT

Each logical function fihas lipossibilities and is chosen from the ﬁnite set of Boolean functions Fi=

{f1

i, f2

i, . . . , fli

i}(hence, |Fi|=li) that the node is associated with.

Each function fk

i∈Fi,k∈[1, li]is chosen with probability P r[fi=fk

i] = pk

iwith Pli

k=1 pk

i= 1. In this article,

we assume that the assignment of logical functions for each node iis independent. Hence, the probability of Boolean

function selections, over Nnodes, is given by the product pµ1

1·pµ2

2· · ··pµN

N, where µi∈[1, li]. Different fk

iselections

lead to different PBN realizations, which occur under different probabilities, resulting in stochastic state evolution of

the network. Consequently, the possible realizations of a PBN are deﬁned as R=QN

i=1 li.

Thus, the probability PXt,Xt+1of transitioning from state Xt→ Xt+1at the next time step is: P[x1(t+ 1),x2(t+

1), . . . , xN(t+1)|x1(t), x2(t), . . . , xN(t)] = PXt,Xt+1where P[x1, x2, . . . , xn|y1, y2, . . . , yn]denotes the joint prob-

ability of x1, x2, . . . , xnconditioned to y1, y2, . . . , yn. We can now construct the transition probability matrix Pcom-

prised of 2N×2Nentries where entry Pm,n indicates the probability of transitioning from current state mto possible

next state n.

2.2 Control problem formulation

In the context of PBNs, and consequently GRNs, control takes the form of discovering policies, or series of interven-

tions (perturbations) to the state of a node (gene), aiming to drive the network from its current state to a desirable state,

where the network exhibits desirable biological properties.

Deﬁnition 1. Consider a PBN at state Xt. Then, deﬁne intervention I(Xt, it),0≤i≤N,as the process of ﬂipping

the binary value xi(t)associated with node i, at time t.

it= 0 denotes no intervention at time t.

Since an intervention strategy in gene therapies should be the least intrusive to the GRN, only a single I(Xt, it)is

allowed at each time step t. This means an intervention is followed by a natural network evolution step according to

the PBN internal transition dynamics2. Hence, we resist operating in a more aggressive intervention mode although it

is favourable from a computational viewpoint.

Deﬁnition 2. Consider I(Xt, it),and intervention horizon H∈Z+.Deﬁne S=

{I(X1, i1),I(X2, i2), . . . I(Xh≤H, ih)}, where 0≤it≤N, and h≤H∈Z+. Again, it= 0 denotes no

intervention, at time t.

The objective is to obtain the sequence of interventions Sthat directs the network from the current state - sampled

from a uniform distribution of all states p(X0)- to a desirable state.

When control input nodes are not known, Sis formed by perturbations effected on any node, hence there are N+

1actions for the RL agent at each state. The objective is to determine the sequence Swithin a ﬁnite number of

interventions, the horizon H, assuming that the MDP is ergodic. The experiments reported in the next section show

this not to be a restricting assumption.

When control input nodes are known, or can be computed, Sis formed by perturbations effected only on the control

nodes. The objective here is to determine the sequence Srequired to increase the steady-state probability mass of

desirable network states. In the study of control of the melanoma PBN [11, 38] these are the states where the gene

WNT5A, which is central to the induction of an invasive phenotype in melanoma cells, is OFF. Perturbations are only

allowed on the pirin gene, as in [11,38]. Since the target domain does not assume knowledge of attractors, this allows

addressing larger networks (cf. we demonstrate control of the Melanoma PBN N=200).

2.3 Deep Reinforcement Learning

The central task of Reinforcement Learning [32] is to solve sequential decision problems by optimising a cumulative

future reward. This can be achieved by learning estimates for the optimal value of each action which is typically

deﬁned as the sum of future rewards when taking that action and following the optimal policy afterwards.

The strategy that determines which action to take is called a policy. Hence, an optimal policy results from selecting

the actions that maximize the future cumulative reward.

2We stress that in this article the probability distribution of successor states, from each state, is unknown when learning the

control policy.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DEEPREINFORCEMENTLEARNINGFORSTABILIZATIONOFLARGE-SCALEPROBABILISTICBOOLEANNETWORKSAPREPRINTSotirisMoschoyiannis,EvangelosChatzaroulas,VytenisSliogeris,YuhuWuOctober26,2022ABSTRACTTheabilitytodirectaProbabilisticBooleanNetwork(PBN)toadesiredstateisimportanttoap-plicationssuchastargetedtherapeuticsin...

展开>> 收起<<

DEEPREINFORCEMENT LEARNING FOR STABILIZATION OFLARGE -SCALE PROBABILISTIC BOOLEAN NETWORKS A P REPRINT.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DEEPREINFORCEMENT LEARNING FOR STABILIZATION OFLARGE -SCALE PROBABILISTIC BOOLEAN NETWORKS A P REPRINT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: