
Deep Reinforcement Learning for Stabilization
of Large-scale Probabilistic Boolean Networks A PREPRINT
While steady states are understood in a small set of experimental GRNs in the literature, they are not generally available
and it is computationally intractable to compute them in larger networks. However, attractors with larger basins of
attraction tend to be more stable. Therefore, the long-term dynamical behavior can be described by a steady-state
distribution (SSD) [11,12] which effectively captures the time spent at each network state. Hence, the control problem
in large-scale PBNs takes the form of set stabilization of the network to a pre-assigned subset of network states. It
transpires that the ability to direct the network to a specific attractor, or a subset of network states, by means of
intervention on individual nodes (genes), is central to GRNs and targeted therapeutics.
In this context, intervention at a certain time takes the form of effecting a perturbation on the state of a node, which
in GRNs translates to knocking-out a gene (switch to 0) or activating it (to 1). A lot of changes in cancer are of a
regulatory epigenetic nature [13] and thus cells are expected to be re-configured to achieve the same outcome, e.g.,
inputs to gene regulatory elements by one signalling pathway can be substituted by another signalling pathway [14].
Hence, cancer biology suggests that perturbations may be transient, e.g., see single-step perturbations in [15]. More
concretely, this means that the network dynamics may change the state of a perturbed node in subsequent time steps.
Existing work in control systems engineering typically restricts perturbations to a subset of nodes, typically the control
nodes of a network [1, 16] or nodes that translate biologically, e.g., pirin and WNT5A driven induction of an invasive
phenotype in melanoma cells [11,17, 18]. The general case where perturbations are considered on the full set of nodes
is less studied, with the exception of [19], even though it is relevant in contexts where control nodes are not available,
or computationally intractable to obtain. Further, motivated by the biological properties found in various target states,
different approaches perturb individual nodes’ states in a PBN in order to either drive it to some attractor within a
finite number of steps (horizon), or change the network’s long-run behaviour by affecting its steady-state distribution
(by increasing the mass probability of the target states).
Seminal work by Cheng et al. on an algebraic state space representation (ASSR) of Boolean networks, based on semi-
tensor product (STP) [20], stimulated an avalanche of research on such networks, including controllability [21–24].
However, ASSR linearises logical functions by enumerating their state spaces, hence it is model-based, and such
methods require estimating 2N×2Nprobabilities, which quickly becomes intractable for a large number of nodes N.
Attempts to overcome this barrier include work on a polynomial time algorithm to identify pinned nodes [25, 26],
which led to subsequent developments in pinning control to rely on local neighbourhoods rather than global state
information [27], with perturbations taking the form of deleting edges, to generate an acyclic subgraph of the network.
Control of larger BNs (not PBNs which are stochastic, rather than deterministic) has been attempted in [28, 29].
However, no guarantees are provided that the original network and its acyclic version, which is eventually controlled,
have the same dynamics. In addition, there are concerns over how such changes in the network topology translate in
biology since cycles are inherent in most biological pathways, e.g., see [30].
The pinning control strategy has been applied to PBNs by Lin et al. [31] but the approach is only demonstrated on a
real PBN with N=9 nodes, which is hardly large-scale. Moreover, the target domain is an attractor (cyclic attractor
comprising 2 states) and not a subset of the state space, validated by a favourable shift in the steady state distribution
(SSD) of the PBN, which is amenable to larger networks. Further, the complexity of this most recent approach is
O(N2+N2d), hence exponential on the largest in-degree dof the pinned nodes in the acyclic version of the original
PBN.
It transpires that the main challenge in dealing with the Boolean paradigm, as the computational counterpart of gene
regulatory networks, has to do with the sheer scale of the network state space, i.e., the state transition graph or
probability transition matrix, which provides a model of the system’s dynamics, grows exponentially on the number
of nodes. The primary objective is to develop optimal control methods that can systematically derive the series of
perturbations required to direct the PBN from its current state to a target state or subset of states, thus rendering
the system stabilizable at state(s) where the cell exhibits desired biological properties. In this paper we demonstrate
stabilization of a Melanoma PBN within a space of 2200 states (Section 4).
Reinforcement Learning (RL) [32], by inception, addresses sequential decision-making by means of maximising a
cumulative reward signal (indicator of how effective a decision was at the previous step), where the choice of action
at each step influences not only the immediate reward but also subsequent states, and by virtue of that, also future
rewards. Most importantly, RL provides a model-free framework and solves a discrete-time optimal control problem
cast as an MDP.
In the context of stabilization of PBNs model-free means that the distribution of probabilities of the next state, and
that from each state, is not known. The RL agent learns to optimise its reward through continuous interaction with
the environment - the PBN, here - from an initial state towards a terminal condition (e.g., reaching the target domain
or exceeding the horizon), by following some policy on selecting actions at each state along the way (which of the
m≤Nnodes’ state to flip). Such an episodic framework can feature in Q-Learning, which combines learning from
2