2 Background & Related Work
2.1 Reinforcement Learning (RL)
RL is usually formalized as a Markov Decision Process
(MDP), which is defined by a tuple (S,A, P, r, γ), where S
is the state space, Athe action space, Pthe transition func-
tion defining the probability of arriving at a given state st+1
after taking action atfrom state st,rthe reward function
defining the expected reward received after taking action at
from state stand γ∈(0,1) the discount factor of the re-
ward. At each time step tof an episode, the agent observes
the current state st∈ S, takes an action at∈ A, and tran-
sitions to another state st+1 ∈ S while receiving a reward
rt. The goal of RL is to train a policy π:S × A → [0,1]
that maximizes the cumulative discounted return, PT
t=0 γtrt
received over the course of an episode with Ttimesteps.
2.2 Q-Learning and Deep Q-learning
Q-Learning (Watkins and Dayan 1989) is one of the main
RL algorithms and the most common method in healthcare
applications (Yu, Liu, and Nemati 2020). It aims to estimate
the value of taking an action afrom a state s, known as the
Q-value Q(s, a). At each timestep t, upon taking action at
from state stand transitioning to state st+1 with reward rt,
the agent updates the Q-value for (st, at)as follows:
Q(st, at) = Q(st, at)+η(rt+γmax
aQ(st+1, a)−Q(st, at))
(1)
where η∈(0,1) is the learning rate and (rt+
γmaxaQ(st+1, a)) is the target of the update. When the
number of states is intractable, it becomes impractical to
store in a table the Q-values for all state-action pairs. We
can however use a function approximator to estimate the Q-
values. The Deep Q Network (DQN) (Mnih 2015) algorithm
combines Q-Learning with deep neural networks to handle
complex RL problems. Despite offering many advantages,
such as the ability to learn from data gathered through any
way of behaving, and to generalize potentially to many states
from a limited sample, DQN comes with challenges, such as
the potential to substantially overestimate certain Q-values.
Overestimation occurs when the estimated mean of a ran-
dom variable is higher than its true mean. Because DQN up-
dates its Q-values towards the target rt+γmaxaQ(st+1, a),
which includes the highest Q-value of the next state st+1,
and because this is usually a noisy estimate, it can lead to an
overestimation.
2.3 Double Deep Q-Network (DDQN)
DDQN (van Hasselt, Guez, and Silver 2015) was introduced
as a solution to the overestimation problem in Q-learning.
While DQN uses a single network to represent the value
function, DDQN uses two different networks, parametrized
by different parameter vectors, θand θ0. At any point in time,
one of the networks, chosen at random, is updated, and its
target is computed using the Q-value estimated by the other
network. Thus, for network Qθ, the target of the update is:
rt+γQθ0(st+1,arg max
aQθ(st+1, a)) (2)
While this is beneficial, DDQN may still suffer from over-
estimation (van Hasselt, Guez, and Silver 2015), especially
in offline RL.
2.4 Offline Reinforcement Learning
Traditional RL methods are based on an online learning
paradigm, in which an agent actively interacts with an en-
vironment. This is an important barrier to RL implementa-
tion in many fields, including healthcare (Levine et al. 2020),
where acting in an environment is inefficient and unethical,
as it would mean putting patients at risk. Consequently, re-
cent years have witnessed significant growth in offline (or
batch) RL, where the learning utilizes a fixed dataset of tran-
sitions D=si
t, ai
t, ri
t, si
t+1N
i=1. Since the understand-
ing of the environment of the RL model is limited to the
dataset, this can lead to the overestimation of Q-values of
state-action pairs which are under-represented in the dataset,
or out-of-distribution (OOD). In the healthcare setting, this
may translate to unsafe recommendations, putting patients
at risk.
2.5 Conservative Q-Learning (CQL)
Conservative Q-Learning (CQL) was proposed to address
overestimation in offline RL (Kumar et al. 2020). It learns
a conservative estimate of the Q-function by adding a reg-
ularizer Est∼D,at∼A[Q(st,at)] on the Q-learning error, in
order to minimize the overestimated values of unseen ac-
tions. In addition, the term −Est,at∼D [Q(st,at)] is added
to maximize the Q-values in the dataset. In summary, CQL
minimizes the estimated Q-values for all actions while si-
multaneously maximizing the estimated Q-values for the ac-
tions in the dataset, thus preventing overestimation of OOD
or underrepresented state-action pairs.
2.6 Related work
Algorithms for ventilation optimization Current ap-
proaches for ventilation optimization in hospitals commonly
rely on proportional-integral-derivative (PID) control (Ben-
nett 1993), which are known to be sub-optimal (Suo et al.
2021). The use of more sophisticated machine learning
methods have been suggested in recent years (Akbulut et al.
2014; Venkata, Koenig, and Pidaparti 2021; Suo et al. 2021).
Recently, RL was proposed using a simple tabular approach
(Peine et al. 2021). This was already expected to outperform
clinical standards, providing strong evidence for the use of
RL in this setting. Nonetheless, to the best of our knowledge,
no Deep RL approach has been proposed for ventilation set-
tings optimization. Furthermore, many core RL challenges,
such as sparse reward and value overestimation, have not yet
been addressed.
Intermediate rewards in healthcare RL has been sug-
gested in various fields of healthcare, such as sepsis treat-
ment (Raghu et al. 2017; Peng et al. 2019), heparin dosage
(Lin et al. 2018), mechanical weaning (Prasad et al. 2017;
Yu, Ren, and Dong 2020) and sedation (Eghbali, Alhanai,
and Ghassemi 2021). In RL, the use of a dense reward signal
can help credit assignment (Mataric 1994), leading to faster