Towards Safe Mechanical Ventilation Treatment Using Deep Offline Reinforcement Learning Flemming Kondrup1 Thomas Jiralerspong1 Elaine Lau1 Nathan de Lara1

2025-05-06 0 0 9.48MB 7 页 10玖币
侵权投诉
Towards Safe Mechanical Ventilation Treatment
Using Deep Offline Reinforcement Learning
Flemming Kondrup*1, Thomas Jiralerspong*1, Elaine Lau*1, Nathan de Lara1,
Jacob Shkrob1, My Duc Tran1, Doina Precup1,2, Sumana Basu1,2
1McGill University
2Mila
Correspondance to: flemming.kondrup@mail.mcgill.ca
Abstract
Mechanical ventilation is a key form of life support for pa-
tients with pulmonary impairment. Healthcare workers are
required to continuously adjust ventilator settings for each
patient, a challenging and time consuming task. Hence, it
would be beneficial to develop an automated decision sup-
port tool to optimize ventilation treatment. We present Deep-
Vent, a Conservative Q-Learning (CQL) based offline Deep
Reinforcement Learning (DRL) agent that learns to predict
the optimal ventilator parameters for a patient to promote 90
day survival. We design a clinically relevant intermediate re-
ward that encourages continuous improvement of the patient
vitals as well as addresses the challenge of sparse reward in
RL. We find that DeepVent recommends ventilation param-
eters within safe ranges, as outlined in recent clinical trials.
The CQL algorithm offers additional safety by mitigating the
overestimation of the value estimates of out-of-distribution
states/actions. We evaluate our agent using Fitted Q Evalu-
ation (FQE) and demonstrate that it outperforms physicians
from the MIMIC-III dataset.
1 Introduction
The COVID-19 pandemic has put enormous pressure on
the healthcare system, particularly on intensive care units
(ICUs). In cases of severe pulmonary impairment, mechan-
ical ventilation assists breathing in patients and acts as
the key form of life support. However, the optimal ven-
tilator settings are individual specific and often unknown
(Zein et al. 2016), leading to ventilator induced lung injury
(VILI), diaphragm dysfunction, pneumonia and oxygen tox-
icity (Pham, Brochard, and Slutsky 2017). To prevent these
complications, and offer optimal care, it is necessary to per-
sonalize mechanical ventilation.
Various efforts have proposed the use of machine learning
(ML) to personalize ventilation treatments. These include
the use of deep supervised learning (Akbulut et al. 2014;
Venkata, Koenig, and Pidaparti 2021) which permits high-
level feature extraction, yet ignores the sequential nature of
ventilation. Furthermore, supervised learning methods can
only hope to imitate the physician’s policy, which may lead
*These authors contributed equally.
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
to suboptimal treatment. Meanwhile, reinforcement learn-
ing (RL) interacts with the environment and gets immedi-
ate feedback from the patient in the form of rewards and
hence can improve upon the physician’s policy. Tabular RL
has recently shown strong potential in mechanical ventila-
tion (Peine et al. 2021), but, to the best of our knowledge,
no previous works have attempted to combine deep learning
and RL to improve mechanical ventilation.
We propose DeepVent, a Deep RL model to optimize me-
chanical ventilation settings and hypothesize it will lead to
improved care. We consider both performance and patient
safety with the aim of bridging the gap between research and
real-life implementation. Here are our main contributions:
We introduce DeepVent, a Deep RL model based on
the Conservative Q-Learning algorithm (Kumar et al.
2020), and show using Fitted Q Evaluation (FQE) that
it achieves higher performance when compared to physi-
cians as recorded in the MIMIC-III dataset (Johnson et al.
2016), behavior cloning and Double Deep Q-Learning
(DDQN) (van Hasselt, Guez, and Silver 2015), a com-
mon RL algorithm in health applications.
• We compare DeepVent’s decisions to those of physi-
cians and of the DDQN agent. We show that DeepVent
makes recommendations within safe ranges, as supported
by recent clinical studies and trials. In contrast, DDQN
makes recommendations in ranges unsupported by clin-
ical guidelines. We hypothesize that this may be due to
DDQN’s overestimation of out-of-distribution states/ac-
tions and demonstrate the potential of Conservative Q-
Learning to address this. This is essential in healthcare,
where risk in decision making must be avoided.
We introduce a clinically relevant intermediate reward
applicable to many fields of healthcare. RL models can
benefit highly from an intermediate reward, as it can
permit faster convergence and improved performance
(Mataric 1994), and thus better outcomes for patients.
Most previous efforts implementing RL in healthcare ei-
ther did not address this or proposed a reward requiring
important domain knowledge (see Section 2.6). Our in-
termediate reward is based on the Apache II mortality
prediction score (Knaus et al. 1985), commonly used by
physicians in ICUs, and leads to improved performance.
arXiv:2210.02552v1 [cs.LG] 5 Oct 2022
2 Background & Related Work
2.1 Reinforcement Learning (RL)
RL is usually formalized as a Markov Decision Process
(MDP), which is defined by a tuple (S,A, P, r, γ), where S
is the state space, Athe action space, Pthe transition func-
tion defining the probability of arriving at a given state st+1
after taking action atfrom state st,rthe reward function
defining the expected reward received after taking action at
from state stand γ(0,1) the discount factor of the re-
ward. At each time step tof an episode, the agent observes
the current state st∈ S, takes an action at∈ A, and tran-
sitions to another state st+1 ∈ S while receiving a reward
rt. The goal of RL is to train a policy π:S × A [0,1]
that maximizes the cumulative discounted return, PT
t=0 γtrt
received over the course of an episode with Ttimesteps.
2.2 Q-Learning and Deep Q-learning
Q-Learning (Watkins and Dayan 1989) is one of the main
RL algorithms and the most common method in healthcare
applications (Yu, Liu, and Nemati 2020). It aims to estimate
the value of taking an action afrom a state s, known as the
Q-value Q(s, a). At each timestep t, upon taking action at
from state stand transitioning to state st+1 with reward rt,
the agent updates the Q-value for (st, at)as follows:
Q(st, at) = Q(st, at)+η(rt+γmax
aQ(st+1, a)Q(st, at))
(1)
where η(0,1) is the learning rate and (rt+
γmaxaQ(st+1, a)) is the target of the update. When the
number of states is intractable, it becomes impractical to
store in a table the Q-values for all state-action pairs. We
can however use a function approximator to estimate the Q-
values. The Deep Q Network (DQN) (Mnih 2015) algorithm
combines Q-Learning with deep neural networks to handle
complex RL problems. Despite offering many advantages,
such as the ability to learn from data gathered through any
way of behaving, and to generalize potentially to many states
from a limited sample, DQN comes with challenges, such as
the potential to substantially overestimate certain Q-values.
Overestimation occurs when the estimated mean of a ran-
dom variable is higher than its true mean. Because DQN up-
dates its Q-values towards the target rt+γmaxaQ(st+1, a),
which includes the highest Q-value of the next state st+1,
and because this is usually a noisy estimate, it can lead to an
overestimation.
2.3 Double Deep Q-Network (DDQN)
DDQN (van Hasselt, Guez, and Silver 2015) was introduced
as a solution to the overestimation problem in Q-learning.
While DQN uses a single network to represent the value
function, DDQN uses two different networks, parametrized
by different parameter vectors, θand θ0. At any point in time,
one of the networks, chosen at random, is updated, and its
target is computed using the Q-value estimated by the other
network. Thus, for network Qθ, the target of the update is:
rt+γQθ0(st+1,arg max
aQθ(st+1, a)) (2)
While this is beneficial, DDQN may still suffer from over-
estimation (van Hasselt, Guez, and Silver 2015), especially
in offline RL.
2.4 Offline Reinforcement Learning
Traditional RL methods are based on an online learning
paradigm, in which an agent actively interacts with an en-
vironment. This is an important barrier to RL implementa-
tion in many fields, including healthcare (Levine et al. 2020),
where acting in an environment is inefficient and unethical,
as it would mean putting patients at risk. Consequently, re-
cent years have witnessed significant growth in offline (or
batch) RL, where the learning utilizes a fixed dataset of tran-
sitions D=si
t, ai
t, ri
t, si
t+1N
i=1. Since the understand-
ing of the environment of the RL model is limited to the
dataset, this can lead to the overestimation of Q-values of
state-action pairs which are under-represented in the dataset,
or out-of-distribution (OOD). In the healthcare setting, this
may translate to unsafe recommendations, putting patients
at risk.
2.5 Conservative Q-Learning (CQL)
Conservative Q-Learning (CQL) was proposed to address
overestimation in offline RL (Kumar et al. 2020). It learns
a conservative estimate of the Q-function by adding a reg-
ularizer Est∼D,atA[Q(st,at)] on the Q-learning error, in
order to minimize the overestimated values of unseen ac-
tions. In addition, the term Est,at∼D [Q(st,at)] is added
to maximize the Q-values in the dataset. In summary, CQL
minimizes the estimated Q-values for all actions while si-
multaneously maximizing the estimated Q-values for the ac-
tions in the dataset, thus preventing overestimation of OOD
or underrepresented state-action pairs.
2.6 Related work
Algorithms for ventilation optimization Current ap-
proaches for ventilation optimization in hospitals commonly
rely on proportional-integral-derivative (PID) control (Ben-
nett 1993), which are known to be sub-optimal (Suo et al.
2021). The use of more sophisticated machine learning
methods have been suggested in recent years (Akbulut et al.
2014; Venkata, Koenig, and Pidaparti 2021; Suo et al. 2021).
Recently, RL was proposed using a simple tabular approach
(Peine et al. 2021). This was already expected to outperform
clinical standards, providing strong evidence for the use of
RL in this setting. Nonetheless, to the best of our knowledge,
no Deep RL approach has been proposed for ventilation set-
tings optimization. Furthermore, many core RL challenges,
such as sparse reward and value overestimation, have not yet
been addressed.
Intermediate rewards in healthcare RL has been sug-
gested in various fields of healthcare, such as sepsis treat-
ment (Raghu et al. 2017; Peng et al. 2019), heparin dosage
(Lin et al. 2018), mechanical weaning (Prasad et al. 2017;
Yu, Ren, and Dong 2020) and sedation (Eghbali, Alhanai,
and Ghassemi 2021). In RL, the use of a dense reward signal
can help credit assignment (Mataric 1994), leading to faster
摘要:

TowardsSafeMechanicalVentilationTreatmentUsingDeepOfineReinforcementLearningFlemmingKondrup*1,ThomasJiralerspong*1,ElaineLau*1,NathandeLara1,JacobShkrob1,MyDucTran1,DoinaPrecup1,2,SumanaBasu1,21McGillUniversity2MilaCorrespondanceto:emming.kondrup@mail.mcgill.caAbstractMechanicalventilationisakeyfo...

展开>> 收起<<
Towards Safe Mechanical Ventilation Treatment Using Deep Offline Reinforcement Learning Flemming Kondrup1 Thomas Jiralerspong1 Elaine Lau1 Nathan de Lara1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:9.48MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注