Towards Safe Mechanical Ventilation Treatment Using Deep Ofﬂine Reinforcement Learning Flemming Kondrup1 Thomas Jiralerspong1 Elaine Lau1 Nathan de Lara1

2025-05-06 0 0 9.48MB 7 页 10玖币

侵权投诉

Towards Safe Mechanical Ventilation Treatment

Using Deep Ofﬂine Reinforcement Learning

Flemming Kondrup*1, Thomas Jiralerspong*1, Elaine Lau*1, Nathan de Lara1,

Jacob Shkrob1, My Duc Tran1, Doina Precup1,2, Sumana Basu1,2

1McGill University

2Mila

Correspondance to: ﬂemming.kondrup@mail.mcgill.ca

Abstract

Mechanical ventilation is a key form of life support for pa-

tients with pulmonary impairment. Healthcare workers are

required to continuously adjust ventilator settings for each

patient, a challenging and time consuming task. Hence, it

would be beneﬁcial to develop an automated decision sup-

port tool to optimize ventilation treatment. We present Deep-

Vent, a Conservative Q-Learning (CQL) based ofﬂine Deep

Reinforcement Learning (DRL) agent that learns to predict

the optimal ventilator parameters for a patient to promote 90

day survival. We design a clinically relevant intermediate re-

ward that encourages continuous improvement of the patient

vitals as well as addresses the challenge of sparse reward in

RL. We ﬁnd that DeepVent recommends ventilation param-

eters within safe ranges, as outlined in recent clinical trials.

The CQL algorithm offers additional safety by mitigating the

overestimation of the value estimates of out-of-distribution

states/actions. We evaluate our agent using Fitted Q Evalu-

ation (FQE) and demonstrate that it outperforms physicians

from the MIMIC-III dataset.

1 Introduction

The COVID-19 pandemic has put enormous pressure on

the healthcare system, particularly on intensive care units

(ICUs). In cases of severe pulmonary impairment, mechan-

ical ventilation assists breathing in patients and acts as

the key form of life support. However, the optimal ven-

tilator settings are individual speciﬁc and often unknown

(Zein et al. 2016), leading to ventilator induced lung injury

(VILI), diaphragm dysfunction, pneumonia and oxygen tox-

icity (Pham, Brochard, and Slutsky 2017). To prevent these

complications, and offer optimal care, it is necessary to per-

sonalize mechanical ventilation.

Various efforts have proposed the use of machine learning

(ML) to personalize ventilation treatments. These include

the use of deep supervised learning (Akbulut et al. 2014;

Venkata, Koenig, and Pidaparti 2021) which permits high-

level feature extraction, yet ignores the sequential nature of

ventilation. Furthermore, supervised learning methods can

only hope to imitate the physician’s policy, which may lead

*These authors contributed equally.

to suboptimal treatment. Meanwhile, reinforcement learn-

ing (RL) interacts with the environment and gets immedi-

ate feedback from the patient in the form of rewards and

hence can improve upon the physician’s policy. Tabular RL

has recently shown strong potential in mechanical ventila-

tion (Peine et al. 2021), but, to the best of our knowledge,

no previous works have attempted to combine deep learning

and RL to improve mechanical ventilation.

We propose DeepVent, a Deep RL model to optimize me-

chanical ventilation settings and hypothesize it will lead to

improved care. We consider both performance and patient

safety with the aim of bridging the gap between research and

real-life implementation. Here are our main contributions:

• We introduce DeepVent, a Deep RL model based on

the Conservative Q-Learning algorithm (Kumar et al.

2020), and show using Fitted Q Evaluation (FQE) that

it achieves higher performance when compared to physi-

cians as recorded in the MIMIC-III dataset (Johnson et al.

2016), behavior cloning and Double Deep Q-Learning

(DDQN) (van Hasselt, Guez, and Silver 2015), a com-

mon RL algorithm in health applications.

• We compare DeepVent’s decisions to those of physi-

cians and of the DDQN agent. We show that DeepVent

makes recommendations within safe ranges, as supported

by recent clinical studies and trials. In contrast, DDQN

makes recommendations in ranges unsupported by clin-

ical guidelines. We hypothesize that this may be due to

DDQN’s overestimation of out-of-distribution states/ac-

tions and demonstrate the potential of Conservative Q-

Learning to address this. This is essential in healthcare,

where risk in decision making must be avoided.

• We introduce a clinically relevant intermediate reward

applicable to many ﬁelds of healthcare. RL models can

beneﬁt highly from an intermediate reward, as it can

permit faster convergence and improved performance

(Mataric 1994), and thus better outcomes for patients.

Most previous efforts implementing RL in healthcare ei-

ther did not address this or proposed a reward requiring

important domain knowledge (see Section 2.6). Our in-

termediate reward is based on the Apache II mortality

prediction score (Knaus et al. 1985), commonly used by

physicians in ICUs, and leads to improved performance.

arXiv:2210.02552v1 [cs.LG] 5 Oct 2022

2 Background & Related Work

2.1 Reinforcement Learning (RL)

RL is usually formalized as a Markov Decision Process

(MDP), which is deﬁned by a tuple (S,A, P, r, γ), where S

is the state space, Athe action space, Pthe transition func-

tion deﬁning the probability of arriving at a given state st+1

after taking action atfrom state st,rthe reward function

deﬁning the expected reward received after taking action at

from state stand γ∈(0,1) the discount factor of the re-

ward. At each time step tof an episode, the agent observes

the current state st∈ S, takes an action at∈ A, and tran-

sitions to another state st+1 ∈ S while receiving a reward

rt. The goal of RL is to train a policy π:S × A → [0,1]

that maximizes the cumulative discounted return, PT

t=0 γtrt

received over the course of an episode with Ttimesteps.

2.2 Q-Learning and Deep Q-learning

Q-Learning (Watkins and Dayan 1989) is one of the main

RL algorithms and the most common method in healthcare

applications (Yu, Liu, and Nemati 2020). It aims to estimate

the value of taking an action afrom a state s, known as the

Q-value Q(s, a). At each timestep t, upon taking action at

from state stand transitioning to state st+1 with reward rt,

the agent updates the Q-value for (st, at)as follows:

Q(st, at) = Q(st, at)+η(rt+γmax

aQ(st+1, a)−Q(st, at))

(1)

where η∈(0,1) is the learning rate and (rt+

γmaxaQ(st+1, a)) is the target of the update. When the

number of states is intractable, it becomes impractical to

store in a table the Q-values for all state-action pairs. We

can however use a function approximator to estimate the Q-

values. The Deep Q Network (DQN) (Mnih 2015) algorithm

combines Q-Learning with deep neural networks to handle

complex RL problems. Despite offering many advantages,

such as the ability to learn from data gathered through any

way of behaving, and to generalize potentially to many states

from a limited sample, DQN comes with challenges, such as

the potential to substantially overestimate certain Q-values.

Overestimation occurs when the estimated mean of a ran-

dom variable is higher than its true mean. Because DQN up-

dates its Q-values towards the target rt+γmaxaQ(st+1, a),

which includes the highest Q-value of the next state st+1,

and because this is usually a noisy estimate, it can lead to an

overestimation.

2.3 Double Deep Q-Network (DDQN)

DDQN (van Hasselt, Guez, and Silver 2015) was introduced

as a solution to the overestimation problem in Q-learning.

While DQN uses a single network to represent the value

function, DDQN uses two different networks, parametrized

by different parameter vectors, θand θ0. At any point in time,

one of the networks, chosen at random, is updated, and its

target is computed using the Q-value estimated by the other

network. Thus, for network Qθ, the target of the update is:

rt+γQθ0(st+1,arg max

aQθ(st+1, a)) (2)

While this is beneﬁcial, DDQN may still suffer from over-

estimation (van Hasselt, Guez, and Silver 2015), especially

in ofﬂine RL.

2.4 Ofﬂine Reinforcement Learning

Traditional RL methods are based on an online learning

paradigm, in which an agent actively interacts with an en-

vironment. This is an important barrier to RL implementa-

tion in many ﬁelds, including healthcare (Levine et al. 2020),

where acting in an environment is inefﬁcient and unethical,

as it would mean putting patients at risk. Consequently, re-

cent years have witnessed signiﬁcant growth in ofﬂine (or

batch) RL, where the learning utilizes a ﬁxed dataset of tran-

sitions D=si

t, ai

t, ri

t, si

t+1N

i=1. Since the understand-

ing of the environment of the RL model is limited to the

dataset, this can lead to the overestimation of Q-values of

state-action pairs which are under-represented in the dataset,

or out-of-distribution (OOD). In the healthcare setting, this

may translate to unsafe recommendations, putting patients

at risk.

2.5 Conservative Q-Learning (CQL)

Conservative Q-Learning (CQL) was proposed to address

overestimation in ofﬂine RL (Kumar et al. 2020). It learns

a conservative estimate of the Q-function by adding a reg-

ularizer Est∼D,at∼A[Q(st,at)] on the Q-learning error, in

order to minimize the overestimated values of unseen ac-

tions. In addition, the term −Est,at∼D [Q(st,at)] is added

to maximize the Q-values in the dataset. In summary, CQL

minimizes the estimated Q-values for all actions while si-

multaneously maximizing the estimated Q-values for the ac-

tions in the dataset, thus preventing overestimation of OOD

or underrepresented state-action pairs.

2.6 Related work

Algorithms for ventilation optimization Current ap-

proaches for ventilation optimization in hospitals commonly

rely on proportional-integral-derivative (PID) control (Ben-

nett 1993), which are known to be sub-optimal (Suo et al.

2021). The use of more sophisticated machine learning

methods have been suggested in recent years (Akbulut et al.

2014; Venkata, Koenig, and Pidaparti 2021; Suo et al. 2021).

Recently, RL was proposed using a simple tabular approach

(Peine et al. 2021). This was already expected to outperform

clinical standards, providing strong evidence for the use of

RL in this setting. Nonetheless, to the best of our knowledge,

no Deep RL approach has been proposed for ventilation set-

tings optimization. Furthermore, many core RL challenges,

such as sparse reward and value overestimation, have not yet

been addressed.

Intermediate rewards in healthcare RL has been sug-

gested in various ﬁelds of healthcare, such as sepsis treat-

ment (Raghu et al. 2017; Peng et al. 2019), heparin dosage

(Lin et al. 2018), mechanical weaning (Prasad et al. 2017;

Yu, Ren, and Dong 2020) and sedation (Eghbali, Alhanai,

and Ghassemi 2021). In RL, the use of a dense reward signal

can help credit assignment (Mataric 1994), leading to faster

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsSafeMechanicalVentilationTreatmentUsingDeepOfineReinforcementLearningFlemmingKondrup*1,ThomasJiralerspong*1,ElaineLau*1,NathandeLara1,JacobShkrob1,MyDucTran1,DoinaPrecup1,2,SumanaBasu1,21McGillUniversity2MilaCorrespondanceto:emming.kondrup@mail.mcgill.caAbstractMechanicalventilationisakeyfo...

展开>> 收起<<

Towards Safe Mechanical Ventilation Treatment Using Deep Ofﬂine Reinforcement Learning Flemming Kondrup1 Thomas Jiralerspong1 Elaine Lau1 Nathan de Lara1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Safe Mechanical Ventilation Treatment Using Deep Ofﬂine Reinforcement Learning Flemming Kondrup1 Thomas Jiralerspong1 Elaine Lau1 Nathan de Lara1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: