Optimizing Pessimism in Dynamic Treatment Regimes A Bayesian Learning Approach Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li

2025-04-29 0 0 1.29MB 18 页 10玖币

侵权投诉

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning

Approach

Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li

UC Berkeley George Washington University LSE UC Berkeley

Abstract

In this article, we propose a novel pessimism-

based Bayesian learning method for optimal dy-

namic treatment regimes in the ofﬂine setting.

When the coverage condition does not hold,

which is common for ofﬂine data, the existing so-

lutions would produce sub-optimal policies. The

pessimism principle addresses this issue by dis-

couraging recommendation of actions that are

less explored conditioning on the state. However,

nearly all pessimism-based methods rely on a key

hyper-parameter that quantiﬁes the degree of pes-

simism, and the performance of the methods can

be highly sensitive to the choice of this parame-

ter. We propose to integrate the pessimism prin-

ciple with Thompson sampling and Bayesian ma-

chine learning for optimizing the degree of pes-

simism. We derive a credible set whose boundary

uniformly lower bounds the optimal Q-function,

and thus we do not require additional tuning of

the degree of pessimism. We develop a gen-

eral Bayesian learning method that works with

a range of models, from Bayesian linear basis

model to Bayesian neural network model. We de-

velop the computational algorithm based on vari-

ational inference, which is highly efﬁcient and

scalable. We establish the theoretical guaran-

tees of the proposed method, and show empiri-

cally that it outperforms the existing state-of-the-

art solutions through both simulations and a real

data example.

1 INTRODUCTION

Due to heterogeneity in patients’ responses to the treat-

ment, one-size-ﬁts-all strategy may no longer be optimal

Proceedings of the 26th International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.

(Jiang et al., 2017). Precision medicine aims to iden-

tify the most effective treatment strategy based on indi-

vidual patient information. For example, for many com-

plex diseases, such as cancer, mental disorders and dia-

betes, patients are usually treated at multiple stages over

time based on their evolving treatment and clinical covari-

ates (Sinyor et al., 2010; Maahs et al., 2012). Dynamic

treatment regimes (DTRs) provide a useful framework of

leveraging data to learn the optimal treatment strategy by

incorporating heterogeneity across patients and time (Mur-

phy, 2003). Formally, a DTR is a sequence of decision

rules, where each rule takes the patient’s past information

as input, and outputs the treatment assignment. An opti-

mal DTR is the one that maximizes patient’s expected clin-

ical outcomes. DTRs generally follow an online learning

paradigm, where the process involves repeatedly collect-

ing patient’s response to the assigned treatment. In medical

studies, however, it is often impractical to constantly col-

lect such interactive information. This prompts us to study

the problem of learning optimal DTRs in an ofﬂine setting,

where the data have already been pre-collected. In this ar-

ticle, we propose a novel Bayesian learning approach using

a pessimistic-type Thompson sampling for ﬁnding DTRs.

1.1 Related Work

Statistical methods for DTRs. There is a vast literature

on statistical methods for ﬁnding optimal DTRs, which,

broadly speaking, includes Q-learning, A-learning and

value search methods. See Tsiatis et al. (2019); Kosorok

and Laber (2019) for an overview. See also Robins (2004);

Qian and Murphy (2011); Zhang et al. (2013); Chakraborty

and Murphy (2014); Zhao et al. (2015); Chen et al. (2016);

Shi et al. (2018a,b); Qi et al. (2020); Chen et al. (2020);

Zhang (2020); Cai et al. (2021); Qiu et al. (2021); Zhou

et al. (2021); Qi et al. (2022), and the references therein.

However, most existing methods rely on a positivity as-

sumption in the ofﬂine data, which essentially requires the

probability of each treatment assignment at each stage is

uniformly bounded away from zero. In the observational

data, such an assumption could easily fail, as certain treat-

ments are prohibited in some scenarios. Therefore, apply-

ing these methods may produce sub-optimal DTRs.

arXiv:2210.14420v2 [stat.ML] 22 Feb 2023

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

Ofﬂine reinforcement learning (RL). Built on Markov

decision process (MDP), Ofﬂine RL learns an optimal pol-

icy from historical data without any online interaction (Pru-

dencio et al., 2022). It is thus highly relevant for precision

medicine type applications. However, many RL algorithms

rely on a crucial coverage assumption, which requires the

ofﬂine data distribution to provide a good coverage over the

state-action distribution induced by all candidate policies.

This assumption may be too restrictive and may not hold

in observational studies. To address this challenge, the pes-

simism principle has been adopted that discourages recom-

mending actions that are less explored conditioning on the

state. The solutions in this family can be roughly classiﬁed

into two categories, including model-based algorithms (see

e.g., Kidambi et al., 2020; Yu et al., 2020; Uehara and Sun,

2021; Yin et al., 2021), and model-free algorithms (see

e.g., Fujimoto et al., 2019; Kumar et al., 2019; Wu et al.,

2019; Buckman et al., 2020; Kumar et al., 2020; Rezaei-

far et al., 2021; Jin et al., 2021; Xie et al., 2021; Zanette

et al., 2021; Bai et al., 2022; Fu et al., 2022). The main

idea of the model-based solutions is to penalize the reward

or transition function whose state-action pair is rarely seen

in the ofﬂine data, whereas the main idea of the model-

free ones is to learn a conservative Q-function that lower

bounds the oracle Q-function. Nevertheless, most of these

solutions either require a well-speciﬁed parametric model,

or rely on a key hyperparameter to quantify the degree of

pessimism. It is noteworthy that the performance of those

solutions can be highly sensitive to the choice of the hyper-

parameter; see Section 2.2 for more illustration. In addi-

tion, many algorithms are developed in the context of long

or inﬁnite-horizon Markov decision process. Their gen-

eralizations to medical applications with non-Markovian

and ﬁnite-horizon systems remain unknown. Finally, we

note that there is concurrent work by Jeunen and Goethals

(2021) that adopts a Bayesian framework for ofﬂine con-

textual bandit. However, their method requires linear func-

tion approximations, and cannot handle complex nonlinear

systems, nor more general sequential decision making.

Thompson sampling. Thompson sampling (TS) is a popu-

lar Bayesian approach proposed by Thompson (1933) that

randomly draws each arm according to its probability of

being optimal, so to balance the exploration-exploitation

trade-off in the online contextual bandit problems. It has

demonstrated a competitive performance in empirical ap-

plications. For instance, Chapelle and Li (2011) showed

that TS outperforms the upper conﬁdence bound (UCB) al-

gorithm in both synthetic and real data applications of ad-

vertisement and news article recommendation. The success

of TS can be attributed to the Bayesian framework it adopts.

In particular, the prior distribution serves as a regularizer to

prevent overﬁtting, which implicitly discourages exploita-

tion. In addition, actions are selected randomly at each time

step according to the posterior distribution, which explic-

itly encourages exploration and is useful in settings with

delayed feedback (Chapelle and Li, 2011).

Bayesian machine learning. Bayesian machine learning

(BML) is a paradigm for constructing machine learning

models based on the Bayes theorem, and has been success-

fully deployed in a wide range of applications (see, e.g.,

Seeger, 2006, for a review). Popular BML methods include

Bayesian linear basis model (Smith, 1973), variational au-

toencoder (Kingma and Welling, 2013), Bayesian random

forests (Quadrianto and Ghahramani, 2014), Bayesian neu-

ral network (Blundell et al., 2015), among many others.

An appealing feature of BML is that, through posterior

sampling, the uncertainty quantiﬁcation is straightforward.

In contrast, the frequentist methods for uncertainty quan-

tiﬁcation that are based on asymptotic theories can be

highly challenging with complex machine learning models,

whereas those based on bootstrap can be computationally

intensive with large datasets.

1.2 Our Proposal and Contributions

In this article, we propose a novel pessimism-based

Bayesian learning approach for ofﬂine optimal dynamic

treatment regimes. We integrate the pessimism principle

and Thompson sampling with the Bayesian machine learn-

ing framework. In particular, we derive an explicit and uni-

form uncertainty quantiﬁcation of the Q-function estimator

given the data, which in turn offers an alternative way of

constructing conﬁdence interval without having to specify

a parametric model or tune the degree of pessimism, as re-

quired by nearly all existing pessimism-based ofﬂine RL

and DTR algorithms. Compared to the RL and DTR algo-

rithms without pessimism, our method yields a better deci-

sion rule when the coverage condition is seriously violated,

and a comparable result when the coverage approximately

holds. Compared to the RL and DTR algorithms adopt-

ing pessimism, our method achieves a more consistent and

competitive performance. Theoretically, we show that the

regret of the proposed method depends only on the estima-

tion error of the optimal action’s Q-estimator, and we pro-

vide the explicit form of its upper bound in a special case of

parametric model. The resulting bound is much narrower

than the regret of the standard Q-learning algorithm that

depends on the uniform estimation error of the Q-estimator

at each action. Methodologically, our approach is fairly

general, and works with a range of different BML models,

from simple Bayesian linear basis model to more complex

Bayesian neural network model. Scientiﬁcally, our pro-

posal offers a viable solution to a critical problem in pre-

cision medicine that can assist patients to achieve the best

individualized treatment strategy. Finally, computationally,

our algorithm is efﬁcient and scalable to large datasets, as it

adopts a variational inference approach to approximate the

posterior distribution, and does not require computationally

intensive posterior sampling method such as Markov chain

Monte Carlo (Geman and Geman, 1984).

Yunzhe Zhou, Zhengling Qi, Chengchun Shi, Lexin Li

Our method shares a similar spirit as TS, in that we also

adopt a Bayesian framework for uncertainty quantiﬁca-

tion and exploration-exploitation trade-off. We also remark

that, although the concepts of pessimism, TS and BML

are not completely new, how to integrate them properly is

highly nontrivial, and is the main contribution of this arti-

cle. First of all, in the online setting, TS randomizes over

actions to address the exploration-exploitation dilemma.

However, randomization contradicts the pessimistic prin-

ciple in the ofﬂine setting. To address this issue, we bor-

row the idea from the Bayesian UCB method (Kaufmann

et al., 2012) for online bandits and generalize it to ofﬂine

sequential decision making. Second, although posterior

sampling allows one to conveniently quantify the pointwise

uncertainty of the estimated Q-function at a given individ-

ual state-action pair, it remains challenging to lower bound

the Q-function uniformly for any state-action pair. Devel-

oping a uniform credible set is crucial for implementing

the pessimism principle. Our proposal provides an effec-

tive solution with a uniform uncertainty quantiﬁcation.

2 PRELIMINARIES

2.1 Bayesian Machine Learning

Let p(o|w)denote a machine learning model indexed by

wthat parameterizes the probability mass or density func-

tion of some random variable O, and let Dn={oi}n

i=1

denote a set of i.i.d. random samples. BML treats was a

random quantity, and learns the entire posterior distribution

p(w|Dn)of wgiven the data Dnbased on the Bayes rule,

by combining the likelihood function p(Dn|w)and a prior

distribution p(w)that reﬂects prior knowledge about w.

Once the posterior distribution of wis learned, a commonly

used point estimator for wis the posterior mean denoted by

bw=E(w| Dn). One can then make the prediction by us-

ing bwand the likelihood function. Alternatively, one can

also make the prediction by using the posterior mean of the

model output. We next consider two speciﬁc examples.

Bayesian Linear Basis Model (BLBM). BLBM is an ex-

tension of the classical Bayesian linear model (Lindley and

Smith, 1972), and models the distribution of a response Y

given X=xas yi=wTφ(xi) = PK

j=1 wjφj(xi) + i,

where φ(x) = {φ1(x),··· , φK(x)}>is a set of Kbasis

functions, w= (w1, . . . , wn)Tis the weight vector, and

the error ifollows a Gaussian distribution. Since the pos-

terior distribution can be explicitly derived, BLBM is easy

to implement in practice. However, it might suffer from

potential model misspeciﬁcation in high-dimensional com-

plex problems.

Bayesian Neural Network (BNN). BNN learns the pos-

terior distribution of the weight parameter win a neural

network. However, exact Bayesian inference is generally

intractable due to the extremely complex model structure.

Blundell et al. (2015) proposed to approximate the exact

posterior distribution p(w|Dn)by a variational distribution

q(w|θ)whose functional form is pre-speciﬁed, and then es-

timate θby minimizing the Kullback-Leibler (KL) diver-

gence, KL[q(w|θ)||p(w|Dn)]. In practice, q(w|θ)can be

set to a multivariate Gaussian distribution, and the param-

eters are updated based on Monte Carlo gradients. Blun-

dell et al. (2015) developed an efﬁcient computational algo-

rithm, and showed BNN achieves a superior performance

in numerous tasks.

2.2 The Pessimism Principle

In the ofﬂine setting, when the coverage condition is not

met, the classical DTR and RL methods may yield sub-

optimal policies. This is because some states and actions

are less covered in the data, whose corresponding Q-values

are difﬁcult to learn, resulting in large variances and ulti-

mately sub-optimal decisions. To address this issue, most

existing ofﬂine RL methods adopt the pessimistic strategy,

and derive the policies to avoid uncertain regions that are

less covered in the data. Particularly, model-free ofﬂine

RL methods learn a conservative Q-estimator that lower

bounds the Q-function during the search of the optimal pol-

icy. We next brieﬂy review a state-of-the-art solution of this

type, the pessimistic value iteration method (PEVI) of Jin

et al. (2021) based on linear models.

Consider a contextual bandit setting, where the ofﬂine data

Dnconsists of ni.i.d. realizations {si, ai, ri}n

i=1 of the

state, action and reward tuple {S, A, R}, where sicollects

the baseline covariates of the ith instance, aiis the action

received, and riis the corresponding reward. We assume

Ris uniformly bounded and a larger value of Rindicates a

better outcome. Denote the space of the covariates and ac-

tions by Sand A, respectively. In addition to estimating the

conditional mean of the reward given the state-action pair,

i.e., Q(S, A) = E(R|S, A), Jin et al. (2021) proposed to

also learn a ξ-uncertainty quantiﬁer Γ, such that the event

Ω = n|b

Q(s, a)−Q(s, a)| ≤ Γ(s, a)for all (s, a)o(1)

holds with probability at least 1−ξfor any ξ > 0, where

Qis an estimator of Q. Instead of computing the greedy

policy with respect to b

Qas in the standard methods, they

proposed to choose the greedy policy that maximizes the

lower bound b

Q−Γ, and showed that the regret of the re-

sulting policy is upper bounded by E[Γ(S, π∗(S))], where

π∗is the true optimal policy. Note that this bound is much

narrower than E[maxaΓ(S, a)], i.e., the regret bound with-

out taking pessimism into account. They further showed

that the resulting policy is minimax optimal in linear ﬁnite

horizon MDPs without the coverage assumption.

Despite its nice theoretical properties, it is challenging to

implement PEVI in practice due to the construction of a

proper Γthat meets the requirement in (1). Jin et al. (2021)

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

Figure 1: A toy example comparing the PEVI method of Jin et al. (2021) under different values of cand our proposed

method PBL.

only developed a construction of Γunder a linear MDP

model, and it cannot be easily generalized to more com-

plex machine learning models. Even in the linear model

case, their construction relies on a hyperparameter c, and

the resulting policy can be highly sensitive to the choice

of c. Actually, this is common for many pessimism-based

RL methods, which often involve some hyperparameter to

quantify the degree of pessimism, and the performances

rely heavily on the tightness of this uncertainty quantiﬁer.

We consider the following toy example to elaborate.

A toy example. Suppose we model Qvia a linear function:

f(s, a, w) = w>φ(s, a), where w∈Rpis the coefﬁcient

of the linear basis function φand is estimated by a ridge

regression following Jin et al. (2021). They set

Γ(s, a) = cp[φ(s, a)TΛ−1φ(s, a)]1/2plog(2dn/ξ),(2)

for some constant c > 0, where Λ =

i=1 φ(si, ai)φ(si, ai)T+λI,λis the ridge param-

eter, and Iis the identity matrix. The choice of cin (2) is

crucial for the performance, as a small cwould fail to meet

the requirement in 1 when the data coverage is inadequate,

and a large cwould over-penalize the Q-function when

the coverage is sufﬁcient. Figure 1 compares the regret

of our method and PEVI, where there are two treatments

{1,2}and a two-dimensional state S= (S1, S2). The

reward Ris generated from a Gaussian distribution with

mean (0.8+0.2A)(S1+ 2S2)and variance σ2, and the

behavior is generated according to an -greedy policy that

combines a uniformly random policy with a pretrained

optimal policy. In this example, characterizes the level

of the coverage, and we consider two levels = 0.95

where sub-optimal actions are less explored, and = 0.5

where the coverage holds. We vary the noise level σ, and

compare our proposed method and PEVI under varying

choices of c={0,1,2,5,10}. It is seen that PEVI is

highly sensitive to cunder different values of and σ. By

contrast, our proposed method takes a signiﬁcance level

as the input, which is ﬁxed to 0.9or 0.95 to ensure (1)

holds with a large probability, and it achieves a much more

stable performance.

3 BAYESIAN LEARNING WITH

PESSIMISM

3.1 Basic Idea: Ofﬂine Contextual Bandit

As discussed earlier, the success of the pessimism-based

methods relies crucially on the uniform uncertainty quan-

tiﬁcation of the Q-function estimation. Existing solutions

require a hyperparameter to properly quantify the degree

of pessimism, whereas the choice of such a parameter can

be difﬁcult. To address this challenge, and to make the

pessimism approach more generally applicable in the of-

ﬂine setting, we propose a data-driven procedure and de-

rive the uniform uncertainty quantiﬁcation, without requir-

ing speciﬁc models or tuning the degree of pessimism when

searching for the optimal decision rules. We ﬁrst illustrate

our idea through a single-stage contextual bandit problem

in this section, and discuss the dynamic setting of dynamic

treatment regimes in the next section.

Suppose we observe the data Dn={si, ai, ri}n

i=1. Mo-

tivated by Thompson sampling, we propose to model the

conditional reward distribution given the state-action pair

by p(r|s, a, w), and estimate the model parameter w∈Rp

under a Bayesian framework. Speciﬁcally, we ﬁrst ap-

ply BML to obtain the posterior distribution p(w|Dn), and

construct a credible set Wgiven the posterior, such that

P(w∈ W|Dn)≥1−α, where 1−α∈(0,1) is the

user-speciﬁed coverage rate, which usually takes the ﬁxed

value of 0.9 or 0.95. Next, instead of choosing an action

that maximizes the conditional mean function

f(s, a, w) = Zr

p(r|s, a, w)dr,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimizingPessimisminDynamicTreatmentRegimes:ABayesianLearningApproachYunzheZhouZhenglingQiChengchunShiLexinLiUCBerkeleyGeorgeWashingtonUniversityLSEUCBerkeleyAbstractInthisarticle,weproposeanovelpessimism-basedBayesianlearningmethodforoptimaldy-namictreatmentregimesintheofinesetting.Whenthecoverag...

展开>> 收起<<

Optimizing Pessimism in Dynamic Treatment Regimes A Bayesian Learning Approach Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimizing Pessimism in Dynamic Treatment Regimes A Bayesian Learning Approach Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: