Optimizing Pessimism in Dynamic Treatment Regimes A Bayesian Learning Approach Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li

2025-04-29 0 0 1.29MB 18 页 10玖币
侵权投诉
Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning
Approach
Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li
UC Berkeley George Washington University LSE UC Berkeley
Abstract
In this article, we propose a novel pessimism-
based Bayesian learning method for optimal dy-
namic treatment regimes in the offline setting.
When the coverage condition does not hold,
which is common for offline data, the existing so-
lutions would produce sub-optimal policies. The
pessimism principle addresses this issue by dis-
couraging recommendation of actions that are
less explored conditioning on the state. However,
nearly all pessimism-based methods rely on a key
hyper-parameter that quantifies the degree of pes-
simism, and the performance of the methods can
be highly sensitive to the choice of this parame-
ter. We propose to integrate the pessimism prin-
ciple with Thompson sampling and Bayesian ma-
chine learning for optimizing the degree of pes-
simism. We derive a credible set whose boundary
uniformly lower bounds the optimal Q-function,
and thus we do not require additional tuning of
the degree of pessimism. We develop a gen-
eral Bayesian learning method that works with
a range of models, from Bayesian linear basis
model to Bayesian neural network model. We de-
velop the computational algorithm based on vari-
ational inference, which is highly efficient and
scalable. We establish the theoretical guaran-
tees of the proposed method, and show empiri-
cally that it outperforms the existing state-of-the-
art solutions through both simulations and a real
data example.
1 INTRODUCTION
Due to heterogeneity in patients’ responses to the treat-
ment, one-size-fits-all strategy may no longer be optimal
Proceedings of the 26th International Conference on Artificial
Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.
PMLR: Volume 206. Copyright 2023 by the author(s).
(Jiang et al., 2017). Precision medicine aims to iden-
tify the most effective treatment strategy based on indi-
vidual patient information. For example, for many com-
plex diseases, such as cancer, mental disorders and dia-
betes, patients are usually treated at multiple stages over
time based on their evolving treatment and clinical covari-
ates (Sinyor et al., 2010; Maahs et al., 2012). Dynamic
treatment regimes (DTRs) provide a useful framework of
leveraging data to learn the optimal treatment strategy by
incorporating heterogeneity across patients and time (Mur-
phy, 2003). Formally, a DTR is a sequence of decision
rules, where each rule takes the patient’s past information
as input, and outputs the treatment assignment. An opti-
mal DTR is the one that maximizes patient’s expected clin-
ical outcomes. DTRs generally follow an online learning
paradigm, where the process involves repeatedly collect-
ing patient’s response to the assigned treatment. In medical
studies, however, it is often impractical to constantly col-
lect such interactive information. This prompts us to study
the problem of learning optimal DTRs in an offline setting,
where the data have already been pre-collected. In this ar-
ticle, we propose a novel Bayesian learning approach using
a pessimistic-type Thompson sampling for finding DTRs.
1.1 Related Work
Statistical methods for DTRs. There is a vast literature
on statistical methods for finding optimal DTRs, which,
broadly speaking, includes Q-learning, A-learning and
value search methods. See Tsiatis et al. (2019); Kosorok
and Laber (2019) for an overview. See also Robins (2004);
Qian and Murphy (2011); Zhang et al. (2013); Chakraborty
and Murphy (2014); Zhao et al. (2015); Chen et al. (2016);
Shi et al. (2018a,b); Qi et al. (2020); Chen et al. (2020);
Zhang (2020); Cai et al. (2021); Qiu et al. (2021); Zhou
et al. (2021); Qi et al. (2022), and the references therein.
However, most existing methods rely on a positivity as-
sumption in the offline data, which essentially requires the
probability of each treatment assignment at each stage is
uniformly bounded away from zero. In the observational
data, such an assumption could easily fail, as certain treat-
ments are prohibited in some scenarios. Therefore, apply-
ing these methods may produce sub-optimal DTRs.
arXiv:2210.14420v2 [stat.ML] 22 Feb 2023
Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach
Offline reinforcement learning (RL). Built on Markov
decision process (MDP), Offline RL learns an optimal pol-
icy from historical data without any online interaction (Pru-
dencio et al., 2022). It is thus highly relevant for precision
medicine type applications. However, many RL algorithms
rely on a crucial coverage assumption, which requires the
offline data distribution to provide a good coverage over the
state-action distribution induced by all candidate policies.
This assumption may be too restrictive and may not hold
in observational studies. To address this challenge, the pes-
simism principle has been adopted that discourages recom-
mending actions that are less explored conditioning on the
state. The solutions in this family can be roughly classified
into two categories, including model-based algorithms (see
e.g., Kidambi et al., 2020; Yu et al., 2020; Uehara and Sun,
2021; Yin et al., 2021), and model-free algorithms (see
e.g., Fujimoto et al., 2019; Kumar et al., 2019; Wu et al.,
2019; Buckman et al., 2020; Kumar et al., 2020; Rezaei-
far et al., 2021; Jin et al., 2021; Xie et al., 2021; Zanette
et al., 2021; Bai et al., 2022; Fu et al., 2022). The main
idea of the model-based solutions is to penalize the reward
or transition function whose state-action pair is rarely seen
in the offline data, whereas the main idea of the model-
free ones is to learn a conservative Q-function that lower
bounds the oracle Q-function. Nevertheless, most of these
solutions either require a well-specified parametric model,
or rely on a key hyperparameter to quantify the degree of
pessimism. It is noteworthy that the performance of those
solutions can be highly sensitive to the choice of the hyper-
parameter; see Section 2.2 for more illustration. In addi-
tion, many algorithms are developed in the context of long
or infinite-horizon Markov decision process. Their gen-
eralizations to medical applications with non-Markovian
and finite-horizon systems remain unknown. Finally, we
note that there is concurrent work by Jeunen and Goethals
(2021) that adopts a Bayesian framework for offline con-
textual bandit. However, their method requires linear func-
tion approximations, and cannot handle complex nonlinear
systems, nor more general sequential decision making.
Thompson sampling. Thompson sampling (TS) is a popu-
lar Bayesian approach proposed by Thompson (1933) that
randomly draws each arm according to its probability of
being optimal, so to balance the exploration-exploitation
trade-off in the online contextual bandit problems. It has
demonstrated a competitive performance in empirical ap-
plications. For instance, Chapelle and Li (2011) showed
that TS outperforms the upper confidence bound (UCB) al-
gorithm in both synthetic and real data applications of ad-
vertisement and news article recommendation. The success
of TS can be attributed to the Bayesian framework it adopts.
In particular, the prior distribution serves as a regularizer to
prevent overfitting, which implicitly discourages exploita-
tion. In addition, actions are selected randomly at each time
step according to the posterior distribution, which explic-
itly encourages exploration and is useful in settings with
delayed feedback (Chapelle and Li, 2011).
Bayesian machine learning. Bayesian machine learning
(BML) is a paradigm for constructing machine learning
models based on the Bayes theorem, and has been success-
fully deployed in a wide range of applications (see, e.g.,
Seeger, 2006, for a review). Popular BML methods include
Bayesian linear basis model (Smith, 1973), variational au-
toencoder (Kingma and Welling, 2013), Bayesian random
forests (Quadrianto and Ghahramani, 2014), Bayesian neu-
ral network (Blundell et al., 2015), among many others.
An appealing feature of BML is that, through posterior
sampling, the uncertainty quantification is straightforward.
In contrast, the frequentist methods for uncertainty quan-
tification that are based on asymptotic theories can be
highly challenging with complex machine learning models,
whereas those based on bootstrap can be computationally
intensive with large datasets.
1.2 Our Proposal and Contributions
In this article, we propose a novel pessimism-based
Bayesian learning approach for offline optimal dynamic
treatment regimes. We integrate the pessimism principle
and Thompson sampling with the Bayesian machine learn-
ing framework. In particular, we derive an explicit and uni-
form uncertainty quantification of the Q-function estimator
given the data, which in turn offers an alternative way of
constructing confidence interval without having to specify
a parametric model or tune the degree of pessimism, as re-
quired by nearly all existing pessimism-based offline RL
and DTR algorithms. Compared to the RL and DTR algo-
rithms without pessimism, our method yields a better deci-
sion rule when the coverage condition is seriously violated,
and a comparable result when the coverage approximately
holds. Compared to the RL and DTR algorithms adopt-
ing pessimism, our method achieves a more consistent and
competitive performance. Theoretically, we show that the
regret of the proposed method depends only on the estima-
tion error of the optimal action’s Q-estimator, and we pro-
vide the explicit form of its upper bound in a special case of
parametric model. The resulting bound is much narrower
than the regret of the standard Q-learning algorithm that
depends on the uniform estimation error of the Q-estimator
at each action. Methodologically, our approach is fairly
general, and works with a range of different BML models,
from simple Bayesian linear basis model to more complex
Bayesian neural network model. Scientifically, our pro-
posal offers a viable solution to a critical problem in pre-
cision medicine that can assist patients to achieve the best
individualized treatment strategy. Finally, computationally,
our algorithm is efficient and scalable to large datasets, as it
adopts a variational inference approach to approximate the
posterior distribution, and does not require computationally
intensive posterior sampling method such as Markov chain
Monte Carlo (Geman and Geman, 1984).
Yunzhe Zhou, Zhengling Qi, Chengchun Shi, Lexin Li
Our method shares a similar spirit as TS, in that we also
adopt a Bayesian framework for uncertainty quantifica-
tion and exploration-exploitation trade-off. We also remark
that, although the concepts of pessimism, TS and BML
are not completely new, how to integrate them properly is
highly nontrivial, and is the main contribution of this arti-
cle. First of all, in the online setting, TS randomizes over
actions to address the exploration-exploitation dilemma.
However, randomization contradicts the pessimistic prin-
ciple in the offline setting. To address this issue, we bor-
row the idea from the Bayesian UCB method (Kaufmann
et al., 2012) for online bandits and generalize it to offline
sequential decision making. Second, although posterior
sampling allows one to conveniently quantify the pointwise
uncertainty of the estimated Q-function at a given individ-
ual state-action pair, it remains challenging to lower bound
the Q-function uniformly for any state-action pair. Devel-
oping a uniform credible set is crucial for implementing
the pessimism principle. Our proposal provides an effec-
tive solution with a uniform uncertainty quantification.
2 PRELIMINARIES
2.1 Bayesian Machine Learning
Let p(o|w)denote a machine learning model indexed by
wthat parameterizes the probability mass or density func-
tion of some random variable O, and let Dn={oi}n
i=1
denote a set of i.i.d. random samples. BML treats was a
random quantity, and learns the entire posterior distribution
p(w|Dn)of wgiven the data Dnbased on the Bayes rule,
by combining the likelihood function p(Dn|w)and a prior
distribution p(w)that reflects prior knowledge about w.
Once the posterior distribution of wis learned, a commonly
used point estimator for wis the posterior mean denoted by
bw=E(w| Dn). One can then make the prediction by us-
ing bwand the likelihood function. Alternatively, one can
also make the prediction by using the posterior mean of the
model output. We next consider two specific examples.
Bayesian Linear Basis Model (BLBM). BLBM is an ex-
tension of the classical Bayesian linear model (Lindley and
Smith, 1972), and models the distribution of a response Y
given X=xas yi=wTφ(xi) = PK
j=1 wjφj(xi) + i,
where φ(x) = {φ1(x),··· , φK(x)}>is a set of Kbasis
functions, w= (w1, . . . , wn)Tis the weight vector, and
the error ifollows a Gaussian distribution. Since the pos-
terior distribution can be explicitly derived, BLBM is easy
to implement in practice. However, it might suffer from
potential model misspecification in high-dimensional com-
plex problems.
Bayesian Neural Network (BNN). BNN learns the pos-
terior distribution of the weight parameter win a neural
network. However, exact Bayesian inference is generally
intractable due to the extremely complex model structure.
Blundell et al. (2015) proposed to approximate the exact
posterior distribution p(w|Dn)by a variational distribution
q(w|θ)whose functional form is pre-specified, and then es-
timate θby minimizing the Kullback-Leibler (KL) diver-
gence, KL[q(w|θ)||p(w|Dn)]. In practice, q(w|θ)can be
set to a multivariate Gaussian distribution, and the param-
eters are updated based on Monte Carlo gradients. Blun-
dell et al. (2015) developed an efficient computational algo-
rithm, and showed BNN achieves a superior performance
in numerous tasks.
2.2 The Pessimism Principle
In the offline setting, when the coverage condition is not
met, the classical DTR and RL methods may yield sub-
optimal policies. This is because some states and actions
are less covered in the data, whose corresponding Q-values
are difficult to learn, resulting in large variances and ulti-
mately sub-optimal decisions. To address this issue, most
existing offline RL methods adopt the pessimistic strategy,
and derive the policies to avoid uncertain regions that are
less covered in the data. Particularly, model-free offline
RL methods learn a conservative Q-estimator that lower
bounds the Q-function during the search of the optimal pol-
icy. We next briefly review a state-of-the-art solution of this
type, the pessimistic value iteration method (PEVI) of Jin
et al. (2021) based on linear models.
Consider a contextual bandit setting, where the offline data
Dnconsists of ni.i.d. realizations {si, ai, ri}n
i=1 of the
state, action and reward tuple {S, A, R}, where sicollects
the baseline covariates of the ith instance, aiis the action
received, and riis the corresponding reward. We assume
Ris uniformly bounded and a larger value of Rindicates a
better outcome. Denote the space of the covariates and ac-
tions by Sand A, respectively. In addition to estimating the
conditional mean of the reward given the state-action pair,
i.e., Q(S, A) = E(R|S, A), Jin et al. (2021) proposed to
also learn a ξ-uncertainty quantifier Γ, such that the event
Ω = n|b
Q(s, a)Q(s, a)| ≤ Γ(s, a)for all (s, a)o(1)
holds with probability at least 1ξfor any ξ > 0, where
b
Qis an estimator of Q. Instead of computing the greedy
policy with respect to b
Qas in the standard methods, they
proposed to choose the greedy policy that maximizes the
lower bound b
QΓ, and showed that the regret of the re-
sulting policy is upper bounded by E[Γ(S, π(S))], where
πis the true optimal policy. Note that this bound is much
narrower than E[maxaΓ(S, a)], i.e., the regret bound with-
out taking pessimism into account. They further showed
that the resulting policy is minimax optimal in linear finite
horizon MDPs without the coverage assumption.
Despite its nice theoretical properties, it is challenging to
implement PEVI in practice due to the construction of a
proper Γthat meets the requirement in (1). Jin et al. (2021)
Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach
Figure 1: A toy example comparing the PEVI method of Jin et al. (2021) under different values of cand our proposed
method PBL.
only developed a construction of Γunder a linear MDP
model, and it cannot be easily generalized to more com-
plex machine learning models. Even in the linear model
case, their construction relies on a hyperparameter c, and
the resulting policy can be highly sensitive to the choice
of c. Actually, this is common for many pessimism-based
RL methods, which often involve some hyperparameter to
quantify the degree of pessimism, and the performances
rely heavily on the tightness of this uncertainty quantifier.
We consider the following toy example to elaborate.
A toy example. Suppose we model Qvia a linear function:
f(s, a, w) = w>φ(s, a), where wRpis the coefficient
of the linear basis function φand is estimated by a ridge
regression following Jin et al. (2021). They set
Γ(s, a) = cp[φ(s, a)TΛ1φ(s, a)]1/2plog(2dn/ξ),(2)
for some constant c > 0, where Λ =
Pn
i=1 φ(si, ai)φ(si, ai)T+λI,λis the ridge param-
eter, and Iis the identity matrix. The choice of cin (2) is
crucial for the performance, as a small cwould fail to meet
the requirement in 1 when the data coverage is inadequate,
and a large cwould over-penalize the Q-function when
the coverage is sufficient. Figure 1 compares the regret
of our method and PEVI, where there are two treatments
{1,2}and a two-dimensional state S= (S1, S2). The
reward Ris generated from a Gaussian distribution with
mean (0.8+0.2A)(S1+ 2S2)and variance σ2, and the
behavior is generated according to an -greedy policy that
combines a uniformly random policy with a pretrained
optimal policy. In this example, characterizes the level
of the coverage, and we consider two levels = 0.95
where sub-optimal actions are less explored, and = 0.5
where the coverage holds. We vary the noise level σ, and
compare our proposed method and PEVI under varying
choices of c={0,1,2,5,10}. It is seen that PEVI is
highly sensitive to cunder different values of and σ. By
contrast, our proposed method takes a significance level
as the input, which is fixed to 0.9or 0.95 to ensure (1)
holds with a large probability, and it achieves a much more
stable performance.
3 BAYESIAN LEARNING WITH
PESSIMISM
3.1 Basic Idea: Offline Contextual Bandit
As discussed earlier, the success of the pessimism-based
methods relies crucially on the uniform uncertainty quan-
tification of the Q-function estimation. Existing solutions
require a hyperparameter to properly quantify the degree
of pessimism, whereas the choice of such a parameter can
be difficult. To address this challenge, and to make the
pessimism approach more generally applicable in the of-
fline setting, we propose a data-driven procedure and de-
rive the uniform uncertainty quantification, without requir-
ing specific models or tuning the degree of pessimism when
searching for the optimal decision rules. We first illustrate
our idea through a single-stage contextual bandit problem
in this section, and discuss the dynamic setting of dynamic
treatment regimes in the next section.
Suppose we observe the data Dn={si, ai, ri}n
i=1. Mo-
tivated by Thompson sampling, we propose to model the
conditional reward distribution given the state-action pair
by p(r|s, a, w), and estimate the model parameter wRp
under a Bayesian framework. Specifically, we first ap-
ply BML to obtain the posterior distribution p(w|Dn), and
construct a credible set Wgiven the posterior, such that
P(w∈ W|Dn)1α, where 1α(0,1) is the
user-specified coverage rate, which usually takes the fixed
value of 0.9 or 0.95. Next, instead of choosing an action
that maximizes the conditional mean function
f(s, a, w) = Zr
p(r|s, a, w)dr,
摘要:

OptimizingPessimisminDynamicTreatmentRegimes:ABayesianLearningApproachYunzheZhouZhenglingQiChengchunShiLexinLiUCBerkeleyGeorgeWashingtonUniversityLSEUCBerkeleyAbstractInthisarticle,weproposeanovelpessimism-basedBayesianlearningmethodforoptimaldy-namictreatmentregimesintheofinesetting.Whenthecoverag...

展开>> 收起<<
Optimizing Pessimism in Dynamic Treatment Regimes A Bayesian Learning Approach Yunzhe Zhou Zhengling Qi Chengchun Shi Lexin Li.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.29MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注