
Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach
Offline reinforcement learning (RL). Built on Markov
decision process (MDP), Offline RL learns an optimal pol-
icy from historical data without any online interaction (Pru-
dencio et al., 2022). It is thus highly relevant for precision
medicine type applications. However, many RL algorithms
rely on a crucial coverage assumption, which requires the
offline data distribution to provide a good coverage over the
state-action distribution induced by all candidate policies.
This assumption may be too restrictive and may not hold
in observational studies. To address this challenge, the pes-
simism principle has been adopted that discourages recom-
mending actions that are less explored conditioning on the
state. The solutions in this family can be roughly classified
into two categories, including model-based algorithms (see
e.g., Kidambi et al., 2020; Yu et al., 2020; Uehara and Sun,
2021; Yin et al., 2021), and model-free algorithms (see
e.g., Fujimoto et al., 2019; Kumar et al., 2019; Wu et al.,
2019; Buckman et al., 2020; Kumar et al., 2020; Rezaei-
far et al., 2021; Jin et al., 2021; Xie et al., 2021; Zanette
et al., 2021; Bai et al., 2022; Fu et al., 2022). The main
idea of the model-based solutions is to penalize the reward
or transition function whose state-action pair is rarely seen
in the offline data, whereas the main idea of the model-
free ones is to learn a conservative Q-function that lower
bounds the oracle Q-function. Nevertheless, most of these
solutions either require a well-specified parametric model,
or rely on a key hyperparameter to quantify the degree of
pessimism. It is noteworthy that the performance of those
solutions can be highly sensitive to the choice of the hyper-
parameter; see Section 2.2 for more illustration. In addi-
tion, many algorithms are developed in the context of long
or infinite-horizon Markov decision process. Their gen-
eralizations to medical applications with non-Markovian
and finite-horizon systems remain unknown. Finally, we
note that there is concurrent work by Jeunen and Goethals
(2021) that adopts a Bayesian framework for offline con-
textual bandit. However, their method requires linear func-
tion approximations, and cannot handle complex nonlinear
systems, nor more general sequential decision making.
Thompson sampling. Thompson sampling (TS) is a popu-
lar Bayesian approach proposed by Thompson (1933) that
randomly draws each arm according to its probability of
being optimal, so to balance the exploration-exploitation
trade-off in the online contextual bandit problems. It has
demonstrated a competitive performance in empirical ap-
plications. For instance, Chapelle and Li (2011) showed
that TS outperforms the upper confidence bound (UCB) al-
gorithm in both synthetic and real data applications of ad-
vertisement and news article recommendation. The success
of TS can be attributed to the Bayesian framework it adopts.
In particular, the prior distribution serves as a regularizer to
prevent overfitting, which implicitly discourages exploita-
tion. In addition, actions are selected randomly at each time
step according to the posterior distribution, which explic-
itly encourages exploration and is useful in settings with
delayed feedback (Chapelle and Li, 2011).
Bayesian machine learning. Bayesian machine learning
(BML) is a paradigm for constructing machine learning
models based on the Bayes theorem, and has been success-
fully deployed in a wide range of applications (see, e.g.,
Seeger, 2006, for a review). Popular BML methods include
Bayesian linear basis model (Smith, 1973), variational au-
toencoder (Kingma and Welling, 2013), Bayesian random
forests (Quadrianto and Ghahramani, 2014), Bayesian neu-
ral network (Blundell et al., 2015), among many others.
An appealing feature of BML is that, through posterior
sampling, the uncertainty quantification is straightforward.
In contrast, the frequentist methods for uncertainty quan-
tification that are based on asymptotic theories can be
highly challenging with complex machine learning models,
whereas those based on bootstrap can be computationally
intensive with large datasets.
1.2 Our Proposal and Contributions
In this article, we propose a novel pessimism-based
Bayesian learning approach for offline optimal dynamic
treatment regimes. We integrate the pessimism principle
and Thompson sampling with the Bayesian machine learn-
ing framework. In particular, we derive an explicit and uni-
form uncertainty quantification of the Q-function estimator
given the data, which in turn offers an alternative way of
constructing confidence interval without having to specify
a parametric model or tune the degree of pessimism, as re-
quired by nearly all existing pessimism-based offline RL
and DTR algorithms. Compared to the RL and DTR algo-
rithms without pessimism, our method yields a better deci-
sion rule when the coverage condition is seriously violated,
and a comparable result when the coverage approximately
holds. Compared to the RL and DTR algorithms adopt-
ing pessimism, our method achieves a more consistent and
competitive performance. Theoretically, we show that the
regret of the proposed method depends only on the estima-
tion error of the optimal action’s Q-estimator, and we pro-
vide the explicit form of its upper bound in a special case of
parametric model. The resulting bound is much narrower
than the regret of the standard Q-learning algorithm that
depends on the uniform estimation error of the Q-estimator
at each action. Methodologically, our approach is fairly
general, and works with a range of different BML models,
from simple Bayesian linear basis model to more complex
Bayesian neural network model. Scientifically, our pro-
posal offers a viable solution to a critical problem in pre-
cision medicine that can assist patients to achieve the best
individualized treatment strategy. Finally, computationally,
our algorithm is efficient and scalable to large datasets, as it
adopts a variational inference approach to approximate the
posterior distribution, and does not require computationally
intensive posterior sampling method such as Markov chain
Monte Carlo (Geman and Geman, 1984).