2.2.1. Value-based Reinforcement Learning
In the value iteration approach, the value function is updated following the Bellman Optimal
Equation (Bellman, 1952):
Vk+1(s) = max
a
E[rt+1 +γVk(St+1)|(St=s, At=a)] (6)
Two classic approaches have been used to estimate Vπ(s), Monte-Carlo-based approach (MC) and
Temporal-Difference-based approach (TD). In MC, based on current state s(t), the agent starts
to interact with the environment until reaching a termination condition. Then, the cumulative
reward Gtcan be calculated based on given rules. The aim is to drive Vt(s)close to Gt, which
leads to the update policy as follow:
Vt(s)←Vt(s) + α(Gt−Vt(s)) (7)
where αis the learning rate. Since the reward obtained by MC is estimated at the end of the
episode, there can be large variances in the cumulative reward. On the contrary, TD only simulates
one step in the episode and the update policy is as follows:
Vt(s)←Vt(s) + α(Rt+γVt(s+ 1) −Vt(s)) (8)
which yields smaller variance but can be less accurate due to a lack of an overview of the whole
episode.
Typical TD-based strategies are Q-learning (Watkins and Dayan, 1992) and State-Action-
Reward-State-Action (Sarsa) algorithm (Sutton, 1996) which replace Vπ(s)with Q(s, a)following
Eq. (8). The update policy of Q-learning is represented as:
Qπ(st, at)←Qπ(st, at) + α(rt+γmaxat+1 Qπ(st+1, at+1)−Qπ(st, at)) (9)
And the update policy of Sarsa can be shown as:
Qπ(st, at)←Qπ(st, at) + α(rt+γQπ(st+1, at+1)−Qπ(st, at)) (10)
Both Q-learning and Sarsa have two policies, a behavior policy to interact with the environment
and sample potential actions from the learning data with randomness and a target policy to
improve the performance with the help of sampling data and thus obtain the optimal policy.
Furthermore, according to the data usage when updating value functions, RL can be divided into
on-policy and off-policy methods. On-policy methods update the policy that is used to make
decisions, while off-policy methods update a policy different from that used to generate the data
(Sutton et al., 1998). Sarsa is an on-policy strategy (i.e., the target policy is the same as the
behavior policy), while Q-learning is an off-policy method (i.e., the target policy is to suppose the
selecting action with the largest reward to update the value function).
In some applications, a large number of states and actions can hardly be captured in Q-learning.
Thus, deep models are used to approximate the value function. Mnih et al. (2015) proposes Deep
Q-Network (DQN) for optimal policy finding. Given a Q-function Qand a target Q-function ˆ
Q
initializing by ˆ
Q=Q, an experience replay buffer is utilized to store the transition (st, at, rt, st+1)
in each time step where atis obtained by Q. When enough sample data is obtained, a mini-batch
of samples is chosen randomly to get the target value by ˆ
Q:
y=ri+γmax
a
ˆ
Q(si+1, a)(11)
Then, the parameters of Qare updated by driving Q(si, ai)close to ywith the gradient descent
method. The target network ˆ
Qwill be reset after Csteps by ˆ
Q=Q.
Further DQN-based methods such as Double-DQN (Van Hasselt et al., 2016) and Dueling-
DQN (Wang et al., 2016) are developed for more robust and faster policy learning. In detail, to
reduce the overestimations caused by DQN (i.e., the estimated value is larger than the true value),
Double-DQN implements the choice and the evaluation of actions with different value functions,
QA(s, a)and QB(s, a). Thus, the updated function can be represented as:
QB(st, at)←QA(st, at) + α(rt+γmaxat+1 QB(st+1, argmaxaQA(st+1, at)) −QA(st, at))
QB(st, at)←QB(st, at) + α(rt+γmaxat+1 QA(st+1, argmaxaQB(st+1, at)) −QB(st, at)) (12)
4