3. Policy-based reinforcement learning
Policy Function Approximation
policy function
-
Policy Function , is a probability density function. -
it takes state as input. -
it output the probabilities for all the actions, e.g.,
-
the agent performs an action random drawn from the distribution.
policy network
policy network: use a neural net to approximate .
-
use policy network to approximate . -
: trainable parameters of the neural net.
-
-
here, . -
that is why we use softmax activation.
state-value function
definition: discounted return.
definition: Action-value function..
definition: state-value funtion..
policy-based RL
definition: Approximate state-value function.
.
policy-based learning: learn that maximizes .
how to improve ? policy gradient ascent.
-
observe state . -
update policy by:
policy gradient
policy gradient: derivative of w.r.t. .
here is over-simplified
-
if the actions are discrete, form 1
-
if , so:
this approach does not work for continous actions.
-
if the actions are continuous, , use form 2
-
randomly sample an action according to the PDF . -
calculate . -
use as an approximation to the policy gradient .
this approach also works for discrete actions.
algorithm
-
observe the state . -
randomly sample action according to . -
compute . -
differentiate policy network: . -
approximate policy gradient:. -
update policy network: .
question: how to compute ?
option 1: reinforce
-
play the game to the end and generate the trajectory: -
compute the discounted return for all . -
since , we can use to approximate .
option 2: approximate using a neural network. actor-critic method.
4. Actor-Critic Methods
definition: state-value function.
policy network(actor):
-
use neural net to approximate .
value network(critic):
-
use neural net to approximate .
so:
.
policy network(actor)
the structure is same to superMario.
value network(critic)
-
inputs: state and action . -
output: approximate action-value(scalar)
train the networks
update the parameters .
-
update policy network to increase the state-value . actor gradually performs better. supervision is purely from the value network(critic).
-
update value network to better estimate the return. critic’s judgement becomes more accurate. supervision is purely from the rewards.
algorithm
-
observe the state . -
randomly sample action according to . -
perform and observe new state and reward . -
update (in value network) using temporal difference(TD). -
update (in policy network) using policy gradient.
update value network q using TD
-
compute and . -
TD target: . -
loss: . -
Gradient descent:
update policy network using pg
-
let . -
.
monte-carlo approximating:. So:
.
Actor-critic algorithm
-
observe state and randomly sample . -
perform ; then enviroment gives new state and reward . -
randomly sample . -
evaluate value network: and . -
compute TD error: . -
differentiate value network:. -
update value network:. -
differentiate policy network: . update policy network:.
(other step 9):. this method is called policy gradient with baseline.
5. sarsa
derive TD Target
definition: discounted return.
so:identity:
-
Assume depends on
identity: for all
TD learning: encourage to approach .
tabular version
-
we want to learn . -
suppose the numbers of states and actions are finite. -
Draw a table and learn the table
algorithm
-
observe a transition . -
sample , where is the policy function. -
TD target: . -
TD error: . -
update:.
sarsa’s name
-
use for updating . -
state-action-reward-state-action(SARSA)
sarsa:neural network version
-
value network version
-
approximate by the value network, . -
is used as the critic who evaluates the actor. -
we want to learn the parameter, . -
TD target:
-
TD error:
-
loss: -
gradient:
-
gradient descent:
summary
-
goal: learn the action-value function . -
tabular version: finite states and actions; -
value network version:
6. Q-learning
also TD algorithm
sarsa vs Q-learning
sarsa
-
sarsa is for training action-value function, . -
TD target:. -
we used sarsa for updating value network (critic)
Q-learning
-
Q-learning is for training the optimal action-value function, . -
TD target:. -
we used Q-learning for DQN.
derive TD Target
-
we have proved that for all :
-
if is the optimal policy , then
-
and both denote the optimal action-value function.
identity:
-
the action is computed by
so:
monte-carlo approximation, TD target is
tabular version
-
observe a transition . -
TD target: -
TD error: . -
update: .
DQN version
-
approximate by DQN, ,a; textbf{w}). -
DQN controls the agent by:
-
we seek to learn the parameter, .
-
observe a transition . -
TD target: . -
TD error: . -
update: .
7. multi-steps TD
sarsa TD target: . Q-learning TD target:
multi-steps return
.
.
-
m-step TD target for sarsa:.
-
m-step TD target for Q-learning:.
if m is suitably tuned, m-step target works better than one-step target.
“Rainbow: combining improvements in deep reinforcement learning. In AAAI, 2018” “Hossel et al.”
8.Experience Replay
shortcoming 1: waste of experience shortcoming 2: correlated updates
-
previously, we use sequentially, to update .
-
consecutive states, and , are strongly correlated (which is bad).
experience replay
-
a transition: . -
store recent n transitions in a replay buffer. -
remove old transitions so that the buffer has at most n transitions. -
buffer capacity n is a tuning hyper-parameter.
“Revisiting fundamentals of experience replay. in ICML, 2019″
TD with Experience replay
-
find by minimizing
-
stochastic gradient descent(SGD) -
randomly sample a transition, , from the buffer -
compute TD error, . -
stochastic gradient:
-
SGD: .
benefits of experience replay
-
make the updates uncorrelated. -
reuse collected experience many times.
prioritized experience replay
basic idea
-
not all transitions are equally important -
which kind of transition is more important, left or right?
-
how do we know which transition is important? -
if a transition has high TD error , it will be given high priority.
importance sampling
-
use importance sampling instead of uniform sampling. -
option 1: sampling probability. -
option 2: sampling probability. -
in sum, big shall be given high priority.
scaling learning rate
-
SGD: , where is the learning rate. -
if uniform sampling is used, is the same for all transitions. -
if importance sampling is used, shall be adjusted according to the importance. -
scale the learning rate by , where . -
in the beginning, set small, increase to 1 over time.
update TD Error
-
associate each transition, , with a TD error, . -
if a transition is newly collected, simply set its to the maximum.
以上内容整理自Wang shusen讲义
原文始发于微信公众号(拒绝拍脑袋):RL理论(3)
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/54852.html