3. Policy-based reinforcement learning
Policy Function Approximation
policy function
Policy Function , is a probability density function. -
it takes state as input. -
it output the probabilities for all the actions, e.g.,
the agent performs an action random drawn from the distribution.
policy network
policy network: use a neural net to approximate .
use policy network to approximate . -
: trainable parameters of the neural net.
here, . -
that is why we use softmax activation.
state-value function
definition: discounted return.
definition: Action-value function..
definition: state-value funtion..
policy-based RL
definition: Approximate state-value function.
policy-based learning: learn that maximizes .
how to improve ? policy gradient ascent.
observe state . -
update policy by:
policy gradient
policy gradient: derivative of w.r.t. .
here is over-simplified
if the actions are discrete, form 1
if , so:
this approach does not work for continous actions.
if the actions are continuous, , use form 2
randomly sample an action according to the PDF . -
calculate . -
use as an approximation to the policy gradient .
this approach also works for discrete actions.
observe the state . -
randomly sample action according to . -
compute . -
differentiate policy network: . -
approximate policy gradient:. -
update policy network: .
question: how to compute ?
option 1: reinforce
play the game to the end and generate the trajectory: -
compute the discounted return for all . -
since , we can use to approximate .
option 2: approximate using a neural network. actor-critic method.
4. Actor-Critic Methods
definition: state-value function.
policy network(actor):
use neural net to approximate .
value network(critic):
use neural net to approximate .
policy network(actor)
the structure is same to superMario.
value network(critic)
inputs: state and action . -
output: approximate action-value(scalar)
train the networks
update the parameters .
update policy network to increase the state-value . actor gradually performs better. supervision is purely from the value network(critic).
update value network to better estimate the return. critic’s judgement becomes more accurate. supervision is purely from the rewards.
observe the state . -
randomly sample action according to . -
perform and observe new state and reward . -
update (in value network) using temporal difference(TD). -
update (in policy network) using policy gradient.
update value network q using TD
compute and . -
TD target: . -
loss: . -
Gradient descent:
update policy network using pg
let . -
monte-carlo approximating:. So:
Actor-critic algorithm
observe state and randomly sample . -
perform ; then enviroment gives new state and reward . -
randomly sample . -
evaluate value network: and . -
compute TD error: . -
differentiate value network:. -
update value network:. -
differentiate policy network: . update policy network:.
(other step 9):. this method is called policy gradient with baseline.
5. sarsa
derive TD Target
definition: discounted return.
Assume depends on
identity: for all
TD learning: encourage to approach .
tabular version
we want to learn . -
suppose the numbers of states and actions are finite. -
Draw a table and learn the table
observe a transition . -
sample , where is the policy function. -
TD target: . -
TD error: . -
sarsa’s name
use for updating . -
sarsa:neural network version
value network version
approximate by the value network, . -
is used as the critic who evaluates the actor. -
we want to learn the parameter, . -
TD target:
TD error:
loss: -
gradient descent:
goal: learn the action-value function . -
tabular version: finite states and actions; -
value network version:
6. Q-learning
also TD algorithm
sarsa vs Q-learning
sarsa is for training action-value function, . -
TD target:. -
we used sarsa for updating value network (critic)
Q-learning is for training the optimal action-value function, . -
TD target:. -
we used Q-learning for DQN.
derive TD Target
we have proved that for all :
if is the optimal policy , then
and both denote the optimal action-value function.
the action is computed by
monte-carlo approximation, TD target is
tabular version
observe a transition . -
TD target: -
TD error: . -
update: .
DQN version
approximate by DQN, ,a; textbf{w}). -
DQN controls the agent by:
we seek to learn the parameter, .
observe a transition . -
TD target: . -
TD error: . -
update: .
7. multi-steps TD
sarsa TD target: . Q-learning TD target:
multi-steps return
m-step TD target for sarsa:.
m-step TD target for Q-learning:.
if m is suitably tuned, m-step target works better than one-step target.
“Rainbow: combining improvements in deep reinforcement learning. In AAAI, 2018” “Hossel et al.”
8.Experience Replay
shortcoming 1: waste of experience shortcoming 2: correlated updates
previously, we use sequentially, to update .
consecutive states, and , are strongly correlated (which is bad).
experience replay
a transition: . -
store recent n transitions in a replay buffer. -
remove old transitions so that the buffer has at most n transitions. -
buffer capacity n is a tuning hyper-parameter.
“Revisiting fundamentals of experience replay. in ICML, 2019″
TD with Experience replay
find by minimizing
stochastic gradient descent(SGD) -
randomly sample a transition, , from the buffer -
compute TD error, . -
stochastic gradient:
SGD: .
benefits of experience replay
make the updates uncorrelated. -
reuse collected experience many times.
prioritized experience replay
basic idea
not all transitions are equally important -
which kind of transition is more important, left or right?
how do we know which transition is important? -
if a transition has high TD error , it will be given high priority.
importance sampling
use importance sampling instead of uniform sampling. -
option 1: sampling probability. -
option 2: sampling probability. -
in sum, big shall be given high priority.
scaling learning rate
SGD: , where is the learning rate. -
if uniform sampling is used, is the same for all transitions. -
if importance sampling is used, shall be adjusted according to the importance. -
scale the learning rate by , where . -
in the beginning, set small, increase to 1 over time.
update TD Error
associate each transition, , with a TD error, . -
if a transition is newly collected, simply set its to the maximum.
以上内容整理自Wang shusen讲义
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。