RL理论(3)

3. Policy-based reinforcement learning

Policy Function Approximation

policy function

  • Policy Function , is a probability density function.
  • it takes state as input.
  • it output the probabilities for all the actions, e.g.,

  • the agent performs an action random drawn from the distribution.

policy network

policy network: use a neural net to approximate .

  • use policy network to approximate .
  • : trainable parameters of the neural net.
RL理论(3)
superMario problem
  • here, .
  • that is why we use softmax activation.

state-value function

definition: discounted return.

definition: Action-value function..

definition: state-value funtion..

policy-based RL

definition: Approximate state-value function.

.

policy-based learning: learn that maximizes .

how to improve ? policy gradient ascent.

  • observe state .
  • update policy by:

policy gradient

policy gradient: derivative of w.r.t. .

here is over-simplified

  • if the actions are discrete, form 1

  • if , so:

this approach does not work for continous actions.

  • if the actions are continuous, , use form 2

  1. randomly sample an action according to the PDF .
  2. calculate .
  3. use as an approximation to the policy gradient .

this approach also works for discrete actions.

algorithm

  1. observe the state .
  2. randomly sample action according to .
  3. compute .
  4. differentiate policy network: .
  5. approximate policy gradient:.
  6. update policy network: .

question: how to compute ?

option 1: reinforce

  • play the game to the end and generate the trajectory:
  • compute the discounted return for all .
  • since , we can use to approximate .

option 2: approximate using a neural network. actor-critic method.

4. Actor-Critic Methods

RL理论(3)
AC method

definition: state-value function.

policy network(actor):

  • use neural net to approximate .

value network(critic):

  • use neural net to approximate .

so:

.

policy network(actor)

the structure is same to superMario.

value network(critic)

RL理论(3)
superMario critic
  • inputs: state and action .
  • output: approximate action-value(scalar)

train the networks

update the parameters .

  • update policy network to increase the state-value . actor gradually performs better. supervision is purely from the value network(critic).

  • update value network to better estimate the return. critic’s judgement becomes more accurate. supervision is purely from the rewards.

algorithm

  1. observe the state .
  2. randomly sample action according to .
  3. perform and observe new state and reward .
  4. update (in value network) using temporal difference(TD).
  5. update (in policy network) using policy gradient.

update value network q using TD

  • compute and .
  • TD target: .
  • loss: .
  • Gradient descent:

update policy network using pg

  • let .
  • .

monte-carlo approximating:. So:

.

Actor-critic algorithm

RL理论(3)
A-C algorithm
  1. observe state and randomly sample .
  2. perform ; then enviroment gives new state and reward .
  3. randomly sample .
  4. evaluate value network: and .
  5. compute TD error: .
  6. differentiate value network:.
  7. update value network:.
  8. differentiate policy network:

    .

  9. update policy network:.

(other step 9):. this method is called policy gradient with baseline.

5. sarsa

derive TD Target

definition: discounted return.

so:identity:

  • Assume depends on

identity: for all

TD learning: encourage to approach .

tabular version

  • we want to learn .
  • suppose the numbers of states and actions are finite.
  • Draw a table and learn the table
RL理论(3)
state-action table

algorithm

  1. observe a transition .
  2. sample , where is the policy function.
  3. TD target: .
  4. TD error: .
  5. update:.

sarsa’s name

  • use for updating .
  • state-action-reward-state-action(SARSA)

sarsa:neural network version

  1. value network version
  • approximate by the value network, .
  • is used as the critic who evaluates the actor.
  • we want to learn the parameter, .
  • TD target:

  • TD error:

  • loss:
  • gradient:

  • gradient descent:

RL理论(3)
value networks

summary

  • goal: learn the action-value function .
  • tabular version: finite states and actions;
  • value network version:

6. Q-learning

also TD algorithm

sarsa vs Q-learning

sarsa

  • sarsa is for training action-value function, .
  • TD target:.
  • we used sarsa for updating value network (critic)

Q-learning

  • Q-learning is for training the optimal action-value function, .
  • TD target:.
  • we used Q-learning for DQN.

derive TD Target

  • we have proved that for all :

  • if is the optimal policy , then

  • and both denote the optimal action-value function.

identity:

  • the action is computed by

so:

monte-carlo approximation, TD target is

tabular version

  • observe a transition .
  • TD target:
  • TD error: .
  • update: .
RL理论(3)
tabular version

DQN version

  • approximate by DQN, ,a; textbf{w}).
  • DQN controls the agent by:

  • we seek to learn the parameter, .
RL理论(3)
DQN version
  • observe a transition .
  • TD target: .
  • TD error: .
  • update:

    .

7. multi-steps TD

  • sarsa TD target: .
  • Q-learning TD target:

multi-steps return

.

.

  • m-step TD target for sarsa:.

  • m-step TD target for Q-learning:.

if m is suitably tuned, m-step target works better than one-step target.

Rainbow: combining improvements in deep reinforcement learning. In AAAI, 2018” “Hossel et al.”

8.Experience Replay

shortcoming 1: waste of experience shortcoming 2: correlated updates

  • previously, we use sequentially, to update .

  • consecutive states, and , are strongly correlated (which is bad).

experience replay

  • a transition: .
  • store recent n transitions in a replay buffer.
  • remove old transitions so that the buffer has at most n transitions.
  • buffer capacity n is a tuning hyper-parameter.

Revisiting fundamentals of experience replay. in ICML, 2019″

RL理论(3)
replay buffer

TD with Experience replay

  • find by minimizing

  • stochastic gradient descent(SGD)
    1. randomly sample a transition, , from the buffer
    2. compute TD error, .
    3. stochastic gradient:

  1. SGD: .

benefits of experience replay

  • make the updates uncorrelated.
  • reuse collected experience many times.

prioritized experience replay

basic idea

  • not all transitions are equally important
  • which kind of transition is more important, left or right?
RL理论(3)
not equally important
  • how do we know which transition is important?
  • if a transition has high TD error , it will be given high priority.

importance sampling

  • use importance sampling instead of uniform sampling.
  • option 1: sampling probability.
  • option 2: sampling probability.
  • in sum, big shall be given high priority.

scaling learning rate

  • SGD: , where is the learning rate.
  • if uniform sampling is used, is the same for all transitions.
  • if importance sampling is used, shall be adjusted according to the importance.
  • scale the learning rate by , where .
  • in the beginning, set small, increase to 1 over time.

update TD Error

  • associate each transition, , with a TD error, .
  • if a transition is newly collected, simply set its to the maximum.
RL理论(3)
big ==>high probability==>small learning rate

以上内容整理自Wang shusen讲义

原文始发于微信公众号(拒绝拍脑袋):RL理论(3)

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/54852.html

(0)
小半的头像小半

相关推荐

发表回复

登录后才能评论
极客之音——专业性很强的中文编程技术网站,欢迎收藏到浏览器,订阅我们!