3. Policy-based reinforcement learning

Policy Function Approximation

policy function

Policy Function , is a probability density function.
it takes state as input.
it output the probabilities for all the actions, e.g.,

the agent performs an action random drawn from the distribution.

policy network

policy network: use a neural net to approximate .

use policy network to approximate .
: trainable parameters of the neural net.

here, .
that is why we use softmax activation.

state-value function

definition: discounted return.

definition: Action-value function..

definition: state-value funtion..

policy-based RL

definition: Approximate state-value function.

policy-based learning: learn that maximizes .

how to improve ? policy gradient ascent.

observe state .
update policy by:

policy gradient

policy gradient: derivative of w.r.t. .

here is over-simplified

if the actions are discrete, form 1

if , so:

this approach does not work for continous actions.

if the actions are continuous, , use form 2

randomly sample an action according to the PDF .
calculate .
use as an approximation to the policy gradient .

this approach also works for discrete actions.

algorithm

observe the state .
randomly sample action according to .
compute .
differentiate policy network: .
approximate policy gradient:.
update policy network: .

question: how to compute ?

option 1: reinforce

play the game to the end and generate the trajectory:
compute the discounted return for all .
since , we can use to approximate .

option 2: approximate using a neural network. actor-critic method.

4. Actor-Critic Methods

definition: state-value function.

policy network(actor):

use neural net to approximate .

value network(critic):

use neural net to approximate .

so:

policy network(actor)

the structure is same to superMario.

value network(critic)

inputs: state and action .
output: approximate action-value(scalar)

train the networks

update the parameters .

update policy network to increase the state-value . actor gradually performs better. supervision is purely from the value network(critic).
update value network to better estimate the return. critic’s judgement becomes more accurate. supervision is purely from the rewards.

algorithm

observe the state .
randomly sample action according to .
perform and observe new state and reward .
update (in value network) using temporal difference(TD).
update (in policy network) using policy gradient.

update value network q using TD

compute and .
TD target: .
loss: .
Gradient descent:

update policy network using pg

let .
.

monte-carlo approximating:. So:

Actor-critic algorithm

observe state and randomly sample .
perform ; then enviroment gives new state and reward .
randomly sample .
evaluate value network: and .
compute TD error: .
differentiate value network:.
update value network:.
differentiate policy network:

.
update policy network:.

(other step 9):. this method is called policy gradient with baseline.

5. sarsa

derive TD Target

definition: discounted return.

so:identity:

Assume depends on

identity: for all

TD learning: encourage to approach .

tabular version

we want to learn .
suppose the numbers of states and actions are finite.
Draw a table and learn the table

algorithm

observe a transition .
sample , where is the policy function.
TD target: .
TD error: .
update:.

sarsa’s name

use for updating .
state-action-reward-state-action(SARSA)

sarsa:neural network version

value network version

approximate by the value network, .
is used as the critic who evaluates the actor.
we want to learn the parameter, .
TD target:

TD error:

loss:
gradient:

gradient descent:

summary

goal: learn the action-value function .
tabular version: finite states and actions;
value network version：

6. Q-learning

also TD algorithm

sarsa vs Q-learning

sarsa

sarsa is for training action-value function, .
TD target:.
we used sarsa for updating value network (critic)

Q-learning

Q-learning is for training the optimal action-value function, .
TD target:.
we used Q-learning for DQN.

derive TD Target

we have proved that for all :

if is the optimal policy , then

and both denote the optimal action-value function.

identity:

the action is computed by

so:

monte-carlo approximation, TD target is

tabular version

observe a transition .
TD target:
TD error: .
update: .

DQN version

approximate by DQN, ,a; textbf{w}).
DQN controls the agent by:

we seek to learn the parameter, .

observe a transition .
TD target: .
TD error: .
update:

.

7. multi-steps TD

sarsa TD target: .
Q-learning TD target:

multi-steps return

m-step TD target for sarsa:.
m-step TD target for Q-learning:.

if m is suitably tuned, m-step target works better than one-step target.

“Rainbow: combining improvements in deep reinforcement learning. In AAAI, 2018” “Hossel et al.”

8.Experience Replay

shortcoming 1: waste of experience shortcoming 2: correlated updates

previously, we use sequentially, to update .
consecutive states, and , are strongly correlated (which is bad).

experience replay

a transition: .
store recent n transitions in a replay buffer.
remove old transitions so that the buffer has at most n transitions.
buffer capacity n is a tuning hyper-parameter.

“Revisiting fundamentals of experience replay. in ICML, 2019″

TD with Experience replay

find by minimizing

stochastic gradient descent(SGD)

randomly sample a transition, , from the buffer
compute TD error, .
stochastic gradient:

SGD: .

benefits of experience replay

make the updates uncorrelated.
reuse collected experience many times.

prioritized experience replay

basic idea

not all transitions are equally important
which kind of transition is more important, left or right?

how do we know which transition is important?
if a transition has high TD error , it will be given high priority.

importance sampling

use importance sampling instead of uniform sampling.
option 1: sampling probability.
option 2: sampling probability.
in sum, big shall be given high priority.

scaling learning rate

SGD: , where is the learning rate.
if uniform sampling is used, is the same for all transitions.
if importance sampling is used, shall be adjusted according to the importance.
scale the learning rate by , where .
in the beginning, set small, increase to 1 over time.

update TD Error

associate each transition, , with a TD error, .
if a transition is newly collected, simply set its to the maximum.

以上内容整理自Wang shusen讲义

原文始发于微信公众号（拒绝拍脑袋）：RL理论（3）

文章由极客之音整理，本文链接：https://www.bmabk.com/index.php/post/54852.html

RL理论（3）

3. Policy-based reinforcement learning

Policy Function Approximation

policy function

policy network

state-value function

policy-based RL

policy gradient

algorithm

4. Actor-Critic Methods

policy network(actor)

value network(critic)

train the networks

algorithm

update value network q using TD

update policy network using pg

Actor-critic algorithm

5. sarsa

derive TD Target

tabular version

algorithm

sarsa’s name

sarsa:neural network version

summary

6. Q-learning

sarsa vs Q-learning

sarsa

Q-learning

derive TD Target

tabular version

DQN version

7. multi-steps TD

multi-steps return

8.Experience Replay

experience replay

TD with Experience replay

benefits of experience replay

prioritized experience replay

basic idea

importance sampling

scaling learning rate

update TD Error

相关推荐

发表回复