Reinforcement Learning for Control Federico Nesti, - - PDF document

reinforcement learning for control
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning for Control Federico Nesti, - - PDF document

4/17/2020 Reinforcement Learning for Control Federico Nesti, f.nesti@santannapisa.it Federico Nesti, f.nesti@santannapisa.it Credits: Brian Douglas 1 4/17/2020 Layout Reinforcement Learning Recap Applications: o Q-Learning o Deep


slide-1
SLIDE 1

4/17/2020 1

Reinforcement Learning for Control

Federico Nesti, f.nesti@santannapisa.it Federico Nesti, f.nesti@santannapisa.it

Credits: Brian Douglas

slide-2
SLIDE 2

4/17/2020 2

  • Reinforcement Learning Recap

Layout

  • Applications:
  • Q-Learning
  • Deep Q-Learning
  • Policy Gradients
  • Actor Critic (DDPG)

M j Diffi lti i D RL

  • Major Difficulties in Deep RL
  • Reinforcement Learning Recap

Layout

  • Applications:
  • Q-Learning
  • Deep Q-Learning
  • Policy Gradients
  • Actor Critic (DDPG)

M j Diffi lti i D RL

  • Major Difficulties in Deep RL
slide-3
SLIDE 3

4/17/2020 3

Reinforcement Learning

Environment Agent

Reinforcement Learning

Agent Environment

slide-4
SLIDE 4

4/17/2020 4

Reinforcement Learning

Start State Terminal State

Reinforcement Learning

START

HOLE HOLE GOAL

slide-5
SLIDE 5

4/17/2020 5

Reinforcement Learning Reinforcement Learning

slide-6
SLIDE 6

4/17/2020 6

  • Reinforcement Learning Recap

Layout

  • Applications:
  • Q-Learning
  • Deep Q-Learning
  • Policy Gradients
  • Actor Critic (DDPG)

M j Diffi lti i D RL

  • Major Difficulties in Deep RL

The two approaches of RL

Policy search («primal» formulation): («primal» formulation): the search for the

  • ptimal policy is done

directly in the policy space.

slide-7
SLIDE 7

4/17/2020 7

The two approaches of RL

Policy search («primal» formulation): («primal» formulation): the search for the

  • ptimal policy is done

directly in the policy space.

Function-based approach

slide-8
SLIDE 8

4/17/2020 8

TD-Learning Q-Learning

T t Target TD error TD error

slide-9
SLIDE 9

4/17/2020 9

Q-Learning Q-Learning Notebook (not slippery)

START

HOLE HOLE HOLE HOLE GOAL 1 2 3 4 5 6 7 8 9 1 0 1 1

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3

1 2 1 3 1 4 1 5

1 3 1 4 1 5

slide-10
SLIDE 10

4/17/2020 10

Q-Learning Notebook (SLIPPERY)

START

HOLE HOLE HOLE HOLE GOAL 1 2 3 4 5 6 7 8 9 1 0 1 1

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3

1 2 1 3 1 4 1 5

1 3 1 4 1 5

Q-Learning Notebook

Remember: Reward is 1 only when Goal is reached, else is 0.

  • In the non-slippery case, the optimal solution is found:

each state has a correspondent optimal action.

  • In the slippery case, where NO OPTIMAL SOLUTION CAN

BE FOUND, the actions were learned to avoid all possible actions that could lead to a hole.

slide-11
SLIDE 11

4/17/2020 11

Deep Q-Learning

Extension of Q-learning for continuous state-space

Target TD error

Problems with Deep Q-Learning

Solution: use of TARGET NETW ORKS. Solution: use of REPLAY BUFFER.

slide-12
SLIDE 12

4/17/2020 12

Deep Q-Learning

y r

Target Netw ork

Random Sampling

Replay Buffer

Netw ork

Evict old data

p g

Deep Q-Learning Notebook

Visual Gridw orld Agent Hole GOAL

slide-13
SLIDE 13

4/17/2020 13

Deep Q-Learning Notebook

A few advices:

  • Keep exploration rate high, learning rate low: lots of episodes are

needed for satisfying solutions

  • Always save good solutions!
  • When the agent is not learning, try adding noise!
  • Reward does not always go high, there is a certain variance.

Deep Q-Learning Notebook

slide-14
SLIDE 14

4/17/2020 14

The two approaches of RL

Policy search («primal» formulation): («primal» formulation): the search for the

  • ptimal policy is done

directly in the policy space.

Policy Search

slide-15
SLIDE 15

4/17/2020 15

Policy Gradients

Expand Expectation Expand Expectation Bring gradient under integral Return to expectation form Return to expectation form Expression for grad-log-prob

Policy Search

Full episode How to com pute p Return p this???

slide-16
SLIDE 16

4/17/2020 16

Pseudo-loss trick

to change to gradient descent

Intuition

slide-17
SLIDE 17

4/17/2020 17

Pseudo-loss trick

W arnings:

  • This pseudo-loss is useful because we use automatic

p differentiation of TensorFlow to compute the gradient.

  • This is not an actual loss function. It does not measure

performance and has no meaning. There is no guarantee that this works.

  • The loss function is defined on a dataset that is dependent on

the parameters (it should not in gradient descent) the parameters (it should not, in gradient descent).

  • Only care about the average reward, which is always a good

performance indicator.

Policy Gradients notebook

slide-18
SLIDE 18

4/17/2020 18

Beyond discrete action spaces Deep Deterministic Policy Gradients

slide-19
SLIDE 19

4/17/2020 19

Deep Deterministic Policy Gradients

(Chain Rule)

Deep Deterministic Policy Gradients

slide-20
SLIDE 20

4/17/2020 20

DDPG notebook

A few advices:

  • DDPG presents high reward variance It is wise to keep track of all the best
  • DDPG presents high reward variance. It is wise to keep track of all the best

solutions and rank them in some clever way (e.g., prefer solutions with correspondent high average reward, or use some other performance measure – oscillations, for instance).

  • Sometimes no satisfying solution is found within the first N episodes. Don’t

worry: if you find a set of parameters that had promising performance, save them and resume training starting from those parameters. This might require smaller learning rates. require smaller learning rates.

  • DDPG is very sensitive to hyperparameters, that must vary for different
  • cases. No standard set of hyperparameters will work out of the box.
  • Reinforcement Learning Recap

Layout

  • Applications:
  • Q-Learning
  • Deep Q-Learning
  • Policy Gradients
  • Actor Critic (DDPG)

M j Diffi lti i D RL

  • Major Difficulties in Deep RL
slide-21
SLIDE 21

4/17/2020 21

Algorithms recap

Sample Computational Efficiency On-Policy algorithms Off-Policy algorithms Model-based shallow RL

  • PILCO

Model-based Deep RL Replay Buffer/ Value Estim ation m ethods

  • Q-Learning

PG m ethods / Actor-Critic Fully online m ethods Gradient-free m ethods

  • Evolutionary

(CMA-ES) Efficiency Efficiency

  • PETS
  • Guided Policy

Search Q Learning

  • DDPG
  • NAF
  • SAC

Actor Critic

  • REINFORCE
  • TRPO

m ethods

  • A3C

Other useful references for RL

  • OpenAI SpinningUp: OpenAI educational resources. It

explains RL basics and the major Deep RL algorithms

  • Sutton-Barto: Reinforcement Learning, an Introduction. Great

basics book, full of examples. A bit hard to read at first but nice handbook.

  • Szepesvàri: Algorithms for Reinforcement Learning. Handbook

for the main RL algorithms.

  • S. Levine: Deep RL course @ Berkeley
slide-22
SLIDE 22

4/17/2020 22

  • Rew ard Hacking: since the reward is a scalar value, it is crucial to design it

in such a way that there is no «clever way» to maximize it. O i l l i i l f b l h d

Practical problems in RL

  • Optim al solution exists only for tabular m ethods: as soon as

approximation/ continuous spaces are used, no optimal solution is guaranteed;

  • RL suffers from the curse of dim ensionality: exploration of the state space

scales exponentially with the number of state space dimensions.

  • Real-world system often break if used randomly
  • Deep RL inherits all the advantages and challenges of Deep Learning
  • Deep RL inherits all the advantages and challenges of Deep Learning

(approximation and generalization, overfitting, high unpredictability, debugging difficulties)