reinforcement learning
play

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit - PowerPoint PPT Presentation

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5 0.1 n-armed bandit 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate n-armed bandit 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0 estimate 0 0 0 0.0 0


  1. Reinforcement Learning Kevin Spiteri April 21, 2015

  2. n-armed bandit

  3. n-armed bandit 0.9 0.5 0.1

  4. n-armed bandit 0.9 0.5 0.1 0.0 0.0 0.0 0.0 estimate

  5. n-armed bandit 0.9 0.5 0.1 0 0.0 0.0 0.0 0.0 estimate 0 0 0 0.0 0 attempts 0 0 0 0.0 0 payoff

  6. n-armed bandit 0.9 0.5 0.1 1.0 0.0 0.0 1.0 0.0 0.0 estimate 1 0 0 1 0.0 0 attempts 1 0 0 1 0.0 0 payoff

  7. n-armed bandit 0.9 0.5 0.1 0.5 0.0 1.0 0.5 0.0 0.0 estimate 2 0 1 2 0.0 0 attempts 1 0 1 1 0.0 0 payoff

  8. Exploration 0.9 0.5 0.1 0.67 0.0 1.0 0.5 0.0 0.0 estimate 3 0 1 2 0.0 0 attempts 2 0 1 1 0.0 0 payoff

  9. Going on … 0.9 0.5 0.1 0.86 0.9 0.5 0.0 0.1 estimate 300 280 10 0.0 10 attempts 258 252 5 0.0 1 payoff

  10. Changing environment 0.7 0.8 0.1 0.86 0.9 0.5 0.0 0.1 estimate 300 280 10 0.0 10 attempts 258 252 5 0.0 1 payoff

  11. Changing environment 0.7 0.8 0.1 0.77 0.8 0.65 0.0 0.1 estimate 600 560 20 0.0 20 attempts 463 448 13 0.0 2 payoff

  12. Changing environment 0.7 0.8 0.1 0.72 0.74 0.74 0.0 0.1 estimate 1500 1400 50 0.0 50 attempts 1078 1036 37 0.0 5 payoff

  13. n-armed bandit ● Optimal payoff (0.82): 0.9 x 300 + 0.8 x 1200 = 1230 ● Actual payoff (0.72): 0.9 x 280 + 0.5 x 10 + 0.1 x 10 + 0.7 x 1120 + 0.8 x 40 + 0.1 x 40 = 1078

  14. n-armed bandit ● Evaluation vs instruction. ● Discounting. ● Initial estimates. ● There is no best way or standard way.

  15. Markov Decision Process (MDP)

  16. Markov Decision Process (MDP) ● States

  17. Markov Decision Process (MDP) ● States

  18. Markov Decision Process (MDP) ● States ● Actions c b a

  19. Markov Decision Process (MDP) ● States ● Actions c ● Model a 0.75 b a 0.25

  20. Markov Decision Process (MDP) ● States ● Actions c ● Model a 0.75 b a 0.25

  21. Markov Decision Process (MDP) ● States ● Actions c 0 ● Model ● Reward a 0.75 5 -1 b a 0.25

  22. Markov Decision Process (MDP) ● States ● Actions c 0 ● Model ● Reward a 0.75 5 -1 ● Policy b a 0.25

  23. Markov Decision Process (MDP) ● States: ball table hand t t h basket floor b f

  24. Markov Decision Process (MDP) ● States: ball table hand t h basket floor b f

  25. Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: b a a) attempt b f b) drop c) wait

  26. Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: a 0.75 b a 0.25 a) attempt b f b) drop c) wait

  27. Markov Decision Process (MDP) ● States: ball table c hand t h basket floor ● Actions: a 0.75 b a 0.25 a) attempt b f b) drop c) wait

  28. Markov Decision Process (MDP) ● States: ball table c 0 hand t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

  29. Markov Decision Process (MDP) ● States: ball table c 0 hand t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait Expected reward per round: 0.25 x 5 + 0.75 x (-1) = 0.5

  30. Markov Decision Process (MDP) ● States: ball table c 0 hand -1 t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

  31. Markov Decision Process (MDP) ● States: ball table c 0 hand -1 t h basket floor ● Actions: a 0.75 5 -1 b a 0.25 a) attempt b f b) drop c) wait

  32. Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

  33. Grid World Reward: Normal move: -1 Over obstacle: -10 Best reward: -15

  34. Optimal Policy

  35. Value Function -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

  36. Initial Policy

  37. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

  38. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

  39. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -24 -14 -13 -3

  40. Policy Iteration

  41. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -15 -14 -4 -3

  42. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -15 -14 -4 -3

  43. Policy Iteration -21 -11 -10 0 -22 -12 -11 -1 -23 -13 -12 -2 -15 -14 -4 -3

  44. Policy Iteration

  45. Policy Iteration -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

  46. Value Iteration 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  47. Value Iteration -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

  48. Value Iteration -2 -2 -2 0 -2 -2 -2 -1 -2 -2 -2 -2 -2 -2 -2 -2

  49. Value Iteration -3 -3 -3 0 -3 -3 -3 -1 -3 -3 -3 -2 -3 -3 -3 -3

  50. Value Iteration -15 -8 -7 0 -14 -9 -6 -1 -13 -10 -5 -2 -12 -11 -4 -3

  51. Stochastic Model 0.95 0.025 0.025

  52. 0.95 Value Iteration 0.025 0.025 -19.2 -10.4 -9.3 0 -18.1 -12.1 -8.2 -1.5 -17.0 -13.6 -6.7 -2.9 -15.7 -14.7 -5.1 -4.0

  53. 0.95 Value Iteration 0.025 0.025 E.g. 13.6: -19.2 -10.4 -9.3 0 13.6 = 0.950 x 13.1 + -18.1 -12.1 -8.2 -1.5 0.025 x 27.0 + 0.025 x 16.7 -17.0 -13.6 -6.7 -2.9 16.6 = -15.7 -14.7 -5.1 -4.0 0.950 x 16.7 + 0.025 x 13.1 + 0.025 x 15.7

  54. Richard Bellman

  55. Bellman Equation

  56. Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

  57. Monte Carlo Methods 0.95 0.025 0.025

  58. 0.95 Monte Carlo Methods 0.025 0.025

  59. 0.95 Monte Carlo Methods 0.025 0.025 -32 -22 -10 0 -21 -11

  60. 0.95 Monte Carlo Methods 0.025 0.025

  61. 0.95 Monte Carlo Methods 0.025 0.025 -21 -11 -10 0

  62. 0.95 Monte Carlo Methods 0.025 0.025

  63. 0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

  64. 0.95 Q-Value 0.025 0.025 -15 -10 -8 -20

  65. Bellman Equation -15 -10 -8 -20

  66. Learning Rate ● We do not replace an old Q value with a new one. ● We update at a designed learning rate. ● Learning rate too small: slow to converge. ● Learning rate too large: unstable. ● Will Dabney PhD Thesis: Adaptive Step-Sizes for Reinforcement Learning.

  67. Reinforcement Learning Tools ● Dynamic Programming ● Monte Carlo Methods ● Temporal Difference Learning

  68. Richard Sutton

  69. Temporal Difference Learning ● Dynamic Programming: Learn a guess from other guesses (Bootstrapping). ● Monte Carlo Methods: Learn without knowing model.

  70. Temporal Difference Learning Temporal Difference: ● Learn a guess from other guesses (Bootstrapping). ● Learn without knowing model. ● Works with longer episodes than Monte Carlo methods.

  71. Temporal Difference Learning Monte Carlo Methods: ● First run through whole episode. ● Update states at end. Temporal Difference Learning: ● Update state at each step using earlier guesses.

  72. 0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

  73. 0.95 Monte Carlo Methods 0.025 0.025 -32 -10 0 -31 -21 -11

  74. 0.95 Temporal Difference 0.025 0.025 -19 -10 0 -22 -18 -12

  75. 0.95 Temporal Difference 0.025 0.025 -19 -10 -23 -10 0 -22 -18 -12 -28 -21 -11

  76. 0.95 Temporal Difference 0.025 0.025 23 = 1 + 22 -19 -10 -23 -10 0 28 = 10 + 18 -22 -18 -12 -28 -21 -11 21 = 10 + 11 11 = 1 + 10 10 = 10 + 0

  77. Function Approximation ● Most problems have large state space. ● We can generally design an approximation for the state space. ● Choosing the correct approximation has a large influence on system performance.

  78. Mountain Car Problem

  79. Mountain Car Problem ● Car cannot make it to top. ● Can can swing back and forth to gain momentum. ● We know x and ẋ. ● x and ẋ give an infinite state space. ● Random – may get to top in 1000 steps. ● Optimal – may get to top in 102 steps.

  80. Function Approximation ● We can partition state space in 200 x 200 grid. ● Coarse coding – different ways of partitioning state space. ● We can approximate V = w T f ● E.g. f = ( x ẋ height ẋ 2 ) T ● We can estimate w to solve problem.

  81. Problems with Reinforcement Learning Policy sometimes gets worse: ● Safe Reinforcement Learning (Phil Thomas) guarantees an improved policy over the current policy. Very specific to training task: ● Learning Parameterized Skills Bruno Castro da Silva PhD Thesis

  82. Checkers ● Arthur Samuel (IBM) 1959

  83. TD-Gammon ● Neural networks and temporal difference. ● Current programs play better than human experts. ● Expert work in input selection.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend