reinforcement learning

Reinforcement Learning Lecture 8 Reinforcement Learning November - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 |


  1. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1

  2. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Outline 1. Context 2. TD Learning 3. Issues Reinforcement Learning November 24, 2015 2

  3. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Machine Learning Tasks • Supervised – Given a training set and a target variable, generalize ; measured over a testing set • Unsupervised – Given a dataset, find “interesting” patterns; potentially no “right” answer • Reinforcement – Learn an optional action policy over time; given an environment that provides states, affords actions, and provides feedback as numerical reward , maximize the expected future reward • Never given I/O pairs • Focus: online (balancing exploration/exploitation) Reinforcement Learning November 24, 2015 3

  4. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Success Stories Reinforcement Learning November 24, 2015 4

  5. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky The Agent-Environment Interface Agent state ac'on s t a t reward r t+1 Environment (stochas2c) s t+1 Reinforcement Learning November 24, 2015 5

  6. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Pole Balancing Reinforcement Learning November 24, 2015 6

  7. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Multi-Armed Bandit Reinforcement Learning November 24, 2015 7

  8. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Types of Tasks • Some tasks are continuous , meaning they are an ongoing sequence of decisions • Some tasks are episodic , meaning there exist terminal states that reset the problem Reinforcement Learning November 24, 2015 8

  9. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Policies A policy is a function that associates a probability with taking a particular action in a particular state π ( s, a ) The goal of RL is to learn an “effective” policy for a particular task Reinforcement Learning November 24, 2015 9

  10. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Objective Select actions so that the sum of the discounted rewards it receives over the future is maximized – Discount rate: 0 ≤ γ ≤ 1 R t = r t +1 + γ r t +2 + γ 2 r t +3 + . . . ∞ X γ k r t + k +1 = k =0 Reinforcement Learning November 24, 2015 10

  11. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Environmental Modeling • An important issue in RL is state representation – Current sensors (observability!) – Past history? • A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state – Given the present, the future does not depend on the past – Memoryless, pathless Reinforcement Learning November 24, 2015 11

  12. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Implications of the Markov Property Often the process is not strictly Markovian, but we can either (i) approximate it as such and yield good results, or (ii) include a fixed window of history as state Thus we can approximate P ( s t +1 = s 0 , r t +1 = r | s t , a t , s t � 1 , a t � 1 , . . . r 1 , s 0 , a 1 ) via P ( s t +1 = s 0 , r t +1 = r | s t , a t ) Reinforcement Learning November 24, 2015 12

  13. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Markov Decision Processes If a process is Markovian, we can model it as a 5-tuple MDP : ( S, A, P ( · , · ) , R ( · , · ) , γ ) – S: set of states – A: set of actions – P a (s, s’): transition function – R a (s, s’): immediate reward Reinforcement Learning November 24, 2015 13

  14. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Recycling Robot MDP Reinforcement Learning November 24, 2015 14

  15. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Value Functions Almost all RL algorithms are based on estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) Value functions are defined with respect to particular policies Reinforcement Learning November 24, 2015 15

  16. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky State-Value Function V π ( s ) = E π [ R t | s t = s ] ∞ X γ k r t + k +1 | s t = s ] = E π [ k =0 Reinforcement Learning November 24, 2015 16

  17. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Action-Value Function Q π ( s, a ) = E π [ R t | s t = s, a t = a ] ∞ X γ k r t + k +1 | s t = s, a t = a ] = E π [ k =0 Reinforcement Learning November 24, 2015 17

  18. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Example: Golf Reinforcement Learning November 24, 2015 18

  19. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Temporal Difference (TD) Learning • Combines ideas from Monte Carlo sampling and dynamic programming • Learns directly from raw experience without a model of environment dynamics • Update estimates based in part on other learned estimates, without waiting for a final outcome Reinforcement Learning November 24, 2015 19

  20. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Visual TD Learning Reinforcement Learning November 24, 2015 20

  21. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Q-Learning: Off-Policy TD Control 1. Initialize Q(s,a) – Random, optimistic, realistic, knowledge 2. Repeat (for each episode): a. Initialize s b. Repeat (for each step of episode) i. Choose action via Q ii. Take action, observe r, s’ iii. a 0 Q ( s 0 , a 0 ) − Q ( s, a )] Q ( s, a ) ← Q ( s, a ) + α [ r + γ max iv. s = s’ until s is terminal Reinforcement Learning November 24, 2015 21

  22. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Choosing Actions • Given a Q function, a common approach to selecting action is ε -greedy 1. Select a random value in [0,1] Ø If > ε , take action with highest estimated value Ø Else, select randomly • In the limit, every action will be sampled an infinite number of times Reinforcement Learning November 24, 2015 22

  23. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Function Representation • Given large state-action spaces, there is a practical problem of how to sample the space, and how to represent it • Modern approaches include hierarchical methods and neural networks Reinforcement Learning November 24, 2015 23

  24. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Application: Michigan Liar’s Dice • Multi-agent opponents • Varying degrees of background knowledge – Opponent modeling – Probabilistic calculation – Symbolic heuristics Reinforcement Learning November 24, 2015 24

  25. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Static 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 25

  26. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Learned 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 26

  27. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Value-Function Initialization 250 200 PMH PM Games Won 150 PH P 100 PMH-0 PM-0 50 PH-0 P-0 0 0 5 10 15 20 Blocks of Trainings (250 Games/Block) Reinforcement Learning November 24, 2015 27

  28. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Summary • Reinforcement Learning (RL) is the problem of learning an effective action policy for obtaining reward • Most RL algorithms model the task as a Markov Decision Process (MDP) and estimate the value of states/state-actions in a value function • Temporal-Difference (TD) Learning is one effective method that is online and model-free Reinforcement Learning November 24, 2015 28

Recommend


More recommend