reinforcement learning
play

Reinforcement Learning Lecture 8 Reinforcement Learning November - PowerPoint PPT Presentation

Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth Institute of Technology COMP4050 Machine Learning | Fall 2015 |


  1. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1

  2. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Outline 1. Context 2. TD Learning 3. Issues Reinforcement Learning November 24, 2015 2

  3. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Machine Learning Tasks • Supervised – Given a training set and a target variable, generalize ; measured over a testing set • Unsupervised – Given a dataset, find “interesting” patterns; potentially no “right” answer • Reinforcement – Learn an optional action policy over time; given an environment that provides states, affords actions, and provides feedback as numerical reward , maximize the expected future reward • Never given I/O pairs • Focus: online (balancing exploration/exploitation) Reinforcement Learning November 24, 2015 3

  4. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Success Stories Reinforcement Learning November 24, 2015 4

  5. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky The Agent-Environment Interface Agent state ac'on s t a t reward r t+1 Environment (stochas2c) s t+1 Reinforcement Learning November 24, 2015 5

  6. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Pole Balancing Reinforcement Learning November 24, 2015 6

  7. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Multi-Armed Bandit Reinforcement Learning November 24, 2015 7

  8. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Types of Tasks • Some tasks are continuous , meaning they are an ongoing sequence of decisions • Some tasks are episodic , meaning there exist terminal states that reset the problem Reinforcement Learning November 24, 2015 8

  9. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Policies A policy is a function that associates a probability with taking a particular action in a particular state π ( s, a ) The goal of RL is to learn an “effective” policy for a particular task Reinforcement Learning November 24, 2015 9

  10. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Objective Select actions so that the sum of the discounted rewards it receives over the future is maximized – Discount rate: 0 ≤ γ ≤ 1 R t = r t +1 + γ r t +2 + γ 2 r t +3 + . . . ∞ X γ k r t + k +1 = k =0 Reinforcement Learning November 24, 2015 10

  11. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Environmental Modeling • An important issue in RL is state representation – Current sensors (observability!) – Past history? • A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state – Given the present, the future does not depend on the past – Memoryless, pathless Reinforcement Learning November 24, 2015 11

  12. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Implications of the Markov Property Often the process is not strictly Markovian, but we can either (i) approximate it as such and yield good results, or (ii) include a fixed window of history as state Thus we can approximate P ( s t +1 = s 0 , r t +1 = r | s t , a t , s t � 1 , a t � 1 , . . . r 1 , s 0 , a 1 ) via P ( s t +1 = s 0 , r t +1 = r | s t , a t ) Reinforcement Learning November 24, 2015 12

  13. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Markov Decision Processes If a process is Markovian, we can model it as a 5-tuple MDP : ( S, A, P ( · , · ) , R ( · , · ) , γ ) – S: set of states – A: set of actions – P a (s, s’): transition function – R a (s, s’): immediate reward Reinforcement Learning November 24, 2015 13

  14. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Recycling Robot MDP Reinforcement Learning November 24, 2015 14

  15. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Value Functions Almost all RL algorithms are based on estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) Value functions are defined with respect to particular policies Reinforcement Learning November 24, 2015 15

  16. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky State-Value Function V π ( s ) = E π [ R t | s t = s ] ∞ X γ k r t + k +1 | s t = s ] = E π [ k =0 Reinforcement Learning November 24, 2015 16

  17. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Action-Value Function Q π ( s, a ) = E π [ R t | s t = s, a t = a ] ∞ X γ k r t + k +1 | s t = s, a t = a ] = E π [ k =0 Reinforcement Learning November 24, 2015 17

  18. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Example: Golf Reinforcement Learning November 24, 2015 18

  19. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Temporal Difference (TD) Learning • Combines ideas from Monte Carlo sampling and dynamic programming • Learns directly from raw experience without a model of environment dynamics • Update estimates based in part on other learned estimates, without waiting for a final outcome Reinforcement Learning November 24, 2015 19

  20. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Visual TD Learning Reinforcement Learning November 24, 2015 20

  21. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Q-Learning: Off-Policy TD Control 1. Initialize Q(s,a) – Random, optimistic, realistic, knowledge 2. Repeat (for each episode): a. Initialize s b. Repeat (for each step of episode) i. Choose action via Q ii. Take action, observe r, s’ iii. a 0 Q ( s 0 , a 0 ) − Q ( s, a )] Q ( s, a ) ← Q ( s, a ) + α [ r + γ max iv. s = s’ until s is terminal Reinforcement Learning November 24, 2015 21

  22. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Choosing Actions • Given a Q function, a common approach to selecting action is ε -greedy 1. Select a random value in [0,1] Ø If > ε , take action with highest estimated value Ø Else, select randomly • In the limit, every action will be sampled an infinite number of times Reinforcement Learning November 24, 2015 22

  23. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Function Representation • Given large state-action spaces, there is a practical problem of how to sample the space, and how to represent it • Modern approaches include hierarchical methods and neural networks Reinforcement Learning November 24, 2015 23

  24. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Application: Michigan Liar’s Dice • Multi-agent opponents • Varying degrees of background knowledge – Opponent modeling – Probabilistic calculation – Symbolic heuristics Reinforcement Learning November 24, 2015 24

  25. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Static 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 25

  26. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Learned 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 26

  27. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Value-Function Initialization 250 200 PMH PM Games Won 150 PH P 100 PMH-0 PM-0 50 PH-0 P-0 0 0 5 10 15 20 Blocks of Trainings (250 Games/Block) Reinforcement Learning November 24, 2015 27

  28. Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Summary • Reinforcement Learning (RL) is the problem of learning an effective action policy for obtaining reward • Most RL algorithms model the task as a Markov Decision Process (MDP) and estimate the value of states/state-actions in a value function • Temporal-Difference (TD) Learning is one effective method that is online and model-free Reinforcement Learning November 24, 2015 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend