machine learning and data mining reinforcement learning
play

Machine Learning and Data Mining Reinforcement Learning Markov - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask Overview Intro Markov Decision Processes Reinforcement Learning Sarsa Q-learning Exploration vs Exploitation tradeoff 2


  1. Policy Evaluation : Grid World 70

  2. Policy Evaluation : Grid World 71

  3. Policy Evaluation : Grid World 72

  4. Policy Evaluation : Grid World 73

  5. Most of the story in a nutshell: 74

  6. Finding Best Policy 75

  7. Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Given a policy π Evaluate the policy π v π ( s ) = E [ R t +1 + γ R t +2 + ... | S t = s ] Improve the policy by acting greedily with respect to v π π ' = greedy( v π ) In Small Gridworld improved policy was optimal, π ' = π ∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π ∗ 76

  8. Policy Iteration 77

  9. Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Iteration Policy evaluation Estimate v π Iterative policy evaluation Policy improvement Generate π I ≥ π Greedy policy improvement 78

  10. Jack’s Car Rental 79

  11. Policy Iteration in Car Rental 80

  12. Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Policy Improvement 81

  13. Lecture 3: Planning by Dynamic Programming Policy Improvement (2) Policy Iteration Policy Improvement If improvements stop, q π ( s , π ' ( s )) = max q π ( s , a ) = q π ( s , π ( s )) = v π ( s ) a ∈A Then the Bellman optimality equation has been satisfied v π ( s ) = max q π ( s , a ) a ∈A Therefore v π ( s ) = v ∗ ( s ) for all s ∈ S so π is an optimal policy 82

  14. Lecture 3: Planning by Dynamic Programming Some Technical Questions Contraction Mapping How do we know that value iteration converges to v ∗ ? Or that iterative policy evaluation converges to v π ? And therefore that policy iteration converges to v ∗ ? Is the solution unique? How fast do these algorithms converge? These questions are resolved by contraction mapping theorem 83

  15. Lecture 3: Planning by Dynamic Programming Value Function Space Contraction Mapping Consider the vector space V over value functions There are |S| dimensions Each point in this space fully specifies a value function v ( s ) What does a Bellman backup do to points in this space? We will show that it brings value functions closer And therefore the backups must converge on a unique solution 84

  16. Lecture 3: Planning by Dynamic Programming Value Function ∞ -Norm Contraction Mapping We will measure distance between state-value functions u and v by the ∞ -norm i.e. the largest difference between state values, || u − v || ∞ = max | u ( s ) − v ( s ) | s ∈S 85

  17. Lecture 3: Planning by Dynamic Programming Bellman Expectation Backup is a Contraction Contraction Mapping 86

  18. Lecture 3: Planning by Dynamic Programming Contraction Mapping Theorem Contraction Mapping Theorem (Contraction Mapping Theorem) For any metric space V that is complete (i.e. closed) under an operator T ( v ) , where T is a γ -contraction, T converges to a unique fixed point At a linear convergence rate of γ 87

  19. Lecture 3: Planning by Dynamic Programming Convergence of Iter . Policy Evaluation and Policy Iteration Contraction Mapping The Bellman expectation operator T π has a unique fixed point v π is a fixed point of T π (by Bellman expectation equation) By contraction mapping theorem Iterative policy evaluation converges on v π Policy iteration converges on v ∗ 88

  20. Lecture 3: Planning by Dynamic Programming Bellman Optimality Backup is a Contraction Contraction Mapping Define the Bellman optimality backup operator T ∗ , T ∗ ( v ) = max R a + γ P a v a ∈A This operator is a γ -contraction, i.e. it makes value functions closer by at least γ (similar to previous proof) || T ∗ ( u ) − T ∗ ( v ) || ∞ ≤ γ || u − v || ∞ 89

  21. Lecture 3: Planning by Dynamic Programming Convergence of Value Iteration Contraction Mapping The Bellman optimality operator T ∗ has a unique fixed point is a fixed point of T ∗ (by Bellman optimality equation) By v ∗ contraction mapping theorem Value iteration converges on v ∗ 90

  22. Most of the story in a nutshell: 91

  23. Most of the story in a nutshell: 92

  24. Most of the story in a nutshell: 93

  25. Lecture 3: Planning by Dynamic Programming Modified Policy Iteration Policy Iteration Extensions to Policy Iteration Does policy evaluation need to converge to v π ? Or should we introduce a stopping condition e.g. E -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section) 94

  26. Lecture 3: Planning by Dynamic Programming Generalised Policy Iteration Policy Iteration Extensions to Policy Iteration Policy evaluation Estimate v π Any policy evaluation algorithm Policy improvement Generate π ' ≥ π Any policy improvement algorithm 95

  27. Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration Value Iteration in MDPs Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v 1 → v 2 → ... → v ∗ Using synchronous backups At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) Convergence to v will be proven later ∗ Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy 96

  28. Lecture 3: Planning by Dynamic Programming Value Iteration (2) Value Iteration Value Iteration in MDPs 97

  29. Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected 99

  30. Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming Three simple ideas for asynchronous dynamic programming: In-place dynamicprogramming Prioritised sweeping Real-time dynamicprogramming 100

  31. Lecture 3: Planning by Dynamic Programming In-Place Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming 101

  32. Lecture 3: Planning by Dynamic Programming Prioritised Sweeping Extensions to Dynamic Programming Asynchronous Dynamic Programming 102

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend