Machine Learning and Data Mining Reinforcement Learning Markov - PowerPoint PPT Presentation

Policy Evaluation : Grid World 70

Most of the story in a nutshell: 74

Finding Best Policy 75

Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Given a policy π Evaluate the policy π v π ( s ) = E [ R t +1 + γ R t +2 + ... | S t = s ] Improve the policy by acting greedily with respect to v π π ' = greedy( v π ) In Small Gridworld improved policy was optimal, π ' = π ∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π ∗ 76

Policy Iteration 77

Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Iteration Policy evaluation Estimate v π Iterative policy evaluation Policy improvement Generate π I ≥ π Greedy policy improvement 78

Jack’s Car Rental 79

Policy Iteration in Car Rental 80

Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Policy Improvement 81

Lecture 3: Planning by Dynamic Programming Policy Improvement (2) Policy Iteration Policy Improvement If improvements stop, q π ( s , π ' ( s )) = max q π ( s , a ) = q π ( s , π ( s )) = v π ( s ) a ∈A Then the Bellman optimality equation has been satisfied v π ( s ) = max q π ( s , a ) a ∈A Therefore v π ( s ) = v ∗ ( s ) for all s ∈ S so π is an optimal policy 82

Lecture 3: Planning by Dynamic Programming Some Technical Questions Contraction Mapping How do we know that value iteration converges to v ∗ ? Or that iterative policy evaluation converges to v π ? And therefore that policy iteration converges to v ∗ ? Is the solution unique? How fast do these algorithms converge? These questions are resolved by contraction mapping theorem 83

Lecture 3: Planning by Dynamic Programming Value Function Space Contraction Mapping Consider the vector space V over value functions There are |S| dimensions Each point in this space fully specifies a value function v ( s ) What does a Bellman backup do to points in this space? We will show that it brings value functions closer And therefore the backups must converge on a unique solution 84

Lecture 3: Planning by Dynamic Programming Value Function ∞ -Norm Contraction Mapping We will measure distance between state-value functions u and v by the ∞ -norm i.e. the largest difference between state values, || u − v || ∞ = max | u ( s ) − v ( s ) | s ∈S 85

Lecture 3: Planning by Dynamic Programming Bellman Expectation Backup is a Contraction Contraction Mapping 86

Lecture 3: Planning by Dynamic Programming Contraction Mapping Theorem Contraction Mapping Theorem (Contraction Mapping Theorem) For any metric space V that is complete (i.e. closed) under an operator T ( v ) , where T is a γ -contraction, T converges to a unique fixed point At a linear convergence rate of γ 87

Lecture 3: Planning by Dynamic Programming Convergence of Iter . Policy Evaluation and Policy Iteration Contraction Mapping The Bellman expectation operator T π has a unique fixed point v π is a fixed point of T π (by Bellman expectation equation) By contraction mapping theorem Iterative policy evaluation converges on v π Policy iteration converges on v ∗ 88

Lecture 3: Planning by Dynamic Programming Bellman Optimality Backup is a Contraction Contraction Mapping Define the Bellman optimality backup operator T ∗ , T ∗ ( v ) = max R a + γ P a v a ∈A This operator is a γ -contraction, i.e. it makes value functions closer by at least γ (similar to previous proof) || T ∗ ( u ) − T ∗ ( v ) || ∞ ≤ γ || u − v || ∞ 89

Lecture 3: Planning by Dynamic Programming Convergence of Value Iteration Contraction Mapping The Bellman optimality operator T ∗ has a unique fixed point is a fixed point of T ∗ (by Bellman optimality equation) By v ∗ contraction mapping theorem Value iteration converges on v ∗ 90

Lecture 3: Planning by Dynamic Programming Modified Policy Iteration Policy Iteration Extensions to Policy Iteration Does policy evaluation need to converge to v π ? Or should we introduce a stopping condition e.g. E -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section) 94

Lecture 3: Planning by Dynamic Programming Generalised Policy Iteration Policy Iteration Extensions to Policy Iteration Policy evaluation Estimate v π Any policy evaluation algorithm Policy improvement Generate π ' ≥ π Any policy improvement algorithm 95

Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration Value Iteration in MDPs Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v 1 → v 2 → ... → v ∗ Using synchronous backups At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) Convergence to v will be proven later ∗ Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy 96

Lecture 3: Planning by Dynamic Programming Value Iteration (2) Value Iteration Value Iteration in MDPs 97

Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected 99

Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming Three simple ideas for asynchronous dynamic programming: In-place dynamicprogramming Prioritised sweeping Real-time dynamicprogramming 100

Lecture 3: Planning by Dynamic Programming In-Place Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming 101

Lecture 3: Planning by Dynamic Programming Prioritised Sweeping Extensions to Dynamic Programming Asynchronous Dynamic Programming 102

Machine Learning and Data Mining Reinforcement Learning Markov - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask Overview Intro Markov Decision Processes Reinforcement Learning Sarsa Q-learning Exploration vs Exploitation tradeoff 2

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Addressing Com m unity Needs in Challenging Tim es Don Bray Director, Account Services and Com m

alternate reality games and futures of textuality /wiki/Alternate_reality_game "An

Programming Languages Tail Recursion and Accumulators Material adapted from Dan Grossman's PL

Introduction to Ohio Concrete Government Affairs This webinar will start soon! Introduction to

CALIFORNIA RADIOACTIVE MATERIALS MANAGEMENT FORUM Keith E. Asmussen, Ph.D., CHAIR Dir.,

Broadening our view on nanomaterials: Highlighting potentials to contribute to a sustainable

Data Handling: Import, Cleaning and Visualisation Lecture 1 : Introduction Prof. Dr. Ulrich

Green Bag Lunch: Changing Ecosystems at the CWRU University Farm Case Western Reserve University

Machine Learning and Data Mining Reinforcement Learning Markov - PowerPoint PPT Presentation

+ Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask Overview Intro Markov Decision Processes Reinforcement Learning Sarsa Q-learning Exploration vs Exploitation tradeoff 2

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction What is data mining? to Data mining functionalities Data Mining Major

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Addressing Com m unity Needs in Challenging Tim es Don Bray Director, Account Services and Com m

alternate reality games and futures of textuality /wiki/Alternate_reality_game &quot;An

Programming Languages Tail Recursion and Accumulators Material adapted from Dan Grossman's PL

Introduction to Ohio Concrete Government Affairs This webinar will start soon! Introduction to

CALIFORNIA RADIOACTIVE MATERIALS MANAGEMENT FORUM Keith E. Asmussen, Ph.D., CHAIR Dir.,

Broadening our view on nanomaterials: Highlighting potentials to contribute to a sustainable

Data Handling: Import, Cleaning and Visualisation Lecture 1 : Introduction Prof. Dr. Ulrich

Green Bag Lunch: Changing Ecosystems at the CWRU University Farm Case Western Reserve University

alternate reality games and futures of textuality /wiki/Alternate_reality_game "An