Policy iteration comments Each step of policy iteration is - PDF document

Policy iteration comments • Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible MDPs cont, Lecture 25 • Converge to optimal policy Converge to optimal policy • Gives exact value of optimal policy Nov 24 2008 Review Policy Iteration Example • Critical components of MDPs Do one iteration of policy iteration on the MDP below. Assume an initial policy of π 1 (Hungry) = Eat and π 1 (Full) = Sleep. – State space and action space; Transition model; Let γ = 0.9 Reward function 1.0 • Value iteration 0.1 – U(S): the expected sum of maximum rewards U(S) th t d f i d 0.9 Hungry Full achievable starting at a particular state Eat Exercise -10 +10 ∑ – Bellman equation: = + γ ( ) ( ) max ( , , ' ) ( ' ) U s R s T s a s U s Watch TV 0. 8 1.0 a ' Sleep s ∑ – Bellman iteration: = + γ ( ) ( ) max ( , , ' ) ( ' ) U s R s T s a s U s + 1 i i a ' – Optimal policy: s ∑ π = * ( ) arg max ( , , ' ) * ( ' ) s T s a s U s 0.2 a ' s Review: Policy Iteration Policy Iteration Example Policy Evaluation Phase • Start with a randomly chosen initial policy π 0 Use initial policy for Hungry: π 1 (Hungry) = Eat • Iterate until no change in utilities: 1. Policy evaluation: given a policy π i , calculate U 1 (Hungry) = ‐ 10 + (0.9)[(0.1)U 1 (Hungry)+(0.9)U 1 (Full)] the utility U i ( s ) of every state s using policy π i ⇒ U 1 (Hungry) = ‐ 10 + (0.09)U 1 (Hungry)+(0.81)U 1 (Full) 1 ( g y) ( ) 1 ( g y) ( ) 1 ( ) by solving the system of equations: ⇒ (0.91)U 1 (Hungry) ‐ (0.81)U 1 (Full) = ‐ 10 Use initial policy for Full: π 1 (Full) = Sleep. 2. Policy improvement: calculate the new policy π i+1 using one ‐ step look ‐ ahead based on U i ( s ): U 1 (Full) = 10 + (0.9)[(0.8)U 1 (Full) + (0.2)U 1 (Hungry)] ⇒ U 1 (Full) = 10 + (0.72)U 1 (Full) + (0.18)U 1 (Hungry)] ∑ π = ( ) arg max ( , , ' ) ( ' ) s T s a s U s + 1 i i ⇒ (0.28)U 1 (Full) ‐ (0.18)U 1 (Hungry) = 10 a ' s 1

Policy Iteration Example Policy Iteration Example π ( Full ) Solve for 2 (0.91)U 1 (Hungry) ‐ (0.81)U 1 (Full) = ‐ 10 ....(Equation 1) U 1 (Hungry) ⎧ ⎫ T(Full, Exercise, Hungry)U (Hungry) [Exercise] (0.28)U 1 (Full) ‐ (0.18)U 1 (Hungry)=10 ...(Equation 2) 1 ⎪ ⎪ and U 1 (Full) = argmax T(Full, Sleep, Full)U (Full) + ⎨ ⎬ 1 ⎪ ⎪ {Exercise, Sleep} T(Full, Sleep, Hungry)U (Hungry) [Sleep] ⎩ ⎭ 1 From Equation 1: ⎧ ⎧ (1.0)U (1 0)U (Hungry) (Hungry) [Exercise] [Exercise] ⎫ ⎫ = 1 1 (0.91)U 1 (Hungry) = -10+(0.81)U 1 (Full) argmax ⎨ ⎬ + (0.8)U (Full) (0.2)U (Hungry) [Sleep] ⎩ ⎭ {Exercise, Sleep} 1 1 =>U 1 (Hungry) = (-10/0.91)+(0.81/0.91)U 1 (Full) ⎧ (1.0)(48.7 ) [Exercise] ⎫ = argmax ⎨ ⎬ => U 1 (Hungry)=-10.9+(0.89)U 1 (Full) + (0.8)(67) (0.2)(48.7 ) [Sleep] ⎩ ⎭ {Exercise, Sleep} ⎧ ⎫ 4 8.7 [ Exercise ] = argmax ⎨ ⎬ 63 . 34 [ Sleep ] ⎩ ⎭ {Exercise, Sleep} = Sleep Policy Iteration Example Policy Iteration Example Solve for (0.91)U 1 (Hungry) ‐ (0.81)U 1 (Full) = ‐ 10 ....(Equation 1) • π 2 (Hungry) =Eat U 1 (Hungry) (0.28)U 1 (Full) ‐ (0.18)U 1 (Hungry)=10 ...(Equation 2) and U 1 (Full) • π 2 (Full) = Sleep Substitute U 1 (Hungry)=-10.9+(0.89)U 1 (Full) into Equation 2 (0 28)U 1 (Full) - (0 18)[-10 9+(0 89)U 1 (Full)]=10 (0.28)U 1 (Full) (0.18)[ 10.9+(0.89)U 1 (Full)] 10 =>(0.28)U 1 (Full) + 1.96-(0.16)U 1 (Full)=10 =>(0.12)U 1 (Full)=8.04 =>U 1 (Full)=67 =>U 1 (Hungry)=-10.9+(0.89)(67)=-10.9+59.63=48.7 Policy Iteration Example So far …. π ( Hungry ) 2 • Given an MDP model we know how to find ⎧ T(Hungry, Eat, Full)U (Full) + ⎫ 1 ⎪ ⎪ optimal policies = argmax ⎨ T(Hungry, Eat, Hungry)U (Hungry) [Eat] ⎬ 1 ⎪ ⎪ {Eat, WatchTV} – Value Iteration or Policy Iteration T(Hungry, WatchTV, Hungry)U (Hungry) [WatchTV] ⎩ ⎭ 1 + ⎧ (0.9)U1(Fu ll) (0.1)U1(Hu ngry) [Eat] ⎫ • But what if we don’t have any form of the = a g a argmax ⎨ ⎨ ⎬ ⎬ (1.0)U1(Hu ngry) [WatchTV] model of the world (e.g., T, and R) ⎩ ⎭ {Eat, WatchTV} + ⎧ (0.9)(67) (0.1)(48.7 ) [Eat] ⎫ – Like when we were babies . . . = argmax ⎨ ⎬ (1.0)(48.7 ) [WatchTV] ⎩ ⎭ {Eat, WatchTV} – All we can do is wander around the world ⎧ 65.2 [Eat] ⎫ observing what happens, getting rewarded and = argmax ⎨ ⎬ 48.7 [Watch] ⎩ ⎭ punished {Eat, WatchTV} = Eat – This is what reinforcement learning about 2

Why not supervised learning Reinforcement/Reward In supervised learning, we had a teacher providing • The key to this trial ‐ and ‐ error approach is having us with training examples with class labels some sort of feedback about what is good and what is bad Has Has Has Ate Has Asian Fever Cough Breathing Chicken • We call this feedback reward or reinforcement Bird Flu Problems Recently f l false • In some environment, rewards are frequent true true true false true true true true true – Ping ‐ pong: each point scored false false false false true – Learning to crawl: forward motion • In other environments, reward is delayed The agent figures out how to predict the class label – Chess: reward only happens at the end of the game given the features. Can We Use Supervised Learning? Importance of Credit Assignment • Now imagine a complex task such as learning to play a board game • Suppose we took a supervised learning approach to learning an evaluation function approach to learning an evaluation function • For every possible position of your pieces, you need a teacher to provide an accurate and consistent evaluation of that position – This is not feasible Trial and Error Reinforcement • A better approach: imagine we don’t have • This is very similar to what happens in a teacher nature with animals and humans • Instead, the agent gets to experiment in its • Positive reinforcement: environment Happiness Pleasure Food Happiness, Pleasure, Food • The agent tries out actions and discovers by • Negative reinforcement: itself which actions lead to a win or loss Pain, Hunger, Lonelinesss • The agent can learn an evaluation function that can estimate the probability of winning from any given position What happens if we get agents to learn in this way? This leads us to the world of Reinforcement Learning 3

Reinforcement Learning in a nutshell Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces, “You lose”. opponent announces, You lose . ‐ Russell and Norvig Introduction to Artificial Intelligence Reinforcement Learning • Agent placed in an environment and must learn to behave optimally in it • Assume that the world behaves like an MDP, except: MDP, except: – Agent can act but does not know the transition model – Agent observes its current state its reward but doesn’t know the reward function • Goal: learn an optimal policy Factors that Make RL Difficult • Actions have non ‐ deterministic effects – which are initially unknown and must be learned • Rewards / punishments can be infrequent / p q – Often at the end of long sequences of actions – How do we determine what action(s) were really responsible for reward or punishment? (credit assignment problem) – World is large and complex 4

Policy iteration comments Each step of policy iteration is - PDF document

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible MDPs cont, Lecture 25 Converge to optimal policy Converge to optimal policy Gives exact

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Piecewise-deterministic Markov processes for spatio-temporal population dynamics Samuel

Introduction to Risk Assessment Introduction to Risk Assessment Samjwal Ratna Bajracharya

Economics of Cybersecurity Prospect Theory Sophie Van Der Zee The Computer Lab, University of

Comp/Phys/APSc 715 Information Visualization and Tufte 4/8/2014 Infovis, Genomics Tufte

Minutes of CAS Faculty Senate 17 February 2020 Present: J. Alcantara-Garcia, J. Angelini,E.

Accountable Health Communities Model Understanding RMHPs AHCM May 2019 Goals of the Day

State of Washingtons Life Sciences Sector Chris E. Rivera President, WBBA Washington

Section 34 Isomorphism theorems Instructor: Yifan Yang Fall 2006 Instructor: Yifan Yang

Policy iteration comments Each step of policy iteration is - PDF document

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible MDPs cont, Lecture 25 Converge to optimal policy Converge to optimal policy Gives exact

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Program Analysis with Local Policy Iteration George Karpenkov VERIMAG May 6, 2015 George

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Introduction to Mobile Robotics The Markov Decision Problem Value Iteration and Policy Iteration

CATTLE MARKET ITERATION 01 CIRCULATION PRESENTATION ITERATION 01 INTRODUCTION This document

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Manipulating an Abstraction (Iteration) CT @ VT An algorithm with iteration START BOOK LIST =

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Combinatorial Newton iteration for Boltzmann oracle Carine Pivoteau joint work with Bruno Salvy

Blockly Lists &amp; Iteration CT @ VT Things we are seeing Using lists to represent a data

61A Lecture 10 Monday, September 17 Sequence Iteration 2 Sequence Iteration def count(s,

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Piecewise-deterministic Markov processes for spatio-temporal population dynamics Samuel

Introduction to Risk Assessment Introduction to Risk Assessment Samjwal Ratna Bajracharya

Economics of Cybersecurity Prospect Theory Sophie Van Der Zee The Computer Lab, University of

Comp/Phys/APSc 715 Information Visualization and Tufte 4/8/2014 Infovis, Genomics Tufte

Minutes of CAS Faculty Senate 17 February 2020 Present: J. Alcantara-Garcia, J. Angelini,E.

Accountable Health Communities Model Understanding RMHPs AHCM May 2019 Goals of the Day

State of Washingtons Life Sciences Sector Chris E. Rivera President, WBBA Washington

Section 34 Isomorphism theorems Instructor: Yifan Yang Fall 2006 Instructor: Yifan Yang

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data