Reminders 21 days until the American election. I voted. Did you? - PowerPoint PPT Presentation

Reminders § 21 days until the American election. I voted. Did you? § Deadline to register to vote in PA is Monday, Oct 19. § HW4 due tonight at 11:59pm Eastern. § Quiz 5 on Adversarial Search is due tomorrow. § HW5 has been released. It will be due on Tuesday Oct 20. § No lecture on Thursday. § Midterm details: § * No HW from Oct 20-27. * Tues Oct 20: Practice midterm released (for credit) * Saturday Oct 24: Practice midterm is due. * Midterm available Monday Oct 26 and Tuesday Oct 27. * 3 hour block. Open book, open notes, no collaboration.

Markov Decision Processes Slides courtesy of Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Stochastic Search Problems § Instead of dealing with situations where the environment +1 deterministic, MDPs deal with 3 stochastic environments. –1 –1 Transition Model: 2 0.8 0.1 0.1 1 Action: Up 1 2 3 4

Defining MDPs § Markov decision processes: s § Set of states S a § Start state s 0 § Set of actions A s, a § Transitions P(s’|s,a) (or T(s,a,s’)) s,a,s’ § Rewards R(s,a,s’) (and discount g ) s’ § MDP quantities so far: § Policy = Choice of action for each state § Utility = sum of (discounted) rewards

Solution == Policy § In search problems a solution was a plan : a sequence of action that corresponded to the shortest path +1 from the start to a goal. § Because of the non-determinism in MDPs we cannot simply give a –1 sequence of actions. § Instead, the solution to an MDP is a policy. A policy maps from a state onto the action to take if the agent is in that state. § 𝞀 (s) = a

Optimal Quantities § The value (utility) of a state s: V * (s) = expected utility starting in s and s is a s state acting optimally a (s, a) is a § The value (utility) of a q-state (s,a): s, a q-state Q * (s,a) = expected utility starting out s,a,s’ (s,a,s’) is a having taken action a from state s and transition (thereafter) acting optimally s’ § The optimal policy: p * (s) = optimal action from state s

The Bellman Equations How to be optimal: Step 1: Take correct first action Step 2: Keep being optimal

The Bellman Equations § Definition of “optimal utility” via expectimax s recurrence gives a simple one-step lookahead a relationship amongst optimal utility values s, a s,a,s’ s’ § These are the Bellman equations, and they characterize optimal values in a way we’ll use over and over

Example Hyperdrive MDP The Millennium Falcon needs to travel far far away, quickly Three states : Cruising, Hyperspace, Crashed Two actions : Maintain speed , Punch it Punch It Punching it doubles the reward , 0.5 +1 even if it doesn’t work. 1.0 Maintain -10 +1 0.5 Hyperspace Maintain Punch It 0.5 +2 0.5 Cruising 1.0 +1 +2 Crashed

Value Iteration § Start with V 0 (s) = 0: no time steps left means an expected reward sum of zero § Given vector of V k (s) values, do one ply of expectimax from each state: V k+1 (s) a s, a § Repeat until convergence s,a,s’ V k (s’) § Complexity of each iteration: O(S 2 A) § Theorem: will converge to unique optimal values § Basic idea: approximations get refined towards optimal values § Policy may converge long before values do

Computing Time-Limited Values

Example: Value Iteration +1 Punch It 0.5 1.0 Maintain -10 +1 0.5 Hyperspace Maintain Punch It 0.5 +2 0.5 Cruising 1.0 +1 +2 Crashed Assume no discount!

Example: Value Iteration +1 Punch It 0.5 1.0 Maintain -10 3.5 2.5 0 +1 0.5 Hyperspace Maintain Punch It 0.5 +2 2 1 0 0.5 Cruising 1.0 +1 +2 Crashed Assume no discount! 0 0 0

Value Iteration § Start with V 0 (s) = 0: no time steps left means an expected reward sum of zero § Given vector of V k (s) values, do one ply of expectimax from each state: V k+1 (s) a s, a § Repeat until convergence s,a,s’ V k (s’) § Complexity of each iteration: O(S 2 A) § Theorem: will converge to unique optimal values § Basic idea: approximations get refined towards optimal values § Policy may converge long before values do

Convergence* How do we know the V k vectors are going to converge? § Case 1: If the tree has maximum depth M, then V M holds § the actual untruncated values Case 2: If the discount is less than 1 § § Sketch: For any state V k and V k+1 can be viewed as depth k+1 expectimax results in nearly identical search trees § The difference is that on the bottom layer, V k+1 has actual rewards while V k has zeros § That last layer is at best all R MAX § It is at worst R MIN § But everything is discounted by γ k that far out § So V k and V k+1 are at most γ k max|R| different § So as k increases, the values converge

k=0 Noise = 0.2 Discount = 0.9 Living reward = 0

Policy Methods

Policy Evaluation

Fixed Policies Do the optimal action Do what p says to do s s a p (s) s, a s, p (s) s,a,s’ s, p (s),s’ s’ s’ § Expectimax trees max over all actions to compute the optimal values § If we fixed some policy p (s), then the tree would be simpler – only one action per state § … though the tree’s value would depend on which policy we fixed

Utilities for a Fixed Policy § Another basic operation: compute the utility of a state s s under a fixed (generally non-optimal) policy p (s) § Define the utility of a state s, under a fixed policy p : s, p (s) V p (s) = expected total discounted rewards starting in s and following p s, p (s),s’ s’ § Recursive relation (one-step look-ahead / Bellman equation):

Example: Policy Evaluation Always Go Right Always Go Forward

Policy Evaluation § How do we calculate the V’s for a fixed policy p ? s p (s) § Idea 1: Turn recursive Bellman equations into updates (like value iteration) s, p (s) s, p (s),s’ s’ § Efficiency: O(S 2 ) per iteration § Idea 2: Without the maxes, the Bellman equations are just a linear system § Solve with Matlab (or your favorite linear system solver)

Policy Extraction

Computing Actions from Values § Let’s imagine we have the optimal values V*(s) § How should we act? § It’s not obvious! § We need to do a mini-expectimax (one step) § This is called policy extraction, since it gets the policy implied by the values

Computing Actions from Q-Values § Let’s imagine we have the optimal q-values: § How should we act? § Completely trivial to decide! § Important lesson: actions are easier to select from q-values than values!

Policy Iteration

Problems with Value Iteration § Value iteration repeats the Bellman updates: s a s, a s,a,s’ § Problem 1: It’s slow – O(S 2 A) per iteration s’ § Problem 2: The “max” at each state rarely changes § Problem 3: The policy often converges long before the values

Policy Iteration § Alternative approach for optimal values: § Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence § Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities as future values § Repeat steps until policy converges § This is policy iteration § It’s still optimal! § Can converge (much) faster under some conditions

Policy Iteration § Step 1 (Policy Evaluation): For fixed current policy p , find values with policy evaluation: § Iterate until values converge: § Step 2 (Policy Improvement): For fixed values, get a better policy using policy extraction § One-step look-ahead:

Comparison § Both value iteration and policy iteration compute the same thing (all optimal values) § In value iteration: § Every iteration updates both the values and (implicitly) the policy § We don’t track the policy, but taking the max over actions implicitly recomputes it § In policy iteration: § We do several passes that update utilities with fixed policy (each pass is fast because we consider only one action, not all of them) § After the policy is evaluated, a new policy is chosen (slow like a value iteration pass) § The new policy will be better (or we’re done) § Both are dynamic programs for solving MDPs

Reminders 21 days until the American election. I voted. Did you? - PowerPoint PPT Presentation

Reminders 21 days until the American election. I voted. Did you? Deadline to register to vote in PA is Monday, Oct 19. HW4 due tonight at 11:59pm Eastern. Quiz 5 on Adversarial Search is due tomorrow. HW5 has been released. It will

9/17/2020 Division Updates and Reminders September 17, 2020 9/17/2020 1 1 Agenda Updates

2019 FISCAL YEAR-END TRAINING 1 Fiscal Year-end OBJECTIVES Reminders 2 AGENDA TOPIC

Ge#ng to the Top of Mind: How Reminders Increase Saving

Exam #1 Review Exam #1 Review By sseshadr Agenda Agenda Reminders Reminders Test

Graph Compression Lecture 17 CSCI 4974/6971 31 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Reminders 3 1 10/23/17 Registration Form Date Translated Entered in the Top 3 on A03

Graph Ordering Lecture 16 CSCI 4974/6971 27 Oct 2016 1 / 12 Todays Biz 1. Reminders 2.

Monthly Webinar Series August 2020 Todays Agenda Trial Updates/Reminders Sandi Cassard

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Reminders Time to deploy! Projects are due before class on Thursday! CS370, Gnay (Emory) Spring

Math 1 Lecture 14 Dartmouth College Wednesday 10-12-16 Contents Reminders/Announcements

Third Quarter Updates _______ Q3 2014 0714.PR.P.PP. 2014 Agenda Claim Process Reminders

Demystifying the Clinical Fellowship Experience and 4 th Y ear Experience S ession Reminders:

2a Kinesiology: Names and Locations of Bones and Posterior Muscles 2a Kinesiology:

CNM to UNM Transfer Day 2014 CNM to UNM Transfer Day 2014 Reminders Save questions for the

Recalls & Reminders - Using MedicalDirector Clinical - Presented by: Katrina Otto Train IT

Looking into mu/pi separation in the ECAL. HPgTPC Meeting Eldwan Brianne DESY Hamburg, 03 rd

15-292 History of Computing Computing in the 1800s: Punched Card Machines Information

Outline Elections and their security CSci 5271 Introduction to Computer Security System

History of Operating Systems Portions of this material courtesy Jennifer Wong and Gene Stark COMP

NOVAtime University Presents 01/18/2018 Whats New in NOVAtime 5000 01/06/2018 SaaS Update

Lab: Introduction to Loop Transformations Tomofumi Yuki EJCP 2017 June 29, Toulouse

STEP 1 PUNCH HOLE ON EACH SIDE OF THE DISC. NOTE: YOU MUST USE A HEAVY DUTY HOLE PUNCHER AS THE

The webinar will begin shortly info@sdsusa.com | ( 800 ) 443 - 6183 | www.sdsusa.com Webinar

Reminders 21 days until the American election. I voted. Did you? - PowerPoint PPT Presentation

Reminders 21 days until the American election. I voted. Did you? Deadline to register to vote in PA is Monday, Oct 19. HW4 due tonight at 11:59pm Eastern. Quiz 5 on Adversarial Search is due tomorrow. HW5 has been released. It will

9/17/2020 Division Updates and Reminders September 17, 2020 9/17/2020 1 1 Agenda Updates

2019 FISCAL YEAR-END TRAINING 1 Fiscal Year-end OBJECTIVES Reminders 2 AGENDA TOPIC

Ge#ng to the Top of Mind: How Reminders Increase Saving

Exam #1 Review Exam #1 Review By sseshadr Agenda Agenda Reminders Reminders Test

Graph Compression Lecture 17 CSCI 4974/6971 31 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Reminders 3 1 10/23/17 Registration Form Date Translated Entered in the Top 3 on A03

Graph Ordering Lecture 16 CSCI 4974/6971 27 Oct 2016 1 / 12 Todays Biz 1. Reminders 2.

Monthly Webinar Series August 2020 Todays Agenda Trial Updates/Reminders Sandi Cassard

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Reminders Time to deploy! Projects are due before class on Thursday! CS370, Gnay (Emory) Spring

Math 1 Lecture 14 Dartmouth College Wednesday 10-12-16 Contents Reminders/Announcements

Third Quarter Updates _______ Q3 2014 0714.PR.P.PP. 2014 Agenda Claim Process Reminders

Demystifying the Clinical Fellowship Experience and 4 th Y ear Experience S ession Reminders:

2a Kinesiology: Names and Locations of Bones and Posterior Muscles 2a Kinesiology:

CNM to UNM Transfer Day 2014 CNM to UNM Transfer Day 2014 Reminders Save questions for the

Recalls &amp; Reminders - Using MedicalDirector Clinical - Presented by: Katrina Otto Train IT

Looking into mu/pi separation in the ECAL. HPgTPC Meeting Eldwan Brianne DESY Hamburg, 03 rd

15-292 History of Computing Computing in the 1800s: Punched Card Machines Information

Outline Elections and their security CSci 5271 Introduction to Computer Security System

History of Operating Systems Portions of this material courtesy Jennifer Wong and Gene Stark COMP

NOVAtime University Presents 01/18/2018 Whats New in NOVAtime 5000 01/06/2018 SaaS Update

Lab: Introduction to Loop Transformations Tomofumi Yuki EJCP 2017 June 29, Toulouse

STEP 1 PUNCH HOLE ON EACH SIDE OF THE DISC. NOTE: YOU MUST USE A HEAVY DUTY HOLE PUNCHER AS THE

The webinar will begin shortly info@sdsusa.com | ( 800 ) 443 - 6183 | www.sdsusa.com Webinar

Recalls & Reminders - Using MedicalDirector Clinical - Presented by: Katrina Otto Train IT