CS 573: Artificial Intelligence Markov Decision Processes Dan Weld - PDF document

CS 573: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington Slides by Dan Klein & Pieter Abbeel / UC Berkeley. (http://ai.berkeley.edu) and by Mausam & Andrey Kolobov Recap: Defining MDPs § Markov decision processes: s § Set of states S a § Start state s 0 § Set of actions A s, a § Transitions P(s’|s,a) (or T(s,a,s’)) s,a,s ’ § Rewards R(s,a,s’) (and discount g ) s ’ § MDP quantities so far: § Policy = Choice of action for each state § Utility = sum of (discounted) rewards 1

Solving MDPs § Value Iteration § Asynchronous VI § Policy Iteration § Reinforcement Learning V* = Optimal Value Function The value (utility) of a state s: V * (s) “expected utility starting in s & acting optimally forever” 2

Q* The value (utility) of the q-state (s,a): Q * (s,a) “expected utility of 1) starting in state s 2) taking action a 3) acting optimally forever after that” Q*(s,a) = reward from executing a in s then ending in s’ plus… discounted value of V*(s’) p * Specifies The Optimal Policy p * (s) = optimal action from state s 3

The Bellman Equations How to be optimal: Step 1: Take correct first action Step 2: Keep being optimal The Bellman Equations § Definition of “optimal utility” via expectimax recurrence gives a simple one-step lookahead relationship amongst optimal utility values (1920-1984) s a s, a § These are the Bellman equations, and they characterize optimal values in a way we’ll use over and over s,a,s ’ s ’ 4

Gridworld: Q* Gridworld Values V* 5

No End in Sight… § We’re doing way too much work with expectimax! § Problem 1: States are repeated § Idea: Only compute needed quantities once § Like graph search ( vs. tree search) § Problem 2: Tree goes on forever § Rewards @ each step à V changes § Idea: Do a depth-limited computation, but with increasing depths until change is small § Note: deep parts of the tree eventually don’t matter if γ < 1 Time-Limited Values § Key idea: time-limited values § Define V k (s) to be the optimal value of s if the game ends in k more time steps § Equivalently, it’s what a depth-k expectimax would give from s [Demo – time-limited values (L8D6)] 6

Value Iteration Value Iteration Called a “Bellman Backup” § Forall s, Initialize V 0 (s) = 0 no time steps left means an expected reward of zero § Repeat do Bellman backups K += 1 } V k+1 (s) Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] a } do ∀ s, a s, a V k+1 (s) = Max a Q k+1 (s, a) s,a,s ’ V ( s’ ) § Repeat until |V k+1 (s) – V k (s) | < ε, forall s “convergence” k Successive approximation; dynamic programming 7

Example: Value Iteration Assume no discount (gamma=1) to keep math simple! Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] V k+1 (s) = Max a Q k+1 (s, a) Example: Value Iteration Assume no discount (gamma=1) to keep math simple! 0 0 0 Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] V k+1 (s) = Max a Q k+1 (s, a) 8

Example: Value Iteration Q( , ,fast) = Assume no discount (gamma=1) to keep math simple! Q( , ,slow) = 0 0 0 0 Q 1 (s,a)= 0 Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] V k+1 (s) = Max a Q k+1 (s, a) Example: Value Iteration Q( , ,fast) = -10 + 0 Assume no discount (gamma=1) to keep math simple! Q( , ,slow) = Q( , ,slow) = ½(1 + 0) + ½(1+0) 0 0 0 1, -10 0 Q 1 (s,a)= 1 0 Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] V k+1 (s) = Max a Q k+1 (s, a) 9

Example: Value Iteration Q( , fast) = Q( , fast) = ½(2 + 0) + ½(2 + 0) Assume no discount (gamma=1) to keep math simple! Q( , slow) = Q( , slow) = 1*(1 + 0) 0 0 0 1, 2 1,-10 0 Q 1 (s,a)= 2 1 0 Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] V k+1 (s) = Max a Q k+1 (s, a) Example: Value Iteration Assume no discount (gamma=1) to keep math simple! 0 0 0 1, 2 1,-10 0 Q 1 (s,a)= 2 1 0 3,3.5 2.5,-10 0 Q 2 (s,a)= Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] 3.5 2.5 0 V k+1 (s) = Max a Q k+1 (s, a) 10

k=0 Noise = 0.2 Discount = 0.9 Living reward = 0 k=1 If agent is in 4,3, it only has one legal action: get jewel. It gets a reward and the game is over. If agent is in the pit, it has only one legal action, die. It gets a penalty and the game is over. Agent does NOT get a reward for moving INTO 4,3. Noise = 0.2 Discount = 0.9 Living reward = 0 11

k=2 Noise = 0.2 Discount = 0.9 Living reward = 0 k=3 Noise = 0.2 Discount = 0.9 Living reward = 0 12

VI: Policy Extraction Computing Actions from Values § Let’s imagine we have the optimal values V*(s) § How should we act? § In general, it’s not obvious! § We need to do a mini-expectimax (one step) § This is called policy extraction, since it gets the policy implied by the values 18

Computing Actions from Q-Values § Let’s imagine we have the optimal q-values: § How should we act? § Completely trivial to decide! § Important lesson: actions are easier to select from q-values than values! Value Iteration - Recap § Forall s, Initialize V 0 (s) = 0 no time steps left means an expected reward of zero § Repeat do Bellman backups K += 1 V k+1 (s) Repeat for all states, s, and all actions, a: a s, a Q k+1 (s, a) = Σ s’ T(s, a, s’) [ R(s, a, s’) + γ V k (s’)] s,a,s ’ V k+1 (s) = Max a Q k+1 (s, a) V ( s’ ) k § Until |V k+1 (s) – V k (s) | < ε, forall s “convergence” § Theorem: will converge to unique optimal values 19

CS 573: Artificial Intelligence Markov Decision Processes Dan Weld - PDF document

CS 573: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington Slides by Dan Klein & Pieter Abbeel / UC Berkeley. (http://ai.berkeley.edu) and by Mausam & Andrey Kolobov Recap: Defining MDPs Markov

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

8th November 2019 Artificial Intelligence Finance Institute NYU Courant Artificial Intelligence

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Introduction to Artificial Intelligence What is Artificial Intelligence for YOU? CPSC 533

Introduction to Artificial Intelligence Introduktion til kunstig intelligens DM533 Artificial

OPEN MEETING LAW OML Governs Public Bodies GB, Committees, APC, potentially some groups which

Goals: Probability CS 70 Tips The probability section in CS 70 usually means: Lets you

Pseudoscalar-mediated dark matter models: LHC vs cosmology Based on: S. Banerjee, D. Barducci,

Yukawa unification in SUSY: assumptions some form of 4-d or x-d SO (10) SUGRA-GUT valid at Q

Slowly rolling scalar fields Quintessence - Generic behaviour 1. PE KE 2. KE dom scalar field

Amortized Analysis Carola Wenk Slides courtesy of Charles Leiserson with changes by Carola Wenk

Amortized Analysis Fibonacci Heaps thanks MIT slides thanks Amortized Analysis by Rebecca

Section 2.4: Amortization schedules and payoff amounts MATH 105: Contemporary Mathematics