[PDF] - Announcements CS 4100: Artificial Intelligence Markov Decision PDF Document

SLIDE 1

Announcements

Homework

k 3: Game Trees s (lead TA: Zhaoqing)

Due Tue 1 Oct at 11:59pm (deadline extended)
Homework

k 4: MDPs s (lead TA: Iris)

Due Mon 7 Oct at 11:59pm
Pr

Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing)

Due Thu 10 Oct at 11:59pm
Offi

Office H Hours

Iris:

s: Mon 10.00am-noon, RI 237

JW

JW: Tue 1.40pm-2.40pm, DG 111

Zh

Zhaoqi qing: : Thu 9.00am-11.00am, HS 202

El

Eli: Fri 10.00am-noon, RY 207

CS 4100: Artificial Intelligence

Markov Decision Processes

Jan-Willem van de Meent Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Non-Deterministic Search Example: Grid World

A

A maze-like ke problem

The agent lives in a grid
Walls block the agent’s path
No

Nois isy movement: act actions s do

not
t al

always ays go as as plan anned ed

80% of the time, the action North takes the agent North

(if there is no wall there)

10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have

been taken, the agent stays put

The

The age gent nt receives s rewards s each h time st step

Small “living” reward each step (can be negative)
Big rewards come at the end (good or bad)
Go

Goal: l: maxim imiz ize sum of rewa wards

Grid World Actions

De Determ rmin inis istic ic Grid rid World rld St Stochastic Grid World

Markov Decision Processes

An MDP is

s defined by

A se

set of st states s s Î S

A se

set of actions s a a Î A

A transi

sition function T(s, s, a, s’) ’)

Probability that a

a from s leads to s’ s’, i.e., P(s P(s’| s, s, a)

Also called the model or the dynamics
A re

reward rd function R(s, s, a, s’) ’)

Sometimes just R(s)

s) or R( R(s’) ’)

A st

start st state

Maybe a terminal st

state

MDPs

s are non-determinist stic se search problems

One way to solve them is with exp

xpectimax search

We’ll have a new tool soon

[Demo – gridworld manual intro (L8D1)]

What is Markov about MDPs?

“Marko

kov” v” generally means that given the current st state, the future and the past st are independent

For Marko

kov v decisi sion processe sses, “Markov” means action outcomes s depend only on the current st state

This is just like search, where the successor function could
nly depend on the current state (not the history)

Andrey Markov (1856-1922)

Policies

In determinist

stic si single-agent se search problems, we wanted an optimal pl plan, or sequence of actions, from start to a goal

For MD

MDPs, we want an optimal policy y p*: *: S → A

A policy p gives an acti

action

n for each st

state

An optimal policy is one that

maxi ximize zes s exp xpected utility y

An exp

xplicit policy defines a reflex x agent

Exp

xpectimax didn’t compute entire policies

It computed the action for a single state only

Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s

SLIDE 2

Optimal Policies

R(s) = -2.0 R(s) = -0.4 R(s) = -0.03 R(s) = -0.01

Example: Racing Example: Racing

A robot car wants to travel far, quickly
Three states: Cool, Warm, Overheated
Two actions: Slow, Fast
Going faster gets double reward

Cool Warm Overheated

Fast Fast Slow Slow 0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2

10

Racing Search Tree MDP Search Trees

Each MDP st

state projects s an exp xpectimax-like ke se search tree

a s s’ s, a (s, s,a,s’) is called a tr transitio ition T( T(s, s,a,s’) ) = P(s P(s’|s, s,a) R( R(s, s,a,s’) s,a,s’ s is a state (s (s, a) a) is a q-st state

Utilities of Sequences Utilities of Sequences

What preferences should an agent have over reward sequences?
More or less?
Now or later?

[1, 2, 2] [2, 3, 4]

r

[0, 0, 1] [1, 0, 0]

r

Discounting

It’s reasonable to maxi

ximize ze the su sum of rewards

It’s also reasonable to pr

prefer rewards s now to rewards s later

One so

solution: values of rewards decay y exp xponentially

Worth Now Worth Next Step Worth In Two Steps

SLIDE 3

Discounting

How to disc

scount?

Each time we descend a level, we

multiply in the discount once

Why

y disc scount?

Sooner rewards probably do have

higher utility than later rewards

Also helps our algorithms converge
Exa

xample: disc scount of 0. 0.5

U(

U([1,2 ,2,3 ,3]) = = 1 1*1 + + 0 0.5 .5*2 + + 0 0.2 .25*3

U(

U([1,2 ,2,3 ,3]) < < U( U([3,2 ,2,1 ,1])

Stationary Preferences

Theorem:

Theorem: if we assume st stationary y preferences

Then:

Then: there are only two ways to define ut utilities es

Additive

ve utility: y:

Disc

scounted utility: y:

Exercise: Discounting

Give

ven:

Actions:

s: East st, West st, and Exi xit (only available in exit states a, e)

Transi

sitions: s: determinist stic

Quiz

z 1: For g = = 1, what is the optimal policy?

Quiz

z 2: For g = = 0.1, what is the optimal policy?

Quiz

z 3: For which g are West st and East st equally good when in state d?

Exercise: Discounting

Give

ven:

Actions:

s: East st, West st, and Exi xit (only available in exit states a, e)

Transi

sitions: s: determinist stic

Quiz

z 1: For g = = 1, what is the optimal policy?

Quiz

z 2: For g = = 0.1, what is the optimal policy?

Quiz

z 3: For which g are West st and East st equally good when in state d?

γ3 · 10 = γ · 1 ! γ = ∆ 1/10 ' 0.32

<latexit sha1_base64="AGurigOo2orwZWQwqUQeS/5KGPs=">AGXnicfZTNbtQwEIDd0m5LoLSFCxKXFXvhsFqS3artBakquBYqv5J9VI5zmzWav7WdtpuLT8dT8GNa6/wAji7ATZxhCNFo5lvPGPePwsYkK67vel5Scrq6219afOs+cbLza3tl+eizTnFM5oGqX80icCIpbAmWQygsuMA4n9C78m4+F/eIWuGBpciqnGQxjEiZsxCiRnW9NcQhiWPydDGNEhl23PbH9pz3R+NgyeTnAQO5iwcS8J5evdXNQeNh5hwqbz3nquNzGKYtN3eoH+91XF7my1bcErhc4Bmq/j6+2VRxykNI8hkTQiQlx5biaHinDJaATawbmAjNAbEsKVERMSgxiq2T3odsV6g3VKE0kJLTipkgsYiLHlrKARVLxyYw8GrYUjlUxS4BCBYmVS8/1o6DAxiZmswyU4Ef5aDVyadDrdzu7qDr9fd0DeEQlIS373bNVwdCDpCUyP5O19vdt5ks51kE/yC3wIpsOCRwR1NTrSRQ+BaovjL3gyEROYfiIAr7sep4WmsLnqPGZ2Z38KLxXiuFKwniW3Wv69h0ASsOaqCpBT07fVgYZMmDMsxSNKQvWymSd7A5hab2xC3IF7PEBpjQiZYlCbWeUYL9KxPRnbQaIEpa1xsGZmHhBrx2zcjGdjZrEntcqc6KJdFgnCw5iYOuM0A05kyotHd8fkOGIxk0KVdm17seT/XsZeD3akq01Z/H1fHWmLpH40a8zq3dkdSnlQ5YpTNmAhr2LzwjWAWQ0sL3hGVuZARCTci2lcnQ7gZzxNR9oxw9Grj0JbO/3vEGv/2Wnc3BYjsl19Aa9Re+Qh/bQAfqMjtEZougbekQ/0a/VH61Wa6O1OUeXl0qfV6iyWq9/A/fVUqw=</latexit>

Infinite Utilities?!

Pr

Probl blem: What if the game lasts forever? Do we get infinite rewards?

Solutions:

s:

Finite horizo

zon: (similar to depth-limited search)

Terminate episodes after a fixed T steps (e.g. life)
Gives nonst

stationary policies (p depends on time left)

Disc

scounting: use 0 0 < < g < < 1

Smaller g means smaller “horizo

zon” – shorter term focus

Abso

sorbing st state: guarantee that for every policy, a terminal state will eventually be reached (like “overheated” for racing)

Recap: Defining MDPs

Marko

kov v decisi sion processe sses: s:

Set of st

states S

Start st

state s0

Se

Set of actions A

Transi

sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’))

Re

Rewards R( R(s, s,a,s’) (and discount g)

MDP quantities

s so so far:

Po

Policy = Choice of action for each state

Ut

Utilit ility = sum of (discounted) rewards

a s s, a s,a,s’ s’

Solving MDPs Optimal Quantities

Th

The value (uti utility ty) ) of f a st state s

V*(s

(s) = expected utility starting in s s and acting opt

ptima

mally

Th

The value (uti utility ty) ) of f a q-st state (s, s,a)

Q*(s,

s,a) = expected utility starting out having taken action a from state s s and (thereafter) acting optimally

Th

The opt

ptima

mal pol policy

p*(s

(s) ) = optimal action from state s

a s s’ s, a

(s,a,s’) is a transition

s,a,s’

s is a state (s, a) is a q-state

[Demo – gridworld values (L8D4)]

SLIDE 4

Gridworld V values

Noise = 0.2 Discount = 0.9 Living reward = 0

Gridworld Q Q values

Noise = 0.2 Discount = 0.9 Living reward = 0

Values of States

Fund

Fundament amental al op

perat

eration:

n: compute the exp

xpectimax va value of a state

Expected utility under optimal action
Average sum of (discounted) rewards
This is just what expectimax computed!
Recursi

sive ve definition of va value (Bellman Equations) s):

a s s, a s,a,s’ s’

Racing Search Tree Racing Search Tree Racing Search Tree

We’re doing way

y too much work k with exp xpectimax!

Pr

Probl blem: : States are repeated

Id

Idea: Only compute needed quantities once

Pr

Probl blem: Tree goes on forever

Id

Idea: Do a depth-limited computation, but with increasing depths until change is small

No

Note te: deep parts of the tree eventually don’t matter if γ < < 1

Time-Limited Values

Key

y idea: time-limited values

De

Defin ine Vk(s) s) to be the optimal value of s if the game ends in k more time steps

Equivalently, it’s what a de

dept pth-k exp xpectimax would give from s

[Demo – time-limited values (L8D6)]

k=0

Noise = 0.2 Discount = 0.9 Living reward = 0

SLIDE 5

k=1

Noise = 0.2 Discount = 0.9 Living reward = 0

k=2

Noise = 0.2 Discount = 0.9 Living reward = 0

k=3

Noise = 0.2 Discount = 0.9 Living reward = 0

k=4

Noise = 0.2 Discount = 0.9 Living reward = 0

k=5

Noise = 0.2 Discount = 0.9 Living reward = 0

k=6

Noise = 0.2 Discount = 0.9 Living reward = 0

k=7

Noise = 0.2 Discount = 0.9 Living reward = 0

k=8

Noise = 0.2 Discount = 0.9 Living reward = 0

SLIDE 6

k=9

Noise = 0.2 Discount = 0.9 Living reward = 0

k=10

Noise = 0.2 Discount = 0.9 Living reward = 0

k=11

Noise = 0.2 Discount = 0.9 Living reward = 0

k=12

Noise = 0.2 Discount = 0.9 Living reward = 0

k=100

Noise = 0.2 Discount = 0.9 Living reward = 0

Computing Time-Limited Values Value Iteration Value Iteration

Sta

Start w t with th V0(s) s) = 0: no time steps left means an expected reward sum of zero

Given vector of Vk(s)

s) values, use exp xpectimax to compute Vk+

k+1(s)

s) :

Repeat until conve

vergence

Complexi

xity y of each iteration: O( O(S2A) A)

Theor

Theorem em: will converge to unique optimal values

Basi

sic idea: approximations get refined towards optimal values

Policy

y may converge long before va values s do

a Vk+1(s) s, a s,a,s’ Vk(s’)

SLIDE 7

Example: Value Iteration

0 0 0 2 1 0 3.5 2.5 0

Assume no discount!

Convergence*

How do we kn

know the Vk vect vectors ar are e going to co conver verge ge?

Ca

Case 1: If the tree has ma maximu mum m depth M, then VM holds the actual unt untrunc uncated value ues

Ca

Case 2: If the di discount is less than 1

Ske

ketch: For any state Vk and Vk+

k+1 can be viewed as depth

k+ k+1 expectimax results in nearly identical search trees

The dif

difference is that on the bo botto ttom la laye yer, Vk+

k+1 has actual rewards while Vk has zeros

That last layer is at

at bes est all Rma

max

It is at

at wo worst Rmi

min

But everything is dis

discounte ted by γk that far out

So Vk and Vk+

k+1 are at most γk (Rma max - Rmi min) different

So as k increases, the valu

values es co conver verge