Announcements Homework k 3: Game Trees s (lead TA: Zhaoqing) Due - - PowerPoint PPT Presentation

announcements
SMART_READER_LITE
LIVE PREVIEW

Announcements Homework k 3: Game Trees s (lead TA: Zhaoqing) Due - - PowerPoint PPT Presentation

Announcements Homework k 3: Game Trees s (lead TA: Zhaoqing) Due Tue 1 Oct at 11:59pm (deadline extended) Homework k 4: MDPs s (lead TA: Iris) Due Mon 7 Oct at 11:59pm Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA:


slide-1
SLIDE 1

Announcements

  • Homework

k 3: Game Trees s (lead TA: Zhaoqing)

  • Due Tue 1 Oct at 11:59pm (deadline extended)
  • Homework

k 4: MDPs s (lead TA: Iris)

  • Due Mon 7 Oct at 11:59pm
  • Pr

Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing)

  • Due Thu 10 Oct at 11:59pm
  • Offi

Office H Hours

  • Iris:

s: Mon 10.00am-noon, RI 237

  • JW

JW: Tue 1.40pm-2.40pm, DG 111

  • Zh

Zhaoqi qing: : Thu 9.00am-11.00am, HS 202

  • El

Eli: Fri 10.00am-noon, RY 207

slide-2
SLIDE 2

CS 4100: Artificial Intelligence

Markov Decision Processes

Jan-Willem van de Meent Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

slide-3
SLIDE 3

Non-Deterministic Search

slide-4
SLIDE 4

Example: Grid World

  • A

A maze-like ke problem

  • The agent lives in a grid
  • Walls block the agent’s path
  • No

Nois isy movement: act actions s do

  • not
  • t al

always ays go as as plan anned ed

  • 80% of the time, the action North takes the agent North

(if there is no wall there)

  • 10% of the time, North takes the agent West; 10% East
  • If there is a wall in the direction the agent would have

been taken, the agent stays put

  • The

The age gent nt receives s rewards s each h time st step

  • Small “living” reward each step (can be negative)
  • Big rewards come at the end (good or bad)
  • Go

Goal: l: maxim imiz ize sum of rewa wards

slide-5
SLIDE 5

Grid World Actions

De Determ rmin inis istic ic Grid rid World rld St Stochastic Grid World

slide-6
SLIDE 6

Markov Decision Processes

  • An MDP is

s defined by

  • A se

set of st states s s Î S

  • A se

set of actions s a a Î A

  • A transi

sition function T(s, s, a, s’) ’)

  • Probability that a

a from s leads to s’ s’, i.e., P(s P(s’| s, s, a)

  • Also called the model or the dynamics
  • A re

reward rd function R(s, s, a, s’) ’)

  • Sometimes just R(s)

s) or R( R(s’) ’)

  • A st

start st state

  • Maybe a terminal st

state

  • MDPs

s are non-determinist stic se search problems

  • One way to solve them is with exp

xpectimax search

  • We’ll have a new tool soon

[Demo – gridworld manual intro (L8D1)]

slide-7
SLIDE 7

What is Markov about MDPs?

  • “Marko

kov” v” generally means that given the current st state, the future and the past st are independent

  • For Marko

kov v decisi sion processe sses, “Markov” means action outcomes s depend only on the current st state

  • This is just like search, where the successor function could
  • nly depend on the current state (not the history)

Andrey Markov (1856-1922)

slide-8
SLIDE 8

Policies

  • In determinist

stic si single-agent se search problems, we wanted an optimal pl plan, or sequence of actions, from start to a goal

  • For MD

MDPs, we want an optimal policy y p*: *: S → A

  • A policy p gives an acti

action

  • n for each st

state

  • An optimal policy is one that

maxi ximize zes s exp xpected utility y

  • An exp

xplicit policy defines a reflex x agent

  • Exp

xpectimax didn’t compute entire policies

  • It computed the action for a single state only

Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s

slide-9
SLIDE 9

Optimal Policies

R(s) = -2.0 R(s) = -0.4 R(s) = -0.03 R(s) = -0.01

slide-10
SLIDE 10

Example: Racing

slide-11
SLIDE 11

Example: Racing

  • A robot car wants to travel far, quickly
  • Three states: Cool, Warm, Overheated
  • Two actions: Slow, Fast
  • Going faster gets double reward

Cool Warm Overheated

Fast Fast Slow Slow 0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2

  • 10
slide-12
SLIDE 12

Racing Search Tree

slide-13
SLIDE 13

MDP Search Trees

  • Each MDP st

state projects s an exp xpectimax-like ke se search tree

a s s’ s, a (s, s,a,s’) is called a tr transitio ition T( T(s, s,a,s’) ) = P(s P(s’|s, s,a) R( R(s, s,a,s’) s,a,s’ s is a state (s (s, a) a) is a q-st state

slide-14
SLIDE 14

Utilities of Sequences

slide-15
SLIDE 15

Utilities of Sequences

  • What preferences should an agent have over reward sequences?
  • More or less?
  • Now or later?

[1, 2, 2] [2, 3, 4]

  • r

[0, 0, 1] [1, 0, 0]

  • r
slide-16
SLIDE 16

Discounting

  • It’s reasonable to maxi

ximize ze the su sum of rewards

  • It’s also reasonable to pr

prefer rewards s now to rewards s later

  • One so

solution: values of rewards decay y exp xponentially

Worth Now Worth Next Step Worth In Two Steps

slide-17
SLIDE 17

Discounting

  • How to disc

scount?

  • Each time we descend a level, we

multiply in the discount once

  • Why

y disc scount?

  • Sooner rewards probably do have

higher utility than later rewards

  • Also helps our algorithms converge
  • Exa

xample: disc scount of 0. 0.5

  • U(

U([1,2 ,2,3 ,3]) = = 1 1*1 + + 0 0.5 .5*2 + + 0 0.2 .25*3

  • U(

U([1,2 ,2,3 ,3]) < < U( U([3,2 ,2,1 ,1])

slide-18
SLIDE 18

Stationary Preferences

  • Theorem:

Theorem: if we assume st stationary y preferences

  • Then:

Then: there are only two ways to define ut utilities es

  • Additive

ve utility: y:

  • Disc

scounted utility: y:

slide-19
SLIDE 19

Exercise: Discounting

  • Give

ven:

  • Actions:

s: East st, West st, and Exi xit (only available in exit states a, e)

  • Transi

sitions: s: determinist stic

  • Quiz

z 1: For g = = 1, what is the optimal policy?

  • Quiz

z 2: For g = = 0.1, what is the optimal policy?

  • Quiz

z 3: For which g are West st and East st equally good when in state d?

slide-20
SLIDE 20

Exercise: Discounting

  • Give

ven:

  • Actions:

s: East st, West st, and Exi xit (only available in exit states a, e)

  • Transi

sitions: s: determinist stic

  • Quiz

z 1: For g = = 1, what is the optimal policy?

  • Quiz

z 2: For g = = 0.1, what is the optimal policy?

  • Quiz

z 3: For which g are West st and East st equally good when in state d?

γ3 · 10 = γ · 1 ! γ = ∆ 1/10 ' 0.32

<latexit sha1_base64="AGurigOo2orwZWQwqUQeS/5KGPs=">AGXnicfZTNbtQwEIDd0m5LoLSFCxKXFXvhsFqS3artBakquBYqv5J9VI5zmzWav7WdtpuLT8dT8GNa6/wAji7ATZxhCNFo5lvPGPePwsYkK67vel5Scrq6219afOs+cbLza3tl+eizTnFM5oGqX80icCIpbAmWQygsuMA4n9C78m4+F/eIWuGBpciqnGQxjEiZsxCiRnW9NcQhiWPydDGNEhl23PbH9pz3R+NgyeTnAQO5iwcS8J5evdXNQeNh5hwqbz3nquNzGKYtN3eoH+91XF7my1bcErhc4Bmq/j6+2VRxykNI8hkTQiQlx5biaHinDJaATawbmAjNAbEsKVERMSgxiq2T3odsV6g3VKE0kJLTipkgsYiLHlrKARVLxyYw8GrYUjlUxS4BCBYmVS8/1o6DAxiZmswyU4Ef5aDVyadDrdzu7qDr9fd0DeEQlIS373bNVwdCDpCUyP5O19vdt5ks51kE/yC3wIpsOCRwR1NTrSRQ+BaovjL3gyEROYfiIAr7sep4WmsLnqPGZ2Z38KLxXiuFKwniW3Wv69h0ASsOaqCpBT07fVgYZMmDMsxSNKQvWymSd7A5hab2xC3IF7PEBpjQiZYlCbWeUYL9KxPRnbQaIEpa1xsGZmHhBrx2zcjGdjZrEntcqc6KJdFgnCw5iYOuM0A05kyotHd8fkOGIxk0KVdm17seT/XsZeD3akq01Z/H1fHWmLpH40a8zq3dkdSnlQ5YpTNmAhr2LzwjWAWQ0sL3hGVuZARCTci2lcnQ7gZzxNR9oxw9Grj0JbO/3vEGv/2Wnc3BYjsl19Aa9Re+Qh/bQAfqMjtEZougbekQ/0a/VH61Wa6O1OUeXl0qfV6iyWq9/A/fVUqw=</latexit>
slide-21
SLIDE 21

Infinite Utilities?!

  • Pr

Probl blem: What if the game lasts forever? Do we get infinite rewards?

  • Solutions:

s:

  • Finite horizo

zon: (similar to depth-limited search)

  • Terminate episodes after a fixed T steps (e.g. life)
  • Gives nonst

stationary policies (p depends on time left)

  • Disc

scounting: use 0 0 < < g < < 1

  • Smaller g means smaller “horizo

zon” – shorter term focus

  • Abso

sorbing st state: guarantee that for every policy, a terminal state will eventually be reached (like “overheated” for racing)

slide-22
SLIDE 22

Recap: Defining MDPs

  • Marko

kov v decisi sion processe sses: s:

  • Set of st

states S

  • Start st

state s0

  • Se

Set of actions A

  • Transi

sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’))

  • Re

Rewards R( R(s, s,a,s’) (and discount g)

  • MDP quantities

s so so far:

  • Po

Policy = Choice of action for each state

  • Ut

Utilit ility = sum of (discounted) rewards

a s s, a s,a,s’ s’

slide-23
SLIDE 23

Solving MDPs

slide-24
SLIDE 24

Optimal Quantities

  • Th

The value (uti utility ty) ) of f a st state s

  • V*(s

(s) = expected utility starting in s s and acting opt

  • ptima

mally

  • Th

The value (uti utility ty) ) of f a q-st state (s, s,a)

  • Q*(s,

s,a) = expected utility starting out having taken action a from state s s and (thereafter) acting optimally

  • Th

The opt

  • ptima

mal pol policy

  • p*(s

(s) ) = optimal action from state s

a s s’ s, a

(s,a,s’) is a transition

s,a,s’

s is a state (s, a) is a q-state

[Demo – gridworld values (L8D4)]

slide-25
SLIDE 25

Gridworld V values

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-26
SLIDE 26

Gridworld Q Q values

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-27
SLIDE 27

Values of States

  • Fund

Fundament amental al op

  • perat

eration:

  • n: compute the exp

xpectimax va value of a state

  • Expected utility under optimal action
  • Average sum of (discounted) rewards
  • This is just what expectimax computed!
  • Recursi

sive ve definition of va value (Bellman Equations) s):

a s s, a s,a,s’ s’

slide-28
SLIDE 28

Racing Search Tree

slide-29
SLIDE 29

Racing Search Tree

slide-30
SLIDE 30

Racing Search Tree

  • We’re doing way

y too much work k with exp xpectimax!

  • Pr

Probl blem: : States are repeated

  • Id

Idea: Only compute needed quantities once

  • Pr

Probl blem: Tree goes on forever

  • Id

Idea: Do a depth-limited computation, but with increasing depths until change is small

  • No

Note te: deep parts of the tree eventually don’t matter if γ < < 1

slide-31
SLIDE 31

Time-Limited Values

  • Key

y idea: time-limited values

  • De

Defin ine Vk(s) s) to be the optimal value of s if the game ends in k more time steps

  • Equivalently, it’s what a de

dept pth-k exp xpectimax would give from s

[Demo – time-limited values (L8D6)]

slide-32
SLIDE 32

k=0

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-33
SLIDE 33

k=1

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-34
SLIDE 34

k=2

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-35
SLIDE 35

k=3

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-36
SLIDE 36

k=4

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-37
SLIDE 37

k=5

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-38
SLIDE 38

k=6

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-39
SLIDE 39

k=7

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-40
SLIDE 40

k=8

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-41
SLIDE 41

k=9

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-42
SLIDE 42

k=10

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-43
SLIDE 43

k=11

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-44
SLIDE 44

k=12

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-45
SLIDE 45

k=100

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-46
SLIDE 46

Computing Time-Limited Values

slide-47
SLIDE 47

Value Iteration

slide-48
SLIDE 48

Value Iteration

  • Sta

Start w t with th V0(s) s) = 0: no time steps left means an expected reward sum of zero

  • Given vector of Vk(s)

s) values, use exp xpectimax to compute Vk+

k+1(s)

s) :

  • Repeat until conve

vergence

  • Complexi

xity y of each iteration: O( O(S2A) A)

  • Theor

Theorem em: will converge to unique optimal values

  • Basi

sic idea: approximations get refined towards optimal values

  • Policy

y may converge long before va values s do

a Vk+1(s) s, a s,a,s’ Vk(s’)

slide-49
SLIDE 49

Example: Value Iteration

0 0 0 2 1 0 3.5 2.5 0

Assume no discount!

slide-50
SLIDE 50

Convergence*

  • How do we kn

know the Vk vect vectors ar are e going to co conver verge ge?

  • Ca

Case 1: If the tree has ma maximu mum m depth M, then VM holds the actual unt untrunc uncated value ues

  • Ca

Case 2: If the di discount is less than 1

  • Ske

ketch: For any state Vk and Vk+

k+1 can be viewed as depth

k+ k+1 expectimax results in nearly identical search trees

  • The dif

difference is that on the bo botto ttom la laye yer, Vk+

k+1 has actual rewards while Vk has zeros

  • That last layer is at

at bes est all Rma

max

  • It is at

at wo worst Rmi

min

  • But everything is dis

discounte ted by γk that far out

  • So Vk and Vk+

k+1 are at most γk (Rma max - Rmi min) different

  • So as k increases, the valu

values es co conver verge

slide-51
SLIDE 51

Next Time: Policy-Based Methods