Announcements CS 4100: Artificial Intelligence Markov Decision - - PDF document

announcements cs 4100 artificial intelligence
SMART_READER_LITE
LIVE PREVIEW

Announcements CS 4100: Artificial Intelligence Markov Decision - - PDF document

Announcements CS 4100: Artificial Intelligence Markov Decision Processes II Homework k 4: MDPs s (lead TA: Iris) Due Mon 7 Oct at 11:59pm Pr Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing) Due Thu 10 Oct at 11:59pm


slide-1
SLIDE 1

Announcements

  • Homework

k 4: MDPs s (lead TA: Iris)

  • Due Mon 7 Oct at 11:59pm
  • Pr

Project 2 t 2: Mu Multi-Ag Agent Search (lead TA: Zhaoqing)

  • Due Thu 10 Oct at 11:59pm
  • Offi

Office H Hours

  • Iris:

s: Mon 10.00am-noon, RI 237

  • JW

JW: Tue 1.40pm-2.40pm, DG 111

  • Zh

Zhaoqi qing: : Thu 9.00am-11.00am, HS 202

  • El

Eli: Fri 10.00am-noon, RY 207

CS 4100: Artificial Intelligence

Markov Decision Processes II

Jan-Willem van de Meent, Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Example: Grid World

  • A

A maze-like ke problem

  • The agent lives in a grid
  • Walls block the agent’s path
  • No

Nois isy movement: act actions s do

  • not
  • t al

always ays go as as plan anned ed

  • 80% of the time, the action North takes the agent North

(if there is no wall there)

  • 10% of the time, North takes the agent West; 10% East
  • If there is a wall in the direction the agent would have

been taken, the agent stays put

  • The

The age gent nt receives s rewards s each h time st step

  • Small “living” reward each step (can be negative)
  • Big rewards come at the end (good or bad)
  • Go

Goal: l: maxim imiz ize sum of rewa wards

Recap: MDPs

  • Marko

kov v decisi sion processe sses: s:

  • Set of st

states S

  • Start st

state s0

  • Se

Set of actions A

  • Transi

sitions P( P(s’ s’|s, s,a) (or T( T(s, s,a,s’) ’))

  • Re

Rewards R( R(s, s,a,s’) (and discount g)

  • MDP quantities

s so so far:

  • Po

Policy = Choice of action for each state

  • Ut

Utilit ility = sum of (discounted) rewards

a s s, a s,a,s’ s’

Optimal Quantities

  • Th

The value (uti utility ty) ) of f a st state s

  • V*(s

(s) = expected utility starting in s s and acting opt

  • ptima

mally

  • Th

The value (uti utility ty) ) of f a q-st state (s, s,a)

  • Q*(s,

s,a) = expected utility starting out having taken action a from state s s and (thereafter) acting optimally

  • Th

The opt

  • ptima

mal pol policy

  • p*(s)

s) = optimal action from state s

a s s’ s, a

(s,a,s’) is a transition

s,a,s’

s is a state (s, a) is a q-state

[Demo – gridworld values (L8D4)]

Gridworld V*(s) s) values Gridworld Q*(s, s, a) values The Bellman Equations

Ho How to be optima mal: St Step 1: 1: Take correct first action St Step 2: Keep being optimal

slide-2
SLIDE 2

The Bellman Equations

  • De

Defin init itio ion of “optimal utility” via ex expect ectimax ax recurrence gives a simple on

  • ne-st

step looka kahead re relation

  • nship

p amongst optimal utility values

  • These are the Be

Bellm llman equatio ions, and they characterize

  • pt
  • ptima

mal values in a way we’ll use over and over

a s s, a s,a,s’ s’

Val Value I e Iter erat ation

  • n
  • Be

Bellm llman equatio ions ch char aract acter erize e the the opti timal va values:

  • Va

Value iteration co computes es up update tes:

  • Va

Value iteration is s just st a fixe xed point so solution method

  • … though Vk vectors are also interpretable as time-limited values

a V(s) s, a s,a,s’ V(s’)

Convergence*

  • How do we kn

know the Vk vect vectors ar are e going to co conver verge ge?

  • Ca

Case 1: If the tree has ma maximu mum m depth M, then VM holds the actual unt untrunc uncated value ues

  • Ca

Case 2: If the di discount is less than 1

  • Ske

ketch: For any state Vk and Vk+

k+1 can be viewed as depth

k+ k+1 expectimax results in nearly identical search trees

  • The dif

difference is that on the bo botto ttom la laye yer, Vk+

k+1 has actual rewards while Vk has zeros

  • That last layer is at

at bes est all Rma

max

  • It is at

at wo worst Rmi

min

  • But everything is dis

discounte ted by γk that far out

  • So Vk and Vk+

k+1 are at most γk (Rma max - Rmi min) different

  • So as k increases, the valu

values es co conver verge

Policy Methods Policy Evaluation Fixed Policies

  • Ex

Expecti tima max: compute ma max ov

  • ver all acti

tion

  • ns to compute the op
  • pti

timal values

  • For fi

fixed ed pol

  • licy

cy p(s), then the tree would be simpler – on

  • nly on
  • ne acti

tion

  • n per sta

tate te

  • … though the tree’s val

value e would depend on wh whic ich polic licy we use

a s s, a s,a,s’ s’ p(s) s s, p(s) s, p(s),s’ s’ Do Do the optim imal l actio ion Do Do what p sa says t s to d do

Utilities for a Fixed Policy

  • An

Another basic operation: compute the utility of a state s under a fi fixed ed (g (generally non-op

  • pti

timal) pol

  • licy
  • De

Defin ine the ut utility of a state s, under a fi fixed ed pol

  • licy

cy p:

Vp(s (s) = expected total discounted rewards starting in s and following p

  • Re

Recur ursive relation (one-step look-ahead / Bellman equation): p(s) s s, p(s) s, p(s),s’ s’

Example: Policy Evaluation

Always Go Right Always Go Forward

slide-3
SLIDE 3

Example: Policy Evaluation

Always Go Right Always Go Forward

Policy Evaluation

  • Ho

How w do do we we calcul ulate the he V’ V’s for for a a fi fixed ed pol

  • licy

cy p?

  • Id

Idea ea 1: Turn recursive Bellman equations into updates (like value iteration)

  • Ef

Efficiency: y: O(S O(S2) per iteration

  • Id

Idea ea 2: Wi Witho hout ut the he maxes, the Bellman equations are ju just a lin linear r system

  • Solve with Matlab (or your favorite linear system solver)

p(s) s s, p(s) s, p(s),s’ s’

Policy Extraction Computing Actions from Values

  • Let’s

s imagine we have ve the optimal va values s V*(s) s)

  • How sh

should we act?

  • It’s

s not obvi vious! s!

  • We

We need to do a mi mini ni-exp xpectimax (one st step)

  • This is called policy

y ext xtraction, since it finds the policy implied by the values

Computing Actions from Q-Values

  • Let’s

s imagine we have ve the optimal q-va values:

  • How sh

should we act?

  • Completely trivi

vial to decide!

  • Important lesso

sson: actions are easier to select from q-va values than va values!

Policy Iteration Problems with Value Iteration

  • Va

Value iteration repeats the Be Bellm llman updates:

  • Pr

Problem 1: It’s slow – O(S O(S2A) A) per iteration

  • Pr

Problem 2: The “m “max” at each state ra rare rely change ges

  • Pr

Problem 3: The pol policy often converges long be before

  • re the values

a s s, a s,a,s’ s’

[Demo: value iteration (L9D2)]

k=0

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-4
SLIDE 4

k=1

Noise = 0.2 Discount = 0.9 Living reward = 0

k=2

Noise = 0.2 Discount = 0.9 Living reward = 0

k=3

Noise = 0.2 Discount = 0.9 Living reward = 0

k=4

Noise = 0.2 Discount = 0.9 Living reward = 0

k=5

Noise = 0.2 Discount = 0.9 Living reward = 0

k=6

Noise = 0.2 Discount = 0.9 Living reward = 0

k=7

Noise = 0.2 Discount = 0.9 Living reward = 0

k=8

Noise = 0.2 Discount = 0.9 Living reward = 0

slide-5
SLIDE 5

k=9

Noise = 0.2 Discount = 0.9 Living reward = 0

k=10

Noise = 0.2 Discount = 0.9 Living reward = 0

k=11

Noise = 0.2 Discount = 0.9 Living reward = 0

k=12

Noise = 0.2 Discount = 0.9 Living reward = 0

k=100

Noise = 0.2 Discount = 0.9 Living reward = 0

Policy Iteration

  • Alternative

ve approach for optimal va values: s:

  • Step 1: Policy

y eva valuation: calculate ut utilities es for some fixe xed policy y (not optimal utilities!) until convergence

  • Step 2: Policy

y improve vement: update po policy using one-step look-ahead with converged (but not optimal!) utilities s as future values

  • Re

Repeat steps until policy conve verges

  • This

s is s policy y iteration

  • It’s still optimal!
  • Can converge (much) faster under some conditions

Policy Iteration

  • Eva

valuation: For fixe xed current policy p, find va values Vp (with policy evaluation):

  • Ite

Iterate te until values conve verge:

  • Improve

vement: For fixe xed va values, get a better po policy (using policy extraction)

  • On

One-st step look-ahead:

Value Iteration vs Policy Iteration

  • Both va

value iteration and policy y iteration compute the same thing (all optimal va values)

  • In va

value iteration:

  • Eve

very y iteration updates both the va values s and (implicitly) the po policy

  • We don’t extract the po

policy, but taking the max over actions implicitly y (re)computes s it

  • In policy

y iteration:

  • We do several passes that update utilities

s with fixe xed policy y (each pass is fast st because we consider only y one action, not all of them)

  • After the po

policy is evaluated, we update the policy y (sl slow like a value iteration pass)

  • The new policy

y will be be better r (or we’re done)

  • Both are dyn

ynamic programs s for so solvi ving MDPs

slide-6
SLIDE 6

Summary: MDP Algorithms

  • So yo

you want to…. ….

  • Compute optimal va

values: s: use va value iteration or policy y iteration

  • Compute va

values s for a particular policy: y: use policy y eva valuation

  • Turn yo

your va values s into a policy: y: use policy y ext xtraction (one-step lookahead)

  • These

se all look k the sa same!

  • They

y basi sically y are – they are all va variations s of Bellman updates

  • They

y all use se one

  • ne-st

step looka kahead exp xpectimax fragments

  • They

y differ only in whether we plug in a fixe xed policy y or do a max x ove ver actions

Double Bandits Double-Bandit MDP

  • Actions:

s: Bl Blue, Re Red

  • States:

s: Win Win, Lose se

W L

$1 1.0 $1 1.0 0.25 $0 0.75 $2 0.75 $2 0.25 $0

No discount 100 time steps Both states have the same value

Offline Planning

  • So

Solving MDPs Ps is offl ffline planning

  • You determine all quantities through computation
  • You need to know the details of the MDP
  • You do not actually play the game!

Pl Play Red Pl Play Blue Va Value

No discount 100 time steps Both states have the same value

150 100 W L

$1 1.0 $1 1.0 0.25 $0 0.75 $2 0.75 $2 0.25 $0

Let’s Play!

$2 $2 $0 $2 $2 $2 $2 $0 $0 $0

Online Planning

  • Rules

s changed! Red’s win chance is different. W L

$1 1.0 $1 1.0 ?? $0 ?? $2 ?? $2 ?? $0

Let’s Play!

$0 $0 $0 $2 $0 $2 $0 $0 $0 $0

What Just Happened?

  • That wasn

sn’t pl planning, it was s le learnin ing!

  • Specifically, re

reinforc rceme ment learn rning

  • There was an MD

MDP, but you couldn’t solve it with just computation

  • You needed to actually

y act to figure it out

  • Im

Importa tant t ideas s in rei reinf nforcement

  • rcement learni

earning ng th that c t came u up

  • Exp

xploration: you have to try unknown actions to get information

  • Exp

xploitation: eventually, you have to use what you know to maximize returns

  • Re

Regret: even if you learn intelligently, you make mistakes

  • Sa

Sampl pling: g: because of chance, you have to try things repeatedly

  • Difficulty:

y: learning can be much harder than solving a known MDP

slide-7
SLIDE 7

Next Time: Reinforcement Learning!