CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: - - PowerPoint PPT Presentation

cs 730 730w 830 intro ai
SMART_READER_LITE
LIVE PREVIEW

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: - - PowerPoint PPT Presentation

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are due Wheeler Ruml (UNH) Lecture 18, CS 730 1 / 14 MDP Wrap-Up RTDP MDPs ADP Q -Learning MDP Wrap-Up Wheeler Ruml (UNH) Lecture 18, CS


slide-1
SLIDE 1

CS 730/730W/830: Intro AI

MDP Wrap-Up ADP Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 1 / 14

1 handout: slides project proposals are due

slide-2
SLIDE 2

MDP Wrap-Up

MDP Wrap-Up ■ RTDP ■ MDPs ADP Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 2 / 14

slide-3
SLIDE 3

Real-time Dynamic Programming

MDP Wrap-Up ■ RTDP ■ MDPs ADP Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 3 / 14

for a known MDP. which states to update? initialize U to an upper bound update U as we follow greedy policy from s0 U(s) ← R(s) + γ max

a

  • s′

T(s, a, s′)U(s′) states that agent is likely to visit (nice anytime profile)

slide-4
SLIDE 4

Summary of MDP Solving

MDP Wrap-Up ■ RTDP ■ MDPs ADP Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 4 / 14

value iteration: compute U π∗

prioritized sweeping

RTDP

policy iteration: compute U π using

linear algebra (exact)

simplified value iteration (exact and faster?)

modified PI (a few updates, so inexact)

slide-5
SLIDE 5

Model-based Reinforcement Learning

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 5 / 14

slide-6
SLIDE 6

Adaptive Dynamic Programming

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 6 / 14

‘model-based’. active vs passive learn T and R as we go, calculating π using MDP methods (eg, VI or PI) Until max-update ≤ loss − bound (1−γ)2

2γ2

for each state s U(s) ← R(s) + γ maxa

  • s′ T(s, a, s′)U(s′)

π(s) = argmax

a

  • s′

T(s, a, s′)U(s′)

slide-7
SLIDE 7

Prioritized Sweeping

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 7 / 14

given an experience (s, a, s′, r), update model update s repeat k times: do highest priority update to update state s with change δ in U(s): update U(s) priority of s ← 0 for each predecessor s′ of s: priority s′ ← max of current and maxa δ ˆ T(s′, as′)

slide-8
SLIDE 8

Policy Iteration

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 8 / 14

repeat until π doesn’t change: given π, compute U π(s) for all states given U, calculate policy by one-step look-ahead If π doesn’t change, U doesn’t either. We are at an equilibrium (= optimal π)!

slide-9
SLIDE 9

Exploration vs Exploitation

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14

problem:

slide-10
SLIDE 10

Exploration vs Exploitation

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 9 / 14

problem: greedy (local minima) U +(s) ← R(s) + γ max

a

f

  • s′

T(s, a, s′)U +(s′), N(a, s)

  • where f(u, n) = Rmax if n < k, u otherwise
slide-11
SLIDE 11

Break

MDP Wrap-Up ADP ■ ADP ■ Sweeping ■ Policy Iteration ■ Bandits ■ Break Q-Learning

Wheeler Ruml (UNH) Lecture 18, CS 730 – 10 / 14

asst 4

final papers: writing-intensive

slide-12
SLIDE 12

Model-free Reinforcement Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 11 / 14

slide-13
SLIDE 13

Q-Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

U(s) = R(s) + γ max

a

  • s′

T(s, a, s′)U(s′)

slide-14
SLIDE 14

Q-Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

U(s) = R(s) + γ max

a

  • s′

T(s, a, s′)U(s′) Q(s, a) = γ

  • s′
  • T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

  • Given experience s, a, s′, r:
slide-15
SLIDE 15

Q-Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

U(s) = R(s) + γ max

a

  • s′

T(s, a, s′)U(s′) Q(s, a) = γ

  • s′
  • T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

  • Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error)

slide-16
SLIDE 16

Q-Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

U(s) = R(s) + γ max

a

  • s′

T(s, a, s′)U(s′) Q(s, a) = γ

  • s′
  • T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

  • Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error) Q(s, a) ← Q(s, a) + α(sensed − predicted)

slide-17
SLIDE 17

Q-Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

U(s) = R(s) + γ max

a

  • s′

T(s, a, s′)U(s′) Q(s, a) = γ

  • s′
  • T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

  • Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error) Q(s, a) ← Q(s, a) + α(sensed − predicted) Q(s, a) ← Q(s, a) + α(γ(r + max

a′

Q(s′, a′)) − Q(s, a))

slide-18
SLIDE 18

Q-Learning

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 12 / 14

U(s) = R(s) + γ max

a

  • s′

T(s, a, s′)U(s′) Q(s, a) = γ

  • s′
  • T(s, a, s′)(R(s′) + max

a′

Q(s′, a′))

  • Given experience s, a, s′, r:

Q(s, a) ← Q(s, a) + α(error) Q(s, a) ← Q(s, a) + α(sensed − predicted) Q(s, a) ← Q(s, a) + α(γ(r + max

a′

Q(s′, a′)) − Q(s, a)) α ≈ 1/N? policy: choose random with probability 1/N?

slide-19
SLIDE 19

Summary

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 13 / 14

Model known (solving MDP):

value iteration

policy iteration: compute U π using

linear algebra

simplified value iteration

a few updates (modified PI) Model unknown (RL):

ADP using

value iteration

a few updates (eg, prioritized sweeping)

Q-learning

slide-20
SLIDE 20

EOLQs

MDP Wrap-Up ADP Q-Learning ■ Q-Learning ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 18, CS 730 – 14 / 14

What question didn’t you get to ask today?

What’s still confusing?

What would you like to hear more about? Please write down your most pressing question about AI and put it in the box on your way out. Thanks!