13. Reinforcemen t Learning [Read Chapter 13] [Exercises - - PDF document

13 reinforcemen t learning read chapter 13 exercises 13 1
SMART_READER_LITE
LIVE PREVIEW

13. Reinforcemen t Learning [Read Chapter 13] [Exercises - - PDF document

13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4] Con trol learning Con trol p olici es that c ho ose optimal actions Q learning Con v ergence 255 lecture slides for


slide-1
SLIDE 1 13. Reinforcemen t Learning [Read Chapter 13] [Exercises 13.1, 13.2, 13.4]
  • Con
trol learning
  • Con
trol p
  • lici
es that c ho
  • se
  • ptimal
actions
  • Q
learning
  • Con
v ergence 255 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-2
SLIDE 2 Con trol Learning Consider learning to c ho
  • se
actions, e.g.,
  • Rob
  • t
learning to do c k
  • n
battery c harger
  • Learning
to c ho
  • se
actions to
  • ptimize
factory
  • utput
  • Learning
to pla y Bac kgammon Note sev eral problem c haracteristics:
  • Dela
y ed rew ard
  • Opp
  • rtunit
y for activ e exploration
  • P
  • ssibilit
y that state
  • nly
partially
  • bserv
able
  • P
  • ssible
need to learn m ultiple tasks with same sensors/eectors 256 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-3
SLIDE 3 One Example: TD-Gammon [T esauro, 1995] Learn to pla y Bac kgammon Immediate rew ard
  • +100
if win
  • 100
if lose
  • for
all
  • ther
states T rained b y pla ying 1.5 million games against itself No w appro ximately equal to b est h uman pla y er 257 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-4
SLIDE 4 Reinforcemen t Learning Problem

Agent Environment

State Reward Action

r + γ γ r + r + ... , where γ <1 2 2 1 Goal: Learn to choose actions that maximize s 1 s 2 s a 1 a 2 a r 1 r 2 r ... <

258 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-5
SLIDE 5 Mark
  • v
Decision Pro cesses Assume
  • nite
set
  • f
states S
  • set
  • f
actions A
  • at
eac h discrete time agen t
  • bserv
es state s t 2 S and c ho
  • ses
action a t 2 A
  • then
receiv es immediate rew ard r t
  • and
state c hanges to s t+1
  • Mark
  • v
assumption: s t+1 =
  • (s
t ; a t ) and r t = r (s t ; a t ) { i.e., r t and s t+1 dep end
  • nly
  • n
curr ent state and action { functions
  • and
r ma y b e nondeterministic { functions
  • and
r not necessarily kno wn to agen t 259 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-6
SLIDE 6 Agen t's Learning T ask Execute actions in en vironmen t,
  • bserv
e results, and
  • learn
action p
  • licy
  • :
S ! A that maximizes E [r t +
  • r
t+1 +
  • 2
r t+2 + : : :] from an y starting state in S
  • here
  • <
1 is the discoun t factor for future rew ards Note something new:
  • T
arget function is
  • :
S ! A
  • but
w e ha v e no training examples
  • f
form hs; ai
  • training
examples are
  • f
form hhs; ai; r i 260 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-7
SLIDE 7 V alue F unction T
  • b
egin, consider deterministic w
  • rlds...
F
  • r
eac h p
  • ssible
p
  • licy
  • the
agen t migh t adopt, w e can dene an ev aluation function
  • v
er states V
  • (s)
  • r
t +
  • r
t+1 +
  • 2
r t+2 + :::
  • 1
X i=0
  • i
r t+i where r t ; r t+1 ; : : : are generated b y follo wing p
  • licy
  • starting
at state s Restated, the task is to learn the
  • ptimal
p
  • licy
  • argmax
  • V
  • (s);
(8s) 261 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-8
SLIDE 8

G

100 100

r (s; a) (immediate rew ard) v alues

G

100 90 100 81 90 81 81 90 81 72 72 81

Q(s; a) v alues

G

100 100 90 90 81

V
  • (s)
v alues

G

One
  • ptimal
p
  • licy
262 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-9
SLIDE 9 What to Learn W e migh t try to ha v e agen t learn the ev aluati
  • n
function V
  • (whic
h w e write as V
  • )
It could then do a lo
  • k
ahead searc h to c ho
  • se
b est action from an y state s b ecause
  • (s)
= argmax a [r (s; a) +
  • V
  • (
(s; a))] A problem:
  • This
w
  • rks
w ell if agen t kno ws
  • :
S
  • A
! S , and r : S
  • A
! <
  • But
when it do esn't, it can't c ho
  • se
actions this w a y 263 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-10
SLIDE 10 Q F unction Dene new function v ery similar to V
  • Q(s;
a)
  • r
(s; a) +
  • V
  • (
(s; a)) If agen t learns Q, it can c ho
  • se
  • ptimal
action ev en without kno wing
  • !
  • (s)
= argmax a [r (s; a) +
  • V
  • (
(s; a))]
  • (s)
= argmax a Q(s; a) Q is the ev aluation function the agen t will learn 264 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-11
SLIDE 11 T raining Rule to Learn Q Note Q and V
  • closely
related: V
  • (s)
= max a Q(s; a ) Whic h allo ws us to write Q recursiv ely as Q(s t ; a t ) = r (s t ; a t ) +
  • V
  • (
(s t ; a t ))) = r (s t ; a t ) +
  • max
a Q(s t+1 ; a ) Nice! Let ^ Q denote learner's curren t appro ximation to Q. Consider training rule ^ Q(s; a) r +
  • max
a ^ Q(s ; a ) where s is the state resulting from applying action a in state s 265 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-12
SLIDE 12 Q Learning for Deterministi c W
  • rlds
F
  • r
eac h s; a initial i ze table en try ^ Q (s; a) Observ e curren t state s Do forev er:
  • Select
an action a and execute it
  • Receiv
e immediate rew ard r
  • Observ
e the new state s
  • Up
date the table en try for ^ Q(s; a) as follo ws: ^ Q(s; a) r +
  • max
a ^ Q(s ; a )
  • s
s 266 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-13
SLIDE 13 Up dating ^ Q

100 81

R

63 72

initial state: s1

100 90 81

R

63

next state: s2

aright

^ Q(s 1 ; a r ig ht ) r +
  • max
a ^ Q(s 2 ; a ) + 0:9 max f63; 81; 100g 90 notice if rew ards non-negativ e, then (8s; a; n) ^ Q n+1 (s; a)
  • ^
Q n (s; a) and (8s; a; n)
  • ^
Q n (s; a)
  • Q(s;
a) 267 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-14
SLIDE 14 ^ Q con v erges to Q. Consider case
  • f
deterministic w
  • rld
where see eac h hs; ai visited innitely
  • ften.
Pr
  • f:
Dene a full in terv al to b e an in terv al during whic h eac h hs; ai is visited. During eac h full in terv al the largest error in ^ Q table is reduced b y factor
  • f
  • Let
^ Q n b e table after n up dates, and
  • n
b e the maxim um error in ^ Q n ; that is
  • n
= max s;a j ^ Q n (s; a)
  • Q(s;
a)j F
  • r
an y table en try ^ Q n (s; a) up dated
  • n
iteration n + 1, the error in the revised estimate ^ Q n+1 (s; a) is j ^ Q n+1 (s; a)
  • Q(s;
a)j = j(r +
  • max
a ^ Q n (s ; a )) (r +
  • max
a Q(s ; a ))j =
  • j
max a ^ Q n (s ; a )
  • max
a Q(s ; a )j
  • max
a j ^ Q n (s ; a )
  • Q(s
; a )j
  • max
s 00 ;a j ^ Q n (s 00 ; a )
  • Q(s
00 ; a )j j ^ Q n+1 (s; a)
  • Q(s;
a)j
  • n
268 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-15
SLIDE 15 Note w e used general fact that j max a f 1 (a)
  • max
a f 2 (a)j
  • max
a jf 1 (a)
  • f
2 (a)j 269 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-16
SLIDE 16 Nondeterministi c Case What if rew ard and next state are non-deterministic? W e redene V ; Q b y taking exp ected v alues V
  • (s)
  • E
[r t +
  • r
t+1 +
  • 2
r t+2 + : : :]
  • E
[ 1 X i=0
  • i
r t+i ] Q(s; a)
  • E
[r (s; a) +
  • V
  • (
(s; a))] 270 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-17
SLIDE 17 Nondeterministi c Case Q learning generalizes to nondeterministic w
  • rlds
Alter training rule to ^ Q n (s; a) (1 n ) ^ Q n1 (s; a)+ n [r +max a ^ Q n1 (s ; a )] where
  • n
= 1 1 + v isits n (s; a) Can still pro v e con v ergence
  • f
^ Q to Q [W atkins and Da y an, 1992] 271 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-18
SLIDE 18 T emp
  • ral
Dierence Learning Q learning: reduce discrepancy b et w een successiv e Q estimates One step time dierence: Q (1) (s t ; a t )
  • r
t +
  • max
a ^ Q(s t+1 ; a) Wh y not t w
  • steps?
Q (2) (s t ; a t )
  • r
t +
  • r
t+1 +
  • 2
max a ^ Q(s t+2 ; a) Or n? Q (n) (s t ; a t )
  • r
t + r t+1 +
  • +
(n1) r t+n1 + n max a ^ Q(s t+n ; a) Blend all
  • f
these: Q
  • (s
t ; a t )
  • (1)
" Q (1) (s t ; a t ) + Q (2) (s t ; a t ) +
  • 2
Q (3) (s t ; a t ) +
  • #
272 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-19
SLIDE 19 T emp
  • ral
Dierence Learning Q
  • (s
t ; a t )
  • (1)
" Q (1) (s t ; a t ) + Q (2) (s t ; a t ) +
  • 2
Q (3) (s t ; a t ) +
  • #
Equiv alen t expression: Q
  • (s
t ; a t ) = r t +
  • [
(1
  • )
max a ^ Q(s t ; a t ) + Q
  • (s
t+1 ; a t+1 )] TD() algorithm uses ab
  • v
e training rule
  • Sometimes
con v erges faster than Q learning
  • con
v erges for learning V
  • for
an y
  • 1
(Da y an, 1992)
  • T
esauro's TD-Gammon uses this algorithm 273 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997
slide-20
SLIDE 20 Subtleties and Ongoing Researc h
  • Replace
^ Q table with neural net
  • r
  • ther
generalizer
  • Handle
case where state
  • nly
partially
  • bserv
able
  • Design
  • ptimal
exploration strategies
  • Extend
to con tin uous action, state
  • Learn
and use ^
  • :
S
  • A
! S
  • Relationship
to dynamic programming 274 lecture slides for textb
  • k
Machine L e arning, T. Mitc hell, McGra w Hill, 1997