Introduction to Reinforcement Learning
Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison
[Based on slides from David Page, Mark Craven]
Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer - - PowerPoint PPT Presentation
Introduction to Reinforcement Learning Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [Based on slides from David Page, Mark Craven] Goals for the lecture you should understand the following
[Based on slides from David Page, Mark Craven]
you should understand the following concepts
2
Task of an agent embedded in an environment repeat forever 1) sense world 2) reason 3) choose an action to perform 4) get feedback (usually reward = 0) 5) learn the environment may be the physical world or an artificial one
3
– 30 pieces, 24 locations
– roll dice, e.g. 2, 5 – move one piece 2 – move one piece 5
– win, lose
– trained against itself (300,000 games) – as good as best previous BG computer program (also by Tesauro)
– beat human champion
4
– 19x19 locations
– Put one stone on some empty location
– win, lose
Lee Sedol by 4-1
shows superior performance than humans
reinforcement learning
5
agent environment state reward action
s0 s1 s2 a0 a1 a2 r0 r1 r2
st ∈ S then chooses action at ∈ A
to state st+1
6
agent environment state reward action
s0 s1 s2 a0 a1 a2 r0 r1 r2
Goal: learn a policy π : S → A for choosing actions that maximizes for every possible starting state s0
7
) , | ( ,...) , , , | (
1 1 1 1 t t t t t t t t
a s s P a s a s s P
+ − − +
= ) , | ( ,...) , , , | (
1 1 1 1 t t t t t t t t
a s r P a s a s r P
+ − − +
= 1 where ...] [
2 2 1
+ + +
+ +
t t t
r r r E
maximizes from every state s ∈ S G
100 100
each arrow represents an action a and the associated number represents deterministic reward r(s, a)
8
=0
] [
t t t
r E
assuming action sequence chosen according to π starting at state s
we’ll denote the value function for this optimal policy as V*(s)
10
=
t t t
G
100 100
Vπ(s) values are shown in red 100 90 81 73 66
11
G
100 100
V*(s) values are shown in red 100 90 100 90 81
12
13
+ S s t t t A a t
* 1 *
initialize V(s) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A { } }
14
S s
'
a
looping through each state and action methodically – but we must visit each state infinitely often
around its environment
15
define a new function, closely related to V* if agent knows Q(s, a), it can choose optimal action without knowing P(s’ | s, a) and it can learn Q(s, a) without knowing P(s’ | s, a)
17
* , | '
a s s
*
a
*
a
G 100 100
r(s, a) (immediate reward) values
G 100 90 100 90 81 81 72 81 81 72 90 81
Q(s, a) values
G 100 90 100 90 81
V*(s) values
18
for each s, a initialize table entry
do forever select an action a and execute it receive immediate reward r
update table entry s ← s’
19
) ' , ' ( ˆ max ) , ( ˆ
'
a s Q r a s Q
a
+ ) , ( ˆ a s Q
100 72 63 81 100 90 63 81
20
2 ' 1
a right
– need to have a ‘next state’ function to generate all possible states – choose next state with highest V value.
– need only know which actions are legal – generally choose next state with highest Q value. V V V Q Q
21
follow the current policy (exploitation)
where c > 0 is a constant that determines how strongly selection favors actions with higher Q values
22
j a s Q a s Q i
j i
) , ( ˆ ) , ( ˆ