ARTIFICIAL INTELLIGENCE
Lecturer: Silja Renooij
Reinforcement learning
Utrecht University The Netherlands
These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja - - PowerPoint PPT Presentation
Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html Outline
Utrecht University The Netherlands
These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
2
3
4
– World’s best player of simulated soccer, 1999; Runner‐up 2000
– 10‐15% improvement over industry standard methods
– World's best assigner of radio channels to mobile telephone calls
– (Probably) world's best down‐peak elevator controller
– navigation, bi‐pedal walking, grasping, switching between skills...
– World's best backgammon & Go players
(Alpha Go: https://www.youtube.com/watch?v=SUbqykXVx0A)
5
6
Learning System
Input x from environment Output (based on) h(x) Training Info
7
– Receive feedback in the form of rewards – Agent’s return in long run is defined by the reward function – Must (learn to) act so as to maximize expected return – All learning is based on observed samples of outcomes! Environment
Actions: a State: s Reward: r
8
action
at st
reward
rt rt+1 st+1
state
t
t +1
t +2
t +3
9
10
– I.e. we don’t know which states are good or what the actions do – Must actually try actions and states out to learn
11
– (1) actively search for a can, – (2) wait for someone to bring it a can, or – (3) go to home base and recharge.
12
search
high low
1, 0 1— β , —3 search recharge wait wait
search
1— α , R β , R
search
α, Rsearch 1, R wait 1, R wait
S high, low
A(high) search, wait
A(low) search, wait, recharge
wait search wait search
waiting while cans
no. expected searching while cans
no. expected R R R R
13
Goal Technique
Goal Technique
Compute V*, * VI/PI
MDP
Goal Technique
Compute Vπ Direct evaluation TD‐learning Compute Q*, * Q‐learning
14
15
16
𝛿𝑆 𝑡′ 𝑆⊤ 𝑡 𝑠
⊤)
17
Assume: = 1
B, east, ‐1, C C, east, ‐1, D D, exit, +10, B, east, ‐1, C C, east, ‐1, D D, exit, +10, E, north, ‐1, C C, east, ‐1, A A, exit, ‐10,
E, north, ‐1, C C, east, ‐1, D D, exit, +10,
18
Assume: = 1
B, east, ‐1, C C, east, ‐1, D D, exit, +10, B, east, ‐1, C C, east, ‐1, D D, exit, +10, E, north, ‐1, C C, east, ‐1, A A, exit, ‐10,
E, north, ‐1, C C, east, ‐1, D D, exit, +10,
19
t1‐1 t1‐2 t1‐3 t2‐1 t2‐2 t2‐3 t3‐1 t3‐2 t3‐3 t4‐1 t4‐2 t4‐3
– easy to understand – doesn’t require any knowledge of T, R – eventually computes the correct average values, using just sample transitions
– wastes information about state connections – each state must be learned separately takes a long time to learn
If B and E both go to C under this policy, how can their values be different?
20
21
Assume: = 1, α = 1/2
22
23
Assume: = 1, α = 1/2
24
Assume: = 1, α = 1/2
25
putt
sand
green
−1
s a n d
−2 −2 −3 −4 −1 −5 −6 −4 −3 −3 −2 −4
−∞ −∞
26
state s
state s’
*
* ' *
a s a
* *
a
* ' *
a
* ' ' *
a s
28
29
'
a
30
sand
green
−1
s a n d
−2 −3 −2
31
1 1 1
right right right
32
2 1
'
a
right
33
34
35
36
37
38
– Too many states to visit them all in training – Too many states to hold the q‐tables in memory
– Learn about some small number of training states from experience – Generalize that experience to new, similar situations – This is a fundamental idea in machine learning!
39
40
41
42
43
(no noise)
44
45
46