INF3490 - Biologically inspired computing Reinforcement Learning
Weria Khaksar
October 10, 2018
INF3490 - Biologically inspired computing Reinforcement Learning - - PowerPoint PPT Presentation
INF3490 - Biologically inspired computing Reinforcement Learning Weria Khaksar October 10, 2018 The Commuter (2018) Ghostbusters (1984) 10.10.2018 2 It would be in vain for one Intelligent Being , to set a Rule to the Actions of another,
October 10, 2018
2
10.10.2018
Ghostbusters (1984) The Commuter (2018)
3
10.10.2018
10.10.2018
4
From: ”Deconstructing Reinforcement Learning” ICML 2009
10.10.2018
5
Barrett WAM robot learning to flip pancakes by reinforcement learning Socially Aware Motion Planning with Deep Reinforcement Learning Hierarchical Reinforcement Learning for Robot Navigation Google DeepMind's Deep Q-learning playing Atari Breakout
Untrained Classifier
6
10.10.2018
7
w1 w2 wn
a=i=1
n wi xi
1 if a q
y = 0 if a < q
inputs weights activation
10.10.2018
50 chess moves later
8
10.10.2018
9
10.10.2018
10.10.2018
10
11
10.10.2018
12
10.10.2018
13
10.10.2018
14
10.10.2018
15
10.10.2018
16
“Move piece from J1 to H1”
10.10.2018
17
You took an opponent’s piece. Reward=1
10.10.2018
18
10.10.2018
19
10.10.2018
10.10.2018
20
𝑢+1is a function of 𝑡𝑢, 𝑏𝑢 .
10.10.2018
21
22
10.10.2018
▪ Total reward: 𝑆 =
𝑢=0 𝑂−1
𝑠
𝑢+1
Future rewards may be uncertain and we might care more about rewards that come soon. Therefore, we discount future rewards: 𝑆 =
𝑢=0 ∞
𝛿𝑢. 𝑠
𝑢+1 ,
0 ≤ 𝛿 ≤ 1
𝑆 =
𝑙=0 ∞
𝛿𝑙. 𝑠
𝑢+𝑙+1 ,
0 ≤ 𝛿 ≤ 1
23
10.10.2018
▪ Future reward: 𝑆 = 𝑠
1 + 𝑠 2 + 𝑠 3 + ⋯ + 𝑠 𝑜
𝑆𝑢 = 𝑠
𝑢 + 𝑠 𝑢+1 + 𝑠 𝑢+2 + ⋯ + 𝑠 𝑜
▪ Discount future rewards (environment is stochastic) 𝑆𝑢 = 𝑠
𝑢 + 𝛿𝑠 𝑢+1 + 𝛿2𝑠 𝑢+2 + ⋯ + 𝛿𝑜−𝑢𝑠 𝑜
= 𝑠
𝑢 + 𝛿(𝑠 𝑢+1+𝛿(𝑠 𝑢+2 + ⋯ ))
= 𝑠
𝑢 + 𝛿𝑆𝑢+1
▪ A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.
24
1 0,990000 0,950000 0,500000 0,050000 2 0,980100 0,902500 0,250000 0,002500 4 0,960596 0,814506 0,062500 0,000006 8 0,922745 0,663420 0,003906 0,000000 16 0,851458 0,440127 0,000015 0,000000 32 0,724980 0,193711 0,000000 0,000000 64 0,525596 0,037524 0,000000 0,000000
10.10.2018
25
10.10.2018
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 10 20 30 40 50 60
0,99 0,95 0,50 0,05
➢ Greedy strategy: pure exploitation ➢ 𝜻-Greedy strategy: exploitation with a little exploration ➢ Soft-Max strategy: 𝑄 𝑅𝑡,𝑢 𝑏
𝑓(𝑅𝑡,𝑢 𝑏 /𝜐) σ𝑐 𝑓(𝑅𝑡,𝑢 𝑐 /𝜐)
26
10.10.2018
27
10.10.2018
𝑠(𝑠 𝑢 = 𝑠′, 𝑡𝑢+1 = 𝑡′|𝑡𝑢, 𝑏𝑢, 𝑠 𝑢−1, … , 𝑠 1, 𝑡1, 𝑏1, 𝑡0, 𝑏0)
𝑠(𝑠 𝑢 = 𝑠′, 𝑡𝑢+1 = 𝑡′|𝑡𝑢, 𝑏𝑢)
28
10.10.2018
29
10.10.2018 A simple example of a Markov Decision Process
10.10.2018
30
– The value of a state, 𝑊(𝑡), averaged over all possible actions in that state. (state-value function) 𝑊 𝑡 = 𝐹 𝑠𝑢 𝑡𝑢 = 𝑡 = 𝐹
𝑗=0 ∞
𝛿𝑗. 𝑠𝑢+𝑗+1| 𝑡𝑢 = 𝑡 – The value of a state/action pair 𝑅(𝑡, 𝑏). (action-value function) 𝑅 𝑡, 𝑏 = 𝐹 𝑠𝑢 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏 = 𝐹
𝑗=0 ∞
𝛿𝑗. 𝑠𝑢+𝑗+1| 𝑡𝑢 = 𝑡, 𝑏𝑢 = 𝑏
experience.
10.10.2018
31
10.10.2018
32
10.10.2018
33
10.10.2018
34
home
10.10.2018
35
10.10.2018
36
10.10.2018
37
10.10.2018
38
10.10.2018
39
10.10.2018
40
10.10.2018
41
10.10.2018
42
10.10.2018
43
10.10.2018
44
10.10.2018
45
10.10.2018
46
10.10.2018
47
10.10.2018
48
10.10.2018
49
10.10.2018
50
10.10.2018
51
10.10.2018
52
10.10.2018
53
10.10.2018
54
10.10.2018
55
10.10.2018
56
10.10.2018
57
10.10.2018
58
10.10.2018
59
10.10.2018
60
61
Start The Cliff Goal
10.10.2018
62
Start The Cliff Goal
10.10.2018
63
Start The Cliff Goal
10.10.2018
64
Start The Cliff Goal Start The Cliff Goal
10.10.2018
MarI/O - Machine Learning for Video Games
10.10.2018
65
10.10.2018
66