Reinforcement Learning
Philipp Koehn 17 November 2015
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
Reinforcement Learning Philipp Koehn 17 November 2015 Philipp - - PowerPoint PPT Presentation
Reinforcement Learning Philipp Koehn 17 November 2015 Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015 Rewards 1 Agent takes actions Agent occasionally receives reward Maybe just at the end of the
Philipp Koehn 17 November 2015
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
1
– agent has to decide on individual moves – reward only at end: win/lose
– ping pong: any point scored – learning to crawl: any forward movement
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
2
State Map Stochastic Movement
= { −0.04 (small penalty) for nonterminal states ±1 for terminal states
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
3
– needs model of environment – learns utility function on states – selects action that maximize expected outcome utility
– learns action-utility function (Q(s,a) function) – does not need to model outcomes of actions – function provides expected utility of taken a given action at a given step
– learns policy that maps states to actions
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
4
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
5
State Map Stochastic Movement Reward Function R(s) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ +1
for goal
–1
for pit
–0.04
for other
→ new state becomes known → reward becomes known
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
6
(⇒ policy needs to be learned)
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
7
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
8
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
9
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
10
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
11
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
12
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
13
+1
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
14
0.84 0.76 0.80 0.88 0.92 1.00 0.96 0.72
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
15
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
16
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
17
U π(s) = E [
∞
∑
t=0
γtR(St)]
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
18
0.84 0.76 0.80 0.88 0.92 1.00 0.96 0.72
– (1,1) one sample: 0.72 – (1,2) two samples: 0.76, 0.84 – (1,3) two samples: 0.80, 0.88
will converge to utility of state
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
19
U π(s) = R(s) + γ ∑
s′
P(s′∣s,π(s)) U π(s′) (γ = reward decay)
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
20
Need to learn:
– whenever a state is visited, record award (deterministic)
– collect statistic count(s,s′) that s′ is reached from s – estimate probability distribution P(s′∣s,π(s)) = count(s,s′) ∑s′′ count(s,s′′) ⇒ Ingredients for policy evaluation algorithm
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
21
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
22
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
23
∆U π(s) = α ( R(s) + γU π(s′) − U π(s)) (α = learning rate)
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
24
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
25
faster than temporal difference learning (TD) – both make adjustments to make successors agree – but: ADP adjusts all possible successors, TD only observed successor
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
26
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
27
– following optimal policy (as currently viewed) – exploration
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
28
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
29
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
30
probability distribution over payoffs
– presume optimal payoff – exploration (new lever)
– Gittins index: formula for solution – uses payoff / number of times used
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
31
– carry out optimal policy ⇒ maximize reward
– with probability p(1/t) take random action – initially (t small) focus on exploration – later (t big) focus on optimal policy
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
32
U(s) ← R(s) + γ maxa ∑
s′
P(s′∣s,a) U(s′)
U +(s) ← R(s) + γ maxa f (∑
s′
P(s′∣s,a) U +(s′),N(s,a))
f(u,n) = { R+ if n < Nc u
R+ is optimistic estimate, best possible award in any state
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
33
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
34
Q(s,a) = R(s) + γ ∑
s′
P(s′∣s,a) maxa′Q(s′,a′)
∆Q(s,a) = α(R(s) + γ maxa′ Q(s′a′) − Q(s,a))
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
35
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
36
– Backgammon has 1020 states – Chess has 1040 states
⇒ Generalization of states needed
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
37
ˆ Uθ(s) = θ1 f1(s) + θ2 f2(s) + ... + θn fn(s)
– f1(s) = (number of white pawns) – (number of black pawns) – f2(s) = (number of white rooks) – (number of black rooks) – f3(s) = (number of white queens) – (number of black queens) – f4(s) = king safety – f5(s) = good pawn position – etc. ⇒ Reduction from 1040 to, say, 20 parameters
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
38
ˆ Uθ(f1,f2) = θ0 + θ1f1 + θ2f2
Uθ(1,1) = 0.8
2( ˆ
Uθ(f1,f2) − uθ(f1,f2))2
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
39
dEθ dθi = ( ˆ Uθ(f1,f2) − uθ(f1,f2)) fi
∆θi = −µ dEθ dθi
– ∆θ1 = −µ( ˆ Uθ(f1,f2) − uθ(f1,f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ – ∆θ2 = −µ( ˆ Uθ(f1,f2) − uθ(f1,f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
40
⇒ we may want to use more complex features
f3(s) = (x − xgoal) + (y − ygoal)
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
41
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
42
π(s) = maxa ˆ Qθ(s,a)
πθ(s,a) = 1 Zs e
ˆ Qθ(s,a)
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
43
⇒ optimizing policy value ρ(θ) may be done in closed form
⇒ gradient descent by following policy gradient
⇒ hillclimb if ρ(θ) improves
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
44
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
45
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
46
x, angle θ, angle speed ˆ θ
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015
47
(fixed policy, partially observable environment, stochastic outcomes of actions) – sampling (carrying out trials) – adaptive dynamic programming – temporal difference learning
– greedy in the limit of infinite exploration – following optimal policy vs. exploration – exploratory adaptive dynamic programming
Philipp Koehn Artificial Intelligence: Reinforcement Learning 17 November 2015