Model-Free Control
(Reinforcement Learning)
and Deep Learning
MARC G. BELLEMARE
Google Brain (Montréal)
Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - - PowerPoint PPT Presentation
Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montral) 1956 1992 2016 2015 T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013) 160 pixels Reward: change in score 210 pixels
Google Brain (Montréal)
1956 1992 2016 2015
THE ARCADE LEARNING ENVIRONMENT (BELLEMARE ET AL., 2013)
ˆ Q = ΠT π ˆ Q
Q − Qπ
1 1 − γ
Q⇤(x, a) = r(x, a) + γ E
P max a02A Q⇤(x0, a0)
M := hX, A, R, P, γi
BEEN SO SUCCESSFUL?
THE MAXIMIZATION OF SOME VALUE FUNCTION”
– SUTTON & BARTO (2017, IN PRESS)
state reward action at rt st
IN ONE
P,π
t=0
π
P,π
t=0
P,π Qπ(x0, a0)
P max a02A Q⇤(x0, a0)
x0⇠P a0⇠π
Qπ
Qk+1 := T πQk
x0⇠P max a02AQ(x0, a0)
a02A Qk(x0, a0)
m times
π
π
Qk
Qπ
Qk+1 := T πQk
xt+1 ∼ P(· | xt, at)
state reward action at rt st
xt
t Qt(x, a)
TD-error δ
a02A Qt(x0, a0) − Qt(x, a)
[1] Konda and Tsitsiklis (2004) [2] Azar et al., Speedy Q-Learning (2011) [3] Harutyunyan, Bellemare, Stepleton, Munos (2016) [4] Munos, Stepleton, Harutyunyan, Bellemare (2016)
θ
[1] Tsitsiklis and Van Roy (1997)
[1] Maei et al. (2009) [2] Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos (2016) [3] Touati et al. (2017)
1956 1992 2016 2015
1956 1992 2016 2015
1956 1992 2016 2015
Slide adapted from Ali Eslami
Graphic by Volodymyr Mnih
Mnih et al., 2015
a02A Q(x0, a0, θ)
target
a02A Q(x0, a0, θ) Q(x, a, θ)
Based on material by David Silver
Based on material by David Silver
Based on material by David Silver
x,a,r,x0,a0⇠D
a02A Q(x0, a0, θ) − Q(x, a, θ)
Based on material by David Silver
a02A Q(x0, a0, θ)
Based on material by David Silver
a02A Q(x0, a0, θ) − Q(x, a, θ)
Based on material by David Silver
Based on material by David Silver
rθ
x0⇠Pa max a02A Q(x0, a0, θ)
Precup, Sutton, and Singh (2000) Thomas and Brunskill (2016)
RQ(x, a) := Q(x, a) +
∞
X
t=0
(λγ)t t−1 Y
s=0
cs
[1] Tsitsiklis and Van Roy (1997) [2] Munos, Stepleton, Harutyunyan, Bellemare (2016)
cs := min n 1, π(as | xs) µ(as | xs)
∞
X
k=0
λkh k X
t=0
γtr(xt, at) + γk+1Q(xk+1, ak+1) | {z }
n-step return
i = Q(x, a) +
∞
X
t=0
(λγ)tδ(xt, at)
k→∞ max a∈A( ˜
a∈A Q∗(x, a)
a02A Q⇤(x, a0) − Q⇤(x, a)
a02A Q(x, a0) − Q(x, a)
Bellemare, Ostrovski, Guez, Thomas, Munos (2016)
Arthur Guez Philip Thomas Rémi Munos