Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - - PowerPoint PPT Presentation

model free control
SMART_READER_LITE
LIVE PREVIEW

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC - - PowerPoint PPT Presentation

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain (Montral) 1956 1992 2016 2015 T HE A RCADE L EARNING E NVIRONMENT (B ELLEMARE ET AL ., 2013) 160 pixels Reward: change in score 210 pixels


slide-1
SLIDE 1

Model-Free Control

(Reinforcement Learning)

and Deep Learning

MARC G. BELLEMARE

Google Brain (Montréal)

slide-2
SLIDE 2
slide-3
SLIDE 3

1956 1992 2016 2015

slide-4
SLIDE 4

THE ARCADE LEARNING ENVIRONMENT (BELLEMARE ET AL., 2013)

slide-5
SLIDE 5

210 pixels 160 pixels 60 frames/second 18 actions Reward: change in score

slide-6
SLIDE 6
  • 33,600 (discrete) dimensions
  • Up to 108,000 decisions/episode (30 minutes)
  • 60+ games: heterogenous dynamical systems
slide-7
SLIDE 7

DEEP LEARNING: AN AI SUCCESS STORY

slide-8
SLIDE 8

ˆ Q = ΠT π ˆ Q

  • ˆ

Q − Qπ

  • D ≤

1 1 − γ

  • ΠQπ − Qπ
  • D

Q⇤(x, a) = r(x, a) + γ E

P max a02A Q⇤(x0, a0)

M := hX, A, R, P, γi

Theory Practice

slide-9
SLIDE 9

WHERE HAS MODEL-FREE CONTROL

BEEN SO SUCCESSFUL?

  • Complex dynamical systems
  • Black-box simulators
  • High-dimensional state spaces
  • Long time horizons
  • Opponent / adversarial element
slide-10
SLIDE 10

PRACTICAL CONSIDERATIONS

  • Are simulations reasonably cheap? model-free
  • Is the notion of “state” complex? model-free
  • Is there partial observability? maybe model-free
  • Can the state space be enumerated? value iteration
  • Is there an explicit model available? model-based
slide-11
SLIDE 11

OUTLINE OF TALK

Ideal case Practical case

slide-12
SLIDE 12

WHAT’S REINFORCEMENT LEARNING, ANYWAY?

“ALL GOALS AND PURPOSES … CAN BE THOUGHT OF AS

THE MAXIMIZATION OF SOME VALUE FUNCTION”

– SUTTON & BARTO (2017, IN PRESS)

slide-13
SLIDE 13
  • At each step t, the

agent

  • Observes a state
  • Takes an action
  • Receives a reward

state reward action at rt st

xt

slide-14
SLIDE 14

THREE LEARNING PROBLEMS

IN ONE

Optimal control Policy evaluation Stochastic
 approximation Function approximation

slide-15
SLIDE 15


 BACKGROUND

  • Formalized as a Markov Decision Process:


  • R, P reward, transition functions
  • 𝛿 discount factor
  • A trajectory is a sequence of interactions with the

environment M := hX, A, R, P, γi x1, a1, r1, x2, a2, . . .

slide-16
SLIDE 16
  • Policy : a probability distribution over actions:

  • Transition function:
  • Value function : total discounted reward


  • As a vector in space of value functions:

π at ∼ π(· | xt) Qπ(x, a) Qπ ∈ Q xt+1 ∼ P(· | xt, at)

Qπ(x, a) = E

P,π

" ∞ X

t=0

γtr(xt, at) | x0, a0 = x, a #

at = π(xt) If deterministic:

slide-17
SLIDE 17
  • “Maximize value function”: find

  • Bellman’s equation:



 
 


  • Optimality equation:


Q∗(x, a) := max

π

Qπ(x, a) Qπ(x, a) = E

P,π

" 1 X

t=0

γtr(xt, at) | x0, a0 = x, a # = r(x, a) + γ E

P,π Qπ(x0, a0)

Q⇤(x, a) = r(x, a) + γ E

P max a02A Q⇤(x0, a0)

slide-18
SLIDE 18

BELLMAN OPERATOR

  • The Bellman operator is a 𝛿-contraction:

  • Fixed point:

T πQ(x, a) := r(x, a) + γ E

x0⇠P a0⇠π

Q(x0, a0) kT πQ Qπk∞  γ kQ Qπk∞ Qπ = T πQπ

Qk

Qk+1 := T πQk

slide-19
SLIDE 19

BELLMAN OPTIMALITY OPERATOR

  • Also a 𝛿-contraction (beware! different proof):

  • Fixed point is optimal v.f.:

Qk Q∗ Qk+1 := T Qk

T Q(x, a) := r(x, a) + γ E

x0⇠P max a02AQ(x0, a0)

kT Q Q∗k∞  γ kQ Q∗k∞ Q∗ = T Q∗ ≥ Qπ

slide-20
SLIDE 20

MODEL-BASED ALGORITHMS

1.Value iteration:
 2.Policy iteration:
 3.Optimistic policy iteration:
 
 Qk+1(x, a) ← T Qk(x, a) = r(x, a) + γ EP max

a02A Qk(x0, a0)

Qk+1(x, a) ← (T πk)mQk(x, a) = T πk · · · T πk | {z }

m times

Qk(x, a) πk = arg max

π

T πQk(x, a) a. Qk+1(x, a) ← Qπk(x, a) b.

slide-21
SLIDE 21

POLICY ITERATION

Optimal control Policy evaluation πk = arg max

π

T πQk(x, a) a. Qk+1(x, a) ← Qπk(x, a) b.

slide-22
SLIDE 22

MODEL-FREE REINFORCEMENT LEARNING

  • Typically no access to P, R
  • Two options:
  • Learn a model (not in this talk)
  • Model-free: learn or directly from samples

Qπ Q∗

slide-23
SLIDE 23

Model-based Model-free

Qk

Qk+1 := T πQk

xt

at

EP

xt

at

xt+1 ∼ P(· | xt, at)

state reward action at rt st

xt

slide-24
SLIDE 24

MODEL-FREE RL:
 SYNCHRONOUS UPDATES

  • For all x, a, sample
  • The SARSA algorithm:



 
 
 


  • is a step-size (sequence)


Qt+1(x, a) ← (1 − αt)Qt(x, a) + αt ˆ T π

t Qt(x, a)

= (1 − αt)Qt(x, a) + αt

  • r(x, a) + γQt(x0, a0)
  • = Qt(x, a) + αt
  • r(x, a) + γQt(x0, a0) − Qt(x, a))
  • |

{z }

TD-error δ

x0 ∼ P(· | x, a), a0 ∼ π(· | x0) αt ∈ [0, 1)

slide-25
SLIDE 25

MODEL-FREE RL:
 Q-LEARNING

  • The Q-Learning algorithm: max. at each iteration

Qt+1(x, a) ←(1 − αt)Qt(x, a) + αt

  • r(x, a) + γmax

a02A Qt(x0, a0) − Qt(x, a)

slide-26
SLIDE 26

Optimal control Policy evaluation Stochastic
 approximation

  • Both converge under


Robbins-Monro conditions

  • Not trivial! Interleaved


learning problems Q-Learning

slide-27
SLIDE 27

ASYNCHRONOUS UPDATES

  • The asynchronous case: learn from trajectories

  • Apply update at each step:

  • This is the setting we


usually deal with

  • Convergence even


more delicate

Q(xt, at) ← Qt(x, a) + αt

  • rt + γQ(xt+1, at+1) − Q(xt, at)
  • x1, a1, r1, x2, a2, · · · ∼ π, P
slide-28
SLIDE 28

OPEN QUESTIONS/ AREAS OF ACTIVE RESEARCH

  • Rates of convergence [1]
  • Variance reduction [2]
  • Convergence guarantees for multi-step methods [3, 4]
  • Off-policy learning: control from fixed behaviour [3, 4]

[1] Konda and Tsitsiklis (2004) [2] Azar et al., Speedy Q-Learning (2011) [3] Harutyunyan, Bellemare, Stepleton, Munos (2016) [4] Munos, Stepleton, Harutyunyan, Bellemare (2016)

slide-29
SLIDE 29

Optimal control Policy evaluation Stochastic
 approximation Function approximation

slide-30
SLIDE 30

Optimal control Policy evaluation Stochastic
 approximation Function approximation

slide-31
SLIDE 31

(VALUE) FUNCTION APPROXIMATION

  • Parametrize value function:

  • Learning now involves a projection step :

  • This leads to additional,


compounding error

  • Can cause divergence


Qπ(x, a) ≈ Q(x, a, θ) Π

ΠT πQ(x, a, θk) : θk+1 ← arg min

θ

  • T πQk(x, a, θk) − Q(x, a, θ)
  • D
slide-32
SLIDE 32

SOME CLASSIC RESULTS [1]

  • Linear approximation:

  • SARSA converges to satisfying

  • Q-Learning may diverge!

[1] Tsitsiklis and Van Roy (1997)

Qπ(x, a) ≈ θ>φ(x, a) ˆ Q

ˆ Q = ΠT π ˆ Q

  • ˆ

Q − Qπ

  • D ≤

1 1 − γ

  • ΠQπ − Qπ
  • D
slide-33
SLIDE 33

OPEN QUESTIONS/ AREAS OF ACTIVE RESEARCH

  • Convergent, linear-time optimal control [1]
  • Exploration under function approximation [2]
  • Convergence of multi-step extensions [3]

[1] Maei et al. (2009) [2] Bellemare, Srinivasan, Ostrovski,
 Schaul, Saxton, Munos (2016) [3] Touati et al. (2017)

slide-34
SLIDE 34

Ideal case Practical case

slide-35
SLIDE 35

Ideal case Practical case

slide-36
SLIDE 36

1956 1992 2016 2015

slide-37
SLIDE 37

1956 1992 2016 2015

Qπ(x, a) ≈ θ>φ(x, a)

slide-38
SLIDE 38

1956 1992 2016 2015

slide-39
SLIDE 39

DEEP LEARNING

Slide adapted from Ali Eslami

slide-40
SLIDE 40

DEEP
 LEARNING

Graphic by Volodymyr Mnih

L(θ) φθ(x, a) rθL(θ)

slide-41
SLIDE 41

Mnih et al., 2015

slide-42
SLIDE 42

DEEP REINFORCEMENT LEARNING

  • Value function as a Q-network
  • Objective function: mean squared error



 


  • Q-Learning gradient:


Q(x, a, θ) L(θ) := E h⇣ r + γ max

a02A Q(x0, a0, θ)

| {z }

target

− Q(x, a, θ) ⌘2i

rθL(θ) = E h⇣ r + γ max

a02A Q(x0, a0, θ) Q(x, a, θ)

⌘ rθQ(x, a, θ) i

Based on material by David Silver

slide-43
SLIDE 43

STABILITY ISSUES

  • Naive Q-Learning oscillates or diverges
  • 1. Data is sequential

Successive samples are non-iid

  • 2. Policy changes rapidly with Q-values

May oscillate; extreme data distributions

  • 3. Scale of rewards and Q-values is unknown

Naive gradients can be large; unstable backpropagation

Based on material by David Silver

slide-44
SLIDE 44

DEEP
 Q-NETWORKS

  • 1. Use experience replay

Break correlations, learn from past policies

  • 2. Target network to keep target values fixed

Avoid oscillations

  • 3. Clip rewards

Provide robust gradients

Based on material by David Silver

slide-45
SLIDE 45

EXPERIENCE REPLAY

  • Build dataset from agent’s experience
  • Take action according to ε-greedy policy
  • Store (x, a, r, x’, a’) in replay memory D
  • Sample transitions from D, perform

asynchronous update:
 


  • Effectively avoids correlations within trajectories

L(θ) =

E

x,a,r,x0,a0⇠D

h⇣ r + γ max

a02A Q(x0, a0, θ) − Q(x, a, θ)

⌘2i

Equivalent to planning with empirical model

Based on material by David Silver

slide-46
SLIDE 46

TARGET Q-NETWORK

  • To avoid oscillations, fix parameters of target in loss

function

  • Compute targets w.r.t. old parameters


  • As before, minimize squared loss:


  • Periodically update target network:


r + γ max

a02A Q(x0, a0, θ)

θ− ← θ

Similar to policy iteration!

Based on material by David Silver

L(θ) = ED h⇣ r + γ max

a02A Q(x0, a0, θ) − Q(x, a, θ)

⌘2i

slide-47
SLIDE 47

CLIPPING REWARDS

  • Clip rewards in range [-1, +1]
  • Ensures gradients are well-conditioned
  • Also prevents value overestimation
  • No longer can tell small, large rewards apart

Based on material by David Silver

slide-48
SLIDE 48

Based on material by David Silver

slide-49
SLIDE 49
slide-50
SLIDE 50

Some Recent Research

slide-51
SLIDE 51

ACTIVE RESEARCH: OFF-POLICY METHODS

  • Reusing data (e.g. from experience replay) can

diverge with approximation:
 


  • Can use importance sampling ratio:
  • But variance is high
  • Also safety issues: how to guarantee performance?

Q(x, a, θ)

← r(x, a) + γ E

x0⇠Pa max a02A Q(x0, a0, θ)

a ∼ µ π(a | s) µ(a | s)

Precup, Sutton, and Singh (2000) Thomas and Brunskill (2016)

slide-52
SLIDE 52

ACTIVE RESEARCH: MULTI-STEP METHODS

  • Greater accuracy [1] from multi-step returns:



 
 
 


  • Retrace(𝝻) [2] both off-policy and multi-step


  • Convergence surprisingly nontrivial, even without


value approximation

RQ(x, a) := Q(x, a) +

X

t=0

(λγ)t t−1 Y

s=0

cs

  • δ(xt, at)

[1] Tsitsiklis and Van Roy (1997) [2] Munos, Stepleton, Harutyunyan, Bellemare (2016)

cs := min n 1, π(as | xs) µ(as | xs)

  • T λQ(x, a) :=

X

k=0

λkh k X

t=0

γtr(xt, at) + γk+1Q(xk+1, ak+1) | {z }

n-step return

i = Q(x, a) +

X

t=0

(λγ)tδ(xt, at)

slide-53
SLIDE 53

ACTIVE RESEARCH: GAP-INCREASING OPERATORS

  • Action gap:
  • New operators that increase action gap, e.g.



 Not necessarily contraction operators Suboptimal Q-values may not converge

  • Yet: guaranteed convergence:


lim

k→∞ max a∈A( ˜

T )kQ(x, a) = max

a∈A Q∗(x, a)

max

a02A Q⇤(x, a0) − Q⇤(x, a)

˜ T Q(x, a) := T Q(x, a) − β  max

a02A Q(x, a0) − Q(x, a)

  • ,

β ∈ [0, 1)

Bellemare, Ostrovski, Guez, Thomas, Munos (2016)

slide-54
SLIDE 54

IN
 CONCLUSION

Optimal control Policy evaluation Stochastic
 approximation Function approximation

slide-55
SLIDE 55

Model-Free Control with Deep Learning

MARC G. BELLEMARE

  • M. Bowling
  • Y. Naddaf
  • J. Veness
  • G. Ostrovski

Arthur Guez Philip Thomas Rémi Munos

  • A. Harutyunyan
  • T. Stepleton
  • S. Srinivasan
  • T. Schaul
  • D. Saxton