Deep Reinforcement Learning Building Blocks Arjun Chandra Research - - PowerPoint PPT Presentation

deep reinforcement learning building blocks
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning Building Blocks Arjun Chandra Research - - PowerPoint PPT Presentation

Deep Reinforcement Learning Building Blocks Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 8 November 2017 https://join.slack.com/t/deep-rl-tutorial/signup The Plan The Problem


slide-1
SLIDE 1

Deep Reinforcement Learning Building Blocks

Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger

8 November 2017 https://join.slack.com/t/deep-rl-tutorial/signup

slide-2
SLIDE 2

The Plan

  • The Problem
  • (deep) RL Concepts by Example
  • Problem Decomposition
  • Solution Methods
  • Value Based
  • Policy Based
  • Actor-Critic
slide-3
SLIDE 3

how to make decisions over +me to maximise my return / “long term reward”?

slide-4
SLIDE 4 h/p://cs.stanford.edu/groups/li/ledog/
slide-5
SLIDE 5

emergence of locomo<on

https://arxiv.org/abs/1707.02286 https://deepmind.com/blog/producing-flexible-behaviours-simulated-environments/ https://www.youtube.com/watch?v=hx_bgoTF7bs
slide-6
SLIDE 6

As we know…

2004

Stanford

2013 —

Vlad Mnih et. al.

2015 —

David Silver et. al. Google DeepMind RL for robots using NNs, L-J Lin. PhD 1993, CMU

1995

Gerald Tesauro

late 1980s

Rich Su/on et al. http://heli.stanford.edu/
slide-7
SLIDE 7

Problem Characteristics

requires strategy delayed consequences dynamic uncertainty/volatility uncharted/unimagined/ exception laden

Image credit: http://wonderfulengineering.com/inside-the-data-center-where-google-stores-all-its-data-pictures/
slide-8
SLIDE 8

machine with agency which learn, plan, and act to find a strategy for solving the problem

explore and exploit probe and learn from feedback autonomous to some extent focus on the long-term objective

Solution

slide-9
SLIDE 9

what is the sequence of ac+ons I could take to maximise my return / “long term reward”?

slide-10
SLIDE 10

Reinforcement Learning

  • bservation and
feedback on actions action Problem/ Environment maximise return E{R} Goal Model dynamics model Agent Model Goal policy/value function π/Q π/Q
slide-11
SLIDE 11

the excrucia<ngly awesome MDP game!

you

env

interact to maximise long term reward Inspired by Rich Sutton's tutorial: https://www.youtube.com/watch?v=ggqnxyjaKe4

slide-12
SLIDE 12

the MDP (S,A,P,R,ϒ)

https://github.com/traai/basic-rl

A B

1 2 1 2

R=-10±3 P=0.99 R=10±3 P=1.00 R=40±3 P=0.99 R=20±3 P=0.01

R: immediate reward func<on R(s, a) P: state transi<on probability P(s’|s, a)

R=20±3 P=0.99 R=40±3 P=0.01 R=-10±3 P=0.01

slide-13
SLIDE 13

the problem (cartoon of an MPD)

state reward ac<on

?

slide-14
SLIDE 14

agent’s job/goal? maximise expected cumula<ve reward/return

r r r r r
slide-15
SLIDE 15

toy problem

home

slide-16
SLIDE 16

state and ac<on spaces

  • size of these spaces can be quite

large

  • specifying the spaces is crucial in designing

a good learning agent

slide-17
SLIDE 17 5 integer values between 1 and 100: {22,44,12,67,9}

size of state space = 100 x 100 x 100 x 100 x 100 can quan<se state space differently

5 values belonging to 2 classes: {1, 2, 1, 2, 1}

size of state space = 2 x 2 x 2 x 2 x 2 in the toy problem? 9

slide-18
SLIDE 18

reward

taking an ac<on in some state results in an immediate reward (can be nega<ve)

slide-19
SLIDE 19

home

slide-20
SLIDE 20

home

  • 1
slide-21
SLIDE 21

home

slide-22
SLIDE 22

home

  • 1
slide-23
SLIDE 23 what if the cat were to start here some
  • ther day?

home

>-1

X

slide-24
SLIDE 24

reward system should tell the agent:

what to achieve

rather than how to achieve

slide-25
SLIDE 25

reward

this is all the feedback an agent gets!

immediate!

slide-26
SLIDE 26

reward

but agent has to choose an ac<on based on expected return

slide-27
SLIDE 27

expected return

?

slide-28
SLIDE 28

task

episodic

(there is an end)

con+nual

(there is no end)

slide-29
SLIDE 29

episodic

(there is an end) agent taking finite (say 5) steps <ll the end... should act based on the

e.g. average of the following

R0 = r

1 +r 2 +r 3 +r4 +r 5

slide-30
SLIDE 30

con+nual

(there is no end) agent can con<nue ac<ng for infinite steps

in +me...

should discount future rewards and act based on e.g. average of the following

R0 = r

1 +γ r 2 +γ 2r 3 +γ 3r4 +γ 4r 5 +!

slide-31
SLIDE 31

discount

future reward is probably more

uncertain than immediate reward

0≤γ ≤1

shortsighted? farsighted?

γ = 0

γ =1

R0 = r

1 +γ r 2 +γ 2r 3 +γ 3r4 +γ 4r 5 +!

slide-32
SLIDE 32

E{ }

Rt = γ kr

t+k+1 k=0 T

R0 = γ krk+1

k=0 T

slide-33
SLIDE 33 immediate reward further reward possibili<es
slide-34
SLIDE 34

expected return

E{ }

Rt = γ kr

t+k+1 k=0 T

slide-35
SLIDE 35

but these expected returns are

not known to agent

beforehand!

slide-36
SLIDE 36

what knowledge might the agent try to acquire to behave properly?

?

slide-37
SLIDE 37

ranking/probability of an ac<on in some state bringing max expected return (long term value)?

slide-38
SLIDE 38

E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} expected long term value of being in each state, under some ac<on selec<on scheme? home

slide-39
SLIDE 39

h

E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R}

expected long term value of taking some ac<on in each state, then behaving using some ac<on selec<on scheme?

slide-40
SLIDE 40

modelling dynamics / mapping the environment?

  • 1
  • 1
  • 10
  • 1
If I go South, I will meet
slide-41
SLIDE 41

predic+on problem learn to predict expected long term reward/value control problem learn to find the op+mal ac+on selec+on scheme/policy

slide-42
SLIDE 42

?

value: how good is an ac<on/state policy: ac<on selec<on model: predict next state/reward to look ahead/plan

slide-43
SLIDE 43

value based policy based model based

types of RL agents?

h E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R} E{R}

both value and policy value/policy + model of dynamics

slide-44
SLIDE 44

we will focus on value based RL in the first half

slide-45
SLIDE 45

ac<on selec<on?

?

values of each possible ac<on in the

current state helps select ac<ons!

expected return for carrying out an

ac<on is its value

slide-46
SLIDE 46

policy can be derived from value

(e.g. act greedily)

slide-47
SLIDE 47

<<expected returns unknown>>

<<ac+ons based on unknowns>>

but what are these values?

slide-48
SLIDE 48

value can be es+mated by

sampling environment

while ac+ng using some policy

e.g. act, accumulate new reward (ground truth), and update

slide-49
SLIDE 49 a b c 1 2 1 2 3
  • 1
3
  • 5
6 2 4 2 3 1 . . . . . . . . . . . . n 7 8 7

Q

agent maintains values for ac+ons within each state

selects ac<ons using these values under some

“policy”

slide-50
SLIDE 50

V

agent maintains state values

selects ac<ons using these values under some

“policy”

1 2 2 3 3
  • 5
4 2 . . . . . . n 7

but… agent needs a model of the environment!

slide-51
SLIDE 51

home

9 states 10^16992 (pixels) 10^308 (ram) con+nuous!
slide-52
SLIDE 52

extract features that help generalise across states

slide-53
SLIDE 53 Q(s, north)

Q

Q(s, south) Q(s, east) Q(s, west)

state s action values given state features

slide-54
SLIDE 54

E{Rt}

Q V

slide-55
SLIDE 55

policy?

probability of choosing an ac<on in state/feature representa<on thereof

𝜌

Qπ(s,a) V π(s)

slide-56
SLIDE 56

usual policies

greedy ε-greedy soi-max

choose best ac+on choose best ac+on with probability 1-ε choose ac+on with probability given by its value

slide-57
SLIDE 57 Yamaguchi先⽣甠, h/p://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg

explora+on vs. exploita+on

a/b tes+ng: try new website feature

trial and error

ads: try new ads

Sta+c, dynamic and adap+ve heterogeneity in socio-economic distributed smart camera networks, P. R. Lewis, L. Esterle, A. Chandra,
  • B. Rinner, J. Torresen, and X. Yao, ACM Transac<ons on Autonomous
and Adap<ve Systems (TAAS), ACM, 2015.

smart camera networks: try new comm. protocol game play: try new moves

slide-58
SLIDE 58

Q

*

Q*(s,a)= maxπ Qπ(s,a)

V

*

V *(s)= maxπ V π(s)

𝜌*

π *(a|s)= 1 if a = argmaxaQ*(s,a) 0 otherwise ⎧ ⎨ ⎪ ⎩ ⎪

slide-59
SLIDE 59

the current state (or state-action pair) has an estimated value (say zero/random initially),

which can be used together with rt+1 to update

value of previous state (or state-action pair)

es<ma<on?

<<use currently visible returns to update values of where you are coming from>>

at st st+1 rt+1

slide-60
SLIDE 60

i.e.

frac<on of (currently visible returns - old value)

  • ld value

+

new value

(1-frac<on) + frac<on

  • ld value
  • curr. vis. returns
slide-61
SLIDE 61 immediate reward rt+1 further reward possibili+es

rt+1 + ϒE{Rt+1 } rt+1 + ϒQ(st+1,at+1)

E { }

Rt = γ kr

t+k+1 k=0 T

Q(s,a)

rt+1 + ϒQ(s’,a’)

slide-62
SLIDE 62

e.g.

V(s)←V(s)+α(rs

a +γV(s')−V(s))

Q(s,a)←Q(s,a)+α(rs

a +γ Q(s',a')−Q(s,a))

under some policy 𝜌(a|s) under some policy 𝜌(a|s)
slide-63
SLIDE 63

e.g. update a lookup table maintaining expected returns

a b c 1 2 1 2 3
  • 1
3
  • 5
6 2 4 2 3 1 . . . . . . . . . . . . n 7 8 7 1 2 2 3 3
  • 5
4 2 . . . . . . n 7

Q V

slide-64
SLIDE 64

let’s play with a version of the above update rule:

Q(s,a)←Q(s,a)+α(rs

a +γ Q(s',a')−Q(s,a))

Q(s,a)←Q(s,a)+α(rs

a +γ maxa'Q(s',a')−Q(s,a))
slide-65
SLIDE 65

let’s play with a version of the above update rule:

indicates a’ to be the ac<on with maximum value in next state s’

Q(s,a)←Q(s,a)+α(rs

a +γ maxa'Q(s',a')−Q(s,a))
slide-66
SLIDE 66
  • ur toy problem

lookup table

N S E W 1 2 3 4 5 6 7 8 9 home
slide-67
SLIDE 67
  • ur toy problem

lookup table

home

1 2 3 4 5 6

7 8 9

home

slide-68
SLIDE 68 home

7 1 2 3 4 5 6

7 8 9

reward structure?

  • ut of bounds:
  • 5
to 5:
  • 10
to 7/home: 10 to any cell except 5 and 7:
  • 1
move…

home

slide-69
SLIDE 69

let’s fix 𝛽 = 0.1, 𝛿 = 0.5

home

7 1 2 3 4 5 6

8 9

slide-70
SLIDE 70

1 2 3 4 5 6

7 8 9

episode 1 begins... home say 𝜁-greedy policy…

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-71
SLIDE 71

1 2 3 4 5 6

7 8 9

?
  • 1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-72
SLIDE 72

1 2 3 4 5 6

7 8 9

  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-73
SLIDE 73

1 2 3 4 5 6

7 8 9

  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-74
SLIDE 74

1 2 3 4 5 6

7 8 9

?
  • 0.1
  • 5

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-75
SLIDE 75

1 2 3 4 5 6

7 8 9

  • 0.5
  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-76
SLIDE 76

1 2 3 4 5 6

7 8 9

  • 0.5
  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-77
SLIDE 77

1 2 3 4 5 6

7 8 9

?
  • 0.5
  • 0.1
  • 1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-78
SLIDE 78

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-79
SLIDE 79

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-80
SLIDE 80

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
?
  • 10

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-81
SLIDE 81

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-82
SLIDE 82

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-83
SLIDE 83

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
?
  • 1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-84
SLIDE 84

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-85
SLIDE 85

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
  • 0.1

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-86
SLIDE 86

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
  • 0.1
? 10

home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-87
SLIDE 87

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
  • 0.1
1

episode 1 ends. home

Q(s,a)←Q(s,a)+α(rs a +γ maxa'Q(s',a')−Q(s,a))

𝛽 = 0.1 𝛿 = 0.5

slide-88
SLIDE 88

let’s work out the next episode, star<ng at state 4 go WEST and then SOUTH

how does the table change?

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
  • 0.1
1

home

𝛽 = 0.1 𝛿 = 0.5

slide-89
SLIDE 89

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 0.5
  • 1
1
  • 0.1
1

𝛽 = 0.1 𝛿 = 0.5

slide-90
SLIDE 90

and the next episode, star<ng at state 3

go WEST -> SOUTH -> WEST -> SOUTH

slide-91
SLIDE 91

1 2 3 4 5 6

7 8 9

  • 0.1
  • 0.5
  • 0.1
  • 1
  • 0.1
  • 0.5
  • 1
1.9
  • 0.05
  • 0.1
1
  • ver <me, values will converge to op<mal!

𝛽 = 0.1 𝛿 = 0.5

slide-92
SLIDE 92

what we just saw was some episodes of

Q-learning

values update towards value of op+mal policy: target comes from value of

assumed next best ac+on

  • ff-policy learning
slide-93
SLIDE 93

SARSA-learning?

values update towards value of current policy: target comes from value of

the actual next ac+on

  • n-policy learning
slide-94
SLIDE 94

Q SARSA

By Andreas Tille (Own work) [GFDL (www.gnu.org/copylei/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creaCvecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Q

data not generated by target policy

SARSA

data generated by target policy Example credit Travis DeWolf: https://studywolf.wordpress.com/ and https://git.io/vFBvv 𝜁: 0.1 𝛿: 1.0
slide-95
SLIDE 95

Problem Decomposition

solution to sub-problem informs solution to whole problem

nested sub-problems

slide-96
SLIDE 96

Bellman Expectation Backup

s, v(s) s’, v(s’) a, q(s, a) a’, q(s’, a’) a r r s’ Value of = P(path) * Value(path) Value of = P(path) * Value(path)

Bellman expectation equations under a given policy

qπ(s,a)= rs a +γ P ss' a s'

π(a'|s')qπ(s',a') a'

vπ(s)= π(a|s) rs a +γ P ss' a vπ(s') s'

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ a

system of linear equations solution: value of policy
slide-97
SLIDE 97

Bellman Optimality Backup

s, v(s) s’, v(s’) a, q(s, a) a’, q(s’, a’) a r r s’ Value of = P(path) * Value(path) Value of = P(path) * Value(path)

Bellman optimality equations under optimal policy

q*(s,a)= rs

a +γ

P

ss' a s'

maxa'q*(s',a')

v*(s)= maxa rs a +γ P ss' a v*(s') s'

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ system of non-linear equations solution: value of optimal policy
slide-98
SLIDE 98

Value Based

slide-99
SLIDE 99

Dynamic Programming

home

1 2 3 4 3 1 2 E N W S 10
  • 10
  • 1
  • 1
  • 5
  • 5
  • 5
  • 5
4 what’s best to do? …using Bellman equations as iterative updates
slide-100
SLIDE 100

Dynamic Programming

home

1 2 3 4 3 1 2 E N W S 10
  • 10
  • 1
  • 1
  • 5
  • 5
  • 5
  • 5
4 what’s best to do? …using Bellman equations as iterative updates
slide-101
SLIDE 101

Policy Iteration

Evaluate Policy (sweep , apply Bellman expectation) Improve Policy (greedy) Value of = P(path) * Value(path)

qπ(s,a)= rs

a +γ

P

ss' a s'

π(a'|s')qπ(s',a')

a'

3 1 2 E N W S 10
  • 10
  • 1
  • 1
  • 5
  • 5
  • 5
  • 5
4
slide-102
SLIDE 102 N: -5 + 0.9*0 E: -5 + 0.9*0 S: -10 + 0.9*0 W: -1 + 0.9*0

qπ(s,a)= rs

a +γ

P

ss' a s'

π(a'|s')qπ(s',a')

a'

Value of = P(path) * Value(path)

π(W|2): 1.0 (greedy)

3 1 2 E N W S 10
  • 10
  • 1
  • 1
  • 5
  • 5
  • 5
  • 5
4 N: -5 + 0.9*0 E: -1 + 0.9*0 S: 10 + 0.9*0 W: -5 + 0.9*0

π(S|1): 1.0 (greedy)

r q iteratively apply Bellman expectation equations in inner loop until values do not change much use greedy policy, given new values
slide-103
SLIDE 103

Value Iteration

Find Optimal Value and Policy (sweep , apply Bellman optimality) Value of = P(path) * Value(path)

q*(s,a)= rs

a +γ

P

ss' a s'

maxa'q*(s',a')

3 1 2 E N W S 10
  • 10
  • 1
  • 1
  • 5
  • 5
  • 5
  • 5
4
slide-104
SLIDE 104 3 1 2 E N W S 10
  • 10
  • 1
  • 1
  • 5
  • 5
  • 5
  • 5
4 Value of = P(path) * Value(path)

q*(s,a)= rs

a +γ

P

ss' a s'

maxa'q*(s',a')

N: -5 + 0.9*0 E: -5 + 0.9*0 S: -10 + 0.9*0 W: -1 + 0.9*0 N: -5 + 0.9*0 E: -1 + 0.9*0 S: 10 + 0.9*0 W: -5 + 0.9*0 r q iteratively apply Bellman optimality equations until values do not change much
slide-105
SLIDE 105

Bellman backups

v1 v2
  • ptimal
before Bellman backup after Bellman backup

largest distance between values decreases after Bellman backups

slide-106
SLIDE 106

From DP to Learning

full-width backups to sample backups
slide-107
SLIDE 107

Full-width Backup

T T T T
slide-108
SLIDE 108

Backup with Sample Return

T T T T
slide-109
SLIDE 109

Backup with Guess

T T T T
slide-110
SLIDE 110

Incremental Updates

E R

{ }≈ µk = 1

k Rτ

τ=1 k

µk = µk−1 + 1 k Rk − µk−1

( )

µk = µk−1 +α Rk − µk−1

( )

batched incremental running (saw this in Q-learning!)

slide-111
SLIDE 111

Sample and Bootstrap

bootstrapping, 𝜇 shallow backup deep backup sampling sample backup full-width backup full trajectory returns step returns/ guess exhaustive search dynamic prg.
slide-112
SLIDE 112

estimating returns

  • ptimising

towards achieving returns It all comes down to:

s r a

Q/ V/ 𝜌
slide-113
SLIDE 113

Q-learning

full-width backups to sample backups target policy
  • ptimal
slide-114
SLIDE 114

SARSA

full-width backups to sample backups target policy same as behaviour policy
slide-115
SLIDE 115

scaling up RL with func<on approxima<on

slide-116
SLIDE 116

Qθ(s,a)=θ0 f0(s,a)+θ1 f1(s,a)+...+θn fn(s,a)

Qtarget =(rs

a +γ maxa'Q(s',a'))

θ ←θ −α∇θ 1 2 Qtarget −Qθ(s,a)

( )

2

e.g. linear approximation

Approximate Q-learning

slide-117
SLIDE 117

Say θ ∈!

S x A , so Qθ(s,a)=θsa

Qtarget = rs

a +γ maxa'Q(s',a')

θsa ←θsa −α∇θsa 1 2 Qtarget −θsa

( )

2

θsa ←θsa −α −Qtarget +θsa

( )

θsa ←θsa +α Qtarget −θsa

( )

θsa ←(1−α)θsa +αQtarget

gradient updates equivalent to tabular Q updates tabular equivalent

slide-118
SLIDE 118 image score change
  • n action

DQN

action Agent Buffer Goal NN Human-level control through deep reinforcement learning, Mnih et. al., Nature 518, Feb 2015
slide-119
SLIDE 119 Human-level control through deep reinforcement learning, Mnih et. al., Nature 518, Feb 2015 h/p://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
  • pixel input
  • 18 joys+ck/bulon posi+ons output
  • change in game score as feedback
  • convolu+onal net represen+ng Q
  • backpropaga+on for training!

human level game control

slide-120
SLIDE 120

neural network

slide-121
SLIDE 121

backpropaga<on

What is the target against which to minimise error?

slide-122
SLIDE 122

experience replay buffer

save transition in memory randomly sample from memory for training = i.i.d

at st st+1 rt+1
slide-123
SLIDE 123

freeze target

freeze

slide-124
SLIDE 124

https://storage.googleapis.com/deepmind-media/dqn/ DQNNaturePaper.pdf

Human-level control through deep reinforcement learning, Mnih et. al., Nature 518, Feb 2015
slide-125
SLIDE 125

however training is

SLOOOOOo….W

slide-126
SLIDE 126

parallelise…

slide-127
SLIDE 127

Parallel Asynchronous Training

shared parameters parallel agents lock-free updates value and policy based methods

Asynchronous Methods for Deep Reinforcement Learning, Mnih et. al., ICML 2016 https://youtu.be/0xo1Ldx3L5Q
slide-128
SLIDE 128

shared params parallel learners HOGWILD! updates

Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent https://github.com/traai/async-deep-rl

HOGWILD! updates

slide-129
SLIDE 129

Policy Based

slide-130
SLIDE 130 𝜌(north|s)

𝜌

𝜌(south|s) 𝜌(east|s) 𝜌(west|s)

state s policy 𝜌(a|s) features

slide-131
SLIDE 131

Intuition

τ :s1, a1, r

1 1, s2, a2, r 2 2, ... , sH−1, aH−1, rH−1 H−1

R𝜐1 = 10

home home R𝜐2 = 5

R𝜐3 = 2

slide-132
SLIDE 132

Intuition

home home

R𝜐1 = 10 R𝜐2 = 5 R𝜐3 = 2 probabilities are relative

𝜌(a|s) along path with high return higher

τ :s1, a1, r

1 1, s2, a2, r 2 2, ... , sH−1, aH−1, rH−1 H−1
slide-133
SLIDE 133

Revisiting the Objective

maxθ Eτ r

st at |πθ t=0 H−1

⎧ ⎨ ⎩ ⎫ ⎬ ⎭

maxθ J(θ) = maxθ P(τ |θ)R(τ )

τ

τ :s1, a1, r

1 1, s2, a2, r 2 2, ... , sH−1, aH−1, rH−1 H−1
slide-134
SLIDE 134

Samples Gradient

J(θ) = P(τ |θ)R(τ )

τ

maxθ J(θ) θ ←θ + ∇θJ(θ)

∇θJ(θ) =∇θ P(τ |θ)R(τ ) τ

= ∇θP(τ |θ)R(τ ) τ

= P(τ |θ) P(τ |θ)∇θP(τ |θ)R(τ ) τ

= P(τ |θ)∇θP(τ |θ) P(τ |θ) R(τ ) τ

= P(τ |θ)∇θ logP(τ |θ)R(τ ) τ

∇θJ(θ) ≈ 1 m ∇θ logP(τ (i) |θ)R(τ (i)) i=1 m

∑ gradient via sampling

J θ ∇θ
slide-135
SLIDE 135

= ∇θ logP(st+1 | st,at )+ logπθ(at | st )

t=0 H−1

t=0 H−1

⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = ∇θ logπθ(at | st )

t=0 H−1

= ∇θ logπθ(at | st )

t=0 H−1

∇θJ(θ) ≈ 1 m ∇θ logπθ(at

(i) | st (i)) t=0 H−1

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ R(τ (i))

i=1 m

∇θ logP(τ |θ) = ∇θ log P(st+1 | st,at )⋅πθ(at | st )

t=0 H−1

⎡ ⎣ ⎢ ⎤ ⎦ ⎥

dynamics model 𝜌(a|s) 𝜌(a|s) 𝜌(a|s) policy H

Dynamics Model

slide-136
SLIDE 136

∇θJ(θ) ≈ 1 m ∇θ logπθ(at

(i) | st (i)) t=0 H−1

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ R(τ (i))

i=1 m

∇θ logπθ(at | st )R(τ )

For each action at in state st during each trajectory m to increase πθ(at | st )

Δθ

x

slide-137
SLIDE 137

Noisy Gradient

R R R R

H

R to increase πθ(at | st )

Δθ

x

slide-138
SLIDE 138

Reduce Noise

H

R time t onwards to increase πθ(at | st )

Δθ

x

R(τ t onwards)

R time t

  • nwards
slide-139
SLIDE 139

Reduce Noise

V=E{R|s}

R R R R baseline b

(how much is action better than average)

to increase πθ(at | st )

Δθ

x —

R(τ t onwards)− b
slide-140
SLIDE 140

Reduce Noise

V=E{R|s}

R R R R baseline b

(how much is action better than average)

to increase πθ(at | st )

Δθ

x —

R(τ t onwards)−V(st )
slide-141
SLIDE 141

Actor-Critic

slide-142
SLIDE 142

Reduce Noise

R R R R

Q = E{R|s,a} = E{r+𝛿V}

critic Q

(expected long term value of action)

to increase πθ(at | st )

Δθ

x —

Q(st,at )−V(st )
slide-143
SLIDE 143

Reduce Noise

R R R R A = Q - V (advantage of an action)

(how much is action better than average)

A(st,at )

to increase πθ(at | st )

Δθ

x —

slide-144
SLIDE 144

parallelise…

slide-145
SLIDE 145

Parallel Asynchronous Training

shared parameters parallel agents lock-free updates value and policy based methods

Asynchronous Methods for Deep Reinforcement Learning, Mnih et. al., ICML 2016 https://youtu.be/0xo1Ldx3L5Q https://youtu.be/Ajjc08-iPx8 https://youtu.be/nMR5mjCFZCw
slide-146
SLIDE 146

shared params parallel learners HOGWILD! updates

Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent Copy Agent https://github.com/traai/async-deep-rl

HOGWILD! updates

slide-147
SLIDE 147

PAAC (Parallel Advantage Actor-Critic)

Efficient Parallel Methods for Deep Reinforcement Learning,
  • A. V. Clemente, H. N. Castejón, and A. Chandra, RLDM 2017

1 GPU/CPU SOTA performance Reduced training time

https://github.com/alfredvc/paac Alfredo Clemente
slide-148
SLIDE 148

code for you to play with...

Rich Su/on’s book examples (exhaus+ve, must try!):

h/ps://github.com/ShangtongZhang/reinforcement-learning-an-introduc<on

Telenor’s implementa<on of asynchronous parallel methods: h/ps://github.com/traai/async-deep-rl Alfredo’s faster parallel methods:

h/ps://github.com/alfredvc/paac

++…

slide-149
SLIDE 149

Next lecture: Applications (and some hacking) November 21, 2017

https://join.slack.com/t/deep-rl-tutorial/signup

Inspired to code/apply RL?