Introduction to Reinforcement Learning Finale Doshi-Velez Harvard - - PowerPoint PPT Presentation

introduction to reinforcement learning finale doshi velez
SMART_READER_LITE
LIVE PREVIEW

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard - - PowerPoint PPT Presentation

Introduction to Reinforcement Learning Finale Doshi-Velez Harvard University Buenos Aires MLSS 2018 We often must make decisions under uncertainty. How to get to work, walk or bus? We often must make decisions under uncertainty. What


slide-1
SLIDE 1

Introduction to Reinforcement Learning Finale Doshi-Velez

Harvard University Buenos Aires MLSS 2018

slide-2
SLIDE 2

We often must make decisions under uncertainty.

How to get to work, walk or bus?

slide-3
SLIDE 3

We often must make decisions under uncertainty.

What projects to work on?

https://imgs.xkcd.com/comics/automation.png

slide-4
SLIDE 4

We often must make decisions under uncertainty.

How to improvise with a new recipe?

https://s-media-cache-ak0.pinimg.com/originals/23/ce/4b/23ce4b2fc9014b26d4b811209550ef5b.jpg

slide-5
SLIDE 5

Some Real Applications of RL

slide-6
SLIDE 6

Why are these problems hard?

  • Must learn from experience (may have prior experience
  • n the same or related task)
  • Delayed rewards/actions may have long term effects

(delayed credit assignment)

  • Explore or exploit? Learn and plan together.
  • Generalization (new developments, don’t assume all

information has been identified)

slide-7
SLIDE 7

Reinforcement learning formalizes this problem

World: Black Box Agent: models, policies, etc.

actions

  • bservation

reward Objective: Maximize (finite or infinite horizon)

E[∑t γ

t rt]

slide-8
SLIDE 8

Concept Check: Reward Adjustment

  • If I adjust every reward r by r + c, does the

policy change?

  • If I adjust every reward r by c*r, does the policy

change?

slide-9
SLIDE 9

Key Terms

  • Policy π(s,a)
  • r π(s) = a
  • State s
  • History

{s0,a0,r0,s1,a1…}

Start

Markov Property: p(st+1 | ht ) = p( st+1 | ht-1 , st , at ) = p( st+1 | st , at ) … we'll come back to identifying state later!

slide-10
SLIDE 10

Markov Decision Process

  • T( s' | s , a ) = Pr( state s' after taking action a in state s )
  • R( s , a , s ' ) = E[ reward after taking action a in state s

and transitioning to s' ] … but may depend on less, e.g. R( s , a ) or even R( s ) Notice given a policy, we have a Markov chain to analyze! State 0 State 1

P = 1 , R = 0 P = 1 , R = 3 P = 1 , R = 2 P = .25 , R = 1 P = .75 , R = 0

slide-11
SLIDE 11

How to Solve an MDP: Value Functions

Value: Vπ(s) = Eπ[ Σt γt rt | s0 = s ] … in s, follow π

slide-12
SLIDE 12

How to Solve an MDP: Value Functions

Value: Vπ(s) = Eπ[ Σt γt rt | s0 = s ] … in s, follow π π

slide-13
SLIDE 13

Concept Check: Discounts

S 4 5 5 1

(1) In functions of γ, what are the values

  • f policies A, B, and

C? (2) When is it better to do B? C?

A B C

slide-14
SLIDE 14

How to Solve an MDP: Value Functions

Value: Vπ(s) = Eπ[ Σt γt rt | s0 = s ] … in s, follow π Action-Value: Qπ(s,a) = Eπ[ Σt γt rt | s0 = s, a0 = a ] … in s, do a, follow π π

slide-15
SLIDE 15

Expanding the expression...

V π(s)=Eπ[∑t γ

t rt∣s0=s]

V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γ Eπ[∑t γ

t rt∣s0=s']]

action: you choose state: world chooses Next action Next state Next reward Discounted future rewards

slide-16
SLIDE 16

Expanding the expression...

V π(s)=Eπ[∑t γ

t rt∣s0=s]

V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γ Eπ[∑t γ

t rt∣s0=s']]

Next action Next state Next reward Discounted future rewards

slide-17
SLIDE 17

Expanding the expression...

V π(s)=Eπ[∑t γ

t rt∣s0=s]

V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γ Eπ[∑t γ

t rt∣s0=s']]

Next action Next state Next reward Discounted future rewards

V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π(s')]

slide-18
SLIDE 18

Expanding the expression...

V π(s)=Eπ[∑t γ

t rt∣s0=s]

V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γ Eπ[∑t γ

t rt∣s0=s']]

Exercise: Rewrite in finite horizon case, making the rewards and transitions depend on time t… notice how thinking about the future is the same as thinking backward from the end!

Next action Next state Next reward Discounted future rewards

V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π(s')]

slide-19
SLIDE 19

Optimal Value Functions

V (s)=maxaQ(s,a) V (s)=maxa∑s' T (s'∣s,a)[r(s ,a, s')+γV (s')] Don't average, take the best! Q-table is the set of values Q(s,a) Note: we still have problems – system must be Markov in s, the size of {s} might be large

slide-20
SLIDE 20

Can we solve this? Policy Evaluation

This is a system of linear equations! V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π(s')]

slide-21
SLIDE 21

Can we solve this? Policy Evaluation

This is a system of linear equations! We can also do it iteratively: Will converge because the Bellman iterator is a contraction – the initial value V0(s) is pushed into the past as the “collected data” r(s,a) takes over. V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π(s')] V π

0(s)=c

V π

k(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π k−1(s')]

slide-22
SLIDE 22

Can we solve this? Policy Evaluation

This is a system of linear equations! We can also do it iteratively: Will converge because the Bellman iterator is a contraction – the initial value V0(s) is pushed into the past as the “collected data” r(s,a) takes over. Finally, can apply Monte carlo: many simulations from s, and see what Vπ(s) is. V π(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π(s')] V π

0(s)=c

V π

k(s)=∑a π(a∣s)∑s' T (s'∣s,a)[r(s,a, s')+γV π k−1(s')]

slide-23
SLIDE 23

Policy Improvement Theorem

Let π, π' be two policies that are the same except for the action that they recommend at state s. If Qπ( s, π'(s) ) > Qπ( s, π(s) ) Then Vπ'(s) > Vπ(s) Gives us a way to improve policies: just be greedy with respect to Q!

slide-24
SLIDE 24

Policy Iteration

Will converge; each step requires a potentially expensive policy evaluation computation

Select some policy π Solve For Vπ Improve policy π Solve For Vπ

...

slide-25
SLIDE 25

Value Iteration

V

k(s)=maxa∑s ' T (s'∣s ,a)[r(s ,a,s')+γV k−1(s')]

Policy Evaluation

Also converges (contraction) Note that in the tabular case, this is a bunch of inexpensive matrix operations!

Policy Improvement

slide-26
SLIDE 26

Linear programming

For any μ; equality for the best action at optimality min∑s V (s)μ(s) s.t .V (s)≥∑s' T (s'∣s,a)[r(s,a, s')+γV (s')]∀ a,s

slide-27
SLIDE 27

Learning from Experience: Reinforcement Learning

Now, instead of the transition T and reward R, we assume that we only have histories. Why is this case interesting?

  • May not have the model
  • Even if have model (e.g. rules of go, or Atari

simulator code), focuses attention on right place

slide-28
SLIDE 28

Taxonomy of Approaches

  • Forward Search/Monte Carlo: Simulate the future,

pick the best one (with or without a model).

  • Value function: Learn V(s)
  • Policy Search: parametrize policy πθ(s) and search

for the best parameters θ, often good for systems in which the cardinality of θ is small.

slide-29
SLIDE 29

Taxonomy of Approaches

  • Forward Search/Monte Carlo: Simulate the future,

pick the best one (with or without a model).

  • Value function: Learn V(s)
  • Policy Search: parametrize policy πθ(s) and search

for the best parameters θ, often good for systems in which the cardinality of θ is small.

slide-30
SLIDE 30

Monte Carlo Policy Evaluation

1) Generate N sequences of length T from state s0 to estimate Vπ(s0). 2)If π has some randomness, or we do s0, a0, then π, can do policy improvement. … might need a lot of data! But okay if we have a blackbox simulator. π π π

slide-31
SLIDE 31

Monte Carlo Policy Evaluation

1) Generate N sequences of length T from state s0 to estimate Vπ(s0). 2)If π has some randomness, or we do s0, a0, then π, can do policy improvement. … might need a lot of data! But okay if we have a blackbox simulator. π π π

Add sophistication: UCT, MCTS

slide-32
SLIDE 32

Taxonomy of Approaches

  • Forward Search/Monte Carlo: Simulate the future,

pick the best one (with or without a model).

  • Value function: Learn V(s)
  • Policy Search: parametrize policy πθ(s) and search

for the best parameters θ, often good for systems in which the cardinality of θ is small.

slide-33
SLIDE 33

Temporal Difference

V π(s)=Eπ[∑t γ

trt∣s0=s]=Eπ[r0+γV π(s')]

Monte Carlo Estimate Dynamic Programming

TD: Start with some V(s), do π(s), and update: Will converge if V π(s)←V π(s)+αt(r0+γV π(s')−V π(s))

Original Value Temporal Difference: Error between the sampled value of where you went and the stored value

∑t αt→∞,∑t αt

2→C

slide-34
SLIDE 34

Monte Carlo (only one trajectory)

slide-35
SLIDE 35

Value Iteration (all actions)

slide-36
SLIDE 36

Temporal Difference

slide-37
SLIDE 37

Example (S&B 6.4, Let γ = 1)

Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: A0B0 B1 B1 B1 B1 B1 B1 B0 MC estimate of V(B)? TD estimate of V(B)?

slide-38
SLIDE 38

Example (S&B 6.4, Let γ = 1)

Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: A0B0 B1 B1 B1 B1 B1 B1 B0 MC estimate of V(B)? VMC(B) = ¾ TD estimate of V(B)? VTD(B) = ¾

slide-39
SLIDE 39

Example (S&B 6.4, Let γ = 1)

Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: A0B0 B1 B1 B1 B1 B1 B1 B0 MC estimate of V(B)? VMC(B) = ¾ TD estimate of V(B)? VMC(B) = ¾ MC estimate of V(A)? TD estimate of V(A)?

slide-40
SLIDE 40

Example (S&B 6.4, Let γ = 1)

Two states (A,B). Two rewards (0,1). Suppose we have seen the histories: A0B0 B1 B1 B1 B1 B1 B1 B0 MC estimate of V(B)? VMC(B) = ¾ TD estimate of V(B)? VTD(B) = ¾ MC estimate of V(A)? VMC(A) = 0 TD estimate of V(A)? VTD(A) = ¾ (because A→B)

slide-41
SLIDE 41

Concept Check: DP, MC, TD

100 A B C F E D 100 A B C F E D Initialize Values with rewards:

(1) What would one round

  • f value iteration do?

(2) What would MC do after ABCF? (3) What would TD do after ABCF? (α=1)

slide-42
SLIDE 42

From Policy Evaluation to Optimization

SARSA: On-policy Improve what you did Q(s,a)←Q(s,a)+αt(rt+γQ(s' ,a')−Q(s ,a))

slide-43
SLIDE 43

From Policy Evaluation to Optimization

SARSA: On-policy Improve what you did Q-learning: Off-policy Improve what you could do Q(s,a)←Q(s,a)+αt(rt+γQ(s' ,a')−Q(s ,a)) Q(s,a)←Q(s,a)+αt(rt+γ maxa'Q(s' ,a')−Q(s,a))

slide-44
SLIDE 44

Concept Check

  • 100

S +1 +10

Let δ be the transition noise. All actions cost -0.1 (1) What is the optimal policy for (γ=.1,δ=.5)? (γ=.1,δ=0)? (γ=.99,δ=.5)? (γ=.99,δ=0)? (2) Using a ε-greedy policy with ε=.5, γ=.99, δ=0: What will SARSA learn? Q- learning learn?

slide-45
SLIDE 45

MC + TD: Eligibility Traces

TD(0):V (s)←V (s)+αt(rt+γV (s')−V (s)) TD(1):V (s)←V (s)+αt(rt+γrt +1+γ

2V (s' ')−V (s))

Until we get to MC (all variance, no bias) ...

Biased estimate of future Less bias, more variance

slide-46
SLIDE 46

Eligibility traces average over all backups

Image from S&B

(1−λ)∑n λ

n−1[rt+γrt +1+γ 2rt+2+...+γ nV (st+n)]

Forward view (can't implement):

n-step return (we don't know all these future values) average

slide-47
SLIDE 47

Eligibility traces average over all backups

Image from S&B

Backward view: Let for all s, except st: Credit assignment back in time. zt(s)=γ λ zt−1(s) zt(st)=1+γ λ zt−1(st) ∀ s ,V (s)←V (s)αt zt(s)(rt+γV (st+1)−V (st))

slide-48
SLIDE 48

Interlude: What about actions??

Given some Q(s,a), how do you choose the action to take? Want to balance exploration with exploitation. Two simple strategies:

  • Epsilon-greedy: take argmaxa Q(s,a) with

probability (1-ε), else take a random action

  • Softmax: take actions with probability

proportional to exp( τ Q(s,a) ).

slide-49
SLIDE 49

More general principles

Lots of research about curiosity, value of future information, etc. Important ideas:

  • Learning has utility (succeed-or-learn)
  • Optimism under uncertainty

Examples: interval exploration, UCB/UCT, E3, RMAX. Recent advances in PSRL.

slide-50
SLIDE 50

Taxonomy of Approaches

  • Forward Search/Monte Carlo: Simulate the future,

pick the best one (with or without a model).

  • Value function: Learn V(s)
  • Policy Search: parametrize policy πθ(s) and search

for the best parameters θ, often good for systems in which the cardinality of θ is small.

Next Speaker: Sergey Levine

slide-51
SLIDE 51

Practical time!

Clone code from https://github.com/dtak/tutorial-rl.git Follow instructions in tutorial.py