Reinforcement Learning Robert Platt Northeastern University Some - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Robert Platt Northeastern University Some - - PowerPoint PPT Presentation

Reinforcement Learning Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Conception of agent act Agent World sense RL conception of agent Agent takes actions a Agent World s,r


slide-1
SLIDE 1

Reinforcement Learning

Robert Platt Northeastern University Some images and slides are used from:

  • 1. CS188 UC Berkeley
  • 2. RN, AIMA
slide-2
SLIDE 2

Conception of agent

Agent World

act sense

slide-3
SLIDE 3

RL conception of agent

Agent World

a s,r

Agent takes actions Agent perceives states and rewards

Transition model and reward function are initially unknown to the agent! – value iteration assumed knowledge of these two things...

slide-4
SLIDE 4

Value iteration

We know the probabilities of moving in each direction when an action is executed We know the reward function

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-5
SLIDE 5

Reinforcement Learning

We know the probabilities of moving in each direction when an action is executed We know the reward function

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-6
SLIDE 6

The different between RL and value iteration

Offmine Solution (value iteration) Online Learning (RL)

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-7
SLIDE 7

Value iteration vs RL

Cool Warm

Overheated

Fast Fast

Slow Slow

0.5 0.5 0.5 0.5 1.0 1.0

+1 +1 +1 +2 +2

  • 10

RL still assumes that we have an MDP

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-8
SLIDE 8

Value iteration vs RL

Cool Warm

Overheated

RL still assumes that we have an MDP – but, we assume we don't know T or R

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-9
SLIDE 9

RL example

https://www.youtube.com/watch?v=goqWX7bC-ZY

slide-10
SLIDE 10

Model-based RL

  • 1. estimate T, R by

averaging experiences

  • 2. solve for policy using

value iteration

  • a. choose an exploration policy

– policy that enables agent to explore all relevant states

  • b. follow policy for a while
  • c. estimate T and R

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-11
SLIDE 11

Model-based RL

  • 1. estimate T, R by

averaging experiences

  • 2. solve for policy using

value iteration

  • a. choose an exploration policy

– policy that enables agent to explore all relevant states

  • b. follow policy for a while
  • c. estimate T and R

Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

slide-12
SLIDE 12

Model-based RL

  • 1. estimate T, R by

averaging experiences

  • 2. solve for policy using

value iteration

  • a. choose an exploration policy

– policy that enables agent to explore all relevant states

  • b. follow policy for a while
  • c. estimate T and R

Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

What's wrong w/ this approach?

slide-13
SLIDE 13

Model-based vs Model-free learning

Goal: Compute expected age of students in this class Unknown P(A): “Model Based” Unknown P(A): “Model Free” Without P(A), instead collect samples [a1, a2, … aN]

Known P(A)

Why does this work? Because samples appear with the right frequencies. Why does this work? Because eventually you learn the right model.

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

slide-14
SLIDE 14

RL: model-free learning approach to estimating the value function

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • We want to improve our estimate of V by computing these

averages:

  • Idea: T

ake samples of outcomes s’ (by doing the action!) and average

π(s) s s, π(s) s1 '

slide-15
SLIDE 15

RL: model-free learning approach to estimating the value function

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • We want to improve our estimate of V by computing these

averages:

  • Idea: T

ake samples of outcomes s’ (by doing the action!) and average

π(s) s s, π(s) s1 ' s2 '

slide-16
SLIDE 16

RL: model-free learning approach to estimating the value function

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • We want to improve our estimate of V by computing these

averages:

  • Idea: T

ake samples of outcomes s’ (by doing the action!) and average

π(s) s s, π(s) s1 ' s2 ' s3 '

slide-17
SLIDE 17

RL: model-free learning approach to estimating the value function

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • We want to improve our estimate of V by computing these

averages:

  • Idea: T

ake samples of outcomes s’ (by doing the action!) and average

π(s) s s, π(s) s1 ' s2 ' s3 '

slide-18
SLIDE 18

Sidebar: exponential moving average

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Exponential moving average
  • The running interpolation update:
  • Makes recent samples more important:
  • Forgets about the past (distant past values were wrong anyway)
slide-19
SLIDE 19

TD Value Learning

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Big idea: learn from every experience!
  • Update V(s) each time we experience a

transition (s, a, s’, r)

  • Likely outcomes s’ will contribute updates

more often

  • T

emporal difgerence learning of values

  • Policy still fjxed, still doing evaluation!
  • Move values toward value of whatever

successor occurs: running average

π(s) s s, π(s) Sample of V(s): Update to V(s): Same update:

s'

slide-20
SLIDE 20

TD Value Learning: example

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Assume: γ = 1, α = 1/2

Observed T ransitions

8

A

B C

D

E

States

slide-21
SLIDE 21

TD Value Learning: example

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Assume: γ = 1, α = 1/2

Observed T ransitions

B, east, C, -2

8

  • 1

8

A

B C

D

E

States

Observed reward

slide-22
SLIDE 22

TD Value Learning: example

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Assume: γ = 1, α = 1/2

Observed T ransitions

B, east, C, -2

8

  • 1

8

  • 1

3

8

C, east, D, -2

A

B C

D

E

States

Observed reward

slide-23
SLIDE 23

What's the problem w/ TD Value Learning?

slide-24
SLIDE 24

What's the problem w/ TD Value Learning?

Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now?

slide-25
SLIDE 25

What's the problem w/ TD Value Learning?

Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now? Solution: Use TD value learning to estimate Q*, not V*

slide-26
SLIDE 26

Detour: Q-Value Iteration

  • Value iteration: fjnd successive (depth-limited) values
  • Start with V0(s) = 0, which we know is right
  • Given Vk, calculate the depth k+1 values for all states:
  • But Q-values are more useful, so compute them instead
  • Start with Q0(s,a) = 0, which we know is right
  • Given Qk, calculate the depth k+1 q-values for all q-states:

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

slide-27
SLIDE 27

Q-Learning

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Q-Learning: sample-based Q-value iteration
  • Learn Q(s,a) values as you go
  • Receive a sample (s,a,s’,r)
  • Consider your old estimate:
  • Consider your new sample estimate:
  • Incorporate the new estimate into a running average:
slide-28
SLIDE 28

Exploration v exploitation

Image: Berkeley CS188 course notes (downloaded Summer 2015)

slide-29
SLIDE 29

Exploration v exploitation: e-greedy action selection

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Several schemes for forcing exploration
  • Simplest: random actions (ε-greedy)
  • Every time step, fmip a coin
  • With (small) probability ε, act randomly
  • With (large) probability 1-ε, act on current

policy

  • Problems with random actions?
  • You do eventually explore the space, but keep

thrashing around once learning is done

  • One solution: lower ε over time
  • Another solution: exploration functions
slide-30
SLIDE 30

Generalizing across states

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Basic Q-Learning keeps a table of all q-values
  • In realistic situations, we cannot possibly learn

about every single state!

  • T
  • o many states to visit them all in training
  • T
  • o many states to hold the q-tables in

memory

  • Instead, we want to generalize:
  • Learn about some small number of training

states from experience

  • Generalize that experience to new, similar

situations

  • This is a fundamental idea in machine learning,

and we’ll see it over and over again

slide-31
SLIDE 31

Generalizing across states

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Let’s say we discover through experience that this state is bad: In naïve q- learning, we know nothing about this state: Or even this

  • ne!
slide-32
SLIDE 32

Feature-based representations

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Solution: describe a state using a vector of

features (properties)

  • Features are functions from states to

real numbers (often 0/1) that capture important properties of the state

  • Example features:
  • Distance to closest ghost
  • Distance to closest dot
  • Number of ghosts
  • 1 / (dist to dot)2
  • Is Pacman in a tunnel? (0/1)
  • …… etc.
  • Is it the exact state on this slide?
  • Can also describe a q-state (s, a) with

features (e.g. action moves closer to food)

slide-33
SLIDE 33

Linear value functions

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Using a feature representation, we can write a q function (or

value function) for any state using a few weights:

  • Advantage: our experience is summed up in a few powerful

numbers

  • Disadvantage: states may share features but actually be very

difgerent in value!

slide-34
SLIDE 34

Linear value functions

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

  • Q-learning with linear Q-functions:
  • Intuitive interpretation:
  • Adjust weights of active features
  • E.g., if something unexpectedly bad happens, blame the features

that were on: disprefer all states with that state’s features

  • Formal justifjcation: online least squares

Exact Q’s Approximate Q’s

slide-35
SLIDE 35

Example: Q-Pacman

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

slide-36
SLIDE 36

Q-Learning and Least Squares

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

slide-37
SLIDE 37

Q-Learning and Least Squares

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

1 2 3 4 1 2 3 2 2 2 2 4 2 6

Prediction:

2 2 4

Prediction:

slide-38
SLIDE 38

Optimization: Least Squares

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

2

Error or “residual” Prediction Observation

slide-39
SLIDE 39

Minimizing Error

Slide: Berkeley CS188 course notes (downloaded Summer 2015)

Approximate q update explained: Imagine we had only one point x, with features f(x), target value y, and weights w: “target”

“prediction”

Gradient descent