Reinforcement Learning Rob Platt Northeastern University Some - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Rob Platt Northeastern University Some - - PowerPoint PPT Presentation

Reinforcement Learning Rob Platt Northeastern University Some images and slides are used from: AIMA CS188 UC Berkeley Reinforcement Learning (RL) Previous session discussed sequential decision making problems where the transition model and


slide-1
SLIDE 1

Reinforcement Learning

Rob Platt Northeastern University Some images and slides are used from: AIMA CS188 UC Berkeley

slide-2
SLIDE 2

Previous session discussed sequential decision making problems where the transition model and reward function were known In many problems, the model and reward are not known in advance Agent must learn how to act through experience with the world This session discusses reinforcement learning (RL) where an agent receives a reinforcement signal

Reinforcement Learning (RL)

slide-3
SLIDE 3

Exploration of the world must be balanced with exploitation of knowledge gained through experience Reward may be received long after the important choices have been made, so credit must be assigned to earlier decisions Must generalize from limited experience

Challenges in RL

slide-4
SLIDE 4

Conception of agent

Agent World

act sense

slide-5
SLIDE 5

RL conception of agent

Agent World

a s,r

Agent takes actions Agent perceives states and rewards

Transition model and reward function are initially unknown to the agent! – value iteration assumed knowledge of these two things...

slide-6
SLIDE 6

Value iteration

We know the probabilities of moving in each direction when an action is executed We know the reward function

+1

  • 1
slide-7
SLIDE 7

Value iteration

We know the probabilities of moving in each direction when an action is executed We know the reward function

+1

  • 1
slide-8
SLIDE 8

Value iteration vs RL

RL still assumes that we have an MDP

Cool Warm Overheated

Fast Fast Slow Slow

0.5 0.5 0.5 0.5 1.0 1.0 +1 +1 +1 +2 +2

  • 10
slide-9
SLIDE 9

Value iteration vs RL

Cool Warm Overheated

RL still assumes that we have an MDP – we know S and A – we still want to calculate an optimal policy BUT: – we do not know T or R – we need to figure our T and R by trying out actions and seeing what happens

slide-10
SLIDE 10

Example: Learning to Walk

Initial A Learning T rial After Learning [1K Trials]

[Kohl and Stone, ICRA 2004]

slide-11
SLIDE 11

Example: Learning to Walk

Initial

[Kohl and Stone, ICRA 2004]

slide-12
SLIDE 12

Example: Learning to Walk

T raining

[Kohl and Stone, ICRA 2004]

slide-13
SLIDE 13

Example: Learning to Walk

Finished

[Kohl and Stone, ICRA 2004]

slide-14
SLIDE 14

Toddler robot uses RL to learn to walk

T edrake et al., 2005

slide-15
SLIDE 15

The next homework assignment!

slide-16
SLIDE 16

Model-based RL

  • a. choose an exploration policy

– policy that enables agent to explore all relevant states

  • b. follow policy for a while
  • c. estimate T and R
  • 1. estimate T, R by

averaging experiences

  • 2. solve for policy in MDP

(e.g., value iteration)

slide-17
SLIDE 17

Model-based RL

  • 1. estimate T, R by

averaging experiences

  • 2. solve for policy in MDP

(e.g., value iteration)

  • a. choose an exploration policy

– policy that enables agent to explore all relevant states

  • b. follow policy for a while
  • c. estimate T and R

Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

slide-18
SLIDE 18
  • 1. estimate T, R by

averaging experiences

  • 2. solve for policy in MDP

(e.g., value iteration)

Model-based RL

  • a. choose an exploration policy

– policy that enables agent to explore all relevant states

  • b. follow policy for a while
  • c. estimate T and R

Number of times agent reached s' by taking a from s Set of rewards obtained when reaching s' by taking a from s

What is a downside of this approach?

slide-19
SLIDE 19

A

B C

D

E

Example: Model-based RL

Blue arrows denote policy States: a,b,c,d,e Actions: l, r, u, d Observations:

  • 1. b,r,c
  • 2. e,u,c
  • 3. c,r,d
  • 4. b,r,a
  • 5. b,r,c

6, e,u,c 7, e,u,c

?

slide-20
SLIDE 20

A

B C

D

E

Example: Model-based RL

Blue arrows denote policy States: a,b,c,d,e Actions: l, r, u, d Observations:

  • 1. b,r,c
  • 2. e,u,c
  • 3. c,r,d
  • 4. b,r,a
  • 5. b,r,c

6, e,u,c 7, e,u,c Estimates: P(c|e,u) = 1 P(c|b,r) = 0.66 P(a|b,r) = 0.33 P(d|c,r) = 1

slide-21
SLIDE 21

Model-based vs Model-free

Suppose you want to calculate average age in this class room Method 2: where: Method 1: where: is a the age of a randomly sampled person

slide-22
SLIDE 22

Model-based vs Model-free

Suppose you want to calculate average age in this class room Method 2: where: Method 1: where: is a the age of a randomly sampled person Model based (why?) Model free (why?)

slide-23
SLIDE 23

Remember this equation? Is this model-based or model-free?

Model-free estimate of the value function

slide-24
SLIDE 24

Remember this equation? Is this model-based or model-free? How do you make it model-free?

Model-free estimate of the value function

slide-25
SLIDE 25

Remember this equation? Let's think about this equation first:

Model-free estimate of the value function

slide-26
SLIDE 26

Thing being estimated Expectation

Model-free estimate of the value function

slide-27
SLIDE 27

Thing being estimated Expectation Sample-based estimate

Model-free estimate of the value function

slide-28
SLIDE 28

Model-free estimate of the value function

How would we use this equation? – get a bunch of samples of – for each sample, calculate – average the results...

slide-29
SLIDE 29

Suppose we have a random variable X and we want to estimate the mean from samples x1,…,xk After k samples Can show that Can be written Learning rate α (k) can be functions other than 1, loose k conditions on learning rate to ensure convergence to mean If learning rate is constant, weight of older samples decay exponentially at the rate (1 − α )

Forgets about the past (distant past values were wrong anyway)

Update rule

ˆ x

k = 1

k x

i i=1 k

ˆ x

k = ˆ

x

k−1 + 1

k(x

k − ˆ

x

k−1)

ˆ x

k = ˆ

x

k−1 +α(k)(x k − ˆ

x

k−1)

ˆ x¬ ˆ x+α(x− ˆ x)

Weighted moving average

slide-30
SLIDE 30

Suppose we have a random variable X and we want to estimate the mean from samples x1,…,xk After k samples Can show that Can be written

ˆ x

k = 1

k x

i i=1 k

ˆ x

k = ˆ

x

k−1 + 1

k(x

k − ˆ

x

k−1)

ˆ x

k = ˆ

x

k−1 +α(k)(x k − ˆ

x

k−1)

Weighted moving average

After several samples

  • r just drop the subscripts...
slide-31
SLIDE 31

Suppose we have a random variable X and we want to estimate the mean from samples x1,…,xk After k samples Can show that Can be written

ˆ x

k = 1

k x

i i=1 k

ˆ x

k = ˆ

x

k−1 + 1

k(x

k − ˆ

x

k−1)

ˆ x

k = ˆ

x

k−1 +α(k)(x k − ˆ

x

k−1)

Weighted moving average

This is called TD Value learning – thing inside the square brackets is called the “TD error”

  • r just drop the subscripts...
slide-32
SLIDE 32

TD Value Learning: example

8

A

B C

D

E

slide-33
SLIDE 33

TD Value Learning: example

8

  • 1

8

A

B C

D

E

B, east, C, -2

Observed reward

slide-34
SLIDE 34

TD Value Learning: example

8

  • 1

8

A

B C

D

E

B, east, C, -2

Observed reward

slide-35
SLIDE 35

TD Value Learning: example

B, east, C, -2

8

  • 1

8

  • 1

3

8

C, east, D, -2

A

B C

D

E

Observed reward

slide-36
SLIDE 36

TD Value Learning: example

B, east, C, -2

8

  • 1

8

  • 1

3

8

C, east, D, -2

A

B C

D

E

Observed reward

slide-37
SLIDE 37

What's the problem w/ TD Value Learning?

slide-38
SLIDE 38

What's the problem w/ TD Value Learning?

Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now?

slide-39
SLIDE 39

What's the problem w/ TD Value Learning?

Can't turn the estimated value function into a policy! This is how we did it when we were using value iteration: Why can't we do this now? Solution: Use TD value learning to estimate Q*, not V*

slide-40
SLIDE 40

How do we estimate Q?

Value of being in state s and acting optimally Value of taken action a from state s and then acting optimally Use this equation inside of the value iteration loop we studied last lecture...

slide-41
SLIDE 41

Model-free reinforcement learning

Life consists of a sequence of tuples like this: (s,a,s',r') Use these updates to get an estimate of Q(s,a) How?

slide-42
SLIDE 42

Model-free reinforcement learning

Here's how we estimated V: So do the same thing for Q:

slide-43
SLIDE 43

Model-free reinforcement learning

Here's how we estimated V: So do the same thing for Q: This is called Q-Learning Most famous type of RL

slide-44
SLIDE 44

Model-free reinforcement learning

Here's how we estimated V: So do the same thing for Q: Q-values learned using Q-Learning

slide-45
SLIDE 45

Q-Learning

slide-46
SLIDE 46

Q-Learning: properties

Q-learning converges to optimal Q-values if:

  • 1. it explores every s, a, s' transition sufficiently often
  • 2. the learning rate approaches zero (eventually)

Key insight: Q-value estimates converge even if experience is obtained using a suboptimal policy. This is called off-policy learning

slide-47
SLIDE 47

SARSA

Q-learning SARSA

slide-48
SLIDE 48

Q-learning vs SARSA

Which path does SARSA learn? Which one does q-learning learn?

slide-49
SLIDE 49

Q-learning vs SARSA

slide-50
SLIDE 50

Exploration vs exploitation

Think about how we choose actions: But: if we only take “greedy” actions, then how do we explore? – if we don't explore new states, then how do we learn anything new?

slide-51
SLIDE 51

Exploration vs exploitation

Think about how we choose actions: But: if we only take “greedy” actions, then how do we explore? – if we don't explore new states, then how do we learn anything new? Taking only greedy actions makes it more likely that you get stuck in local minimia in the policy space

slide-52
SLIDE 52

Exploration vs exploitation

But: if we only take “greedy” actions, then how do we explore? – if we don't explore new states, then how do we learn anything new? Choose a random action e% of the time. OW, take the greedy action

slide-53
SLIDE 53

Function approximation

+1

  • 1

So far, the policy is distinct for each state – knowing something about this state tells us nothing about what to do in

  • ther states.
slide-54
SLIDE 54

Function approximation

So far, the policy is distinct for each state – knowing something about this state tells us nothing about what to do in

  • ther states.

But, what if you have a large state space? How should these states generalize?

slide-55
SLIDE 55

Solution: describe a state using a vector

  • f features (properties)

Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features:

Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) …… etc.

Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

Feature-based representations

slide-56
SLIDE 56

Using a feature representation, we can write a q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value!

Linear value functions

slide-57
SLIDE 57

Q-learning with linear Q-functions: Intuitive interpretation:

Adjust weights of active features E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features

Formal justification: online least squares

Exact Q’s Approximate Q’s

Approximate Q-learning

slide-58
SLIDE 58

Example: Q-Pacman