CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz - - PowerPoint PPT Presentation

cs325 artificial intelligence ch 21 reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz - - PowerPoint PPT Presentation

CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz Gnay, Emory Univ. Spring 2013 Gnay Ch. 21 Reinforcement Learning Spring 2013 1 / 23 Rats! Rat put in a cage with lever. Each lever press sends a signal to


slide-1
SLIDE 1

CS325 Artificial Intelligence

  • Ch. 21 – Reinforcement Learning

Cengiz Günay, Emory Univ. Spring 2013

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 1 / 23

slide-2
SLIDE 2

Rats!

Fundooprofessor

Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center.

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 2 / 23

slide-3
SLIDE 3

Rats!

Fundooprofessor

Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center. Rat presses lever continously until . . .

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 2 / 23

slide-4
SLIDE 4

Rats!

Fundooprofessor

Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center. Rat presses lever continously until . . . it dies because it stops eating and drinking.

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 2 / 23

slide-5
SLIDE 5

Wikipedia.org

slide-6
SLIDE 6

Dopamine Neurons Respond to Novelty

sciencemuseum.org.uk

Schultz et al. (1997)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 4 / 23

slide-7
SLIDE 7

Dopamine Neurons Respond to Novelty

sciencemuseum.org.uk

It turns out: Novelty detection = Temporal Difference rule in Reinforcement Learning (Sutton and Barto, 1981) Schultz et al. (1997)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 4 / 23

slide-8
SLIDE 8
slide-9
SLIDE 9

Performance standard

Agent Environment

Sensors Performance element changes knowledge learning goals Problem generator feedback Learning element Critic Actuators

slide-10
SLIDE 10

Entry/Exit Surveys

Exit survey: Planning Under Uncertainty

Why can’t we use a regular MDP for partially-observable situations? Give an example where you think MDPs would help you solve a problem in your daily life.

Entry survey: Reinforcement Learning (0.25 points of final grade)

In a partially-observable scenario, can reinforcement be used to learn MDP rewards? How can we improve MDP by using the plan-execute cycle?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 7 / 23

slide-11
SLIDE 11

Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 8 / 23

slide-12
SLIDE 12

Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Can we use the plan-execute cycle?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 8 / 23

slide-13
SLIDE 13

Blindfolded MDPs: Enter Reinforcement Learning

1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Can we use the plan-execute cycle? Explore first Update world state based on reward/reinforcement ⇒ Reinforcement Learning (see Scholarpedia article)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 8 / 23

slide-14
SLIDE 14

Where Does Reinforcement Learning Fit?

Machine learning so far:

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-15
SLIDE 15

Where Does Reinforcement Learning Fit?

Machine learning so far: Unsupervised learning: find regularities in input data, x

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-16
SLIDE 16

Where Does Reinforcement Learning Fit?

Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f (x) → y

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-17
SLIDE 17

Where Does Reinforcement Learning Fit?

Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f (x) → y Reinforcement learning: find mapping between states and actions, s → a

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-18
SLIDE 18

Where Does Reinforcement Learning Fit?

Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f (x) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π(s) → a)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-19
SLIDE 19

Where Does Reinforcement Learning Fit?

Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f (x) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π(s) → a)

Which is it?

S U R Speech recognition: connect sounds to transcripts Star data: find groupings from spectral emissions Rat presses lever: gets reward based on certain conditions Elevator controller: multiple elevators, minimize wait time

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-20
SLIDE 20

Where Does Reinforcement Learning Fit?

Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f (x) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π(s) → a)

Which is it?

S U R X Speech recognition: connect sounds to transcripts X Star data: find groupings from spectral emissions X Rat presses lever: gets reward based on certain conditions X Elevator controller: multiple elevators, minimize wait time

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 9 / 23

slide-21
SLIDE 21

But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward: π(s) = arg max

π

E ∞

  • t=0

γtR(s, π(s), s′)

  • ,

with reward at state: R(s), or from action, R(s, a, s′).

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 10 / 23

slide-22
SLIDE 22

But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward: π(s) = arg max

π

E ∞

  • t=0

γtR(s, π(s), s′)

  • ,

with reward at state: R(s), or from action, R(s, a, s′). By estimating utility values: V (s) ←

  • arg max

a

γ

  • s′

P(s′|s, a)V (s′)

  • + R(s) ,

with transition probabilities: P(s′|s, a)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 10 / 23

slide-23
SLIDE 23

But, Wasn’t That What Markov Decision Processes Were?

Find optimal policy to maximize reward: π(s) = arg max

π

E ∞

  • t=0

γtR(s, π(s), s′)

  • ,

with reward at state: R(s), or from action, R(s, a, s′). By estimating utility values: V (s) ←

  • arg max

a

γ

  • s′

P(s′|s, a)V (s′)

  • + R(s) ,

with transition probabilities: P(s′|s, a) Assumes we know R(s) and P(s′|s, a)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 10 / 23

slide-24
SLIDE 24

Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s′|s, a). What to do?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 11 / 23

slide-25
SLIDE 25

Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s′|s, a). What to do? Use Reinforcement Learning (RL) to explore and find rewards

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 11 / 23

slide-26
SLIDE 26

Blindfolded Agent Must Learn From Rewards

Don’t know R(s) or P(s′|s, a). What to do? Use Reinforcement Learning (RL) to explore and find rewards

Agent types:

knows learns uses Utility agent P R → U U Q-learning (RL) Q(s, a) Q Reflex π(s)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 11 / 23

slide-27
SLIDE 27

Video: Backgammon and Choppers

slide-28
SLIDE 28

How Much to Learn?

1 Passive RL: Simple Case

Keep policy π(s) fixed, learn others Always do same actions, and learn utilities Examples:

public transit commute learning a difficult game

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 13 / 23

slide-29
SLIDE 29

How Much to Learn?

1 Passive RL: Simple Case

Keep policy π(s) fixed, learn others Always do same actions, and learn utilities Examples:

public transit commute learning a difficult game

2 Active RL

Learn policy at the same time Help explore better by changing policy Example: drive own car

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 13 / 23

slide-30
SLIDE 30

RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration: V (s) ←

  • arg max

a

γ

  • s′

P(s′|s, a)V (s′)

  • + R(s) .

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 14 / 23

slide-31
SLIDE 31

RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration: V (s) ←

  • arg max

a

γ

  • s′

P(s′|s, a)V (s′)

  • + R(s) .

TD rule: Use derivative when going s → s′: V (s) ← V (s) + α

  • R(s) + γV (s′) − V (s)
  • where:

α is the learning rate, and γ is the discount factor.

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 14 / 23

slide-32
SLIDE 32

RL in Practise: Temporal Difference (TD) Rule

Animals use derivative: Remember value iteration: V (s) ←

  • arg max

a

γ

  • s′

P(s′|s, a)V (s′)

  • + R(s) .

TD rule: Use derivative when going s → s′: V (s) ← V (s) + α

  • R(s) + γV (s′) − V (s)
  • where:

α is the learning rate, and γ is the discount factor. It’s even simpler than before!

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 14 / 23

slide-33
SLIDE 33

Passive RL: Simple Case

1 2 3 4 a +1 b xt −1 c S Keep same policy That is, follow same path and update values, V (s)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 15 / 23

slide-34
SLIDE 34

Passive RL: Simple Case

1 2 3 4 a +1 b xt −1 c S Keep same policy That is, follow same path and update values, V (s) To mimic increasing confidence, reduce learning rate with number of visits, N(s): α = 1 N(s) + 1 like in simulated annealing.

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 15 / 23

slide-35
SLIDE 35

Passive RL: Simple Case

1 2 3 4 a +1 b xt −1 c S Keep same policy That is, follow same path and update values, V (s) To mimic increasing confidence, reduce learning rate with number of visits, N(s): α = 1 N(s) + 1 like in simulated annealing. TD rule: V (s) ← V (s) + 1 N(s) + 1

  • R(s) + γV (s′) − V (s)
  • Günay
  • Ch. 21 – Reinforcement Learning

Spring 2013 15 / 23

slide-36
SLIDE 36

Passive RL: Simple Case (2)

1 2 3 4 a → → → +1 b ↑ xt −1 c S V (s) ← V (s) + ∆ ∆ = 1 N(s) + 1

  • R(s) + γV (s′) − V (s)
  • For simplicity, γ = 1.

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 16 / 23

slide-37
SLIDE 37

Passive RL: Simple Case (2)

1 2 3 4 a → → → +1 b ↑ xt −1 c S V (s) ← V (s) + ∆ ∆ = 1 N(s) + 1

  • R(s) + γV (s′) − V (s)
  • For simplicity, γ = 1.

N V (s) ∆ a3 → a4 1 1/2

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 16 / 23

slide-38
SLIDE 38

Passive RL: Simple Case (2)

1 2 3 4 a → → → +1 b ↑ xt −1 c S V (s) ← V (s) + ∆ ∆ = 1 N(s) + 1

  • R(s) + γV (s′) − V (s)
  • For simplicity, γ = 1.

N V (s) ∆ a3 → a4 1 1/2 a2 → a3 2 1/6

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 16 / 23

slide-39
SLIDE 39

Passive RL: Simple Case (2)

1 2 3 4 a → → → +1 b ↑ xt −1 c S V (s) ← V (s) + ∆ ∆ = 1 N(s) + 1

  • R(s) + γV (s′) − V (s)
  • For simplicity, γ = 1.

N V (s) ∆ a3 → a4 1 1/2 a2 → a3 2 1/6 a3 → a4 2 1/2 1/6

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 16 / 23

slide-40
SLIDE 40

Passive RL: Simple Case (2)

1 2 3 4 a → → → +1 b ↑ xt −1 c S V (s) ← V (s) + ∆ ∆ = 1 N(s) + 1

  • R(s) + γV (s′) − V (s)
  • For simplicity, γ = 1.

N V (s) ∆ a3 → a4 1 1/2 a2 → a3 2 1/6 a3 → a4 2 1/2 1/6 Convergence time?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 16 / 23

slide-41
SLIDE 41

Passive RL: Problems?

0.2 0.4 0.6 0.8 1 100 200 300 400 500 Utility estimates Number of trials (1,1) (1,3) (2,1) (3,3) (4,3) 0.1 0.2 0.3 0.4 0.5 0.6 20 40 60 80 100 RMS error in utility Number of trials

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 17 / 23

slide-42
SLIDE 42

Passive RL: Problems?

0.2 0.4 0.6 0.8 1 100 200 300 400 500 Utility estimates Number of trials (1,1) (1,3) (2,1) (3,3) (4,3) 0.1 0.2 0.3 0.4 0.5 0.6 20 40 60 80 100 RMS error in utility Number of trials

Limited by constant policy? Fewer visited states cause poor estimate?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 17 / 23

slide-43
SLIDE 43

Active RL: Example

Greedy algorithm After updating V (s) and N(s), recalculate policy π(s)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 18 / 23

slide-44
SLIDE 44

Active RL: Example

Greedy algorithm After updating V (s) and N(s), recalculate policy π(s)

0.5 1 1.5 2 50 100 150 200 250 300 350 400 450 500 RMS error, policy loss Number of trials RMS error Policy loss

1 2 3 1 2 3 –1 +1 4

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 18 / 23

slide-45
SLIDE 45

Active RL: Example

Greedy algorithm After updating V (s) and N(s), recalculate policy π(s)

0.5 1 1.5 2 50 100 150 200 250 300 350 400 450 500 RMS error, policy loss Number of trials RMS error Policy loss

1 2 3 1 2 3 –1 +1 4

Greedy algorithm cannot find optimal policy⇒ needs more exploration

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 18 / 23

slide-46
SLIDE 46

How to Improve Active RL?

Source of errors:

Reason for error: sampling policy V too low V too high increase N helps?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 19 / 23

slide-47
SLIDE 47

How to Improve Active RL?

Source of errors:

Reason for error: sampling policy V too low T V too high T increase N helps? T

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 19 / 23

slide-48
SLIDE 48

How to Improve Active RL?

Source of errors:

Reason for error: sampling policy V too low T T V too high T F increase N helps? T F

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 19 / 23

slide-49
SLIDE 49

How to Improve Active RL?

Source of errors:

Reason for error: sampling policy V too low T V too high T increase N helps? T Exploration vs. Exploitation: We can’t do without it We can’t live with too much of it

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 19 / 23

slide-50
SLIDE 50

How to Improve Active RL?

Source of errors:

Reason for error: sampling policy V too low T V too high T increase N helps? T Exploration vs. Exploitation: We can’t do without it We can’t live with too much of it Exploration: Minimize it, use random moves?

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 19 / 23

slide-51
SLIDE 51

Exploring Agent

1 2 3 4 a +1 +1 +1 +1 b +1 xt +1 +1 c S +1 +1 +1 Initialize all V (s) = +R (e.g., +1) Until N(s)>e; exploration threshold Then use V (s) Wait until built confidence

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 20 / 23

slide-52
SLIDE 52

Exploring Agent Does Much Better

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 20 40 60 80 100 Utility estimates Number of trials (1,1) (1,2) (1,3) (2,3) (3,2) (3,3) (4,3) 0.2 0.4 0.6 0.8 1 1.2 1.4 20 40 60 80 100 RMS error, policy loss Number of trials RMS error Policy loss

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 21 / 23

slide-53
SLIDE 53

Q-Learning

Instead of V (s), use Q(s, a): Q(s, a) = arg max

a

V (s) then the value iteration becomes Q(s, a) = R(s) + γ

  • s′

P(s′|s, a) arg max

a′ Q(s′, a′)

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 22 / 23

slide-54
SLIDE 54

Q-Learning

Instead of V (s), use Q(s, a): Q(s, a) = arg max

a

V (s) then the value iteration becomes Q(s, a) = R(s) + γ

  • s′

P(s′|s, a) arg max

a′ Q(s′, a′)

State of the art, but also has problems with dimensionality

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 22 / 23

slide-55
SLIDE 55

Q-Learning in Real World Problems

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 23 / 23

slide-56
SLIDE 56

Q-Learning in Real World Problems

Translate problem space to feature space: s = [f1, . . . , fm]

Günay

  • Ch. 21 – Reinforcement Learning

Spring 2013 23 / 23