Reinforcement learning Chapter 21, Sections 14 Chapter 21, Sections - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement learning Chapter 21, Sections 14 Chapter 21, Sections - - PowerPoint PPT Presentation

Reinforcement learning Chapter 21, Sections 14 Chapter 21, Sections 14 1 Outline Examples Learning a value function for a fixed policy temporal difference learning Q-learning Function approximation Exploration


slide-1
SLIDE 1

Reinforcement learning

Chapter 21, Sections 1–4

Chapter 21, Sections 1–4 1

slide-2
SLIDE 2

Outline

♦ Examples ♦ Learning a value function for a fixed policy – temporal difference learning ♦ Q-learning ♦ Function approximation ♦ Exploration

Chapter 21, Sections 1–4 2

slide-3
SLIDE 3

Reinforcement learning

Agent is in an MDP or POMDP environment Only feedback for learning is percept + reward Agent must learn a policy in some form: – transition model T(s, a, s′) plus value function U(s) – Q(a, s) = expected utility if we do a in s and then act optimally – policy π(s)

Chapter 21, Sections 1–4 3

slide-4
SLIDE 4

Example: 4 ×3 world

1 2 3 1 2 3 − 1 + 1 4

START

(1, 1)-.04→(1, 2)-.04→(1, 3)-.04→(1, 2)-.04→(1, 3)-.04→ · · · (4, 3)+1 (1, 1)-.04→(1, 2)-.04→(1, 3)-.04→(2, 3)-.04→(3, 3)-.04→ · · · (4, 3)+1 (1, 1)-.04→(2, 1)-.04→(3, 1)-.04→(3, 2)-.04→(4, 2)-1 .

Chapter 21, Sections 1–4 4

slide-5
SLIDE 5

Example: Backgammon

1 2 3 4 5 6 7 8 9 10 11 12 24 23 22 21 20 19 18 17 16 15 14 13 25

Reward for win/loss only in terminal states, otherwise zero TDGammon learns ˆ U(s), represented as 3-layer neural network Combined with depth 2 or 3 search, one of top three players in world

Chapter 21, Sections 1–4 5

slide-6
SLIDE 6

Example: Animal learning

RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, recreational pharmaceuticals, etc. Example: bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area

Chapter 21, Sections 1–4 6

slide-7
SLIDE 7

Example: Autonomous helicopter

Reward = – squared deviation from desired state

Chapter 21, Sections 1–4 7

slide-8
SLIDE 8

Example: Autonomous helicopter

Chapter 21, Sections 1–4 8

slide-9
SLIDE 9

Temporal difference learning

Fix a policy π, execute it, learn U π(s) Bellman equation: U π(s) = R(s) + γ

  • s′ T(s, π(s), s′)U π(s′)

TD update adjusts utility estimate to agree with Bellman equation: U π(s) ← U π(s) + α(R(s) + γ U π(s′) − U π(s)) Essentially using sampling from the environment instead of exact summation

Chapter 21, Sections 1–4 9

slide-10
SLIDE 10

TD performance

0.2 0.4 0.6 0.8 1 100 200 300 400 500 Utility estimates Number of trials (1,1) (1,3) (2,1) (3,3) (4,3) 0.1 0.2 0.3 0.4 0.5 0.6 20 40 60 80 100 RMS error in utility Number of trials

Chapter 21, Sections 1–4 10

slide-11
SLIDE 11

Q-learning

One drawback of learning U(s): still need T(s, a, s′) to make decisions Q(a, s) = expected utility if we do a in s and then act optimally Bellman equation: Q(a, s) = R(s) + γ

  • s′ T(s, π(s), s′) max

a′ Q(a′, s′)

Q-learning update: Q(a, s) ← Q(a, s) + α(R(s) + γ max

a′ Q(a′, s′) − Q(a, s))

Q-learning is a model-free method for learning and decision making Q-learning is a model-free method for learning and decision making (so cannot use model to constrain Q-values, do mental simulation, etc.)

Chapter 21, Sections 1–4 11

slide-12
SLIDE 12

Function approximation

For real problems, cannot represent U or Q as a table!! Typically use linear function approximation: ˆ Uθ(s) = θ1 f1(s) + θ2 f2(s) + · · · + θn fn(s) . Use a gradient step to modify θ parameters: θi ← θi + α [R(s) + γ ˆ Uθ(s′) − ˆ Uθ(s)]∂ ˆ Uθ(s) ∂θi θi ← θi + α [R(s) + γ max

a′

ˆ Qθ(a′, s′) − ˆ Qθ(a, s)]∂ ˆ Qθ(a, s) ∂θi Often very effective in practice, but convergence not guaranteed

Chapter 21, Sections 1–4 12

slide-13
SLIDE 13

Exploration

How should the agent behave? Choose action with highest expected utility?

0.5 1 1.5 2 50 100 150 200 250 300 350 400 450 500 RMS error, policy loss Number of trials RMS error Policy loss

1 2 3 1 2 3 –1 +1 4

Exploration vs. exploitation: occasionally try “suboptimal” actions!!

Chapter 21, Sections 1–4 13

slide-14
SLIDE 14

Summary

Reinforcement learning methods find approximate solutions to MDPs Work directly from experience in the environment Need not be given transition model a priori Q-learning is completely model-free Function approximation (e.g., linear combination of features) helps RL scale up to very large MDPs Exploration is required for convergence to optimal solutions

Chapter 21, Sections 1–4 14