Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 - - PowerPoint PPT Presentation

Introduction Model-based Learning Temporal Difference Learning Partially Observable States Reinforcement Learning Steven J Zeil Old Dominion Univ. Fall 2010 1 Introduction Model-based Learning Temporal Difference Learning Partially


slide-1
SLIDE 1

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Reinforcement Learning

Steven J Zeil

Old Dominion Univ.

Fall 2010

1

slide-2
SLIDE 2

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Reinforcement Learning

Learning policies for which ultimate payoff is only after many steps

2

slide-3
SLIDE 3

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Reinforcement Learning

Learning policies for which ultimate payoff is only after many steps

e.g., games, robotics

2

slide-4
SLIDE 4

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Reinforcement Learning

Learning policies for which ultimate payoff is only after many steps

e.g., games, robotics

Unlike supervised learning, correct I/O pairs are not avail

2

slide-5
SLIDE 5

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Reinforcement Learning

Learning policies for which ultimate payoff is only after many steps

e.g., games, robotics

Unlike supervised learning, correct I/O pairs are not avail

May not be a “correct”

  • uput

2

slide-6
SLIDE 6

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Reinforcement Learning

Learning policies for which ultimate payoff is only after many steps

e.g., games, robotics

Unlike supervised learning, correct I/O pairs are not avail

May not be a “correct”

  • uput

Heavy emphasis on on-line learning

2

slide-7
SLIDE 7

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Short-term versus Long-term Reward

Goal is to optimize a reward that may be given at the end of a sequence of state transitions Approximated by a series of immediate rewards after each transition Requires balance of short-term versus long-term planning At any given step, may engage in

exploitation of what we know or exploration of unknown states

3

slide-8
SLIDE 8

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Basic Components

set of states S set of actions A rules for transitioning between states rules for immediate reward

  • f a transition

rules for what the agent can

  • bserve

4

slide-9
SLIDE 9

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

K-armed Bandit

Among K levers, choose the

  • ne that pays best

Q(a): value of action a Reward is ra Set Q(a) = ra Choose a∗ if Q(a∗) = maxa Q(a)

Rewards stochastic (keep an expected reward): Qt+1(a) ← Qt(a)+η[rt+1(a)−Qt(

5

slide-10
SLIDE 10

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

K-armed Bandit variants

This problem becomes more interesting if we don’t know all the ra Trade-off of exploitation and exploration

6

slide-11
SLIDE 11

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Policies and Cumulative Rewards

Policy: π : S → A. at = π(st) Value of a policy: V π(st) Cumulative value:

Finite-horizon (episodic): V π = E[

T

  • i=1

rt+1] Infinite horizon: V π = E[

  • i=1

γi−1rt+1] 0 ≤ γ < 1 is the discount rate

7

slide-12
SLIDE 12

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

State-Action pairs

V (st) is a measure of how good it is for the agent to be in state st Alternative, we can talk about Q(st, at), how good it is to perform action at when in state st Q∗(st, at) is the expected cumulative reward of of action at taken in state st assuming we follow an optimal policy afterwards

8

slide-13
SLIDE 13

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Optimal Policies

V ∗(st) = max

π

V π(st), ∀st V ∗(st) = max

at Q ∗ (st, at)

Bellman’s equation: V ∗(st) = max

at

  • E[rt+1] + γ
  • st+1

P(st+1|st, at)V ∗(st+1)

  • Q∗(st, at) = E[rt+1] + γ
  • st+1

P(st+1|st, at) max

at Q∗(st+1, at+1)

Choose the at that maximizes Q∗(st, at) (greedy)

9

slide-14
SLIDE 14

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Model-based Learning

Environment, P(st+1|st, at), p(rt+1|st, at), is known There is no need for exploration Can be solved using dynamic programming

10

slide-15
SLIDE 15

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Value Iteration

Initialize all V (S) to arbitrary values repeat for all s ∈ S do for all a ∈ A do Q(s, a) ← E[r|s, a] + γ

s′∈S P(s′|s, a)V (s′)

end for end for V (S) ← maxa Q(s, a) until V (s) converges

11

slide-16
SLIDE 16

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Policy Iteration

Initialize a policy π arbitrarily repeat π′ ← π Compute the values V π(s) using π by solving Bellman’s equation Improve the policy by choosing best a at each step until π = π′

12

slide-17
SLIDE 17

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Temporal Difference Learning

If we do not know the entire environment, must do some exploration. Need to do some exploration

Exploration will, in effect, take a sample from P(st+1|st, at) and p(rt+1|st, at) Use the reward received in the next time step to update the value of current state (action)

13

slide-18
SLIDE 18

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

ǫ-greedy

For some ǫ,

if a has probability at least 1 − ǫ of being the best choice, choose a (exploit)

  • therwise choose a random action (explore)

Softmax: P(a|s) = exp Q(s,a)

T

  • b∈A exp Q(s,b)

T

Simulated Annealing with temperature T P(a|s) = exp Q(s, a)

  • b∈A exp Q(s, b)

14

slide-19
SLIDE 19

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Nondeterministic Rewards and Actions

When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning ˆ Q(st, at) ← ˆ Q(st, at)+η

  • rt+1 + γ max

at+1

ˆ Q(st+1,at+1)− ˆ

Q(st,at)

  • Off-policy vs on-policy (Sarsa)

15

slide-20
SLIDE 20

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Q-learning

Initialize all Q(s, a) to arbitrary values for all episodes do Initialize s repeat Choose a using policy derived from Q (e.g., ǫ-greedy) Take action a, observe r and s′ Update Q(s,a) (prev. slide) s ← s′ until s is in terminal state end for

16

slide-21
SLIDE 21

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Partially Observable States

The agent does not know its state but receives an

  • bservation

p(ot+1|st, at) which can be used to infer a belief about states

17

slide-22
SLIDE 22

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

The Tiger Problem

Two doors, behind one of which there is a tiger p: prob that tiger is behind the left door r(A,Z) Tiger left Tiger right Open left

  • 100

+80 Open right +90

  • 100

z is hidden state: location of the tiger R(aL) = −100p + 80(1 − p), R(aRE) = 90p − 100(1 − p)

18

slide-23
SLIDE 23

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

. . . with Microphones

We can sense with a reward of R(aS) = −1 We have unreliable sensors P(OL|ZL) = 0.7 P(OL|ZR) = 0.3 P(OR|ZL) = 0.3 P(OR|ZR) = 0.7

19

slide-24
SLIDE 24

Introduction Model-based Learning Temporal Difference Learning Partially Observable States

Effects of Sensors

20