Introduction to Reinforcement Learning and Q-Learning Skyler Seto - - PowerPoint PPT Presentation

introduction to reinforcement learning and q learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Reinforcement Learning and Q-Learning Skyler Seto - - PowerPoint PPT Presentation

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler Seto (ss3349) Introduction to Reinforcement Learning and


slide-1
SLIDE 1

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Introduction to Reinforcement Learning and Q-Learning

Skyler Seto (ss3349) May 2, 2016

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-2
SLIDE 2

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Outline

1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-3
SLIDE 3

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Introduction

How does an agent behave?

1 An agent can be a passive learner, lounging around

analyzing data, then constructing its model.

2 An agent can be an active learner and learn to act on the

fly given sequences of the form (state, action, reward). In this talk, we consider an agent to be one who actively learns from the environment (2).

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-4
SLIDE 4

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Markov Decision Process (MDP)

Definition The MDP framework consists of the four elements (S, A, R, P)

  • S is the finite set of possible states,
  • A is the finite set of possible actions,
  • R is the reward model R : S × A → R,
  • P is the transition model P(s|s, a) with
  • s′∈S P(s′|s, a) = 1.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-5
SLIDE 5

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Robot Navigation

1 State space S is the set of all possible locations and

directions.

2 Action space A is the set of possible motions: move

forward, backward, etc.

3 Reward model R rewards the robot positively if it gets to

the goal, and negatively if it hits an obstacle.

4 Transition probability accounts for some probability the

robot moves forward, doesn’t move, and moves forward twice.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-6
SLIDE 6

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Markov Decision Process Diagram

Figure: Two Step Markov Decision Process

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-7
SLIDE 7

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Properties of MDP

1 The reward function R(s, a) is deterministic and

time-homogeneous.

2 P(st+1|st, at) is independent of t and thus

time-homogeneous.

3 Transition model is Markovian.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-8
SLIDE 8

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Reinforcement Learning in the MDP

1 Consider a partially known model (S, A, R, P) where S and

A are known, but R and P must be learned as the agent acts.

2 Define the policy for the MDP πt : S → A as the solution to

the MDP .

3 What is the optimal policy (π∗) the agent should learn in

  • rder to maximize its total discounted expected reward

(γs)?

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-9
SLIDE 9

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Outline

1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-10
SLIDE 10

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Value and Optimal Value

Definition For a given policy π and discounted reward factor γ, the value

  • f a state s is

V π(s) = Rs(π(s)) + γ

  • y∈S

Ps,y(π(s))V π(y) V ∗(s) = V π∗(s) = max

a

  Rs(a) + γ

  • y∈S

Ps,y(a)V π∗(y)   

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-11
SLIDE 11

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Q Function

Definition For a policy π define Q values (action-values) as Qπ(s, a) = Rs(a) + γ

  • y∈S

Px,y(π(s))V π(y) = Es [Rs(a) + γV π(y)] The Q value is the expected discounted reward for executing action a at state s and following policy π after.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-12
SLIDE 12

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Q Values for the Optimal Policy

1 Let Q∗(s, a) = Qπ∗(s, a) be the optimal action-values, 2 V ∗(s) = max a

Q∗(s, a) be the optimal value,

3 π∗(s) = arg max a

Q∗(s, a) be the optimal policy. If an agent learns the Q values, it can easily determine the

  • ptimal action.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-13
SLIDE 13

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Q-Learning

In Q-learning the agent experiences a sequence of stages. At the nth stage, the agent:

  • observes its current state xn,
  • performs an action an,
  • observes subsequent state transition to yn,
  • receives reward rn,

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-14
SLIDE 14

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Q-Learning

  • updates its Q function with learning factor αn according to
  • If s = xn and a = an

Qn(s, a) = (1 − αn)Qn−1(s, a) + αn [rn + γVn−1(yn)]

  • Otherwise

Qn(s, a) = Qn−1(s, a)

where Vn−1(y) = max

b {Qn−1(y, b)}

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-15
SLIDE 15

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Outline

1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-16
SLIDE 16

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Convergence Theorem

Let ni(s, a) be the ith time that action a is tried in state s. Theorem Given bounded rewards |rn| ≤ R, learning rates 0 ≤ αn < 1, and

  • i=1

αni(s,a) = ∞,

  • i=1
  • αni(s,a)

2 < ∞ ∀s, a then Qn(s, a) a.s → Q∗(s, a) as n → ∞.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-17
SLIDE 17

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Action Replay Process (ARP)

1 Let S = {s}, A = {a} be the set of states and actions for

the original MDP .

2 Create an infinite deck of cards with the jth card from the

bottom having (sj, aj, yj, rj, αj) written on it.

3 Additionally take the bottom card to have the value Q0(s, a)

for all s and a.

4 We define the ARP as to have S′ = {(s, n)} and

A′ = A = {a}.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-18
SLIDE 18

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

State Transitions in (ARP)

Given current state (s, n) and action a, we determine the next state by

1 Removing all cards for stages after n. 2 Find the first t searching from the top of the (remaining)

deck where st = s and at = a.

3 Flip a biased coin with probability αt.

  • If the coin is heads, return reward rt, and transition to the

state (yt, t − 1). Repeat the process on the remaining deck without card t.

  • If the coin is tails, find another card with s and a.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-19
SLIDE 19

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Transition Probability for ARP

1 Define the expected reward of card n determined by the

ARP as R(n)

s (a) 2 Define the transition probability for the ARP as

PARP

(x,n),(y,m)(a) with

P(n)

x,y(a) = n−1

  • m=1

PARP

(x,n),(y,m)(a)

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-20
SLIDE 20

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Lemma A: Qn are Optimal for ARP

Lemma Qn(s, a) = Q∗

ARP ((s, n), a)

Qn(s, a) are the optimal action values for ARP states (s, n) and ARP actions a.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-21
SLIDE 21

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Lemma A: Qn are Optimal for ARP

The ARP was constructed to have this property. At n = 0, Q0(s, a) is the optimal and only possible action-value

  • f (s, 0) and a, so

Q0(s, a) = Q∗

ARP((s, 0), a)

It’s easy to see by induction that for all a and s, and for any n, Qn(s, a) = Q∗

ARP ((s, n), a)

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-22
SLIDE 22

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Lemma B: Convergence of Transitions and Rewards

Lemma With probability 1, the probabilities P(n)

x,y(a) and expected

rewards R(n)

x (a) in the ARP converge and tend to the transition

matrices and expected rewards in the true process as the card n → ∞.

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-23
SLIDE 23

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Lemma B: Convergence of Transitions and Rewards

It is a standard theorem that if Xn is updated according to Xn+1 = Xn + βn(ξn − Xn) where 0 ≤ βn < 1, and ∞

i=1 βn = ∞, ∞ i=1 [βn]2 < ∞, and ξn is

a bounded random variable with mean Ξ, then Xn

a.s

→ Ξ

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-24
SLIDE 24

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Lemma B: Convergence of Transitions and Rewards

  • Define ni = ni(x, a).
  • If R(x,ni)(a) is the expected immediate reward for

performing action a from state x at card n, R(x,ni+1)(a) = R(x,ni)(a) + αni+1(rni+1 − R(x,ni)(a)) Since R is written in the form above, Ξ = E [rni+1] = Rx(a) and R(x,ni)(a) → Rx(a)

  • Similarly, P(n)

x,y(a) → Px,y(a) since

P(ni+1)

x,y

(a) = P(ni)

x,y (a) + αni+1

  • I{yni+1 = y}(y) − P(ni)

x,y (a)

  • Skyler Seto (ss3349)

Introduction to Reinforcement Learning and Q-Learning

slide-25
SLIDE 25

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Lemma C: Close Action-Values

Lemma Consider executing a series of t actions in the ARP and in the real process. If the probabilities P(n)

x,y(a) and expected rewards

R(n)

x (a) at appropriate levesl of the ARP for each of the actions,

are close to Px,y(a) and Rx(a), respectively, then the value of the series of actions in the ARP , QARP(x, a1, . . . , at) will be close to its value in the true process Q(x, a1, . . . , at).

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-26
SLIDE 26

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

Finishing the Convergence Proof

  • Lemma B bounds the distance between P(n)

x,y(a) and Px,y,

and R(n)

x (a) and Rx(a).

  • Lemma C shows that if the transition probabilities and

rewards are close, then the values of the actions QARP((s, n), a1, . . . at) and Q(s, a1, . . . , at) must be close too.

  • By Lemma B and C, the action-values are close, and so

are the optimal action-values Q∗

ARP((s, n), a1, . . . at) and

Q∗(s, a1, . . . , at).

  • By Lemma A, the optimal Q value for the nth level of the

ARP is Qn and so Qn(s, a) → Q∗(s, a).

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

slide-27
SLIDE 27

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence

References

1 Pfeffer, A., Parkes, D., Adams, R. "Markov Decision

Processes", Harvard Extension School, CSCI E-181, 2014.

2 Pfeffer, A., Parkes, D., Adams, R. "Reinforcement

Learning", Harvard Extension School, CSCI E-181, 2014.

3 Watkins, C., Dayan, P

. "Technical note: Q-Learning" in Machine Learning, 8. Boston. Manufactured in The

  • Netherlands. Kluwer Academic Publishers, 279-292

(1992).

Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning