Reinforcement Learning Reinforcement Learning Now that you know a - - PowerPoint PPT Presentation

reinforcement learning reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning Reinforcement Learning Now that you know a - - PowerPoint PPT Presentation

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost


slide-1
SLIDE 1

Reinforcement Learning

slide-2
SLIDE 2

Reinforcement Learning

  • Now that you know a little about Optimal Control Theory, you

actually have some knowledge in RL.

  • RL shares the overall goal with OCT: solving for a control

policy such that the cumulative cost is minimized; good for solving problems which include a long-term versus short-term reward trade-off.

  • But OCT assumes perfect knowledge of the system’s

description in the form of a model and ensures strong guarantees, while RL operates directly on measured data and rewards from interaction with the environment.

slide-3
SLIDE 3

RL in Robotics

  • Reinforcement learning (RL) enables a robot to autonomously

discover an optimal behavior through trial-and-error interactions with its environment.

  • The designer of a control task provides feedback in terms of a

scalar objective function that measures the one-step performance of the robot.

  • Problems are often high-dimensional with continuous states

and actions, and the state is often partially observable.

  • Experience on a real physical system is tedious to obtain,

expensive and often hard to reproduce.

slide-4
SLIDE 4

Problem Definition

  • A reinforcement learning problem typical includes:
  • A set of states:
  • A set of actions:
  • Transition rules:
  • Reward function:
  • Here we assume full observability but with a stochastic

transition model. S A P a

ss0

r S 7! R S ⇥ S ⇥ A 7! R 0 ≤ P a

ss0 ≤ 1

X

s0

P a

ss0 = 1

slide-5
SLIDE 5

Long-term Expected Return

  • Finite-horizon expected return
  • Infinite-horizon return with a discount factor γ
  • In the limit when γ approaches 1, the metric approaches what

is known as the average-reward criterion

J = E " H X

k=0

rk # J = E " ∞ X

k=0

γkrk # J = lim

H→∞ E

" 1 H

H

X

k=0

rk #

slide-6
SLIDE 6

Value Function

  • Recall from optimal control theory, v(x) = “minimal total cost

for completing the task starting from state x”

  • An value function that follows a particular policy,
  • The optimal value function:

V Π(s) = EΠ{Rt|st = s} = EΠ{

X

k=0

γkrt+k+1|st = s} γ Rt = rt+1 + γrt+2 + γ2rt+3 + · · · V ∗(s) = max

Π

V Π(s) Π

where is the discount factor

V Π(s) : S 7! R

slide-7
SLIDE 7

Policy

  • Deterministic policy:
  • Probabilistic policy:
  • The optimal policy:

S 7! A S ⇥ A 7! R Π∗ = arg max

Π

V Π(s)

slide-8
SLIDE 8

Exploration and Exploitation

  • To gain information about the rewards and the behavior of the

system, the agent needs to explore by considering previously unused actions or actions it is uncertain about.

  • Need to decide whether to stick to well known actions with

high rewards or to try new things in order to discover new strategies with an even higher reward.

  • This problem is commonly known as the exploration-

exploitation trade-off.

slide-9
SLIDE 9

Value Function Approach Policy Search Approach

Dynamic Program Value Iteration Policy Iteration Monte Carlo Temporal Difference TD(lambda) SARSA Q-learning

Actor-Critic Approach

Policy Gradient Expectation–Maximization Information-Theoretic Integral Path

slide-10
SLIDE 10

Bellman Equation

  • The expected long-term reward of a policy can be expressed in

a recursive formulation.

V Π(s) = EΠ{

X

k=0

γkrt+k+1|st = s} = X

s0

P Π(s)

ss0

(r(s0) + γEΠ{

1

X

k=0

γkrt+k+2|st+1 = s0}) = X

s0

P Π(s)

ss0

(r(s0) + γV Π(s0))

slide-11
SLIDE 11

Value Iteration

  • Value iteration starts with a guess V(0) of the optimal value

function and construct a sequence of improved guesses.

  • This process is guaranteed to converge to the optimal value

function V in a finite number of iterations.

V (i+1)(s) = max

Π

X

s0

P Π(s)

ss0

(r(s0) + γV (i)(s0))

slide-12
SLIDE 12

Policy Iteration

  • Find the optimal policy by iterating two procedures until

convergence

  • Policy Evaluation
  • Policy Improvement
slide-13
SLIDE 13

Policy Evaluation

Π V Π

Input: Output: Step 1: Arbitrarily initialize V (s), ∀s ∈ S Step 2: Repeat For each s

a = Π(s) V (s) = X

s0

P a

ss0(r(s0) + γV (s0))

V (s)

Until convergence Step 3: Output

slide-14
SLIDE 14

Policy Improvement

V Π Π0 s Π0(s)

Input: Output: Step 1: For each Step 2: Output

Π0(s) = arg max

a

X

s0

P a

ss0(r(s0) + γV Π(s0))

slide-15
SLIDE 15

Monte Carlo Approach

  • Both value iteration and policy iteration use dynamic

programming approach.

  • Dynamic programming approach requires a transition model, P,

which is often unavailable in real world problem.

  • Monte Carlo algorithm does not require a model to be known.

Instead, it generate samples to approximate the value function.

slide-16
SLIDE 16

The Q function

  • Introduce the Q function:
  • Use the Q(s, a) function instead of the value function V(s)

because in the absence of transition model the values with respect to all possible actions at s must be stored explicitly.

  • The optimal

QΠ(s, a) : S ⇥ A 7! R QΠ(s, a) = EΠ{Rt|st = s, at = a} = EΠ{

X

k=0

γkrt+k+1|st = s, at = a} Q∗(s, a) = max

Π

QΠ(s, a) V ∗(s) = max

a

Q∗(s, a) Π∗(s) = arg max

a

Q∗(s, a)

slide-17
SLIDE 17

Monte Carlo Policy Iteration

Q(s, a) Π(s, a) = 1 |A| ∀s ∈ S return(s, a) ∀a ∈ A Π s0

a0

− → s1

a1

− → s2

a2

− → · · · (s, a) return(s, a)

Assign average of to Q(s, a) Step 1: Arbitrarily initialize and an empty list Step 2: Repeat for many times Generate an episode using For each pair in the episode Compute long-term return R from (s, a) Append R to return(s, a) Continue...

Policy evaluation Policy improvement

slide-18
SLIDE 18

Monte Carlo Policy Iteration

s a∗ = arg max

a

Q(s, a) Π(s, a) = 1 − ✏ if a = a∗ ✏ |A| − 1 if a = a∗ ✏ Π(s) = arg max

a

Q(s, a)

For all

Policy improvement

Step 3: Output is a small number, which affects exploration and exploitation

slide-19
SLIDE 19

Temporal Difference Learning

  • Problem with the Monte Carlo learning is that it takes a lot of

time to simulate/execute the episodes.

  • Temporal Difference (TD) learning is a combination of Monte

Carlo and dynamic programming.

  • Update the value function based on previously learned

estimates.

slide-20
SLIDE 20

Policy Iteration in TD

Q(s, a) Π(s, a) = 1 |A|

Step 1: Arbitrarily initialize Step 2: Repeat for each episode s = initial state of the episode a = generate a sample from Π(s, a) Repeat for each step in the episode s’ = new state by taking action a from s Continue...

slide-21
SLIDE 21

Policy Iteration in TD

Π(s0, a) = 1 − ✏ if a = a∗ ✏ |A| − 1 if a = a∗

a’ = generate a sample from Π(s0, a)

Q(s, a) = Q(s, a) + α[r(s0) + γQ(s0, a0) − Q(s, a)]

s = s’ a = a’

a⇤ = arg max

a

Q(s0, a)

until s is the terminal state Step 3: Output Π(s) = arg max

a

Q(s, a)

slide-22
SLIDE 22

n-step TD and Linear Combination

s’ s s’’ s s’ s’’ s s’ a a a a’ a’ sn

1-step TD 2-step TD Monte Carlo

(1 − λ) (1 − λ)λ λn−1 λ = 0 λ = 1

1-step TD method Monte Carlo method