Reinforcement Learning Reinforcement Learning Now that you know a - - PowerPoint PPT Presentation
Reinforcement Learning Reinforcement Learning Now that you know a - - PowerPoint PPT Presentation
Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost
Reinforcement Learning
- Now that you know a little about Optimal Control Theory, you
actually have some knowledge in RL.
- RL shares the overall goal with OCT: solving for a control
policy such that the cumulative cost is minimized; good for solving problems which include a long-term versus short-term reward trade-off.
- But OCT assumes perfect knowledge of the system’s
description in the form of a model and ensures strong guarantees, while RL operates directly on measured data and rewards from interaction with the environment.
RL in Robotics
- Reinforcement learning (RL) enables a robot to autonomously
discover an optimal behavior through trial-and-error interactions with its environment.
- The designer of a control task provides feedback in terms of a
scalar objective function that measures the one-step performance of the robot.
- Problems are often high-dimensional with continuous states
and actions, and the state is often partially observable.
- Experience on a real physical system is tedious to obtain,
expensive and often hard to reproduce.
Problem Definition
- A reinforcement learning problem typical includes:
- A set of states:
- A set of actions:
- Transition rules:
- Reward function:
- Here we assume full observability but with a stochastic
transition model. S A P a
ss0
r S 7! R S ⇥ S ⇥ A 7! R 0 ≤ P a
ss0 ≤ 1
X
s0
P a
ss0 = 1
Long-term Expected Return
- Finite-horizon expected return
- Infinite-horizon return with a discount factor γ
- In the limit when γ approaches 1, the metric approaches what
is known as the average-reward criterion
J = E " H X
k=0
rk # J = E " ∞ X
k=0
γkrk # J = lim
H→∞ E
" 1 H
H
X
k=0
rk #
Value Function
- Recall from optimal control theory, v(x) = “minimal total cost
for completing the task starting from state x”
- An value function that follows a particular policy,
- The optimal value function:
V Π(s) = EΠ{Rt|st = s} = EΠ{
∞
X
k=0
γkrt+k+1|st = s} γ Rt = rt+1 + γrt+2 + γ2rt+3 + · · · V ∗(s) = max
Π
V Π(s) Π
where is the discount factor
V Π(s) : S 7! R
Policy
- Deterministic policy:
- Probabilistic policy:
- The optimal policy:
S 7! A S ⇥ A 7! R Π∗ = arg max
Π
V Π(s)
Exploration and Exploitation
- To gain information about the rewards and the behavior of the
system, the agent needs to explore by considering previously unused actions or actions it is uncertain about.
- Need to decide whether to stick to well known actions with
high rewards or to try new things in order to discover new strategies with an even higher reward.
- This problem is commonly known as the exploration-
exploitation trade-off.
Value Function Approach Policy Search Approach
Dynamic Program Value Iteration Policy Iteration Monte Carlo Temporal Difference TD(lambda) SARSA Q-learning
Actor-Critic Approach
Policy Gradient Expectation–Maximization Information-Theoretic Integral Path
Bellman Equation
- The expected long-term reward of a policy can be expressed in
a recursive formulation.
V Π(s) = EΠ{
∞
X
k=0
γkrt+k+1|st = s} = X
s0
P Π(s)
ss0
(r(s0) + γEΠ{
1
X
k=0
γkrt+k+2|st+1 = s0}) = X
s0
P Π(s)
ss0
(r(s0) + γV Π(s0))
Value Iteration
- Value iteration starts with a guess V(0) of the optimal value
function and construct a sequence of improved guesses.
- This process is guaranteed to converge to the optimal value
function V in a finite number of iterations.
V (i+1)(s) = max
Π
X
s0
P Π(s)
ss0
(r(s0) + γV (i)(s0))
Policy Iteration
- Find the optimal policy by iterating two procedures until
convergence
- Policy Evaluation
- Policy Improvement
Policy Evaluation
Π V Π
Input: Output: Step 1: Arbitrarily initialize V (s), ∀s ∈ S Step 2: Repeat For each s
a = Π(s) V (s) = X
s0
P a
ss0(r(s0) + γV (s0))
V (s)
Until convergence Step 3: Output
Policy Improvement
V Π Π0 s Π0(s)
Input: Output: Step 1: For each Step 2: Output
Π0(s) = arg max
a
X
s0
P a
ss0(r(s0) + γV Π(s0))
Monte Carlo Approach
- Both value iteration and policy iteration use dynamic
programming approach.
- Dynamic programming approach requires a transition model, P,
which is often unavailable in real world problem.
- Monte Carlo algorithm does not require a model to be known.
Instead, it generate samples to approximate the value function.
The Q function
- Introduce the Q function:
- Use the Q(s, a) function instead of the value function V(s)
because in the absence of transition model the values with respect to all possible actions at s must be stored explicitly.
- The optimal
QΠ(s, a) : S ⇥ A 7! R QΠ(s, a) = EΠ{Rt|st = s, at = a} = EΠ{
∞
X
k=0
γkrt+k+1|st = s, at = a} Q∗(s, a) = max
Π
QΠ(s, a) V ∗(s) = max
a
Q∗(s, a) Π∗(s) = arg max
a
Q∗(s, a)
Monte Carlo Policy Iteration
Q(s, a) Π(s, a) = 1 |A| ∀s ∈ S return(s, a) ∀a ∈ A Π s0
a0
− → s1
a1
− → s2
a2
− → · · · (s, a) return(s, a)
Assign average of to Q(s, a) Step 1: Arbitrarily initialize and an empty list Step 2: Repeat for many times Generate an episode using For each pair in the episode Compute long-term return R from (s, a) Append R to return(s, a) Continue...
Policy evaluation Policy improvement
Monte Carlo Policy Iteration
s a∗ = arg max
a
Q(s, a) Π(s, a) = 1 − ✏ if a = a∗ ✏ |A| − 1 if a = a∗ ✏ Π(s) = arg max
a
Q(s, a)
For all
Policy improvement
Step 3: Output is a small number, which affects exploration and exploitation
Temporal Difference Learning
- Problem with the Monte Carlo learning is that it takes a lot of
time to simulate/execute the episodes.
- Temporal Difference (TD) learning is a combination of Monte
Carlo and dynamic programming.
- Update the value function based on previously learned
estimates.
Policy Iteration in TD
Q(s, a) Π(s, a) = 1 |A|
Step 1: Arbitrarily initialize Step 2: Repeat for each episode s = initial state of the episode a = generate a sample from Π(s, a) Repeat for each step in the episode s’ = new state by taking action a from s Continue...
Policy Iteration in TD
Π(s0, a) = 1 − ✏ if a = a∗ ✏ |A| − 1 if a = a∗
a’ = generate a sample from Π(s0, a)
Q(s, a) = Q(s, a) + α[r(s0) + γQ(s0, a0) − Q(s, a)]
s = s’ a = a’
a⇤ = arg max
a
Q(s0, a)
until s is the terminal state Step 3: Output Π(s) = arg max
a
Q(s, a)
n-step TD and Linear Combination
s’ s s’’ s s’ s’’ s s’ a a a a’ a’ sn
1-step TD 2-step TD Monte Carlo
(1 − λ) (1 − λ)λ λn−1 λ = 0 λ = 1
1-step TD method Monte Carlo method