reinforcement learning reinforcement learning
play

Reinforcement Learning Reinforcement Learning Now that you know a - PowerPoint PPT Presentation

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost


  1. Reinforcement Learning

  2. Reinforcement Learning • Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. • RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost is minimized; good for solving problems which include a long-term versus short-term reward trade-off. • But OCT assumes perfect knowledge of the system’s description in the form of a model and ensures strong guarantees, while RL operates directly on measured data and rewards from interaction with the environment.

  3. RL in Robotics • Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. • The designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot. • Problems are often high-dimensional with continuous states and actions, and the state is often partially observable. • Experience on a real physical system is tedious to obtain, expensive and often hard to reproduce.

  4. Problem Definition • A reinforcement learning problem typical includes: • A set of states: S • A set of actions: A • Transition rules: P a S ⇥ S ⇥ A 7! R ss 0 X P a ss 0 = 1 0 ≤ P a ss 0 ≤ 1 s 0 • Reward function: S 7! R r • Here we assume full observability but with a stochastic transition model.

  5. Long-term Expected Return • Finite-horizon expected return " H # X J = E r k k =0 • Infinite-horizon return with a discount factor γ " ∞ # X γ k r k J = E k =0 • In the limit when γ approaches 1, the metric approaches what is known as the average-reward criterion " H # 1 X J = lim H →∞ E r k H k =0

  6. Value Function • Recall from optimal control theory, v ( x ) = “minimal total cost for completing the task starting from state x ” • An value function that follows a particular policy, Π V Π ( s ) : S 7! R ∞ X V Π ( s ) = E Π { R t | s t = s } = E Π { γ k r t + k +1 | s t = s } k =0 R t = r t +1 + γ r t +2 + γ 2 r t +3 + · · · where is the discount factor γ • The optimal value function: V Π ( s ) V ∗ ( s ) = max Π

  7. Policy • Deterministic policy: S 7! A • Probabilistic policy: S ⇥ A 7! R • The optimal policy: Π ∗ = arg max V Π ( s ) Π

  8. Exploration and Exploitation • To gain information about the rewards and the behavior of the system, the agent needs to explore by considering previously unused actions or actions it is uncertain about. • Need to decide whether to stick to well known actions with high rewards or to try new things in order to discover new strategies with an even higher reward. • This problem is commonly known as the exploration- exploitation trade-off.

  9. Value Function Policy Search Approach Approach Policy Gradient Dynamic Program Expectation–Maximization Value Iteration Policy Iteration Information-Theoretic Monte Carlo Integral Path Temporal Difference TD(lambda) Actor-Critic Approach SARSA Q-learning

  10. Bellman Equation • The expected long-term reward of a policy can be expressed in a recursive formulation. ∞ X V Π ( s ) = E Π { γ k r t + k +1 | s t = s } k =0 1 P Π ( s ) X X γ k r t + k +2 | s t +1 = s 0 } ) ( r ( s 0 ) + γ E Π { = ss 0 s 0 k =0 P Π ( s ) X ( r ( s 0 ) + γ V Π ( s 0 )) = ss 0 s 0

  11. Value Iteration • Value iteration starts with a guess V (0) of the optimal value function and construct a sequence of improved guesses. P Π ( s ) V ( i +1) ( s ) = max X ( r ( s 0 ) + γ V ( i ) ( s 0 )) ss 0 Π s 0 • This process is guaranteed to converge to the optimal value function V in a finite number of iterations.

  12. Policy Iteration • Find the optimal policy by iterating two procedures until convergence • Policy Evaluation • Policy Improvement

  13. Policy Evaluation Input: Output: V Π Π Step 1: Arbitrarily initialize V ( s ) , ∀ s ∈ S Step 2: Repeat For each s a = Π ( s ) X ss 0 ( r ( s 0 ) + γ V ( s 0 )) V ( s ) = P a s 0 Until convergence Step 3: Output V ( s )

  14. Policy Improvement Input: Output: V Π Π 0 Step 1: For each s X ss 0 ( r ( s 0 ) + γ V Π ( s 0 )) Π 0 ( s ) = arg max P a a s 0 Step 2: Output Π 0 ( s )

  15. Monte Carlo Approach • Both value iteration and policy iteration use dynamic programming approach. • Dynamic programming approach requires a transition model, P, which is often unavailable in real world problem. • Monte Carlo algorithm does not require a model to be known. Instead, it generate samples to approximate the value function.

  16. The Q function • Introduce the Q function: Q Π ( s, a ) : S ⇥ A 7! R Q Π ( s, a ) = E Π { R t | s t = s, a t = a } ∞ X γ k r t + k +1 | s t = s, a t = a } = E Π { k =0 • Use the Q ( s , a ) function instead of the value function V ( s ) because in the absence of transition model the values with respect to all possible actions at s must be stored explicitly. • The optimal Q Π ( s, a ) Q ∗ ( s, a ) = max Π V ∗ ( s ) = max Q ∗ ( s, a ) Π ∗ ( s ) = arg max Q ∗ ( s, a ) a a

  17. Monte Carlo Policy Iteration 1 Step 1: Arbitrarily initialize Π ( s, a ) = Q ( s, a ) |A| and an empty list return( s, a ) ∀ s ∈ S ∀ a ∈ A Step 2: Repeat for many times Generate an episode using a 0 a 1 a 2 Π s 0 − → s 1 − → s 2 − → · · · For each pair in the episode ( s, a ) Policy Compute long-term return R from ( s, a ) evaluation Append R to return( s, a ) Assign average of to Q ( s, a ) return( s, a ) Policy Continue... improvement

  18. Monte Carlo Policy Iteration For all s a ∗ = arg max Q ( s, a ) Policy a improvement 1 − ✏ if a = a ∗ Π ( s, a ) = ✏ if a = a ∗ |A| − 1 Step 3: Output Π ( s ) = arg max Q ( s, a ) a is a small number, which affects ✏ exploration and exploitation

  19. Temporal Difference Learning • Problem with the Monte Carlo learning is that it takes a lot of time to simulate/execute the episodes. • Temporal Difference (TD) learning is a combination of Monte Carlo and dynamic programming. • Update the value function based on previously learned estimates.

  20. Policy Iteration in TD 1 Step 1: Arbitrarily initialize Π ( s, a ) = Q ( s, a ) |A| Step 2: Repeat for each episode s = initial state of the episode a = generate a sample from Π ( s, a ) Repeat for each step in the episode s’ = new state by taking action a from s Continue...

  21. Policy Iteration in TD a ⇤ = arg max Q ( s 0 , a ) a 1 − ✏ if a = a ∗ Π ( s 0 , a ) = ✏ if a = a ∗ |A| − 1 a’ = generate a sample from Π ( s 0 , a ) Q ( s, a ) = Q ( s, a ) + α [ r ( s 0 ) + γ Q ( s 0 , a 0 ) − Q ( s, a )] s = s’ a = a’ until s is the terminal state Step 3: Output Π ( s ) = arg max Q ( s, a ) a

  22. n-step TD and Linear Combination 1-step TD 2-step TD Monte Carlo s s s a a a s’ s’ s’ a’ a’ (1 − λ ) s’’ s’’ (1 − λ ) λ 1-step TD method λ = 0 Monte Carlo method λ = 1 s n λ n − 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend