reinforcement learning
play

Reinforcement Learning Mich` ele Sebag ; TP : Herilalaina - PowerPoint PPT Presentation

Reinforcement Learning Mich` ele Sebag ; TP : Herilalaina Rakotoarison TAO, CNRS INRIA Universit e Paris-Sud Jan. 14th, 2019 Credit for slides: Richard Sutton, Freek Stulp, Olivier Pietquin 1 / 62 Where we are MDP Main Building


  1. Reinforcement Learning Mich` ele Sebag ; TP : Herilalaina Rakotoarison TAO, CNRS − INRIA − Universit´ e Paris-Sud Jan. 14th, 2019 Credit for slides: Richard Sutton, Freek Stulp, Olivier Pietquin 1 / 62

  2. Where we are MDP Main Building block General settings Model-based Model-free Finite Dynamic Programming Discrete RL Infinite (optimal control) Continuous RL Last course: Function approximation This course: Direct policy search; Evolutionary robotics 2 / 62

  3. Position of the problem Notations ◮ State space S ◮ Action space A ◮ Transition model p ( s , a , s ′ ) �→ [0 , 1] ◮ Reward r ( s ) bounded Mainstream RL: based on values �� � V ∗ : S �→ I π ∗ ( s ) = arg opt p ( s , a , s ′ ) V ∗ ( s ′ ) R a ∈A s ′ Q ∗ : S × A �→ I π ∗ ( s ) = arg opt ( Q ∗ ( s , a )) R a ∈A What we want π : S �→ A Aren’t we learning something more complex than needed ?... ⇒ Let us consider Direct policy search 3 / 62

  4. From RL to Direct Policy Search Direct policy search : define ◮ Search space (representation of solutions) ◮ Optimization criterion ◮ Optimization algorithm 4 / 62

  5. Examples 5 / 62

  6. Representation 1.Explicit representation ≡ Policy space π is represented as a function from S onto A ◮ Non-parametric representation, e.g. decision tree or random forest ◮ Parametric representation. Given a function space, π is defined by a vector of parameters θ .  Linear function on S  π θ = Radius-based function on S (deep) Neural net  R d and θ in I R d , E.g. in the linear function case, given s ∈ S = I π θ ( s ) = � s , θ � 6 / 62

  7. Representation 2. Implicit representation: for example Trajectory generators π ( s ) is obtained by solving an auxiliary problem. For instance, ◮ Define desired trajectories Dynamic movement primitives ◮ Trajectory τ = f ( θ ) ◮ Action = getting back to the trajectory given the current state s 7 / 62

  8. Direct policy search in RL Two approaches ◮ Model-free approaches ◮ Model-based approaches History ◮ Model-free approaches were the first ones; they work well but i) require many examples; ii) these examples must be used in a smart way. ◮ Model-based approaches are more recent. They proceed by i) modelling the MDP from examples (this learning step has to be smart); ii) using the model as if it were a simulator. Important points: the model must give a prediction and a confidence interval (will be very important for the exploration). 8 / 62

  9. DPS: The model-free approach DPS: The model-based approach Gaussian processes Evolutionary robotics Reminder Evolution of morphology Others 9 / 62

  10. The model-free approach Algorithm 1. Explore: Generate trajectories τ i = ( s i , t , a i , t ) T after π θ k t =1 2. Evaluate: ◮ Compute quality of trajectories Episode-based ◮ Compute quality of (state-action) pairs Step-based 3. Update: compute θ k +1 Two modes ◮ Episode-based ◮ learn a distribution D k over Θ ◮ draw θ after D k , generate trajectory, measure its quality ◮ bias D k toward the high quality regions in Θ space ◮ Step-based ◮ draw a t from π ( s t , θ k ) ◮ measure q θ ( s , a ) from the cumulative reward gathered after having visited ( s , a ) 10 / 62

  11. Model-free Episode-based DPS. PROS Getting rid of Markovian assumption ◮ 11 / 62

  12. Model-free Episode-based DPS. PROS Getting rid of Markovian assumption ◮ Rover on Mars: take a picture of region 1, region 2, ... 11 / 62

  13. PROS, 2 Hopes of scalability ◮ With respect to continuous state space ◮ No divergence even under function approximation Tackling more ambitious goals also see Evolutionary RL ◮ Partial observability does not hurt convergence (though increases computational cost) ◮ Optimize controller (software) and also morphology of the robot (hardware); ◮ Possibly consider co-operation of several robots... 12 / 62

  14. Model-free Episode-based DPS. CONS Lost the global optimum properties ◮ Not a well-posed optimization problem in general ◮ Lost the Bellman equation ⇒ larger variance of solutions A noisy optimization problem ◮ Policy π → a distribution over the trajectories (depending on starting point, on noise in the environment, sensors, actuators...) t γ t r t +1 | θ ◮ V ( θ ) = def I �� � or E V ( θ ) = def I E θ [ J ( trajectory )] ◮ In practice K V ( θ ) ≈ 1 � J ( trajectory i ) K i =1 How many trajectories are needed ? Requires tons of examples 13 / 62

  15. CONS, 2 The in-situ vs in-silico dilemma ◮ In-situ: launch the robot in the real-life and observe what happens ◮ In-silico: use a simulator ◮ But is the simulator realistic ??? The exploration vs exploitation dilemma ◮ For generating the new trajectories ◮ For updating the current solution θ θ t +1 = θ t − α t ∇ V ( θ ) Very sensitive to the learning rate α t . 14 / 62

  16. The model-free approach, how An optimization objective An optimization mechanism ◮ Gradient-based optimization ◮ Define basis functions φ i , learn α i ◮ Use black-box optimization 15 / 62

  17. Cumulative value, gradient The cumulative discounted value � γ t r ( s t ) V ( s 0 ) = r ( s ) + t =1 with s t +1 next state after s t for policy π θ The gradient ∂ V ( s 0 , θ ) ≈ V ( s 0 , θ + ǫ ) − V ( s 0 , θ − ǫ ) ∂θ 2 ǫ ◮ Model p ( s t +1 | s t , a t , θ ) not required but useful ◮ Laarge variance ! many samples needed. A trick ◮ Using a simulator: Fix the random seed and reset ◮ No variance of V ( s 0 , θ ), much smaller variance of its gradient 16 / 62

  18. Average value, gradient No discount: long term average reward �� � 1 r ( s t ) | s 0 = s V ( s ) = lim T I E T →∞ t Assumption: ergodic Markov chain (After a while, the initial state does not matter). ◮ V ( s ) does not depend on s ◮ One can estimate the percentage of time spent in state s q ( θ, s ) = Pr θ ( S = s ) Yields another value to optimize � V ( θ ) = I E θ [ r ( S )] = r ( s ) q ( θ, s ) s 17 / 62

  19. Model-free Direct Policy Search Algorithm E θ [ r ( S )] = � 1. V ( θ ) = I s r ( s ) q ( θ, s ) 2. Compute or estimate the gradient ∇ V ( θ ) 3. θ t +1 = θ t + α t ∇ V ( θ ) Computing the derivative �� � � ∇ V = ∇ r ( s ) ∇ q ( θ, s ) r ( s ) q ( θ, s ) = s s � r ( S ) ∇ q ( θ, S ) � = I E S ,θ q ( θ, S ) = I E S ,θ [ r ( S ) ∇ log q ( θ, S )] Unbiased estimate of the gradient ( � integral = empirical sum) r ( s i ) ∇ q ( θ, s i ) ∇ V = 1 ˆ � q ( θ, s i ) N i 18 / 62

  20. The Success Matching Principle π new ( a | s ) ∝ Success ( s , a , θ ) .π old ( a | s ) Different computations of “Success” ◮ θ ∼ D k generates trajectory, evaluation V ( θ ) ◮ Transform evaluation into (non-negative) probability w k ◮ Find mixture policy π k +1 � p ( a | s ) ∝ w k p ( a | s , θ k ) ◮ Find θ k +1 accounting for π k +1 ◮ Update D k , iterate 19 / 62

  21. Computing the weights w k = exp ( β ( V ( θ ) − minV ( θ )) β : temperature of optimization simulated annealing Example � V ( θ ) − minV ( θ ) � = exp 10 maxV ( θ ) − minV ( θ ) 20 / 62

  22. Model-free Direct Policy Search, summary Algorithm ◮ Define the criterion to be optimized (cumulative value, average value) ◮ Define the search space (Θ: parametric representation of π ) ◮ Optimize it: θ k → θ k + 1 ◮ Using gradient approaches ◮ Updating a distribution D k on Θ ◮ In the step-based mode or success matching case: k +1 ( s , a ); find θ k +1 such that Q π = q ∗ find next best q ∗ k +1 Pros ◮ It works Cons ◮ Requires tons of examples ◮ Optimization process difficult to tune: ◮ Learning rate difficult to adjust ◮ Regularization (e.g. using KL divergence) badly needed and difficult to adjust 21 / 62

  23. DPS: The model-free approach DPS: The model-based approach Gaussian processes Evolutionary robotics Reminder Evolution of morphology Others 22 / 62

  24. Direct Policy Search. The model-based approach Algorithm 1. Use data τ i = ( s i , t , a i , t ) T p ( s ′ | s , a ) t =1 to learn a forward model ˆ 2. Use the model as a simulator (you need the estimation, and the confidence of the estimation , for exploration) 3. Optimize policy 4. (Use policy on robot and improve the model) 23 / 62

  25. DPS: The model-free approach DPS: The model-based approach Gaussian processes Evolutionary robotics Reminder Evolution of morphology Others 24 / 62

  26. Learning the model Modeling 25 / 62

  27. Learning the model Modeling and predicting 25 / 62

  28. Learning the model Modeling When optimizing a model: very useful to have a measure of uncertainty on the prediction 25 / 62

  29. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  30. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  31. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  32. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  33. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  34. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  35. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

  36. Learning the model, 2 Gaussian Processes http://www.gaussianprocess.org/ 26 / 62

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend