neural fitted actor critic
play

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface - PowerPoint PPT Presentation

Background Neural Fitted Actor-Critic Future works Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine, LORIA 8 th July 2016 1/18 Background Neural Fitted Actor-Critic Future works Outline


  1. Background Neural Fitted Actor-Critic Future works Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine, LORIA 8 th July 2016 1/18

  2. Background Neural Fitted Actor-Critic Future works Outline Background 1 Neural Fitted Actor-Critic 2 Future works 3 2/18

  3. Background Neural Fitted Actor-Critic Future works Reinforcement Learning 3/18

  4. Background Neural Fitted Actor-Critic Future works Reinforcement Learning Optimization problem � ∞ � t =0 γ t r t Find function π : S → A that maximize rewards E π � 3/18

  5. Background Neural Fitted Actor-Critic Future works Constraints and Motivations Reinforcement learning + Developmental robotics : Continuous environments 1 No prior models of agent or environment 2 Use non linear approximator (neural networks) 3 No prior goal states or trajectories 4 4/18

  6. Background Neural Fitted Actor-Critic Future works How to solve reinforcement learning problems ? Actor-only π : S → A play update ∞ t =0 γ t r t � π k +1 π k 5/18

  7. Background Neural Fitted Actor-Critic Future works How to solve reinforcement learning problems ? Actor-only π : S → A play update ∞ t =0 γ t r t � π k +1 π k Critic-only Q : S × A → R play update deduce ∞ t =0 γ t r t � Q k +1 Q k π k 5/18

  8. Background Neural Fitted Actor-Critic Future works How to solve reinforcement learning problems ? Actor-only π : S → A play update ∞ t =0 γ t r t � π k +1 π k Critic-only Q : S × A → R play update deduce ∞ t =0 γ t r t � Q k +1 Q k π k Actor-Critic π : S → A V : S → R play update ∞ t =0 γ t r t � π k V k +1 π k +1 V k update 5/18

  9. Background Neural Fitted Actor-Critic Future works State of the art Critic only Fitted Q Iteration Q Learning, Sarsa Actor only Evolutionnary algorithms (CMA-ES, ...) PI 2 Actor-critic Natural Actor Critic Cacla 6/18

  10. Background Neural Fitted Actor-Critic Future works State of the art Unsatisfied Constraints : (1) No Continuous environments (2) No prior models of agent or environment Critic only (3) Use linear approximator (4) No prior goal states or trajectories Fitted Q Iteration (1) Q Learning, Sarsa (1) Actor only Evolutionnary algorithms (CMA-ES) → poor data efficiency PI 2 (3) (4) Actor-critic Natural Actor Critic (3) (4) Cacla → poor data efficiency, lot of meta-parameters 7/18

  11. Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity ideal algorithm data required 8/18

  12. Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity NFQ ideal algorithm data required 8/18

  13. Background Neural Fitted Actor-Critic Future works Neural Fitted Q (NFQ) decisional complexity NFQ data required N �� 2 � � a ′ ∈ A Q k ( s t +1 , a ′ ) � Q k +1 = arg min Q ( s t , a t ) − r t +1 + γ max Q ∈F c t =1 π ∗ ( s ) = arg max Q ( s , a ) a ∈ A Hidden Inputs Outputs layer Q ( s , a ) s 1 s 2 s 3 a 1 a 2 9/18

  14. Background Neural Fitted Actor-Critic Future works CACLA decisional complexity Temporal Difference Error δ t = r t + γ V ( s t +1 ) − V ( s t ) CACLA CMA-ES data required Critic V k +1 ( s ) = V k ( s ) + α v δ t ∂ V t ( s t ) θ V i , k +1 = θ V i , k +1 + α v δ t ∂θ V i , k +1 Actor α a ( a t − u t ) ∂ u t ( s t )  ∂θ t , if δ > 0    θ t +1 = θ t +   0 , otherwise  10/18

  15. Background Neural Fitted Actor-Critic Future works Neural Fitted Actor Critic 2) a. actor update { s t , a t } a ∼ π Environment > 0 π δ Agent Rprop V 0 ≤ 0 { s t , u t } u D π repeat V k V k +1 s , r Rprop { s t , v k , t } 1) interactions 2) b. critic update 11/18

  16. Background Neural Fitted Actor-Critic Future works Neural Fitted Actor Critic decisional complexity � 2 � � V k +1 ← argmin V ( s t ) − r t +1 + γ V k − 1 ( s t +1 ) NFAC V ∈F c s t ∈D π data required Hidden Inputs Outputs layer V ( s ) s 1 s 2 s 3 � a t , if δ t > 0 � 2 � � π k +1 ← argmin π ( s t ) − u t , otherwise π ∈F a s t ∈D π Hidden Inputs Outputs layer s 1 a 1 s 2 a 2 s 3 12/18

  17. Background Neural Fitted Actor-Critic Future works Experimental Results 13/18

  18. Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity NFQ NFAC CACLA ideal CMA-ES algorithm data required 14/18

  19. Background Neural Fitted Actor-Critic Future works Landscape algorithms decisional complexity NFQ DDPG NAF ? NFAC CACLA ideal CMA-ES algorithm data required 15/18

  20. Background Neural Fitted Actor-Critic Future works Methods landscape decisional complexity NFQ NFAC+ DDPG NAF ? NFAC CACLA ideal CMA-ES algorithm data required 16/18

  21. Background Neural Fitted Actor-Critic Future works Toward a better data efficiency Fitted Actor-Critic N �� 2 � � Q π � r t +1 + γ Q π k +1 = argmin c ( a t | s t ) Q ( s t , a t ) − k ( s t +1 , π ( s t +1 )) Q ∈F c t =1 N � � � π k +1 = argmax Q k +1 s t , π k ( s t ) π ∈F a t =1 1 , π ( a t | s t ) � � c ( a t | s t ) = min π 0 ( a t | s t ) 17/18

  22. Background Neural Fitted Actor-Critic Future works Conclusion & Further Works Neural Fitted Actor Critic Compare to DDPG Don’t forget previous data Guided exploration of sensorimotor space Increase the dimension of states/actions Redefine the reward function for the new sub-goal 18/18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend