 
              Instituto de Telecomunicações Instituto de Sistemas e Robótica Robotics Institute Instituto Superior Técnico Carnegie Mellon University Recurrent Predictive State Policy ( RPSP ) Networks ICML 2018, Stockholm Sweden July 12, 2018 Zita Marinho Co-authors: zmarinho@cmu.edu Ahmed Hefny, CMU ( equal contribution ) Wen Sun, CMU Siddhartha S. Srinivasa, UW/CMU Geoffrey J. Gordon, CMU/Microsoft
Policy learning and model learning partial obs actions a 1 a 2 … a t o 1 o 2 … o t policy π robot joint torques robot joint angles Recurrent Predictive State Policy nets 2 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Recurrent Predictive State Policy Nets ! " sample actions 4 ! " ! "#$ ! "#$ ! "#$ 6 " Σ # "3( # "'( # " # "'2 states - predictive states ` pred $ 0 " observations Recurrent Predictive State Policy nets 3 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Predictive State Representations h t = [o t-n:t , a t-n:t ] history future o 1 o 2 o t … o t-1 … o t+k o t+k+1 a t a 1 a 2 … … a t+k a t+k+1 a t-1 q t predictive state → E [ o t : t + k | h t ; a t : t + k ] q t − Boots et al. 2009 sufficient statistic of conditional future observations TPSRs, Rosencrantz et al. 2004 Littman et al. 2001, Jaeger et al.1998 Recurrent Predictive State Policy nets 4 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Predictive State Representations Prediction W pred q t history future o 1 o 2 o t … o t-1 … o t+k o t+k+1 a t a 1 a 2 … … a t+k a t+k+1 a t-1 linear transformation in feature space (RKHS) Recurrent Predictive State Policy nets 5 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Predictive State Representations Filtering W ext q t q t+1 o 1 o 2 o t … o t-1 … o t+k o t+k+1 a t a 1 a 2 … … a t+k a t+k+1 a t-1 PSR Filter state update q t +1 = f cond ( W ext q t , a t , o t ) in RKHS this is kernel Bayes' rule (Fukumizu et al. 2013) Recurrent Predictive State Policy nets 6 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Predictive State Representations ? W pred W ext q 0 how do we learn a PSR Boots et al. 2011, Hefny et al. 2015, Sun et al. 2015 o 1 o 2 o t … o t-1 … o t+k o t+k+1 a t a 1 a 2 … … a t+k a t+k+1 a t-1 q t no reward signal … reduction to supervised learning !!!! Recurrent Predictive State Policy nets 7 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Recurrent Predictive State Policy Nets Why use PSRs as filter? Consistent initialization Predictive State + Method of moments • • Non-linear dynamics Kernel-based representation • • Scalable learning algorithm Random projections • • Robustness and sample efficiency Local refinement by BPTT • • # "'( # "3( $ ) # "'2 # " PSR states *+,- %&" PSR $ ! " 0 " ./%- observations 0 1 " Recurrent Predictive State Policy nets 8 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Recurrent Predictive State Policy Nets π θ ) " ! " actions sample sample reactive policy 4 θ re 6 " Σ ) # "'( ! "#$ ! "($ # "3( $ # "'2 ! "#' # " ! " PSR states *+,- %&" θ PSR PSR $ 0 " ! " ./%- observations % " 0 % 1 " & " Z. Marinho,A. Hefny, W. Sun G. Gordon, S. Srinivasa ICML 2018 (under review) Recurrent Predictive State Policy nets 9 zmarinho@cmu.edu | ICML 2018 - poster #200 |
RPSP Initialization π θ ) " PSR initialization with Method of Moments actions sample • efficient and consistent Boots 11, Hefny et al. 2015 • does not require interaction (reward signal) reactive policy Downey et al. 2017 • differentiable can be trained end-to-end θ re ! "#$ ! "($ ! "#' ! " PSR states θ PSR PSR observations % " % & " Recurrent Predictive State Policy nets 10 zmarinho@cmu.edu | ICML 2018 - poster #200 |
RPSP Optimization actions reward Cumulative reward a t r t PSR states q t accomplish the task observations Prediction error ˆ o t o t ` pred keep model accurate Recurrent Predictive State Policy nets 11 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Algorithm ' " sample 1. Initialize PSR θ re initialize ! "#$ ! " θ PSR % " % & " Recurrent Predictive State Policy nets 12 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Algorithm 1. Initialize PSR 2. Optimize on a batch of trajectories o 1 o 2 … o t o t+1 … o t+k a 1 a 2 … a t a t+1 … a t+k r 1 r 2 … r t r t+1 … r t+k ` pred J ( π θ ) Recurrent Predictive State Policy nets 13 zmarinho@cmu.edu | ICML 2018 - poster #200 |
RPSP Optimization learning via policy gradient alternate optimization joint opt REINFORCE “Vanilla” Policy Gradient Natural Gradient - higher variance - requires Hessian vector mult. + faster , simpler + smoother policy changes Schulman et al. 2015 Williams et al. 1992 • direct policy estimation • applicable to any robust gradient optimizer Recurrent Predictive State Policy nets 14 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Experiments OpenAI Gym MUJOCO environments • partial observations (joints/ no vel.) • continuous observations • continous actions Swimmer CartPole Walker2d Hopper 3 joints 2 joints 8 joints 5 joints 6 DoFs 3 DoFs 2 DoFs 1DoF Recurrent Predictive State Policy nets 15 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Experiments Cross-environment performance Recurrent Predictive State Policy nets 16 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Conclusions • combine PSR filter + reactive network for partial environments • make use of consistent initialization methods for the filter • make use of prediction loss to improve policy • end-to-end policy learning algorithm Recurrent Predictive State Policy nets 17 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Thank you! zmarinho@cmu.edu Questions ? Come See US @ POSTER #200 This research was supported by the Portuguese Foundation of Science and Technology under grant SFRH/BD/52015/2012. Recurrent Predictive State Policy nets 18 zmarinho@cmu.edu | ICML 2018 - poster #200 |
Recommend
More recommend