Recurrent Predictive State Policy ( RPSP ) Networks ICML 2018, - - PowerPoint PPT Presentation

recurrent predictive state policy rpsp networks
SMART_READER_LITE
LIVE PREVIEW

Recurrent Predictive State Policy ( RPSP ) Networks ICML 2018, - - PowerPoint PPT Presentation

Instituto de Telecomunicaes Instituto de Sistemas e Robtica Robotics Institute Instituto Superior Tcnico Carnegie Mellon University Recurrent Predictive State Policy ( RPSP ) Networks ICML 2018, Stockholm Sweden July 12, 2018 Zita


slide-1
SLIDE 1

ICML 2018, Stockholm Sweden

Co-authors: Ahmed Hefny, CMU (equal contribution) Wen Sun, CMU Siddhartha S. Srinivasa, UW/CMU Geoffrey J. Gordon, CMU/Microsoft

Robotics Institute Carnegie Mellon University

July 12, 2018

Zita Marinho

Instituto de Telecomunicações Instituto de Sistemas e Robótica Instituto Superior Técnico

Recurrent Predictive State Policy (RPSP) Networks

zmarinho@cmu.edu

slide-2
SLIDE 2

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

Policy learning and model learning

2

  • t
  • 1
  • 2

… at a1 a2 …

actions robot joint angles robot joint torques partial obs policy

π

slide-3
SLIDE 3

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

3

actions

  • bservations

states

Recurrent Predictive State Policy Nets

predictive states

#" $ #"3( !" 4 sample Σ 6"

!"

!"#$

!"#$

!"#$

#"'(

  • 0"

#"'2

`pred

slide-4
SLIDE 4

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

at-1 a1 a2 … at …

  • t-1
  • 1
  • 2

  • t

… at+k

  • t+k

at+k+1

  • t+k+1

4

Predictive State Representations

qt

Boots et al. 2009

predictive state

qt − → E [ot:t+k | ht ; at:t+k]

sufficient statistic of conditional future observations

ht = [ot-n:t , at-n:t] history

future

TPSRs, Rosencrantz et al. 2004 Littman et al. 2001, Jaeger et al.1998

slide-5
SLIDE 5

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

at-1 a1 a2 … at …

  • t-1
  • 1
  • 2

  • t

… at+k

  • t+k

at+k+1

  • t+k+1

5

Predictive State Representations

history

future

Prediction

Wpred

linear transformation in feature space (RKHS)

qt

slide-6
SLIDE 6

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

at-1 a1 a2 … at …

  • t-1
  • 1
  • 2

  • t

… at+k

  • t+k

at+k+1

  • t+k+1

6

Predictive State Representations

qt qt+1

state update

in RKHS this is kernel Bayes' rule (Fukumizu et al. 2013)

PSR Filter

qt+1 = f cond (Wextqt, at, ot)

Wext

Filtering

slide-7
SLIDE 7

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

at-1 a1 a2 … at …

  • t-1
  • 1
  • 2

  • t

… at+k

  • t+k

at+k+1

  • t+k+1

7

Predictive State Representations

how do we learn a PSR

?

… reduction to supervised learning !!!!

Boots et al. 2011, Hefny et al. 2015, Sun et al. 2015

Wpred Wext q0

qt

no reward signal

slide-8
SLIDE 8

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

8

  • bservations

PSR states

Recurrent Predictive State Policy Nets

PSR

!" #" $

%&"

#"'( )

*+,-

$

./%-

1" 0" #"'2 #"3(

Why use PSRs as filter?

  • Consistent initialization
  • Non-linear dynamics
  • Scalable learning algorithm
  • Robustness and sample efficiency
  • Predictive State + Method of moments
  • Kernel-based representation
  • Random projections
  • Local refinement by BPTT
slide-9
SLIDE 9

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

actions

  • bservations

PSR states

Recurrent Predictive State Policy Nets

reactive policy PSR

!" #" $

%&"

#"'( )

*+,-

$

./%-

1" 0" #"'2 #"3( !" 4 sample Σ 6"

  • Z. Marinho,A. Hefny, W. Sun G. Gordon, S. Srinivasa ICML 2018 (under review)

!" !"#$ % &" %" !"#' !"($ )" sample

θre θPSR

πθ

9

slide-10
SLIDE 10

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

10

RPSP Initialization

!" !"#$ % &" %" !"#' !"($ )" sample

θre θPSR

actions

  • bservations

PSR states

reactive policy PSR

πθ

PSR initialization with Method of Moments

  • efficient and consistent
  • does not require interaction (reward signal)
  • differentiable can be trained end-to-end

Downey et al. 2017 Boots 11, Hefny et al. 2015

slide-11
SLIDE 11

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

11

actions

  • bservations

PSR states

RPSP Optimization

Prediction error

reward

rt at

  • t

ˆ

  • t

qt

Cumulative reward

accomplish the task keep model accurate

`pred

slide-12
SLIDE 12

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

12

  • 1. Initialize PSR

Algorithm

initialize

!" !"#$ % &" %" '" sample

θre θPSR

slide-13
SLIDE 13

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

at a1 a2 … at+1 …

  • t
  • 1
  • 2

  • t+1

… at+k

  • t+k

13

Algorithm

  • 2. Optimize on a batch of trajectories

rt r1 r2 … rt+1 … rt+k

`pred

J(πθ)

  • 1. Initialize PSR
slide-14
SLIDE 14

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

14

learning via policy gradient

RPSP Optimization

joint opt REINFORCE alternate optimization

“Vanilla” Policy Gradient Natural Gradient

  • higher variance

+ faster , simpler

  • requires Hessian vector mult.

+ smoother policy changes

Williams et al. 1992 Schulman et al. 2015

  • direct policy estimation
  • applicable to any robust gradient optimizer
slide-15
SLIDE 15

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

15

OpenAI Gym MUJOCO environments

  • partial observations (joints/ no vel.)
  • continuous observations
  • continous actions

Experiments

Walker2d

8 joints 6 DoFs

Hopper

5 joints 3 DoFs

Swimmer

3 joints 2 DoFs

CartPole

2 joints 1DoF

slide-16
SLIDE 16

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

16

Cross-environment performance

Experiments

slide-17
SLIDE 17

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

17

Conclusions

  • combine PSR filter + reactive network for partial

environments

  • make use of consistent initialization methods for the filter
  • make use of prediction loss to improve policy
  • end-to-end policy learning algorithm
slide-18
SLIDE 18

zmarinho@cmu.edu | ICML 2018 - poster #200 |

Recurrent Predictive State Policy nets

18

zmarinho@cmu.edu

This research was supported by the Portuguese Foundation of Science and Technology under grant SFRH/BD/52015/2012.

Questions ?

Thank you!

Come See US @ POSTER #200