Transfer of Samples in Policy Search via Multiple Importance - PowerPoint PPT Presentation

Transfer of Samples in Policy Search via Multiple Importance Sampling Andrea Tirinzoni, Mattia Salvini, and Marcello Restelli 36th International Conference on Machine Learning, Long Beach, California

1 Motivation Policy Search (PS) : very effective RL technique for continuous control tasks [Heess et al., 2017] [OpenAI, 2018] [Vinyals et al., 2017] High sample complexity remains a major limitation Tirinzoni et al. Transfer of Samples in Policy Search via Multiple Importance Sampling ICML 2019

1 Motivation Policy Search (PS) : very effective RL technique for continuous control tasks [Heess et al., 2017] [OpenAI, 2018] [Vinyals et al., 2017] High sample complexity remains a major limitation Samples available from several sources are discarded Different policies Different environments Tirinzoni et al. Transfer of Samples in Policy Search via Multiple Importance Sampling ICML 2019

1 Motivation Policy Search (PS) : very effective RL technique for continuous control tasks [Heess et al., 2017] [OpenAI, 2018] [Vinyals et al., 2017]  High sample complexity remains a major limitation     Samples available from several sources are discarded  Transfer of Samples Different policies    Different environments   Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

2 Transfer of Samples Source Task M 2 τ i, 2 ∼ π θ 2 , P 2 τ i, 1 ∼ π θ 1 , P 1 τ i,m ∼ π θ m , P m Source Task M 1 Source Task M m π θ , P Target Task M Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

2 Transfer of Samples Source Task M 2 τ i, 2 ∼ π θ 2 , P 2 τ i, 1 ∼ π θ 1 , P 1 τ i,m ∼ π θ m , P m Source Task M 1 Source Task M m π θ , P Target Task M Existing works: batch value-based settings [Lazaric et al., 2008, Taylor et al., 2008, Lazaric and Restelli, 2011, Laroche and Barlier, 2017, Tirinzoni et al., 2018] Extension to online PS algorithms not trivial Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

3 Transferring Samples in Policy Search Goal : Transfer source trajectories to improve the target gradient estimation Multiple Importance Sampling (MIS) Gradient Estimator n j m J ( θ ) := 1 p ( τ | θ , P ) � � ∇ MIS w ( τ i,j ) g θ ( τ i,j ) w ( τ ) := � m θ j =1 α j p ( τ | θ j , P j ) n � �� j =1 i =1 weights gradient Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

3 Transferring Samples in Policy Search Goal : Transfer source trajectories to improve the target gradient estimation Multiple Importance Sampling (MIS) Gradient Estimator n j m J ( θ ) := 1 p ( τ | θ , P ) � � ∇ MIS w ( τ i,j ) g θ ( τ i,j ) w ( τ ) := � m θ j =1 α j p ( τ | θ j , P j ) n � �� j =1 i =1 weights gradient Unbiased and bounded weights Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

3 Transferring Samples in Policy Search Goal : Transfer source trajectories to improve the target gradient estimation Multiple Importance Sampling (MIS) Gradient Estimator n j m J ( θ ) := 1 p ( τ | θ , P ) � � ∇ MIS w ( τ i,j ) g θ ( τ i,j ) w ( τ ) := � m θ j =1 α j p ( τ | θ j , P j ) n � �� j =1 i =1 weights gradient Unbiased and bounded weights Easily combined with other variance reduction techniques Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

3 Transferring Samples in Policy Search Goal : Transfer source trajectories to improve the target gradient estimation Multiple Importance Sampling (MIS) Gradient Estimator n j m J ( θ ) := 1 p ( τ | θ , P ) � � ∇ MIS w ( τ i,j ) g θ ( τ i,j ) w ( τ ) := � m θ j =1 α j p ( τ | θ j , P j ) n � �� j =1 i =1 weights gradient Unbiased and bounded weights Easily combined with other variance reduction techniques Effective sample size ≡ Transferable knowledge → Adaptive batch size Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

3 Transferring Samples in Policy Search Goal : Transfer source trajectories to improve the target gradient estimation Multiple Importance Sampling (MIS) Gradient Estimator n j m J ( θ ) := 1 p ( τ | θ , P ) � � ∇ MIS w ( τ i,j ) g θ ( τ i,j ) w ( τ ) := � m θ j =1 α j p ( τ | θ j , P j ) n � �� j =1 i =1 weights gradient Unbiased and bounded weights Easily combined with other variance reduction techniques Effective sample size ≡ Transferable knowledge → Adaptive batch size Provably robust to negative transfer Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

4 Estimating the Transition Models Problem : P unknown → Importance weights cannot be computed Solution : Online minimization of an upper-bound to the expected MSE of ∇ MIS J ( θ ) θ Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

4 Estimating the Transition Models Problem : P unknown → Importance weights cannot be computed Solution : Online minimization of an upper-bound to the expected MSE of ∇ MIS J ( θ ) θ Obtain principled estimates even without target samples Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

4 Estimating the Transition Models Problem : P unknown → Importance weights cannot be computed Solution : Online minimization of an upper-bound to the expected MSE of ∇ MIS J ( θ ) θ Obtain principled estimates even without target samples Can be efficiently optimized for Discrete set of models Reproducing Kernel Hilbert Spaces (RKHS) → Closed-form solution Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

5 Empirical Results Cartpole Minigolf 200 − 5 Expected Return 150 − 10 100 − 15 50 − 20 50 100 150 200 250 200 400 600 Episodes Episodes No Transfer Sample reuse Known Models Unknown models Good performance with both known and unknown models Very effective sample reuse from different policies but same environment Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

6 Thank you! andrea.tirinzoni@polimi.it https://github.com/AndreaTirinzoni/ Meet us at poster #118 @ Pacific Ballroom Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

7 References Hammersley, J. and Handscomb, D. (1964). Monte Carlo Methods . Methuen’s monographs on applied probability and statistics. Methuen. Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, A., Riedmiller, M., et al. (2017). Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286 . Laroche, R. and Barlier, M. (2017). Transfer reinforcement learning with shared dynamics. In AAAI . Lazaric, A. and Restelli, M. (2011). Transfer from multiple mdps. In Advances in Neural Information Processing Systems . Lazaric, A., Restelli, M., and Bonarini, A. (2008). Transfer of samples in batch reinforcement learning. In Proceedings of the 25th international conference on Machine learning . Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

8 References (cont.) OpenAI (2018). Learning dexterous in-hand manipulation. CoRR , abs/1808.00177. Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series , page 80. Taylor, M. E., Jong, N. K., and Stone, P. (2008). Transferring instances for model-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases , pages 488–505. Springer. Tirinzoni, A., Sessa, A., Pirotta, M., and Restelli, M. (2018). Importance weighted transfer of samples in reinforcement learning. In International Conference on Machine Learning , pages 4943–4952. Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

9 References (cont.) Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., K¨ uttler, H., Agapiou, J., Schrittwieser, J., et al. (2017). Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 . Tirinzoni et al. ICML 2019 Transfer of Samples in Policy Search via Multiple Importance Sampling

Transfer of Samples in Policy Search via Multiple Importance - PowerPoint PPT Presentation

Transfer of Samples in Policy Search via Multiple Importance Sampling Andrea Tirinzoni, Mattia Salvini, and Marcello Restelli 36th International Conference on Machine Learning, Long Beach, California 1 Motivation Policy Search (PS) : very

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Computer graphics III Multiple Importance Sampling Jaroslav Kivnek, MFF UK

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

1 last time HCL details/built in components HCL debug/interactive options walkthrough of SEQ

Overview General Principles of Pipelining Goal Computer Architecture: Pipelining

Gluino/squarks will be produced copiously at the LHC if the masses are less than 1 TeV.

Numerical optimization minimizing a function by evaluating it at many trial points. Main

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical

Basic Pipelining Wrap-up Exploiting ILP (from Slide Set 20) Chapter 6 and beyond 1 2

The pseudo-GDPR on digital marketplaces challenge a general testbed for normative reasoning and