Optimistic Policy Optimization via Multiple Importance Sampling - - PowerPoint PPT Presentation

optimistic policy optimization via multiple importance
SMART_READER_LITE
LIVE PREVIEW

Optimistic Policy Optimization via Multiple Importance Sampling - - PowerPoint PPT Presentation

Optimistic Policy Optimization via Multiple Importance Sampling Matteo Papini Alberto Maria Metelli Lorenzo Lupo Marcello Restelli 11th June 2019 Thirty-sixth International Conference on Machine Learning, Long Beach, CA, USA 1 Policy


slide-1
SLIDE 1

Optimistic Policy Optimization via Multiple Importance Sampling

Matteo Papini Alberto Maria Metelli Lorenzo Lupo Marcello Restelli

11th June 2019 Thirty-sixth International Conference on Machine Learning, Long Beach, CA, USA

slide-2
SLIDE 2

1

Policy Optimization

1

Parameter space Θ Ď Rd A parametric policy for each θ P Θ Each inducing a distribution pθ over trajectories A return Rpτq for every trajectory τ Goal: max

θPΘ Jpθq “ Eτ„pθ rRpτqs

Iterative optimization (e.g., gradient ascent)

Θ

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-3
SLIDE 3

1

Policy Optimization

1

Parameter space Θ Ď Rd A parametric policy for each θ P Θ Each inducing a distribution pθ over trajectories A return Rpτq for every trajectory τ Goal: max

θPΘ Jpθq “ Eτ„pθ rRpτqs

Iterative optimization (e.g., gradient ascent)

Θ θ

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-4
SLIDE 4

1

Policy Optimization

1

Parameter space Θ Ď Rd A parametric policy for each θ P Θ Each inducing a distribution pθ over trajectories A return Rpτq for every trajectory τ Goal: max

θPΘ Jpθq “ Eτ„pθ rRpτqs

Iterative optimization (e.g., gradient ascent)

Θ θ T pθ

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-5
SLIDE 5

1

Policy Optimization

1

Parameter space Θ Ď Rd A parametric policy for each θ P Θ Each inducing a distribution pθ over trajectories A return Rpτq for every trajectory τ Goal: max

θPΘ Jpθq “ Eτ„pθ rRpτqs

Iterative optimization (e.g., gradient ascent)

Θ θ T pθ τ Rpτq

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-6
SLIDE 6

1

Policy Optimization

1

Parameter space Θ Ď Rd A parametric policy for each θ P Θ Each inducing a distribution pθ over trajectories A return Rpτq for every trajectory τ Goal: max

θPΘ Jpθq “ Eτ„pθ rRpτqs

Iterative optimization (e.g., gradient ascent)

Θ θ T Jpθq pθ

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-7
SLIDE 7

1

Policy Optimization

1

Parameter space Θ Ď Rd A parametric policy for each θ P Θ Each inducing a distribution pθ over trajectories A return Rpτq for every trajectory τ Goal: max

θPΘ Jpθq “ Eτ„pθ rRpτqs

Iterative optimization (e.g., gradient ascent)

Θ θ1 T Jpθ1q pθ1

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-8
SLIDE 8

2

Exploration in Policy Optimization

2

Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-9
SLIDE 9

2

Exploration in Policy Optimization

2

Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-10
SLIDE 10

2

Exploration in Policy Optimization

2

Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-11
SLIDE 11

2

Exploration in Policy Optimization

2

Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-12
SLIDE 12

2

Exploration in Policy Optimization

2

Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees

If only this were a Multi-Armed Bandit...

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-13
SLIDE 13

2

Exploration in Policy Optimization

2

Continuous decision process ù ñ difficult Policy gradient methods tend to be greedy (e.g., TRPO [6], PGPE [7]) Mainly undirected (e.g., entropy bonus [2]) Lack of theoretical guarantees

If only this were a Correlated Multi-Armed Bandit...

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-14
SLIDE 14

3

Policy Optimization as a Correlated MAB

3

Arms: parameters θ Payoff: expected return Jpθq Continuous MAB [3]: we need structure Arm correlation [5] through trajectory distributions Importance Sampling (IS)

θA θB

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-15
SLIDE 15

3

Policy Optimization as a Correlated MAB

3

Arms: parameters θ Payoff: expected return Jpθq Continuous MAB [3] Arm correlation [5] through trajectory distributions Importance Sampling (IS)

θA θB JpθAq JpθBq

MAB

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-16
SLIDE 16

3

Policy Optimization as a Correlated MAB

3

Arms: parameters θ Payoff: expected return Jpθq Continuous MAB [3] Arm correlation [5] through trajectory distributions Importance Sampling (IS)

Θ θA θB JpθAq JpθBq

MAB

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-17
SLIDE 17

3

Policy Optimization as a Correlated MAB

3

Arms: parameters θ Payoff: expected return Jpθq Continuous MAB [3] Arm correlation [5] through trajectory distributions Importance Sampling (IS)

Θ θA θB T pθA JpθAq T pθB JpθBq

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-18
SLIDE 18

3

Policy Optimization as a Correlated MAB

3

Arms: parameters θ Payoff: expected return Jpθq Continuous MAB [3] Arm correlation [5] through trajectory distributions Importance Sampling (IS)

Θ θA θB T pθA JpθAq T pθB JpθBq IS

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-19
SLIDE 19

4

OPTIMIST

4

A UCB-like index [4]:

Btpθq “ q Jtpθq lo

  • mo
  • n

ESTIMATE a truncated multiple importance sampling estimator [8, 1]

T pθ1 pθ2 pθt´1 pθ MIS

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-20
SLIDE 20

4

OPTIMIST

4

A UCB-like index [4]:

Btpθq “ q Jtpθq lo

  • mo
  • n

ESTIMATE a truncated multiple importance sampling estimator [8, 1]

` C d d2ppθ}Φtq log 1

δt

t looooooooooomooooooooooon

EXPLORATION BONUS: distributional distance from previous solutions

T Φt pθ d2

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-21
SLIDE 21

4

OPTIMIST

4

A UCB-like index [4]:

Btpθq “ q Jtpθq lo

  • mo
  • n

ESTIMATE a truncated multiple importance sampling estimator [8, 1]

` C d d2ppθ}Φtq log 1

δt

t looooooooooomooooooooooon

EXPLORATION BONUS: distributional distance from previous solutions

Select θt “ arg max

θPΘ Btpθq

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-22
SLIDE 22

5

Sublinear Regret

5

RegretpTq “ řT

t“0 Jpθ˚q ´ Jpθtq

Compact, d-dimensional parameter space Θ Under mild assumptions on the policy class, with high probability:

RegretpTq “ r O ´? dT ¯

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-23
SLIDE 23

5

Sublinear Regret

5

RegretpTq “ řT

t“0 Jpθ˚q ´ Jpθtq

Compact, d-dimensional parameter space Θ Under mild assumptions on the policy class, with high probability:

RegretpTq “ r O ´? dT ¯

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-24
SLIDE 24

5

Sublinear Regret

5

RegretpTq “ řT

t“0 Jpθ˚q ´ Jpθtq

Compact, d-dimensional parameter space Θ Under mild assumptions on the policy class, with high probability:

RegretpTq “ r O ´? dT ¯

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-25
SLIDE 25

6

Empirical Results

6

River Swim

1,000 2,000 3,000 4,000 5,000 0.5 1 Episodes Cumulative Return OPTIMIST PGPE

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-26
SLIDE 26

6

Empirical Results

6

River Swim

1,000 2,000 3,000 4,000 5,000 0.5 1 Episodes Cumulative Return OPTIMIST PGPE

Caveats Easy implementation only for parameter-based exploration [7] Difficult optimization ù ñ discretization ...

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019

slide-27
SLIDE 27

Thank You for Your Attention!

Poster #103 Code: github.com/WolfLo/optimist Contact: matteo.papini@polimi.it Web page: t3p.github.io/icml19

slide-28
SLIDE 28

8

References

8

[1] Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717. [2] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018, pages 1856–1865. [3] Kleinberg, R., Slivkins, A., and Upfal, E. (2013). Bandits and experts in metric spaces. arXiv preprint arXiv:1312.1277. [4] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22. [5] Pandey, S., Chakrabarti, D., and Agarwal, D. (2007). Multi-armed bandit problems with dependent arms. In Proceedings of the 24th international conference on Machine learning, pages 721–728. ACM. [6] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897. [7] Sehnke, F., Osendorfer, C., R¨ uckstieß, T., Graves, A., Peters, J., and Schmidhuber, J. (2008). Policy gradients with parameter-based exploration for control. In International Conference on Artificial Neural Networks, pages 387–396. Springer. [8] Veach, E. and Guibas, L. J. (1995). Optimally combining sampling techniques for Monte Carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques - SIGGRAPH ’95, pages 419–428. ACM Press.

  • M. Papini

Optimistic Policy Optimization via Multiple Importance Sampling ICML 2019