Reproducible, Reusable, and Robust Reinforcement Learning Joelle - - PowerPoint PPT Presentation

reproducible reusable and robust reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reproducible, Reusable, and Robust Reinforcement Learning Joelle - - PowerPoint PPT Presentation

Reproducible, Reusable, and Robust Reinforcement Learning Joelle Pineau Facebook AI Research, Montreal School of Computer Science, McGill University Neural Information Processing Systems (NeurIPS) December 5, 2018 Reproducibility refers to


slide-1
SLIDE 1

Reproducible, Reusable, and Robust Reinforcement Learning

Joelle Pineau

Facebook AI Research, Montreal School of Computer Science, McGill University Neural Information Processing Systems (NeurIPS) December 5, 2018

slide-2
SLIDE 2

2

Reusability Reproducibility Robustness

Using the same materials as were used by the original investigator.

Bollen et al. National Science Foundation, 2015.

“Reproducibility refers to the ability of a researcher to duplicate the results of a prior study…. Reproducibility is a minimum necessary condition for a finding to be believable and informative.”

slide-3
SLIDE 3

Reproducibility crisis in science (2016)

https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

3

slide-4
SLIDE 4

Reproducibility crisis in science (2016)

https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

4

slide-5
SLIDE 5

Reinforcement learning (RL)

5

state, reward action

Learn ! = strategy to find this cheese! Environment Ø Very general framework for sequential decision-making! Ø Learning by trial-and-error, from sparse feedback. Ø Improves with experience, in real-time.

slide-6
SLIDE 6

Impressive successes in games!

6

Elf

slide-7
SLIDE 7

RL applications beyond games

  • Robotics
  • Video games
  • Conversational systems
  • Medical intervention
  • Algorithm improvement
  • Crop management
  • Personalized tutoring
  • Energy trading
  • Autonomous driving
  • Prosthetic arm control
  • Forest fire management
  • Financial trading
  • Many more!

7

slide-8
SLIDE 8

Adaptive neurostimulation

8

state, reward action

Panuccio, Guez, Vincent, Avoli, Pineau, Exp Neurol, 2013

slide-9
SLIDE 9

9

neau

RL in simulation RL in real-world from ~101 – 102 trials

slide-10
SLIDE 10

25+ years of RL papers

10

  • P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger.

Deep Reinforcement Learning that Matters. AAAI 2017 (+updates).

# of papers per year

slide-11
SLIDE 11

RL via Policy gradient methods

Maximize expected return, ! ", $% = '[ )0 + )1 + … + rT | s0 ] using gradient ascent:

11

state distribution value fn

.!(", $%) ." = 1

2

345($|$%) 1

7

.89(:|$) ." ;45($, :)

Neural network (< ) state

a1

Policy

p<(a|s)

a2 ak

slide-12
SLIDE 12

Policy gradient papers

» Evolution-Guided Policy Gradient in Reinforcement Learning » On Learning Intrinsic Rewards for Policy Gradient Methods » Evolved Policy Gradients » Policy Optimization via Importance Sampling » Dual Policy Iteration » Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization » Genetic-Gated Networks for Deep Reinforcement Learning » Simple random search of static linear policies is competitive for reinforcement learning » Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models » …..

Most papers use same policy gradient baseline algorithms.

12

NeurIPS’18

Many more at ICLR’18, ICML’18, AAAI’18, EWRL’18, CoRL’18, …

slide-13
SLIDE 13

Policy gradient baseline algorithms

Same standard baselines used in all of these papers:

» Trust Region Policy Optimization (TRPO), Schulman et al. 2015. » Proximal Policy Optimization (PPO), Schulman et al. 2017. » Deep Deterministic Policy Gradients (DDPG), Lillicrap et al. 2015. » Actor-Critic Kronecker-Factored Trust Region (ACKTR), Wu et al. 2017.

13

slide-14
SLIDE 14

Consider Mujoco simulator:

Alg.1 Alg.2 Alg.3 Alg.4

Robustness of policy gradient algorithms

Video taken from: https://gym.openai.com/envs/HalfCheetah-v1

14

slide-15
SLIDE 15

Consider Mujoco simulator:

15

Alg.1 Alg.2 Alg.3 Alg.4 Alg.1 Alg.2 Alg.3 Alg.4 Alg.1 Alg.2 Alg.3 Alg.4

Robustness of policy gradient algorithms

slide-16
SLIDE 16

Codebase comparison

TRPO implementations:

16

slide-17
SLIDE 17

Codebase comparison

TRPO implementations:

17

slide-18
SLIDE 18

Effect of hyperparameter configurations

Unit activation:

18

Policy network structure:

slide-19
SLIDE 19

An intricate interplay of hyperparameters!

How motivated are we to find the best hyperparameters for our baselines?

19

slide-20
SLIDE 20

Fair comparison is easy, right?

20

Same amount of data. Same amount of computation.

slide-21
SLIDE 21

Let’s look a little closer

21

n=5 n=5

slide-22
SLIDE 22

Let’s look a little closer

22

Both are same TRPO code with best hyperparameter configuration! n=5 n=5

slide-23
SLIDE 23

How should we measure performance

  • f the learned policy?

23

  • Average return over test trials?
  • Confidence interval?

+

How do we pick n?

Alg.1 Alg.2 Alg.3 Alg.4

slide-24
SLIDE 24

How many trials?

24

slide-25
SLIDE 25

Consider the case of n=10

25

10 20 30 40 50 60 70

Baseline to beat

slide-26
SLIDE 26

Consider the case of n=10

26

10 20 30 40 50 60 70 10 20 30 40 50 60 70

  • Strong positive bias: seems to beat the baseline!
  • Variance appears much smaller.

Baseline to beat Baseline to beat

Top-3 results

slide-27
SLIDE 27

27

https://www.alexirpan.com/2018/02/14/rl-hard.html

slide-28
SLIDE 28

From fair comparisons…

  • Different methods have distinct sets of hyperparameters.
  • Different methods exhibit variable sensitivity to hyperparams.
  • What method is best often depends on data/compute budget.

28

to robust conclusions.

slide-29
SLIDE 29

Yes:

  • Paper has experiments

100%

  • Paper uses neural networks

90%

  • All hyperparams for proposed algorithm are provided.

90%

  • All hyperparams for baselines are provided.

60%

  • Code is linked.

55%

  • Method for choosing hyperparams is specified

20%

  • Evaluations on some variation of a hold-out test set

10%

  • Significance testing applied

5%

29

We surveyed 50 RL papers from 2018

(published at NeurIPS, ICML, ICLR)

slide-30
SLIDE 30

Yes:

  • Paper has experiments

100%

  • Paper uses neural networks

90%

  • All hyperparams for proposed algorithm are provided.

90%

  • All hyperparams for baselines are provided.

60%

  • Code is linked.

55%

  • Method for choosing hyperparams is specified

20%

  • Evaluations on some variation of a hold-out test set

10%

  • Significance testing applied

5%

30

Let’s add a little shade!

We surveyed 50 RL papers from 2018

(published at NeurIPS, ICML, ICLR)

slide-31
SLIDE 31

31

How about a reproducibility checklist?

slide-32
SLIDE 32

How about a reproducibility checklist?

For all algorithms presented, check if you include: q A clear description of the algorithm. q An analysis of the complexity (time, space, sample size) of the algorithm. q A link to downloadable source code, including all dependencies. For any theoretical claim, check if you include: q A statement of the result. q A clear explanation of any assumptions. q A complete proof of the claim.

slide-33
SLIDE 33

How about a reproducibility checklist?

For all algorithms presented, check if you include: q A clear description of the algorithm. q An analysis of the complexity (time, space, sample size) of the algorithm. q A link to downloadable source code, including all dependencies. For any theoretical claim, check if you include: q A statement of the result. q A clear explanation of any assumptions. q A complete proof of the claim. For all figures and tables that present empirical results, check if you include: q A complete description of the data collection process, including sample size. q A link to downloadable version of the dataset or simulation environment. q An explanation of how sample were allocated for training / validation / testing. q An explanation of any data that was excluded. q The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results. q The exact number of evaluation runs. q A description of how experiments were run. q A clear definition of the specific measure or statistics used to report results. q Clearly defined error bars. q A description of results including central tendency (e.g. mean) and variation (e.g. stddev). q The computing infrastructure used.

slide-34
SLIDE 34

The role of infrastructure on reproducibility

34

slide-35
SLIDE 35

The role of infrastructure on reproducibility

35

slide-36
SLIDE 36

Myth or fact?

36

Reinforcement Learning is the only case of ML where it is acceptable to test on your training set.

slide-37
SLIDE 37

Myth or fact?

Reinforcement Learning is the only case of ML where it is acceptable to test on your training set.

37

The RL generalization roadmap

Classical RL

Train/test on same task.

AGI

Test on anything!

slide-38
SLIDE 38

Myth or fact?

Reinforcement Learning is the only case of ML where it is acceptable to test on your training set.

38

Classical RL

Train/test on same task.

AGI

Test on anything!

Separate tasks for train / test

The RL generalization roadmap

slide-39
SLIDE 39

Myth or fact?

Reinforcement Learning is the only case of ML where it is acceptable to test on your training set.

39

Classical RL

Train/test on same task.

AGI

Test on anything!

Separate rnd seeds for train / test Separate tasks for train / test

The RL generalization roadmap

slide-40
SLIDE 40

Generalization in RL

40 Results from Zhang, Ballas, Pineau, ArXiv 2018 See also Zhang, Vinyals, Munos, Bengio 2018

!rr = "

# ∑#%('(|'*~,-.,0) - " 1 ∑1%('(|'*~,-2,-,0)

slide-41
SLIDE 41

Generalization in RL

41 Results from Zhang, Ballas, Pineau, ArXiv 2018 See also Zhang, Vinyals, Munos, Bengio 2018

!rr = "

# ∑#%('(|'*~,-.,0) - " 1 ∑1%('(|'*~,-2,-,0) Standard RL Acrobot simulator

slide-42
SLIDE 42

Generalization in RL

42 Results from Zhang, Ballas, Pineau, ArXiv 2018 See also Zhang, Vinyals, Munos, Bengio 2018

Standard RL Acrobot simulator

!rr = "

# ∑#%('(|'*~,-.,0) - " 1 ∑1%('(|'*~,-2,-,0)

From JC Gamboa Higuera, D. Meger, G. Dudek, ICRA’17.

slide-43
SLIDE 43

Natural world has incredible complexity!

43

slide-44
SLIDE 44

Many RL benchmarks are ridiculously simple!

Ø Low-dim state space (Mujoco) Ø Small number of actions (ALE) Ø Few initial states Ø Deterministic transitions and rewards Ø Short description length, e.g. <100KB. Easy to memorize! Brittle to perturbations.

44

slide-45
SLIDE 45

Natural world => RL simulation

45 Zhang, Ballas, Pineau, ArXiv 2018

Lantana camara!

RL actions

slide-46
SLIDE 46

Natural world => RL simulation

46

Lantana camara!

Zhang, Ballas, Pineau, ArXiv 2018

slide-47
SLIDE 47

47

slide-48
SLIDE 48

Real-world video => RL simulation

48

Breakout (Atari)

Zhang, Wu, Pineau, 2018

slide-49
SLIDE 49

Real-world video => RL simulation

49

Breakout (Atari)

Original

What is going on?

  • Add random video in background:

“natural” noise + game strategy.

  • Different train/test video

=> clear train/test separation.

  • Fast and plentiful data acquisition.
  • Easy replication and comparison.

Zhang, Wu, Pineau, 2018

slide-50
SLIDE 50

What is next? Embodied Intelligence via Photorealistic Simulators

Colleagues at FAIR + Georgia Tech + FRL

Whelan et al., 2018 (Facebook Reality Labs)

Multi-task RL in Photorealistic Simulators

slide-51
SLIDE 51

Myth or fact?

51

Classical RL

Train/test on same task.

AGI

Test on anything!

Separate rnd seeds for train / test

The RL generalization roadmap

Separate image/video background Multi-task photorealistic simulator

Not necessarily!

Reinforcement Learning is the only case of ML where it is acceptable to test on your training set.

slide-52
SLIDE 52

Step out into the real-world!

52

slide-53
SLIDE 53

53

Reusability Reproducibility Robustness Science is a collective institution that aims to understand and explain.

slide-54
SLIDE 54

For all algorithms presented, check if you include: q A clear description of the algorithm. q An analysis of the complexity (time, space, sample size) of the algorithm. q A link to downloadable source code, including all dependencies. For any theoretical claim, check if you include: q A statement of the result. q A clear explanation of any assumptions. q A complete proof of the claim. For all figures and tables that present empirical results, check if you include: q A complete description of the data collection process, including sample size. q A link to downloadable version of the dataset or simulation environment. q An explanation of how sample were allocated for training / validation / testing. q An explanation of any data that was excluded. q The range of hyper-parameters considered, the method to select the best hyper-parameter configuration, and the specification of all hyper-parameters used to generate results. q The exact number of evaluation runs. q A description of how experiments were run. q A clear definition of the specific measure or statistics used to report results. q Clearly defined error bars. q A description of results including central tendency (e.g. mean) and variation (e.g. stddev). q The computing infrastructure used.

Reusability Reproducibility Robustness Science is a collective institution that aims to understand and explain.

slide-55
SLIDE 55

55

slide-56
SLIDE 56

56

slide-57
SLIDE 57

Major Contributors:

Riashat Islam

RL Reproducibility:

Reasoning and Learning Lab@ McGill

Peter Henderson Phil Bachman Doina Precup David Meger Joshua Romoff

Reproducibility Challenge:

  • H. Larochelle
  • R. Nan Ke

Natural RL:

Amy Zhang

  • K. Sinha

Nicolas Ballas Yuxin Wu

  • G. Fried

MILA (RLLab) @ McGill FAIR Montreal

slide-58
SLIDE 58

Thank you!