introduction to evolution strategy algorithms
play

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric - PowerPoint PPT Presentation

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders REINFORCEMENT LEARNING CHALLENGES () is a discrete Credit assignment problem function of theta Bob got a great How do we get a bonus this


  1. INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders

  2. REINFORCEMENT LEARNING CHALLENGES 𝑔(πœ„) is a discrete Credit assignment problem function of theta… Bob got a great How do we get a bonus this year! gradient ​𝛼↓ ΞΈ 𝑔 ? …what did Bob do to earn his bonus? Time horizon: 1 year [+] Met all his deadlines Backprop [+] Took an ML course 3 years ago Discrete ​𝛼↓ ΞΈ 𝑔 Local minima Sparse reward signal 𝑔(πœ„) IDEA: Lets just treat 𝑔 like a black-box function when optimizing it. like a black-box function when optimizing it. Evolution β€œTry different ΞΈ ”, and see what works. strategy If we find good ΞΈ ’s, keep them, discard the bad ones. Recombine β€‹πœ„β†“ 1 and β€‹πœ„β†“ 2 to form a new (possibly better) β€‹πœ„β†“ 3

  3. EVOLUTION STRATEGY ALGORITHMS Β„ The template: β€œSample” new generation MNIST ConvNet Generate some parameter parameters vectors for your neural networks. Fitness Evaluate how well each neural network performs on a training set . β€œPrepare” to sample the new generation: Given how well each β€œmutant” performed… Natural selection! Γ  Keep the good ones The ones that remain β€œrecombine” to form the next generation.

  4. SCARY β€œTEST FUNCTIONS” (1) Rastrigin function Rastrigin function (again) Test function Lots of local optima; will be difficult to optimize with Backprop + SGD!

  5. SCARY β€œTEST FUNCTIONS” (2) Schaffer function

  6. ΞΈ WHAT WE WANT TO DO; β€œTRY DIFFERENT β€œ Schaffer Rastrigin Algorithm: CMA-ES

  7. CMA-ES; HIGH-LEVEL OVERVIEW 𝑃 ( β€‹πœ„β†‘ 2 ) Step 3: Step 1: Step 2: Recombine to form the Calculate fitness of Natural selection! new generation: current generation 𝑕( 1 ) Keep the top 25%. (purple dots) Discrepancy between mean of previous generation and top 25% will cast a wider net!

  8. ES: LESS COMPUTATIONALLY EXPENSIVE IDEA: Sample neural-network parameters from a multi-variate gaussian w/ diagonal covariance matrix. Update 𝑂(πœ„ =[ 𝜈 , Ξ£] ) parameters using REINFORCE gradient estimate. Neural-network parameters. Parameters for sampling neural-network parameters. 𝑃 (ΞΈ) Adaptive Οƒ and Β΅

  9. ES: __EVEN_LESS__ COMPUTATIONALLY EXPENSIVE Β„ IDEA: Just use the same Οƒ and 𝜈 for each parameter. for each parameter. Γ¨ Sample neural-network parameters from β€œ isotropic gaussian ” = 𝑂(𝜈 , β€‹πœβ†‘ 2 𝐽) Β„ Seems suspiciously simple…but it can compete! Β„ OpenAI ES paper: Β„ 𝜏 is a hyperparameter is a hyperparameter Β„ 1 set of hyperparameters for Atari Constant Οƒ and Β΅ Β„ 1 set of hyperparameters for Mujoco Β„ Competes with A3C and TRPO performance

  10. EVOLUTION STRATEGIES AS A SCALABLE ALTERNATIVE TO REINFORCEMENT LEARNING James Gleeson Eric Langlois William Saunders

  11. TODAY’S RL LANDSCAPE AND RECENT SUCCESS Β„ Discrete action tasks: Β„ Continuous action tasks: Β„ Learning to play Atari from raw pixels Β„ β€œHopping” locomotion Β„ Expert-level go player Q-learning: Policy gradient; e.g. TRPO: Learn the action-value function: Learn the policy directly Approximate the function using a neural-network, train it using gradients computed via backpropagation (i.e. the chain rule)

  12. MOTIVATION: PROBLEMS WITH BACKPROPAGATION Β„ Backpropagation isn’t perfect: Β„ GPU memory requirements Β„ Difficult to parallelize Β„ Cannot apply directly to non-differentiable functions Β„ e.g. discrete functions 𝐺 ( πœ„ ) (the topic of this course) Β„ Exploding gradient (e.g. for RNN’s) Β„ You have a datacenter, and cycles to spend RL problem

  13. AN ALTERNATIVE TO BACKPROPAGATION: EVOLUTION STRATEGY (ES) Claim: And have it be embarrassingly parallel ? Proof: 2 nd order Taylor series approximation 𝐺 ( πœ„ ) independent of πœ— Relevant to our course: 𝐺(πœ„) could be a discrete function of ΞΈ No derivates of 𝐺 ( πœ„ ) Gradient of objective 𝐺 ( πœ„ ) No chain rule / backprop required!

  14. THE MAIN CONTRIBUTION OF THIS PAPER Β„ Criticisms: Β„ Evolution strategy aren’t new! Β„ Common sense: The variance/bias of this gradient estimator will be too high, making the algorithm unstable on today’s problems! Β„ This paper aims to refute your common sense : Β„ Comparison against state-of-the-art RL algorithms: Β„ Atari: Half the games do better than a recent algorithm (A3C) , half the games do worse Β„ Mujoco: Can match state-of-the-art policy gradients on continuous action tasks. Linear speedups with more compute nodes: 1 day with A3C Γ¨ 1 hour with ES

  15. FIRST ATTEMPT AT ES: THE SEQUENTIAL ALGORITHM Gradient estimator needed for updating ΞΈ : Sample: In RL, the fitness 𝐺(πœ„) is defined as: Embarassingly parallel! for each 𝑋𝑝𝑠𝑙𝑓​𝑠↓𝑗 𝑗 =1.. π‘œ : : 𝑋𝑝𝑠𝑙𝑓​𝑠↓𝑗 : computes ​𝐺↓𝑗 in parallel Generate n random perturbations of ΞΈ Sequentially run each mutant Compute gradient estimate

  16. SECOND ATTEMPT: THE PARALLEL ALGORITHM Embarassingly parallel! With β€‹πΊβ†“π‘˜ and β€‹πœ—β†“π‘˜ known by everyone, each worker compute the same gradient estimate Tradeoff: redundant computation over Β„ KEY IDEA: Minimize communication cost |πœ„| message size avoid sending len( πœ— )= |πœ„| , send len (​𝐺↓𝑗 ) =1 instead. How? Each worker reconstructs random perturbation vector Ο΅ …How? M ake initial random seed of 𝑋𝑝𝑠𝑙𝑓​𝑠↓𝑗 globally known.

  17. EXPERIMENT: HOW WELL DOES IT SCALE? Actual speedup Ideal speedup (perfectly linear) 200 cores, 60 minutes Criticism: Are diminishing returns due to: β€’ increased communication cost from more workers β€’ less reduction in variance of the gradient estimate from more workers Β„ Linearly! With diminishing returns; often inevitable.

  18. INTRINSIC DIMENSIONALITY OF THE PROBLEM Justification: E.g. Simple linear regression: β€‹πœ—β†“ 2 Double | ΞΈ | β†’| ​ ΞΈ ↑ β€² | β€‹πœ—β†“ 1 β€‹πœ—β†“ 1 ∼ β€‹πœ—β†“ 2 ∼ 𝑂 ( 𝜈 , ​ πœβ†‘ 2 ) β‰ˆ finite differences in some random direction Ο΅ Γ¨ # update steps scales with |πœ„| ? After adjusting Ξ· and Οƒ , Update step has the same effect. Γ¨ Same # of update steps. Argument: # of update steps in ES scales with the intrinsic dimensionality of ΞΈ needed for the problem, not with the length of ΞΈ.

  19. WHEN IS ES A BETTER CHOICE THAN POLICY GRADIENTS? How do we compute gradients? ASIDE: In case you forget; for independent X & Y: Policy gradients: Policy network outputs a softmax of probabilities for different discrete actions, and we sample an action randomly . Evolution strategy (ES): We randomly perturb our parameters: then select actions according to Independent of episode length. Credit assignment problem ES makes fewer Variance of gradient estimate grows linearly (potentially incorrect) with the length of the episode. assumptions 𝛿 only fixes this for short-term returns! only fixes this for short-term returns!

  20. Ο„ EXPERIMENT: ES ISN’T SENSITIVE TO LENGTH OF EPISODE Β„ Frame-skip F: Playing pong with frameskip Β„ Agent can select an action every F frames of input pixels β‰ˆ Same policy Β„ E.g. F = 4 frame 1: agent selects an action frame 1-3: agent is forced to take Noop action IDEA: artificially inflate the length of an episode Ο„ Argument: Since the ES algorithm doesn’t make any assumption about time horizon Ξ³ (decaying reward), it is less sensitive to long episodes Ο„ (i.e. the credit assignment problem )

  21. EXPERIMENT: LEARNED PERFORMANCE Β„ The authors looked at: Β„ discrete action tasks -- Atari Β„ continuous action tasks -- Mujoco

  22. EXPERIMENT: DISCRETE ACTION TASKS -- ATARI Β„ Paper’s claim: β€œGiven the same amount of compute time as other algorithms, compared to A3C , ES does better on 21 games, worse on 29 ” 50 games in total Slightly misleading claim Best score: 4 19 7 9 11 if you aren’t reading 8% 38% 14% 18% 22% carefully: A3C still does better on most games across all algorithms Γ¨ ES is still beaten by other algorithms when it beats A3C

  23. EXPERIMENT: CONTINUOUS ACTION TASKS -- MUJOCO Sampling complexity: How many steps in the environment were needed to reach X% of policy gradient performance? < 1 Γ¨ Better sampling complexity ​ # 𝐹𝑇 π‘ˆπ‘—π‘›π‘“π‘‘π‘’π‘“π‘žπ‘‘/ # > 1 Γ¨ Worse sampling complexity π‘ˆπ‘†π‘„π‘ƒ π‘ˆπ‘—π‘›π‘“π‘‘π‘’π‘“π‘žπ‘‘ Harder tasks: at most 10x more samples required Simpler tasks: as few as 0.33x samples required

  24. SUMMARY: EVOLUTION STRATEGY Β„ ES are a viable alternative to current RL algorithms : Policy gradient; e.g. TRPO: Q-learning: Learn the policy Learn the action-value directly function: Β„ ES: Treat the problem like a black-box, perturb ΞΈ and evaluate fitness F( πœ„ ) : Β„ No potentially incorrect assumptions about credit assignment problem (e.g. time horizon Ξ³ ) Β„ No backprop required Β„ Embarrassingly parallel Β„ Lower GPU memory requirements

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend