INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS
James Gleeson Eric Langlois William Saunders
INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric - - PowerPoint PPT Presentation
INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS James Gleeson Eric Langlois William Saunders REINFORCEMENT LEARNING CHALLENGES () is a discrete Credit assignment problem function of theta Bob got a great How do we get a bonus this
James Gleeson Eric Langlois William Saunders
Credit assignment problem Bob got a great bonus this year! β¦what did Bob do to earn his bonus? π(π) is a discrete function of thetaβ¦ How do we get a gradient βπΌβΞΈ π? Discrete π(π) Backprop Local minima βπΌβΞΈ π Sparse reward signal IDEA: Lets just treat π like a black-box function when optimizing it. like a black-box function when optimizing it. βTry different ΞΈβ, and see what works. If we find good ΞΈβs, keep them, discard the bad ones. Recombine βπβ1 and βπβ2 to form a new (possibly better) βπβ3 Time horizon: 1 year [+] Met all his deadlines [+] Took an ML course 3 years ago Evolution strategy
Β The template:
Fitness Evaluate how well each neural network performs on a training set. βPrepareβ to sample the new generation: Given how well each βmutantβ performedβ¦ Natural selection! Γ Keep the good ones The ones that remain βrecombineβ to form the next generation. βSampleβ new generation Generate some parameter vectors for your neural networks. MNIST ConvNet parameters
Rastrigin function Test function Rastrigin function (again) Lots of local optima; will be difficult to optimize with Backprop + SGD!
Schaffer function
ΞΈ Rastrigin Schaffer
Step 1: Calculate fitness of current generation π(1) Step 2: Natural selection! Keep the top 25%. (purple dots) Step 3: Recombine to form the new generation: Discrepancy between mean of previous generation and top 25% will cast a wider net! π(βπβ2 )
π(ΞΈ) Parameters for sampling neural-network parameters. Neural-network parameters. Adaptive Ο and Β΅
Constant Ο and Β΅
Β IDEA:
Just use the same Ο and π for each parameter. for each parameter. Γ¨ Sample neural-network parameters from βisotropic gaussianβ
Β Seems suspiciously simpleβ¦but it can compete! Β OpenAI ES paper: Β π is a hyperparameter
is a hyperparameter
Β 1 set of hyperparameters for Atari Β 1 set of hyperparameters for Mujoco Β Competes with A3C and TRPO performance
James Gleeson Eric Langlois William Saunders
Q-learning: Learn the action-value function:
Β Continuous action tasks:
Β βHoppingβ locomotion
Learn the policy directly Policy gradient; e.g. TRPO: Approximate the function using a neural-network, train it using gradients computed via backpropagation (i.e. the chain rule)
Β Discrete action tasks:
Β Learning to play Atari from raw pixels Β Expert-level go player
Β Backpropagation isnβt perfect: Β You have a datacenter, and cycles to spend
RL problem
Β GPU memory requirements Β Difficult to parallelize Β Cannot apply directly to non-differentiable functions
Β e.g. discrete functions πΊ(π) (the topic of this course)
Β Exploding gradient (e.g. for RNNβs)
And have it be embarrassingly parallel? Proof: πΊ(π) independent of π Gradient of
No derivates of πΊ(π) No chain rule / backprop required! πΊ(π) could be a discrete function of ΞΈ Relevant to our course: Claim: 2nd order Taylor series approximation
Β Criticisms: Β This paper aims to refute your common sense:
Β Comparison against state-of-the-art RL algorithms:
Β Atari:
Half the games do better than a recent algorithm (A3C), half the games do worse
Β Mujoco:
Can match state-of-the-art policy gradients on continuous action tasks. Linear speedups with more compute nodes: 1 day with A3C Γ¨ 1 hour with ES
Β Evolution strategy arenβt new! Β Common sense:
Generate n random perturbations of ΞΈ
Gradient estimator needed for updating ΞΈ: In RL, the fitness πΊ(π) is defined as:
Sequentially run each mutant Compute gradient estimate
Sample: Embarassingly parallel! for each πππ ππβπ βπ π=1..π: : πππ ππβπ βπ :computes βπΊβπ in parallel
Β KEY IDEA: Minimize communication cost
How? Each worker reconstructs random perturbation vector Ο΅ β¦How? Make initial random seed of πππ ππβπ βπ globally known. With βπΊβπ and βπβπ known by everyone, each worker compute the same gradient estimate Embarassingly parallel! Tradeoff: redundant computation over |π| message size
Β Linearly!
200 cores, 60 minutes Actual speedup Ideal speedup (perfectly linear) Criticism: Are diminishing returns due to:
cost from more workers
the gradient estimate from more workers
β finite differences in some random direction Ο΅ E.g. Simple linear regression: Double |ΞΈ|β|βΞΈββ² | After adjusting Ξ· and Ο, Update step has the same effect. Γ¨ Same # of update steps. βπβ1 βπβ2 βπβ1 βΌβπβ2 βΌπ(π, β πβ2 ) Justification: Γ¨ # update steps scales with |π|?
ASIDE: In case you forget; for independent X & Y: Policy gradients: Policy network outputs a softmax of probabilities for different discrete actions, and we sample an action randomly. Variance of gradient estimate grows linearly with the length of the episode. πΏ only fixes this for short-term returns!
Independent of episode length. Evolution strategy (ES): We randomly perturb our parameters: then select actions according to Credit assignment problem ES makes fewer (potentially incorrect) assumptions How do we compute gradients?
Β Frame-skip F:
Β Agent can select an action every F frames
Β E.g. F = 4
Argument: Since the ES algorithm doesnβt make any assumption about time horizon Ξ³ (decaying reward), it is less sensitive to long episodes Ο (i.e. the credit assignment problem) Playing pong with frameskip β Same policy Ο
Β The authors looked at:
Β discrete action tasks -- Atari Β continuous action tasks -- Mujoco
Β Paperβs claim:
50 games in total 4 8% Best score: 19 38% 11 22% 7 14% 9 18%
Slightly misleading claim if you arenβt reading carefully: A3C still does better on most games across all algorithms Γ¨ ES is still beaten by
when it beats A3C
Simpler tasks: as few as 0.33x samples required Harder tasks: at most 10x more samples required β# πΉπ πππππ‘π’πππ‘/# ππππ πππππ‘π’πππ‘ < 1 Γ¨ Better sampling complexity > 1 Γ¨ Worse sampling complexity Sampling complexity: How many steps in the environment were needed to reach X% of policy gradient performance?
Β ES are a viable alternative to current RL algorithms: Β ES:
Treat the problem like a black-box, perturb ΞΈ and evaluate fitness F(π):
Β No potentially incorrect assumptions about credit assignment problem
(e.g. time horizon Ξ³)
Β No backprop required
Β Embarrassingly parallel Β Lower GPU memory requirements