Evolution Strategies
Distributed deep reinforcement learning
Steven Schmatz
@stevenschmatz Evolutionary Strategies November 21, 2017
(blog.otoro.net)
Evolution Strategies Distributed deep reinforcement learning - - PowerPoint PPT Presentation
Evolution Strategies Distributed deep reinforcement learning (blog.otoro.net) Evolutionary Strategies Steven Schmatz November 21, 2017 @stevenschmatz Deep Reinforcement Learning Evolution Strategies Steven Schmatz November 21, 2017
Distributed deep reinforcement learning
Steven Schmatz
@stevenschmatz Evolutionary Strategies November 21, 2017
(blog.otoro.net)
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
(reinforcement learning)
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Stationary distribution Independence
Clear input-output relationship
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Stationary distribution Independence
Clear input-output relationship
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Stationary distribution
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
The training data changes as you act differently.
Independence
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Adjacent game frames are usually very similar.
Clear input-output relationship
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
There can be a large delay between action and reward.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Model Training objective
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
(our objective) (our weight update)
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
What if our reward function is highly nonlinear? How far should we step? What if our reward is received much later? What if our policy is non-differentiable?
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
At each iteration:
1. Generate candidate solutions from
2. Evaluate a fitness function for each candidate 3. Aggregate the results and discard bad candidates.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Select the single best previous solution, and add Gaussian noise. (Keep standard deviation fixed.)
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Only keep the top performing 10% of solutions. Randomly select two solutions. Recombine them by randomly assigning each parameter value from either parent. (and add fixed Gaussian noise.)
Example:
Combine (1, 2, 3) and (4, 5, 6):
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances.
Problem:
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Treat the problem a bit differently: Then use the gradient with your favorite SGD optimizer:
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Similar to Natural ES, but σ constant.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Basic idea:
Similar to Natural ES, but σ constant.
Note: to parallelize we only need to know pairs!
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
7 2 1 6 3 4 5
Initialize
1. Create shared list of all random seeds, one per worker; and
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
7 2 1 6 3 4 5
Repeat:
1. Sample
Initialize
1. Create shared list of all random seeds, one per worker; and
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
7 2 1 6 3 4 5
Repeat:
1. Sample 2. Evaluate
Initialize
1. Create shared list of all random seeds, one per worker; and
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
7 2 1 6 3 4 5
Repeat:
1. Sample 2. Evaluate 3. Communicate to all nodes
Initialize
1. Create shared list of all random seeds, one per worker; and
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
7 2 1 6 3 4 5
Repeat:
1. Sample 2. Evaluate 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds
Initialize
1. Create shared list of all random seeds, one per worker; and
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
7 2 1 6 3 4 5
Repeat:
1. Sample 2. Evaluate 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds 5.
Initialize
1. Create shared list of all random seeds, one per worker; and
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
iteration is a single scalar per machine.
Gorila) must communicate entire parameter lists.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
iteration is a single scalar per machine.
Gorila) must communicate entire parameter lists.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Non-differentiable policies!
(hard attention!)
No backprop!
3x computation time decrease!
Sparse rewards!
Learn long-term policies in hard environments!
And much cheaper than GPUs!
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Not useful for supervised learning.
(good, reliable gradients)
Data inefficient
About 3–10x less data efficient
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
If you have a large amount of CPU cores (>100),
may be a good bet.
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Steven Schmatz
@stevenschmatz Evolution Strategies November 21, 2017
Appendix