Evolution Strategies Distributed deep reinforcement learning - - PowerPoint PPT Presentation

evolution strategies
SMART_READER_LITE
LIVE PREVIEW

Evolution Strategies Distributed deep reinforcement learning - - PowerPoint PPT Presentation

Evolution Strategies Distributed deep reinforcement learning (blog.otoro.net) Evolutionary Strategies Steven Schmatz November 21, 2017 @stevenschmatz Deep Reinforcement Learning Evolution Strategies Steven Schmatz November 21, 2017


slide-1
SLIDE 1

Evolution Strategies

Distributed deep reinforcement learning

Steven Schmatz

@stevenschmatz Evolutionary Strategies November 21, 2017

(blog.otoro.net)

slide-2
SLIDE 2

Deep Reinforcement Learning

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-3
SLIDE 3

Agenda

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

  • 1. Why is deep reinforcement learning hard?
  • 2. How does evolution strategies (ES) help?
  • 3. Advice on applying ES to real-world problems
slide-4
SLIDE 4

RL in a nutshell

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

(reinforcement learning)

slide-5
SLIDE 5

Deep RL in a nutshell

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-6
SLIDE 6

Deep CNNs are useful.

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-7
SLIDE 7

Assumptions of supervised learning

Stationary distribution Independence

  • f examples

Clear input-output relationship

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-8
SLIDE 8

RL violates these

  • assumptions. 😮

Stationary distribution Independence

  • f examples

Clear input-output relationship

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-9
SLIDE 9

Stationary distribution

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

The training data changes as you act differently.

RL violates these

  • assumptions. 😮
slide-10
SLIDE 10

Independence

  • f examples

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Adjacent game frames are usually very similar.

RL violates these

  • assumptions. 😮
slide-11
SLIDE 11

Clear input-output relationship

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

There can be a large delay between action and reward.

RL violates these

  • assumptions. 😮
slide-12
SLIDE 12

Deep Q-Learning

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Model Training objective

slide-13
SLIDE 13

Policy gradients

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

(our objective) (our weight update)

slide-14
SLIDE 14

Policy gradients

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

What if our reward function is highly nonlinear? How far should we step? What if our reward is received much later? What if our policy is non-differentiable?

slide-15
SLIDE 15

Local optima

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-16
SLIDE 16

Local optima

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-17
SLIDE 17

Black-box optimization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-18
SLIDE 18

ES to the rescue!

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

At each iteration:

1. Generate candidate solutions from

  • ld candidates by adding noise

2. Evaluate a fitness function for each candidate 3. Aggregate the results and discard bad candidates.

slide-19
SLIDE 19

Simple ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Select the single best previous solution, and add Gaussian noise. (Keep standard deviation fixed.)

slide-20
SLIDE 20

Genetic ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Only keep the top performing 10% of solutions. Randomly select two solutions. Recombine them by randomly assigning each parameter value from either parent. (and add fixed Gaussian noise.)

Example:

Combine (1, 2, 3) and (4, 5, 6):

  • (1, 5, 6)
  • (4, 2, 3)
slide-21
SLIDE 21

CMA–ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances.

slide-22
SLIDE 22

CMA–ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Select the best 25% of the population. Calculate a covariance matrix of these best 25%. (represents a promising area to search for new candidates) Generate new candidates using the per- parameter means and variances.

Problem:

😮

slide-23
SLIDE 23

Natural ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Treat the problem a bit differently: Then use the gradient with your favorite SGD optimizer:

slide-24
SLIDE 24

OpenAI ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Similar to Natural ES, but σ constant.

slide-25
SLIDE 25

OpenAI ES

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Basic idea:

Similar to Natural ES, but σ constant.

Note: to parallelize we only need to know pairs!

slide-26
SLIDE 26

Parallelization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

7 2 1 6 3 4 5

Initialize

1. Create shared list of all random seeds, one per worker; and

slide-27
SLIDE 27

Parallelization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

7 2 1 6 3 4 5

Repeat:

1. Sample

Initialize

1. Create shared list of all random seeds, one per worker; and

slide-28
SLIDE 28

Parallelization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

7 2 1 6 3 4 5

Repeat:

1. Sample 2. Evaluate

Initialize

1. Create shared list of all random seeds, one per worker; and

slide-29
SLIDE 29

Parallelization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

7 2 1 6 3 4 5

Repeat:

1. Sample 2. Evaluate 3. Communicate to all nodes

Initialize

1. Create shared list of all random seeds, one per worker; and

slide-30
SLIDE 30

Parallelization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

7 2 1 6 3 4 5

Repeat:

1. Sample 2. Evaluate 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds

Initialize

1. Create shared list of all random seeds, one per worker; and

slide-31
SLIDE 31

Parallelization

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

7 2 1 6 3 4 5

Repeat:

1. Sample 2. Evaluate 3. Communicate to all nodes 4. Reconstruct for all other nodes using known random seeds 5.

Initialize

1. Create shared list of all random seeds, one per worker; and

slide-32
SLIDE 32

Efficiency

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

  • The only information communicated at each

iteration is a single scalar per machine.

  • Most distributed update mechanisms (A3C,

Gorila) must communicate entire parameter lists.

  • Result: linear horizontal parallelization.
slide-33
SLIDE 33

Efficiency

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

  • The only information communicated at each

iteration is a single scalar per machine.

  • Most distributed update mechanisms (A3C,

Gorila) must communicate entire parameter lists.

  • Result: linear horizontal parallelization.
slide-34
SLIDE 34

Benefits

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Non-differentiable policies!

(hard attention!)

No backprop!

3x computation time decrease!

Sparse rewards!

Learn long-term policies in hard environments!

And much cheaper than GPUs!

slide-35
SLIDE 35

Drawbacks

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Not useful for supervised learning.

(good, reliable gradients)

Data inefficient

About 3–10x less data efficient

slide-36
SLIDE 36

Bottom Line

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

If you have a large amount of CPU cores (>100),

  • r if you have sparse rewards, evolution strategies

may be a good bet.

slide-37
SLIDE 37

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

slide-38
SLIDE 38

Steven Schmatz

@stevenschmatz Evolution Strategies November 21, 2017

Appendix