Stochastic Search using the Natural Gradient Ecient Natural - - PowerPoint PPT Presentation

stochastic search using the natural gradient
SMART_READER_LITE
LIVE PREVIEW

Stochastic Search using the Natural Gradient Ecient Natural - - PowerPoint PPT Presentation

Stochastic Search using the Natural Gradient Ecient Natural Evolution Strategies (eNES) Yi Sun, Daan Wierstra, Tom Schaul, and Jrgen Schmidhuber {yi,daan,tom,juergen}@idsia.ch IDSIA, Galleria 2, Manno 6928, Switzerland June 17th, 2009 Yi


slide-1
SLIDE 1

Stochastic Search using the Natural Gradient

E¢cient Natural Evolution Strategies (eNES) Yi Sun, Daan Wierstra, Tom Schaul, and Jürgen Schmidhuber {yi,daan,tom,juergen}@idsia.ch

IDSIA, Galleria 2, Manno 6928, Switzerland

June 17th, 2009

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 1 / 22

slide-2
SLIDE 2

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-3
SLIDE 3

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-4
SLIDE 4

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-5
SLIDE 5

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-6
SLIDE 6

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc. Highly non-isotropic (ill-shaped) local behavior.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-7
SLIDE 7

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc. Highly non-isotropic (ill-shaped) local behavior.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Correlation between all dimensions.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-8
SLIDE 8

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc. Highly non-isotropic (ill-shaped) local behavior.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Correlation between all dimensions.

Expensive …tness evaluations.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-9
SLIDE 9

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc. Highly non-isotropic (ill-shaped) local behavior.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Correlation between all dimensions.

Expensive …tness evaluations. High dimensionality, d up to hundreds.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-10
SLIDE 10

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc. Highly non-isotropic (ill-shaped) local behavior.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Correlation between all dimensions.

Expensive …tness evaluations. High dimensionality, d up to hundreds.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-11
SLIDE 11

Blackbox Optimization

Goal: Maximizing some unknown ‘…tness’ function f (z), z 2 Rd.

−4 −2 2 4 −3 −2 −1 1 2 3 0.5 1 1.5 2 2.5 3

Challenge: Complex …tness landscapes.

Local optimas, saddle points, etc. Highly non-isotropic (ill-shaped) local behavior.

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3

Correlation between all dimensions.

Expensive …tness evaluations. High dimensionality, d up to hundreds. Powerful methods are required to solve such problems.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 2 / 22

slide-12
SLIDE 12

Stochastic Search Algorithms

Basic idea: Optimization by using population of samples. Typical ‡ow of stochastic search algorithm:

Initialization Sampling from Search Distribution Evaluating fitnesses

  • f samples

Updating Search Distribution loop

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 3 / 22

slide-13
SLIDE 13

Stochastic Search Algorithms

Basic idea: Optimization by using population of samples. Typical ‡ow of stochastic search algorithm:

Initialization Sampling from Search Distribution Evaluating fitnesses

  • f samples

Updating Search Distribution loop

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 3 / 22

slide-14
SLIDE 14

Stochastic Search Algorithms

Basic idea: Optimization by using population of samples. Typical ‡ow of stochastic search algorithm:

Initialization Sampling from Search Distribution Evaluating fitnesses

  • f samples

Updating Search Distribution loop

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 3 / 22

slide-15
SLIDE 15

Stochastic Search Algorithms

Basic idea: Optimization by using population of samples. Typical ‡ow of stochastic search algorithm:

Initialization Sampling from Search Distribution Evaluating fitnesses

  • f samples

Updating Search Distribution loop

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 3 / 22

slide-16
SLIDE 16

Stochastic Search Algorithms

Basic idea: Optimization by using population of samples. Typical ‡ow of stochastic search algorithm:

Initialization Sampling from Search Distribution Evaluating fitnesses

  • f samples

Updating Search Distribution loop

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 3 / 22

slide-17
SLIDE 17

Stochastic Search Algorithms

Basic idea: Optimization by using population of samples. Typical ‡ow of stochastic search algorithm:

Initialization Sampling from Search Distribution Evaluating fitnesses

  • f samples

Updating Search Distribution loop

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 3 / 22

slide-18
SLIDE 18

Stochastic Gradient Ascent

Let p (jθ) be the search distribution. We want to update θ towards better expected …tness: J (θ) = E [f jθ] =

Z

f (z) p (zjθ) dz.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 4 / 22

slide-19
SLIDE 19

Stochastic Gradient Ascent

Let p (jθ) be the search distribution. We want to update θ towards better expected …tness: J (θ) = E [f jθ] =

Z

f (z) p (zjθ) dz. The most straight forward way is by gradient ascent: θ θ + αOθJ (θ) .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 4 / 22

slide-20
SLIDE 20

Stochastic Gradient Ascent

Let p (jθ) be the search distribution. We want to update θ towards better expected …tness: J (θ) = E [f jθ] =

Z

f (z) p (zjθ) dz. The most straight forward way is by gradient ascent: θ θ + αOθJ (θ) . We can compute the ‘vanilla’ gradient as OθJ (θ) =

Z

f (z) Oθp (zjθ) dz =

Z

f (z) p (zjθ) p (zjθ)Oθp (zjθ) dz (log-likelihood trick) = E [f (z) Oθ log p (zjθ) jθ] .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 4 / 22

slide-21
SLIDE 21

Stochastic Gradient Ascent

Using the Monte-Carlo estimation OθJ (θ) = E [f (z) Oθ log p (zjθ) jθ] ' 1 n ∑n

i=1 f (zi) Oθ log p (zijθ) = 1

nGf, with G = [Oθ log p (z1jθ) . . . Oθ log p (znjθ)] , f = [f (z1) . . . f (zn)]> .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 5 / 22

slide-22
SLIDE 22

Stochastic Gradient Ascent

Using the Monte-Carlo estimation OθJ (θ) = E [f (z) Oθ log p (zjθ) jθ] ' 1 n ∑n

i=1 f (zi) Oθ log p (zijθ) = 1

nGf, with G = [Oθ log p (z1jθ) . . . Oθ log p (znjθ)] , f = [f (z1) . . . f (zn)]> .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 5 / 22

slide-23
SLIDE 23

Stochastic Gradient Ascent

Using the Monte-Carlo estimation OθJ (θ) = E [f (z) Oθ log p (zjθ) jθ] ' 1 n ∑n

i=1 f (zi) Oθ log p (zijθ) = 1

nGf, with G = [Oθ log p (z1jθ) . . . Oθ log p (znjθ)] , f = [f (z1) . . . f (zn)]> . Now the problem is to compute Oθ log p (zjθ). A closed form derivation can be obtained if p (zjθ) is a Gaussian distribution.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 5 / 22

slide-24
SLIDE 24

The Gaussian Search Distribution

The search distribution is given by p (zjθ) = N (zjx, C) .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

slide-25
SLIDE 25

The Gaussian Search Distribution

The search distribution is given by p (zjθ) = N (zjx, C) . We use the parameter set θ = hx, Ai, with A being the Cholesky decomposition of C, i.e., A is an upper triangular matrix (UTM) and C = A>A.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

slide-26
SLIDE 26

The Gaussian Search Distribution

The search distribution is given by p (zjθ) = N (zjx, C) . We use the parameter set θ = hx, Ai, with A being the Cholesky decomposition of C, i.e., A is an upper triangular matrix (UTM) and C = A>A.

No redundancy in θ since C is symmetric.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

slide-27
SLIDE 27

The Gaussian Search Distribution

The search distribution is given by p (zjθ) = N (zjx, C) . We use the parameter set θ = hx, Ai, with A being the Cholesky decomposition of C, i.e., A is an upper triangular matrix (UTM) and C = A>A.

No redundancy in θ since C is symmetric.

Oθ log p (zjθ) can be computed in closed form: Ox log p (zjθ) = C (z x) OA log p (zjθ) = A> (z x) (z x)> C diag

  • A

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

slide-28
SLIDE 28

The Gaussian Search Distribution

The search distribution is given by p (zjθ) = N (zjx, C) . We use the parameter set θ = hx, Ai, with A being the Cholesky decomposition of C, i.e., A is an upper triangular matrix (UTM) and C = A>A.

No redundancy in θ since C is symmetric.

Oθ log p (zjθ) can be computed in closed form: Ox log p (zjθ) = C (z x) OA log p (zjθ) = A> (z x) (z x)> C diag

  • A

Os

θJ (θ) can be computed from Oθ log p (z1jθ) . . . Oθ log p (z1jθ).

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

slide-29
SLIDE 29

Stochastic Gradient Ascent

θ θ + αOs

θJ (θ) = θ + α

nGf

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

slide-30
SLIDE 30

Stochastic Gradient Ascent

θ θ + αOs

θJ (θ) = θ + α

nGf

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

slide-31
SLIDE 31

Stochastic Gradient Ascent

θ θ + αOs

θJ (θ) = θ + α

nGf

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

slide-32
SLIDE 32

Stochastic Gradient Ascent

θ θ + αOs

θJ (θ) = θ + α

nGf

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

slide-33
SLIDE 33

Stochastic Gradient Ascent

θ θ + αOs

θJ (θ) = θ + α

nGf

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

slide-34
SLIDE 34

Novel Ideas in eNES

1

Use the Natural Gradient instead of the vanilla gradient.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

slide-35
SLIDE 35

Novel Ideas in eNES

1

Use the Natural Gradient instead of the vanilla gradient.

2

The natural gradient is computed in an Exact and E¢cient way.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

slide-36
SLIDE 36

Novel Ideas in eNES

1

Use the Natural Gradient instead of the vanilla gradient.

2

The natural gradient is computed in an Exact and E¢cient way.

3

Use Importance Mixing for reusing previously evaluated samples.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

slide-37
SLIDE 37

Novel Ideas in eNES

1

Use the Natural Gradient instead of the vanilla gradient.

2

The natural gradient is computed in an Exact and E¢cient way.

3

Use Importance Mixing for reusing previously evaluated samples.

4

Introducing Optimal Fitness Baseline to reduce the variance of gradient estimation.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

slide-38
SLIDE 38
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-39
SLIDE 39
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-40
SLIDE 40
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-41
SLIDE 41
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-42
SLIDE 42
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-43
SLIDE 43
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-44
SLIDE 44
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-45
SLIDE 45
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-46
SLIDE 46
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-47
SLIDE 47
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-48
SLIDE 48
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-49
SLIDE 49
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-50
SLIDE 50
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-51
SLIDE 51
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-52
SLIDE 52
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-53
SLIDE 53
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-54
SLIDE 54
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-55
SLIDE 55
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-56
SLIDE 56
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-57
SLIDE 57
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-58
SLIDE 58
  • 1. Why Natural Gradient?

Vanilla gradient doesn’t work:

Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance.

Basic idea of natural gradient

Steepest ascent direction when considering correlations between elements in θ. Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

slide-59
SLIDE 59
  • 1. Formulation of Natural Gradient

Assume the distance between two adjacent distributions p (jθ) and p (jθ + δθ) is de…ned by their KL divergence. The natural gradient ˜ OθJ (θ) is given by the necessary condition F ˜ OθJ (θ) = OθJ (θ) .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

slide-60
SLIDE 60
  • 1. Formulation of Natural Gradient

Assume the distance between two adjacent distributions p (jθ) and p (jθ + δθ) is de…ned by their KL divergence. The natural gradient ˜ OθJ (θ) is given by the necessary condition F ˜ OθJ (θ) = OθJ (θ) .

F is the Fisher information matrix (FIM) of θ: (Intuitively, the normalized covariance of the gradient.) F = E h (Oθ log p (zjθ)) (Oθ log p (zjθ))>i .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

slide-61
SLIDE 61
  • 1. Formulation of Natural Gradient

Assume the distance between two adjacent distributions p (jθ) and p (jθ + δθ) is de…ned by their KL divergence. The natural gradient ˜ OθJ (θ) is given by the necessary condition F ˜ OθJ (θ) = OθJ (θ) .

F is the Fisher information matrix (FIM) of θ: (Intuitively, the normalized covariance of the gradient.) F = E h (Oθ log p (zjθ)) (Oθ log p (zjθ))>i . F may not be invertible.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

slide-62
SLIDE 62
  • 1. Formulation of Natural Gradient

Assume the distance between two adjacent distributions p (jθ) and p (jθ + δθ) is de…ned by their KL divergence. The natural gradient ˜ OθJ (θ) is given by the necessary condition F ˜ OθJ (θ) = OθJ (θ) .

F is the Fisher information matrix (FIM) of θ: (Intuitively, the normalized covariance of the gradient.) F = E h (Oθ log p (zjθ)) (Oθ log p (zjθ))>i . F may not be invertible.

If F is invertable, we can compute the (estimated) natural gradient as ˜ OθJ (θ) = FOθJ (θ) , ˜ Os

θJ (θ) = FOs θJ (θ) .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

slide-63
SLIDE 63
  • 2. Property of FIM in the Gaussian Case

Let θ = hx, Ai. Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

slide-64
SLIDE 64
  • 2. Property of FIM in the Gaussian Case

Let θ = hx, Ai. Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible. The Fisher information matrix is a block diagonal matrix F = 2 6 6 6 4 C F1 ... Fd 3 7 7 7 5 .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

slide-65
SLIDE 65
  • 2. Property of FIM in the Gaussian Case

Let θ = hx, Ai. Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible. The Fisher information matrix is a block diagonal matrix F = 2 6 6 6 4 C F1 ... Fd 3 7 7 7 5 .

C is the FIM for x.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

slide-66
SLIDE 66
  • 2. Property of FIM in the Gaussian Case

Let θ = hx, Ai. Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible. The Fisher information matrix is a block diagonal matrix F = 2 6 6 6 4 C F1 ... Fd 3 7 7 7 5 .

C is the FIM for x. Fk is the FIM for (n k + 1 non-zero elements in) the k-th row of A.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

slide-67
SLIDE 67
  • 2. Property of FIM in the Gaussian Case

Let θ = hx, Ai. Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible. The Fisher information matrix is a block diagonal matrix F = 2 6 6 6 4 C F1 ... Fd 3 7 7 7 5 .

C is the FIM for x. Fk is the FIM for (n k + 1 non-zero elements in) the k-th row of A. The FIM suggest a natural grouping of elements in θ.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

slide-68
SLIDE 68
  • 2. E¢cient Inverse of FIM

The computation of natural gradient requires the inverse of F. Naively, F is a matrix of size O

  • d2

, so computing F requires O

  • d6

.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

slide-69
SLIDE 69
  • 2. E¢cient Inverse of FIM

The computation of natural gradient requires the inverse of F. Naively, F is a matrix of size O

  • d2

, so computing F requires O

  • d6

. We already …nd that F is block diagonal, so computing F requires O

  • d4

.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

slide-70
SLIDE 70
  • 2. E¢cient Inverse of FIM

The computation of natural gradient requires the inverse of F. Naively, F is a matrix of size O

  • d2

, so computing F requires O

  • d6

. We already …nd that F is block diagonal, so computing F requires O

  • d4

. We can do better! Use the special form of each sub-block, the complexity is reduced to O

  • d3

.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

slide-71
SLIDE 71
  • 2. E¢cient Inverse of FIM

The computation of natural gradient requires the inverse of F. Naively, F is a matrix of size O

  • d2

, so computing F requires O

  • d6

. We already …nd that F is block diagonal, so computing F requires O

  • d4

. We can do better! Use the special form of each sub-block, the complexity is reduced to O

  • d3

. The estimated natural gradient is then computed as Os

θJ (θ) = 1

nFGf. with complexity O

  • d3

.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

slide-72
SLIDE 72
  • 3. Importance Mixing

At each cycle, we need to evaluate n new samples.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

slide-73
SLIDE 73
  • 3. Importance Mixing

At each cycle, we need to evaluate n new samples. It is common that the updated θ(t) is close to θ(t1).

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

slide-74
SLIDE 74
  • 3. Importance Mixing

At each cycle, we need to evaluate n new samples. It is common that the updated θ(t) is close to θ(t1). Problem: Redundant …tness evaluations in overlapping high density area.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

slide-75
SLIDE 75
  • 3. Importance Mixing

At each cycle, we need to evaluate n new samples. It is common that the updated θ(t) is close to θ(t1). Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

slide-76
SLIDE 76
  • 3. Importance Mixing

At each cycle, we need to evaluate n new samples. It is common that the updated θ(t) is close to θ(t1). Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

slide-77
SLIDE 77
  • 3. Importance Mixing

At each cycle, we need to evaluate n new samples. It is common that the updated θ(t) is close to θ(t1). Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Reusing samples: fewer …tness evaluations.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

slide-78
SLIDE 78
  • 3. Importance Mixing

Formally, importance mixing is carried out by two rejection samplings.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22

slide-79
SLIDE 79
  • 3. Importance Mixing

Formally, importance mixing is carried out by two rejection samplings. Forward pass: For each sample z from the previous batch, accept with probability min 8 < :1, p

  • zjθ(t)

p

  • zjθ(t1)

9 = ; .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22

slide-80
SLIDE 80
  • 3. Importance Mixing

Formally, importance mixing is carried out by two rejection samplings. Forward pass: For each sample z from the previous batch, accept with probability min 8 < :1, p

  • zjθ(t)

p

  • zjθ(t1)

9 = ; . Backward pass: Accept newly generated sample z with probability max 8 < :0, 1 p

  • zjθ(t1)

p

  • zjθ(t)

9 = ; until batch size reached.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22

slide-81
SLIDE 81
  • 4. Optimal Fitness Baseline

A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. OθJ = Oθ

Z

f (z) p (zjθ) dz Oθ

Z

bp (zjθ) dz | {z }

=0

= Oθ

Z

[f (z) b] p (zjθ) dz, b is called the …tness baseline.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22

slide-82
SLIDE 82
  • 4. Optimal Fitness Baseline

A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. OθJ = Oθ

Z

f (z) p (zjθ) dz Oθ

Z

bp (zjθ) dz | {z }

=0

= Oθ

Z

[f (z) b] p (zjθ) dz, b is called the …tness baseline. Adding the baseline b won’t a¤ect the expectation of OθJ.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22

slide-83
SLIDE 83
  • 4. Optimal Fitness Baseline

A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. OθJ = Oθ

Z

f (z) p (zjθ) dz Oθ

Z

bp (zjθ) dz | {z }

=0

= Oθ

Z

[f (z) b] p (zjθ) dz, b is called the …tness baseline. Adding the baseline b won’t a¤ect the expectation of OθJ. But it a¤ects the variance of the estimation: For natural gradient V [ ˜ OθJ (θ)] ∝ b2E h u>u i 2bE h u>v i + const with u = FOθ log p (zjθ) , v = f (z) u.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22

slide-84
SLIDE 84
  • 4. Optimal Fitness Baseline

V [ ˜ OθJ (θ)] is of quadratic form, we can minimize it. The optimal …tness baseline is given by b = E

  • u>v
  • E [u>u] ' ∑n

i=1 u> i vi

∑n

i=1 u> i ui

.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

slide-85
SLIDE 85
  • 4. Optimal Fitness Baseline

V [ ˜ OθJ (θ)] is of quadratic form, we can minimize it. The optimal …tness baseline is given by b = E

  • u>v
  • E [u>u] ' ∑n

i=1 u> i vi

∑n

i=1 u> i ui

. The natural gradient is then estimated by ˜ Os

θJ (θ) = 1

nFG (f b) .

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

slide-86
SLIDE 86
  • 4. Optimal Fitness Baseline

V [ ˜ OθJ (θ)] is of quadratic form, we can minimize it. The optimal …tness baseline is given by b = E

  • u>v
  • E [u>u] ' ∑n

i=1 u> i vi

∑n

i=1 u> i ui

. The natural gradient is then estimated by ˜ Os

θJ (θ) = 1

nFG (f b) . Better: Di¤erent baselines bj for di¤erent (groups of) parameter θj, further reducing the variance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

slide-87
SLIDE 87
  • 4. Optimal Fitness Baseline

V [ ˜ OθJ (θ)] is of quadratic form, we can minimize it. The optimal …tness baseline is given by b = E

  • u>v
  • E [u>u] ' ∑n

i=1 u> i vi

∑n

i=1 u> i ui

. The natural gradient is then estimated by ˜ Os

θJ (θ) = 1

nFG (f b) . Better: Di¤erent baselines bj for di¤erent (groups of) parameter θj, further reducing the variance.

The block diagonal structure of F suggests using a block …tness baseline, where di¤erent baseline values are computed for each group

  • f parameters in θ.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

slide-88
SLIDE 88

Putting Things Together

Initialization loop

Update population using importance mixing

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

slide-89
SLIDE 89

Putting Things Together

Initialization loop

Update population using importance mixing Evaluate newly generated samples

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

slide-90
SLIDE 90

Putting Things Together

Compute optimal baseline b and ˜ Os

θJ (θ)

Initialization loop

Update population using importance mixing Evaluate newly generated samples

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

slide-91
SLIDE 91

Putting Things Together

Update: θ θ + α ˜ Os

θJ (θ)

Compute optimal baseline b and ˜ Os

θJ (θ)

Initialization loop

Update population using importance mixing Evaluate newly generated samples

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

slide-92
SLIDE 92

Empirical Results - Standard Blackbox Benchmarks

number of evaluations

  • fitness

Unimodal-50 Cigar DiffPow Ellipsoid ParabR Schwefel SharpR Sphere Tablet Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 18 / 22

slide-93
SLIDE 93

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-94
SLIDE 94

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-95
SLIDE 95

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-96
SLIDE 96

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-97
SLIDE 97

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-98
SLIDE 98

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-99
SLIDE 99

Empirical Results - Multimodal

−0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

eNES is able to jump over deceptive local optima.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

slide-100
SLIDE 100

Empirical Results - Double Pole Balancing

2

F

β β

1

x

Non-Markovian double pole balancing, average numbers of evaluations. Method SANE ESP NEAT CMA CoSyNE FEM NES Eval. 262, 700 7, 374 6, 929 3, 521 1, 249 2, 099 1, 753

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 20 / 22

slide-101
SLIDE 101

Summary

We derived a clear blackbox optimization algorithm from …rst principles.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 21 / 22

slide-102
SLIDE 102

Summary

We derived a clear blackbox optimization algorithm from …rst principles. Derivation of exact Fisher information matrix.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 21 / 22

slide-103
SLIDE 103

Summary

We derived a clear blackbox optimization algorithm from …rst principles. Derivation of exact Fisher information matrix. E¢cient computation of the FIM inverse.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 21 / 22

slide-104
SLIDE 104

Summary

We derived a clear blackbox optimization algorithm from …rst principles. Derivation of exact Fisher information matrix. E¢cient computation of the FIM inverse. Importance mixing reduces the number of …tness evaluations.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 21 / 22

slide-105
SLIDE 105

Summary

We derived a clear blackbox optimization algorithm from …rst principles. Derivation of exact Fisher information matrix. E¢cient computation of the FIM inverse. Importance mixing reduces the number of …tness evaluations. Optimal …tness baselines improve the performance.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 21 / 22

slide-106
SLIDE 106

Summary

We derived a clear blackbox optimization algorithm from …rst principles. Derivation of exact Fisher information matrix. E¢cient computation of the FIM inverse. Importance mixing reduces the number of …tness evaluations. Optimal …tness baselines improve the performance. Competitive performance on standard benchmarks, including non-Markovian double pole balancing tasks.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 21 / 22

slide-107
SLIDE 107

Empirical Results - Importance Mixing and Optimal Baseline

Percentage of runs that prematurely converged, while varying the type of …tness baseline used. Baseline premature convergence None 52% Uniform 50% Block 0% Importance Mixing reduces the number of …tness evaluations by a factor of 3 4.

Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 22 / 22