stochastic search using the natural gradient
play

Stochastic Search using the Natural Gradient Ecient Natural - PowerPoint PPT Presentation

Stochastic Search using the Natural Gradient Ecient Natural Evolution Strategies (eNES) Yi Sun, Daan Wierstra, Tom Schaul, and Jrgen Schmidhuber {yi,daan,tom,juergen}@idsia.ch IDSIA, Galleria 2, Manno 6928, Switzerland June 17th, 2009 Yi


  1. The Gaussian Search Distribution The search distribution is given by p ( z j θ ) = N ( z j x , C ) . We use the parameter set θ = h x , A i , with A being the Cholesky decomposition of C , i.e., A is an upper triangular matrix (UTM) and C = A > A . No redundancy in θ since C is symmetric. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

  2. The Gaussian Search Distribution The search distribution is given by p ( z j θ ) = N ( z j x , C ) . We use the parameter set θ = h x , A i , with A being the Cholesky decomposition of C , i.e., A is an upper triangular matrix (UTM) and C = A > A . No redundancy in θ since C is symmetric. O θ log p ( z j θ ) can be computed in closed form: O x log p ( z j θ ) = C � ( z � x ) � A � � O A log p ( z j θ ) = A �> ( z � x ) ( z � x ) > C � � diag Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

  3. The Gaussian Search Distribution The search distribution is given by p ( z j θ ) = N ( z j x , C ) . We use the parameter set θ = h x , A i , with A being the Cholesky decomposition of C , i.e., A is an upper triangular matrix (UTM) and C = A > A . No redundancy in θ since C is symmetric. O θ log p ( z j θ ) can be computed in closed form: O x log p ( z j θ ) = C � ( z � x ) � A � � O A log p ( z j θ ) = A �> ( z � x ) ( z � x ) > C � � diag O s θ J ( θ ) can be computed from O θ log p ( z 1 j θ ) . . . O θ log p ( z 1 j θ ) . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 6 / 22

  4. Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

  5. Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

  6. Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

  7. Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

  8. Stochastic Gradient Ascent θ J ( θ ) = θ + α θ θ + α O s n Gf Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 7 / 22

  9. Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

  10. Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 The natural gradient is computed in an Exact and E¢cient way. 2 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

  11. Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 The natural gradient is computed in an Exact and E¢cient way. 2 Use Importance Mixing for reusing previously evaluated samples. 3 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

  12. Novel Ideas in eNES Use the Natural Gradient instead of the vanilla gradient. 1 The natural gradient is computed in an Exact and E¢cient way. 2 Use Importance Mixing for reusing previously evaluated samples. 3 Introducing Optimal Fitness Baseline to reduce the variance of 4 gradient estimation. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 8 / 22

  13. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  14. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  15. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  16. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  17. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  18. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  19. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  20. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  21. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  22. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  23. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  24. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  25. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  26. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  27. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  28. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  29. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  30. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  31. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  32. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  33. 1. Why Natural Gradient? Vanilla gradient doesn’t work: Over-aggressive steps on ridges. Too small steps on plateaus. Slow or premature convergence, non-robust performance. Basic idea of natural gradient Steepest ascent direction when considering correlations between elements in θ . Re-weight gradient elements according to their uncertainties, resp. Isotropic convergence on ill-shaped surface. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 9 / 22

  34. 1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

  35. 1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ F is the Fisher information matrix (FIM) of θ : (Intuitively, the normalized covariance of the gradient.) h ( O θ log p ( z j θ )) ( O θ log p ( z j θ )) > i F = E . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

  36. 1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ F is the Fisher information matrix (FIM) of θ : (Intuitively, the normalized covariance of the gradient.) h ( O θ log p ( z j θ )) ( O θ log p ( z j θ )) > i F = E . F may not be invertible. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

  37. 1. Formulation of Natural Gradient Assume the distance between two adjacent distributions p ( �j θ ) and p ( �j θ + δθ ) is de…ned by their KL divergence. The natural gradient O θ J ( θ ) is given by the necessary condition ˜ O θ J ( θ ) = O θ J ( θ ) . F ˜ F is the Fisher information matrix (FIM) of θ : (Intuitively, the normalized covariance of the gradient.) h ( O θ log p ( z j θ )) ( O θ log p ( z j θ )) > i F = E . F may not be invertible. If F is invertable, we can compute the (estimated) natural gradient as O θ J ( θ ) = F � O θ J ( θ ) , ˜ O s θ J ( θ ) = F � O s ˜ θ J ( θ ) . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 10 / 22

  38. 2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

  39. 2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 7 5 . ... 4 F d Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

  40. 2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 5 . 7 ... 4 F d C � is the FIM for x . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

  41. 2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 5 . 7 ... 4 F d C � is the FIM for x . F k is the FIM for ( n � k + 1 non-zero elements in) the k -th row of A . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

  42. 2. Property of FIM in the Gaussian Case Let θ = h x , A i . Under this setting, we …nd (quite luckily): The Fisher information matrix is indeed invertible . The Fisher information matrix is a block diagonal matrix 2 3 C � 6 7 F 1 6 7 F = 6 7 5 . ... 4 F d C � is the FIM for x . F k is the FIM for ( n � k + 1 non-zero elements in) the k -th row of A . The FIM suggest a natural grouping of elements in θ . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 11 / 22

  43. 2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

  44. 2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . We already …nd that F is block diagonal, so computing F � requires � d 4 � O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

  45. 2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . We already …nd that F is block diagonal, so computing F � requires � d 4 � O . We can do better! Use the special form of each sub-block, the � d 3 � complexity is reduced to O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

  46. 2. E¢cient Inverse of FIM The computation of natural gradient requires the inverse of F . � d 2 � , so computing F � requires Naively, F is a matrix of size O � d 6 � O . We already …nd that F is block diagonal, so computing F � requires � d 4 � O . We can do better! Use the special form of each sub-block, the � d 3 � complexity is reduced to O . The estimated natural gradient is then computed as θ J ( θ ) = 1 O s n F � Gf . � d 3 � with complexity O . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 12 / 22

  47. 3. Importance Mixing At each cycle, we need to evaluate n new samples. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

  48. 3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

  49. 3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

  50. 3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

  51. 3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

  52. 3. Importance Mixing At each cycle, we need to evaluate n new samples. It is common that the updated θ ( t ) is close to θ ( t � 1 ) . Problem: Redundant …tness evaluations in overlapping high density area. Importance Mixing: Generate samples in less explored areas, while keeping the updated batch conformed to the new search distribution. Reusing samples: fewer …tness evaluations. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 13 / 22

  53. 3. Importance Mixing Formally, importance mixing is carried out by two rejection samplings. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22

  54. 3. Importance Mixing Formally, importance mixing is carried out by two rejection samplings. Forward pass: For each sample z from the previous batch, accept with probability � z j θ ( t ) � 8 9 < = p � z j θ ( t � 1 ) � min : 1 , ; . p Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22

  55. 3. Importance Mixing Formally, importance mixing is carried out by two rejection samplings. Forward pass: For each sample z from the previous batch, accept with probability � z j θ ( t ) � 8 9 < = p � z j θ ( t � 1 ) � min : 1 , ; . p Backward pass: Accept newly generated sample z with probability � z j θ ( t � 1 ) � 8 9 < = p � z j θ ( t ) � max : 0 , 1 � ; p until batch size reached. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 14 / 22

  56. 4. Optimal Fitness Baseline A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. Z Z O θ J = O θ f ( z ) p ( z j θ ) d z � O θ bp ( z j θ ) d z | {z } = 0 Z = O θ [ f ( z ) � b ] p ( z j θ ) d z , b is called the …tness baseline. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22

  57. 4. Optimal Fitness Baseline A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. Z Z O θ J = O θ f ( z ) p ( z j θ ) d z � O θ bp ( z j θ ) d z | {z } = 0 Z = O θ [ f ( z ) � b ] p ( z j θ ) d z , b is called the …tness baseline. Adding the baseline b won’t a¤ect the expectation of O θ J . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22

  58. 4. Optimal Fitness Baseline A typical problem with the Monte-Carlo gradient estimation is that the variance is too big. The …tness baseline is introduced to reduce the variance. Z Z O θ J = O θ f ( z ) p ( z j θ ) d z � O θ bp ( z j θ ) d z | {z } = 0 Z = O θ [ f ( z ) � b ] p ( z j θ ) d z , b is called the …tness baseline. Adding the baseline b won’t a¤ect the expectation of O θ J . But it a¤ects the variance of the estimation: For natural gradient h i h i O θ J ( θ )] ∝ b 2 E u > u u > v V [ ˜ � 2 b E + const with u = F � O θ log p ( z j θ ) , v = f ( z ) u . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 15 / 22

  59. 4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

  60. 4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i The natural gradient is then estimated by θ J ( θ ) = 1 O s n F � G ( f � b � ) . ˜ Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

  61. 4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i The natural gradient is then estimated by θ J ( θ ) = 1 O s n F � G ( f � b � ) . ˜ Better: Di¤erent baselines b j for di¤erent (groups of) parameter θ j , further reducing the variance. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

  62. 4. Optimal Fitness Baseline O θ J ( θ )] is of quadratic form, we can minimize it. The optimal V [ ˜ …tness baseline is given by � � E [ u > u ] ' ∑ n u > v i = 1 u > b � = E i v i . ∑ n i = 1 u > i u i The natural gradient is then estimated by θ J ( θ ) = 1 O s n F � G ( f � b � ) . ˜ Better: Di¤erent baselines b j for di¤erent (groups of) parameter θ j , further reducing the variance. The block diagonal structure of F suggests using a block …tness baseline , where di¤erent baseline values are computed for each group of parameters in θ . Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 16 / 22

  63. Putting Things Together Initialization Update population using importance mixing loop Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

  64. Putting Things Together Initialization Update population using importance mixing loop Evaluate newly generated samples Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

  65. Putting Things Together Initialization Update population using importance mixing loop Compute optimal baseline b � and Evaluate newly O s ˜ θ J ( θ ) generated samples Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

  66. Putting Things Together Initialization Update: Update population O s θ θ + α ˜ θ J ( θ ) using importance mixing loop Compute optimal baseline b � and Evaluate newly O s ˜ θ J ( θ ) generated samples Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 17 / 22

  67. Empirical Results - Standard Blackbox Benchmarks Unimodal-50 Cigar DiffPow Ellipsoid ParabR Schwefel SharpR Sphere Tablet -fitness number of evaluations Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 18 / 22

  68. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  69. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  70. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  71. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  72. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  73. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  74. Empirical Results - Multimodal 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 eNES is able to jump over deceptive local optima. Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 19 / 22

  75. Empirical Results - Double Pole Balancing β 1 β 2 F x Non-Markovian double pole balancing, average numbers of evaluations. Method SANE ESP NEAT CMA CoSyNE FEM NES Eval. 262 , 700 7 , 374 6 , 929 3 , 521 1 , 249 2 , 099 1 , 753 Yi Sun, et al. (IDSIA) E¢cient Natural Evolution Strategies 17/06/2009 20 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend