Guided Evolutionary Strategies Augmenting random search with - - PowerPoint PPT Presentation

guided evolutionary strategies
SMART_READER_LITE
LIVE PREVIEW

Guided Evolutionary Strategies Augmenting random search with - - PowerPoint PPT Presentation

Guided Evolutionary Strategies Augmenting random search with surrogate gradients Niru Maheswaranathan // Google Research, Brain Team Joint work with: Luke Metz, George Tucker, Dami Choi, Jascha Sohl-dickstein Optimizing with surrogate gradients


slide-1
SLIDE 1

Guided Evolutionary Strategies

Niru Maheswaranathan // Google Research, Brain Team Joint work with: Luke Metz, George Tucker, Dami Choi, Jascha Sohl-dickstein

Augmenting random search with surrogate gradients

slide-2
SLIDE 2

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased)

slide-3
SLIDE 3

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications

slide-4
SLIDE 4

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications

  • Neural networks with non-differentiable layers
slide-5
SLIDE 5

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications

  • Neural networks with non-differentiable layers
  • Meta-learning (where computing an exact meta-gradient is costly)
slide-6
SLIDE 6

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased) Example applications

  • Neural networks with non-differentiable layers
  • Meta-learning (where computing an exact meta-gradient is costly)
  • Gradients from surrogate models (synthetic gradients, black box attacks)
slide-7
SLIDE 7

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased)

slide-8
SLIDE 8

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased)

Zeroth-Order

  • nly function values, f(x)
slide-9
SLIDE 9

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased)

Zeroth-Order

  • nly function values, f(x)

First-Order

gradient information, 𝝰f(x)

slide-10
SLIDE 10

Optimizing with surrogate gradients

Surrogate gradient directions that are correlated with the true gradient (but may be biased)

Zeroth-Order

  • nly function values, f(x)

Guided ES

Surrogate gradient

x(t)

<latexit sha1_base64="QXzJ+63yzjnBhE/B24j/45Maj2U=">ACF3icbVDLSgNBEJz1bXxFPXoZDYJ6CLs+15uoBy+igjFCNsrspJMmZ1dZnrFsOQjPKof4028evRbvDiJQXwVDBRV3T3dFSZSGHTdN2dgcGh4ZHRsPDcxOTU9k5+duzBxqjmUeCxjfRkyA1IoKFACZeJBhaFEsph6Drl29AGxGrc2wnUI1YQ4m64AytVL69ylZwtXOdL7jF3d1Nf8unlvjbvrdOvaLbwxcpkD5Or/PvQS3maQKuWTGVDw3wWrGNAouoZMLUgMJ4y3WgIqlikVgqlv3Q5dtkqN1mNtn0LaU793ZCwyph2FtjJi2DS/va74n1dJse5XM6GSFEHxz4/qaQY0+7tCY0cJRtSxjXwu5KeZNpxtEmlAsOwd6i4djOPUlAM4z1WhYw3YiE6tjbGsFi0KU2rT/Z/CUX60Vvo+iebRb29vu5jZEFskRWiEd2yB45IqekRDhpkTvyQB6de+fJeXZePksHnH7PkB5/UDTIugxg=</latexit>

First-Order

gradient information, 𝝰f(x)

slide-11
SLIDE 11

Guided evolutionary strategies

0.25

Schematic Loss

slide-12
SLIDE 12

0.25

Schematic

Guided evolutionary strategies

Loss

slide-13
SLIDE 13

Guided evolutionary strategies

0.25

Guiding distribution

Schematic Loss

slide-14
SLIDE 14

Guided evolutionary strategies

0.25

Guiding distribution

Schematic

✏ ∼ N(0, Σ)

<latexit sha1_base64="xgziz2qIJBne+iFHmu40YhOTK+Q=">ACNnicdVBNT1NBFJ2H0BRKbpkM1pN0JjmvVIRd0RZsFExWiDpNM1909vHhPl4mZlH0rz0D/BrWIL/xA07wpY9G+a1JVGjJ5nk5Jx79x70lwK5+P4VzR37/6Dh/MLi7WlR4+fLNdXnu45U1iOHW6ksQcpOJRCY8cL/EgtwgqlbifHn2q/P1jtE4Y/cOPcuwpyLQYCg4+SP36S4a5E9JoypxQlCnwhxk+W8Fr+l7LvIFLzu1xtx80M7SdrvaEU2WlOyEa8nMU2a8QNMsNuv37DBoYXCrXnEpzrJnHueyVYL7jEcY0VDnPgR5BhN1ANCl2vnFwzpq+CMqBDY8PTnk7U3ztKUM6NVBoq23d314l/svrFn642SuFzguPmk8/GhaSekOraOhAWORejgIBbkXYlfJDsMB9CLDGtjHcYvFzmPs1Rwve2DclA5spocfhtow9ZxUNad1FQv9P9lrNZL3Z+tZubH2c5bZAVskLskYS8p5skR2ySzqEkxNySs7Jz+gsuoguo6tp6Vw063lG/kB0fQsJGqxU</latexit>

Sample perturbations Loss

slide-15
SLIDE 15

Guided evolutionary strategies

0.25

Guiding distribution

Schematic

✏ ∼ N(0, Σ)

<latexit sha1_base64="xgziz2qIJBne+iFHmu40YhOTK+Q=">ACNnicdVBNT1NBFJ2H0BRKbpkM1pN0JjmvVIRd0RZsFExWiDpNM1909vHhPl4mZlH0rz0D/BrWIL/xA07wpY9G+a1JVGjJ5nk5Jx79x70lwK5+P4VzR37/6Dh/MLi7WlR4+fLNdXnu45U1iOHW6ksQcpOJRCY8cL/EgtwgqlbifHn2q/P1jtE4Y/cOPcuwpyLQYCg4+SP36S4a5E9JoypxQlCnwhxk+W8Fr+l7LvIFLzu1xtx80M7SdrvaEU2WlOyEa8nMU2a8QNMsNuv37DBoYXCrXnEpzrJnHueyVYL7jEcY0VDnPgR5BhN1ANCl2vnFwzpq+CMqBDY8PTnk7U3ztKUM6NVBoq23d314l/svrFn642SuFzguPmk8/GhaSekOraOhAWORejgIBbkXYlfJDsMB9CLDGtjHcYvFzmPs1Rwve2DclA5spocfhtow9ZxUNad1FQv9P9lrNZL3Z+tZubH2c5bZAVskLskYS8p5skR2ySzqEkxNySs7Jz+gsuoguo6tp6Vw063lG/kB0fQsJGqxU</latexit>

Sample perturbations Gradient estimate

g =

  • 22P

P

X

i=1

✏i (f (x + ✏i) − f (x − ✏i))

<latexit sha1_base64="tNPwrnTJqaT6qo9hEVG3YAf6sQ=">ACu3icdVHLbhMxFPUMj5bwaIAlG0OEVB6NZqatkiwClWDBhEk0laK08j3JmYeh6yPYjI8gfxF/wG38IGz0wiUVSOZN3je67t63PjUnClg+CX59+4ev2zu6dzt179x/sdR8+OlVFJRlMWSEKeR5TBYLnMNVcCzgvJdAsFnAWX76r9bNvIBUv8i96XcI8o2nOE86odqlF94chzSUzmcZzE/aDBq+D/shcLQho9CmY5JIygyJQVNrIkwUTzN6YSKLJ9btqmxh+Di0F6beQqm4cA8YbomARO8nbfj+6okebrSLw624sE14ibYRbfXNDMcHO2q+FhS4bRCG876ENJovub7IsWJVBrpmgSs3CoNRzQ6XmTIDtkEpBSdklTWHmaE4zUHPTuGHxc5dZ4qSQbuUaN9m/TxiaKbXOYleZUb1S/2p18jptVulkODc8LysNOWsfSiqBdYHr+eAl8C0WDtCmeSuV8xW1Dmv3RQ75D24v0j46O79VIKkupAvDaEyzXhu3d9S8pTU1Lm1tQT/n5xG/fCwH30+6p3In61vu+gJeob2UYgG6AR9QBM0Rczb8469N95bf+wz/6sv2lLf23j9GF2BX/0BEFHZeA=</latexit>

Loss

slide-16
SLIDE 16

Guided evolutionary strategies

0.25

Guiding distribution

Schematic Standard (vanilla) ES Identity covariance Choosing the guiding distribution

Σ = α nI + (1 − α) k UU T

<latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit>

𝛽: hyperparameter n: parameter dimension Loss

slide-17
SLIDE 17

Guided evolutionary strategies

0.25

Guiding distribution

Schematic Guided ES Identity + low rank covariance Choosing the guiding distribution 𝛽: hyperparameter n: parameter dimension k: subspace dimension

Σ = α nI + (1 − α) k UU T

<latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit>

U ∈ Rn×k

<latexit sha1_base64="OfV4IOQRAbPKw5Qc3Ug12K2VJ3E=">ACL3icbVBNaxNBGJ6tsZo26hHEUaDUHoIuzHq5hbaHnopRjEfkE3D7ORNMmR2dpl5VwjLnvw1Htv+mNJL8drf4MXZJIg1PjDwzPN+P2EihUHXvXG2Hjzc3nlUelx+8nR3b7/y7HnXxKnm0OGxjHU/ZAakUNBgRL6iQYWhRJ64fy4iPe+gTYiVl9xkcAwYlMlJoIztNKo8qpDA6FoEDGchWH2JT/P7A9FBIbO81Gl6tazYb/3qeW+B98r069mrvEH1Ila7RHlV/BOZpBAq5ZMYMPDfBYcY0Ci4hLwepgYTxOZvCwFLF7Jxhtjwjp2+tMqaTWNunkC7VvysyFhmziEKbWaxr/o0V4v9igxQn/jATKkRF8NmqSYkwLT+hYaOAoF5YwroXdlfIZ04yjda4cnIC9RcOZ7fspAc0w1odZwPQ0Eiq3t02D10FBrVsb3mySbr3mvau5nxvV1tHatxJ5Sd6QA+KRj6RFTkmbdAgn38kPckmunAvn2rl1fq5St5x1zQtyD87db0ogqgk=</latexit>

Guiding subspace columns are surrogate gradients Loss

slide-18
SLIDE 18

Demo

Perturbed quadratic Quadratic function with a bias added to the gradient

slide-19
SLIDE 19

Demo

Perturbed quadratic Quadratic function with a bias added to the gradient

slide-20
SLIDE 20

Demo

Perturbed quadratic Quadratic function with a bias added to the gradient

slide-21
SLIDE 21

Example applications

Unrolled optimization Surrogate gradient from one step of BPTT

slide-22
SLIDE 22

Example applications

Unrolled optimization Surrogate gradient from one step of BPTT Synthetic gradients Surrogate gradient is from a synthetic model

slide-23
SLIDE 23

Summary

Guided Evolutionary Strategies Optimization algorithm when you only have access to surrogate gradients Pacific Ballroom #146 Learn more at our poster brain-research/guided-evolutionary-strategies @niru_m

slide-24
SLIDE 24

Choosing optimal hyperparameters

Σ = α nI + (1 − α) k UU T

<latexit sha1_base64="/0j5mgekPFxLzknXSTCBL6dDVg4=">ACS3icbZDLbtQwFIadKYV2uHTaLtkYRkgFxCgpvWQWSBVlAQtEU1baTKMTjwnGWtsJ7KdSqMoz9KnYQld9zm6AxZ4LkJAOZKlX/9/ju3zJYXgxvr+ldYurV8+87KavPuvfsP1lrGycmLzXDiOUi12cJGBRcYWS5FXhWaASZCDxNxofT/PQcteG5OraTAvsSMsVTzsA6a9Dqxp94JoG+onGqgVUxiGIEdaVq+o4+X5hbAX1B58nTuhrXNIo+Hw9ab/T7e6EuyF1ItwLg20adPxZ/RZtsqijQetHPMxZKVFZJsCYXuAXtl+BtpwJrJtxabANoYMe04qkGj61WzFmj5xzpCmuXZHWTpz/5yoQBozkYnrlGBH5t9sav4v65U2DfsV0VpUbH5Q2kpqM3plBcdco3MiokTwDR3f6VsBI6KdVSb8Rt0u2h87+79UKAGm+tnDqLOJFe12y2LH8VT6WjdYHNTnGx3gpcd/+NO+D1gtsKeUgeky0SkH1yQN6SIxIRi7IF/KNXHpfvWvu/dz3trwFjOb5K9qLP8ChAeyzw=</latexit>

Optimal hyperparameter (ɑ) Guided ES Identity + low rank covariance