Gradient Descent: The Ultimate Optimizer Erik Meijer @headinthebox - - PowerPoint PPT Presentation

gradient descent the ultimate optimizer erik meijer
SMART_READER_LITE
LIVE PREVIEW

Gradient Descent: The Ultimate Optimizer Erik Meijer @headinthebox - - PowerPoint PPT Presentation

Gradient Descent: The Ultimate Optimizer Erik Meijer @headinthebox Copenhagen Denmark We all want to write cool apps like this ... Software 1.0 Augustin-Louis Cauchy 1789 -1857 What if we feed the examples/tests to a mathematician


slide-1
SLIDE 1

Copenhagen Denmark

Gradient Descent: The Ultimate Optimizer Erik Meijer

@headinthebox

slide-2
SLIDE 2
slide-3
SLIDE 3

We all want to write cool apps like this ...

slide-4
SLIDE 4

Software 1.0

slide-5
SLIDE 5
slide-6
SLIDE 6

Augustin-Louis Cauchy 1789 -1857

slide-7
SLIDE 7

What if we feed the examples/tests to a mathematician

  • r machine and let

it deduce the code for us?

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

Physicists and Mathematicians have been doing curve fitting and function approximation for

  • centuries. Just

saying.

slide-12
SLIDE 12

Galileo Galilei 1566-1642 Joseph Fourier 1768-1830 Henry Padé 1863-1953

slide-13
SLIDE 13

“Everything interesting in CS has already been invented by mathematicians at least 100 years ago.” @headinthebox

slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16

Fourier(x) = a0 + (∑ aicos(ixπ/L)) + (∑ bisin(ixπ/L)) PadeN,M(x) = (∑i∈0…N aixi) (1+∑i∈1…M bixi)

slide-17
SLIDE 17

We’ll jump on the latest Computer Science bandwagon; Deep Learning, using Artificial Neural Networks!!!!!!!!!!!!!!!!!!

slide-18
SLIDE 18

George Cybenko, 1989

slide-19
SLIDE 19

Activation function weights/ parameters Linear algebra/ map-reduce

slide-20
SLIDE 20
slide-21
SLIDE 21

One input, one weight, identity

slide-22
SLIDE 22

strong assumption

slide-23
SLIDE 23
slide-24
SLIDE 24

var a: ℝ = … val η: ℝ = … some tiny value of your choosing … fun model(x: ℝ): ℝ = a*x fun loss(y: ℝ, ŷ: ℝ): ℝ = (y-ŷ)2 fun train(n: Int, samples: Sequence<Pair<ℝ,ℝ>>) { repeat(n) { epoch(samples) } } fun epoch(samples: Sequence<Pair<ℝ,ℝ>>) { samples.foreach{ (x,y) ➝ val e = loss(y, model(x)) val de/da = 2*a*x2-2*x*y a -= η*de/da } }

syntax cheat syntax cheat

slide-25
SLIDE 25

var a: ℝ = … val η: ℝ = … some tiny value of your choosing … fun model(x: ℝ): ℝ = a*x fun loss(y: ℝ, ŷ: ℝ): ℝ = (y-ŷ)2 fun train(n: Int, samples: Sequence<Pair<ℝ,ℝ>>) { repeat(n) { epoch(samples) } } fun epoch(samples: Sequence<Pair<ℝ,ℝ>>) { samples.foreach{ (x,y) ➝ val e = loss(y, model(x)) val de/da = 2*a*x2-2*x*y a -= η*de/da } }

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28

Differentiable Programming, I told you so!

slide-29
SLIDE 29
slide-30
SLIDE 30

f(x) =3x2+4 f’(x) = 6x

slide-31
SLIDE 31

f(a+bε) = f(a)+f’(a)bε

Read ⇦ that again!

slide-32
SLIDE 32

High school math review: Sum Rule

The derivative of (u+v) with respect to x

slide-33
SLIDE 33

High school math review: Product Rule

The derivative of (u*v*w) with respect to x

slide-34
SLIDE 34

High school math review: Chain Rule

The derivative

  • f (f∘g) with

respect to x

slide-35
SLIDE 35

Sum Rule Product Rule Chain Rule

(a+(da/dx)ε) + (c+(dc/dx)ε) ={ dual number } (a+c)+ (da/dx+dc/dx)ε = { sum rule } (a+c)+(d(a+b)/dx)ε (a+(da/dx)ε) * (b+(db/dx)ε) ={ dual number } (a*b)+ (a*(db/dx)+(da/dx)*b)ε = { product rule } (a*b)+(d(a*b)/dx)ε f(a+(da/dx)ε) ={ dual number } f(a)+ d(f(a)/da)(da/dx)ε = { chain rule } f(a)+(df(a)/dx)ε

Your high school education was a total waste of time!

slide-36
SLIDE 36

class 𝔼(val r: ℝ, val ε: ℝ=1.0) fun sin(x: 𝔼): 𝔼 = 𝔼(r=sin(x.r), ε=cos(x.r)*x.ε) fun cos(x: 𝔼): 𝔼 = 𝔼(r=cos(x.r), ε=-sin(x.r)*x.ε)

  • perator fun 𝔼.times(that: 𝔼): 𝔼 =

𝔼(r=this.r*that.r, ε=this.ε*that.r + this.r*that.ε)

  • perator fun 𝔼.plus(that: 𝔼): 𝔼 = 𝔼(r=this.r+that.r, ε=this.ε+that.ε)
  • perator fun 𝔼.minus(that: 𝔼): 𝔼 = 𝔼(r=this.r-that.r, ε=this.ε-that.ε)
  • perator fun 𝔼.unaryMinus(): 𝔼 = 𝔼(r=-this.r, ε=-this.ε)
  • perator fun 𝔼.div(that: 𝔼): 𝔼 =

(this.r/that.r).let{ 𝔼(r=it, ε=(this.ε/that.r - it*that.ε/that.r) }

slide-37
SLIDE 37

var a: ℝ = … val η: ℝ = … fun model(x: ℝ): ℝ = a*x fun loss(y: ℝ, ŷ: ℝ): ℝ = (y-ŷ)2 fun train(n: Int, samples: Sequence<Pair<ℝ,ℝ>>) { repeat(n) { epoch(samples) } } fun epoch(samples: Sequence<Pair<ℝ,ℝ>>) { samples.foreach{ (x,y) ➝ val e = loss(y, model(x)) val de/da = 2*a*x2-2*x*y a -= η*de/da } }

slide-38
SLIDE 38

var a: 𝔼 = 𝔼(…) val η: ℝ = … def model(x: ℝ): 𝔼 = a*x def loss(y: ℝ, ŷ: 𝔼): 𝔼 = (y-ŷ)2 def epoch(samples: List[(ℝ,ℝ)]) { samples.foreach{ case (x,y) ➝ { val e = loss(y,model(x)) val de/da: ℝ =e.ε a -= η*(de/da) } }

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

val x = ⅅ(3.0) val y = ⅅ(5.0) val z = x*y // dz/dx + dz/dy //𝔼(r=15.0, ε=8.0)

slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
  • perator fun ℝ.times(that: List<ℝ>) = that.map{ this*it }
  • perator fun List<ℝ>.times(that: ℝ) = this.map{ it*that }
  • perator fun List<ℝ>.unaryMinus() = this.map{ -it }
  • perator fun List<ℝ>.plus(that: List<ℝ>) = this.zip(that){ x,y ➝ x+y }
  • perator fun List<ℝ>.minus(that: List<ℝ>) = this.zip(that){ x,y ➝ x-y }
  • perator fun List<ℝ>.div(that: ℝ) = this.map{ it/that }

class 𝔼(val r: ℝ, val ε: List<ℝ>) fun sin(x: 𝔼): 𝔼 = 𝔼(r=sin(x.r), ε=cos(x.r)*x.ε)

  • perator fun 𝔼.times(that: 𝔼): 𝔼 =

𝔼(r=this.r*that.r, ε=this.ε*that.r + this.r*that.ε)

That’s all that needs to change

slide-46
SLIDE 46

Mathematically, by changing numbers to lists, we upgraded from dual numbers to synthetic differential geometry and deep category theory

slide-47
SLIDE 47

val x = ⅅ(3.0, 0.th) val y = ⅅ(5.0, 1.th) val z = x*y // [∂z/∂x, ∂z/∂y] //𝔼(r=15.0, ε=[5.0, 3.0])

slide-48
SLIDE 48

var κ: ⅅ = ⅅ(1e-20, 0.th) var η: ⅅ = ⅅ(1e-20, 1.th) var a = ⅅ(Math.random(), 2.th) val 𝛿 = 1e-80 fun model(x: ℝ): ⅅ = a*x fun loss(y: ℝ, ŷ: ⅅ): ⅅ = (y-ŷ)2 fun epoch(samples: List<Pair<ℝ,ℝ>>) { lateinit var e: ⅅ samples.forEach { (x,y) ➝ val (∂e/∂κ, ∂e/∂η, ∂e/∂a) = loss(y, model(x)) κ -= 𝛿 * ∂e/∂κ η -= κ * ∂e/∂η a -= η * ∂e/∂a }}

slide-49
SLIDE 49

Error η κ a

slide-50
SLIDE 50

Wait what, can’t you do any non-toy examples?

Yes we can!

slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54

Choosing the correct hyper parameter is essential Lower is better

slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57

How do we pick the meta-step size?

slide-58
SLIDE 58
slide-59
SLIDE 59

Stack a small number of hyper-...-hyper parameters layers and pick a tiny number for the last fixed one.

slide-60
SLIDE 60

https://arxiv.org/pdf/1909.13371.pdf

slide-61
SLIDE 61

fun id(x: ⅅ) = ⅅ(r=x.r, ε=1.0*x.ε) var x = ⅅ(0.0, n.th); repeat(n){ x = id(x) }; x.ε ∂id(xn)/∂xn*(…*(∂id(xn)/∂xn*[…, ∂xn/∂xn])…)

slide-62
SLIDE 62
slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65

(...([] ++ x1) ++ ...) ++ xn is slow O(n2) x1++(... ++(xn++[])...) is fast O(n) = ((x1++)∘…∘(xn++)) []

Thinking Fast not Slow

Represent lists by functions !%#@&?

slide-66
SLIDE 66
slide-67
SLIDE 67

(...([] ++ x1) ++ ...) ++ xn is slow O(n2) x1++(... ++(xn++[])...) is fast O(n) ∂id(xn)/∂xn*(…*(∂id(xn)/∂xn*[…, ∂xn/∂xn])…) is slow O(n2) (…(∂id(xn)/∂xn*∂id(xn)/∂xn)*…)*[…, ∂xn/∂xn])…) is fast O(n)

Thinking Fast not Slow

slide-68
SLIDE 68
slide-69
SLIDE 69

Chain Rule Product Rule 〚f’(x.r)*x.ε〛(c) ={ 〚a〛(b) = b*a } c*(f’(x.r)*x.ε) ={ associativity } (c*f’(x.r))*x.ε ={ 〚a〛(b) = b*a } 〚x.ε〛(c*f’(x.r)) ={ 〚x.ε〛= x.ɞ } x.ɞ(c*f’(x.r)) ={ abstraction } { x.ɞ(it*f’(x.r)) }(c) 〚this.ε*that.r + this.r*that.ε〛(c) ={ commutativity } 〚that.r*this.ε + this.r*that.ε〛(c) ={〚a〛(b) = b*a } c*(that.r*this.ε + this.r*that.ε) ={ distributivity } c*(that.r*this.ε) + c*(this.r*that.ε) ={ associativity } (c*that.r)*this.ε + (c*this.r)*that.ε ={ definition of 〚〛} 〚this.ε〛(c*that.r) + 〚that.ε〛(c*this.r) ={ 〚x.ε〛= x.ɞ } this.ɞ(c*that.r) + that.ɞ(c*this.r) ={ abstraction } { this.ɞ(it*that.r) + that.ɞ(it*this.r) }(c)

slide-70
SLIDE 70

class ⅅ(val r: ℝ, val ɞ: (ℝ)->ℝ={it}) /* df(a)/dx = (df(a)/da)*(da/dx) */ fun sin(x: ⅅ): ⅅ = ⅅ(r = sin(x.r), ɞ = { x.ɞ(it*cos(x.r)) }) /* d(a*b)/dx = (da/dx)*b + a*(db/dx) */

  • perator fun ⅅ.times(that: ⅅ): ⅅ =

ⅅ(r = this.r * that.r, ɞ = { this.ɞ(it*that.r) +that.ɞ(it*this.r) })

slide-71
SLIDE 71
slide-72
SLIDE 72

repeat(n) { x = x*x }; x.ɞ(1.0)

slide-73
SLIDE 73
slide-74
SLIDE 74
slide-75
SLIDE 75
slide-76
SLIDE 76
slide-77
SLIDE 77

class 𝔼(val r: ℝ, var ε: ℝ = 0.0, var n: Int = 0, val ɞ: (ℝ)➝ℝ = { it }) { fun ɞ(d: ℝ): ℝ { ε += d if(--n == 0) { return ɞ.invoke(ε) } else { return 0.0 } } } fun ⅅ.backward(d: ℝ = 1.0) { this.n++; this.ɞ(d) } fun sin(x: 𝔼): 𝔼 = 𝔼(r=sin(x.r), ɞ={ x.ɞ(it*cos(x.r)*) }).also{ x.n++ }

  • perator fun 𝔼.times(that: 𝔼): 𝔼 =

𝔼(r=this.r*that.r, ɞ={ this.ɞ(it*that.r); that.ɞ(it*this.r) }) .also{ this.n++; that.n++; }

slide-78
SLIDE 78
slide-79
SLIDE 79

data class 𝔼(val r: ℝ, var ε: ℝ = 0.0, var n: Int = 0, val ɞ: (ℝ)➝Unit={it}) { fun ɞ(d: ℝ) { ε += d; if(--n == 0) { ɞ.invoke(ε) } } } fun sin(x: 𝔼): 𝔼 = 𝔼(r=sin(x.r), ɞ={ x.ɞ(it*cos(x.r)) }).also{ x.n++ }

  • perator fun 𝔼.times(that: 𝔼): 𝔼 =

𝔼(r=this.r*that.r, ɞ={ this.ɞ(it*that.r); that.ɞ(it*this.r) }) .also{ this.n++; that.n++; } val x:𝔼 = …; val y:𝔼 = … val z = e[x,y]; z.backward() val ∂z/∂x = x.ε; val ∂z/∂y = y.ε

slide-80
SLIDE 80
slide-81
SLIDE 81

typealias Cont = ⅅ.(()->Unit)->Unit class ⅅ(val r: ℝ, var ε: ℝ = 1.0, val ɞ: Cont = { κ ➝ κ() }) fun sin(x: ⅅ): ⅅ = ⅅ(r = sin(x.r), ɞ = { κ ➝ x.ɞ { this.ε = cos(x.r)*x.ε; κ() }})

  • perator fun ⅅ.times(that: ⅅ): ⅅ = ⅅ(this@times.r*that.r, ɞ =

{ κ ➝ this@times.ɞ { that.ɞ { this.ε = this@times.ε*that.r + this@times.r*that.ε; κ() }}})

Forward CPS

slide-82
SLIDE 82

class ⅅ(val r: ℝ, var ε: ℝ = 0.0, val ɞ: Cont = { κ ➝ κ() }) fun sin(x: ⅅ): ⅅ = object: ⅅ(r = sin(x.r), ɞ = { κ ➝ x.ɞ{ k(); x.ε += this.ε*cos(x.r) }}

  • perator fun ⅅ.times(that: ⅅ): ⅅ = ⅅ(r = this@times.r * that.r, ɞ = {

κ ➝ this@times.ɞ { that.ɞ { κ(); this@times.ε += this.ε*that.r; that.ε += this.ε*this@times.r }}})

Backward CPS

slide-83
SLIDE 83
slide-84
SLIDE 84
slide-85
SLIDE 85

#KotlinConf

THANK YOU AND REMEMBER TO VOTE

Erik Meijer @headinthebox