[PPT] - Basics of Numerical Optimization: Preliminaries Ju Sun Computer PowerPoint Presentation

SLIDE 1

Basics of Numerical Optimization: Preliminaries

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

February 11, 2020

1 / 24

SLIDE 2

Supervised learning as function approximation

– Underlying true function: f0 – Training data: {xi, yi} with yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close

2 / 24

SLIDE 3

Supervised learning as function approximation

– Underlying true function: f0 – Training data: {xi, yi} with yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Find f, i.e., optimization min

f∈H

i

ℓ (yi, f (xi)) + Ω (f) – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNNW , i.e., a deep neural network with weights W

2 / 24

SLIDE 4

Supervised learning as function approximation

– Underlying true function: f0 – Training data: {xi, yi} with yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Find f, i.e., optimization min

f∈H

i

ℓ (yi, f (xi)) + Ω (f) – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNNW , i.e., a deep neural network with weights W – Optimization: min

W

i

ℓ (yi, DNNW (xi)) + Ω (W )

2 / 24

SLIDE 5

Supervised learning as function approximation

– Underlying true function: f0 – Training data: {xi, yi} with yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Find f, i.e., optimization min

f∈H

i

ℓ (yi, f (xi)) + Ω (f) – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNNW , i.e., a deep neural network with weights W – Optimization: min

W

i

ℓ (yi, DNNW (xi)) + Ω (W ) – Generalization: how to avoid over-complicated DNNW in view of UAT

2 / 24

SLIDE 6

Supervised learning as function approximation

– Underlying true function: f0 – Training data: {xi, yi} with yi ≈ f0 (xi) – Choose a family of functions H, so that ∃f ∈ H and f and f0 are close – Find f, i.e., optimization min

f∈H

i

ℓ (yi, f (xi)) + Ω (f) – Approximation capacity: Univeral approximation theorems (UAT) = ⇒ replace H by DNNW , i.e., a deep neural network with weights W – Optimization: min

W

i

ℓ (yi, DNNW (xi)) + Ω (W ) – Generalization: how to avoid over-complicated DNNW in view of UAT Now we start to focus on optimization.

2 / 24

SLIDE 7

Outline

Elements of multivatiate calculus Optimality conditions of unconstrained optimization

3 / 24

SLIDE 8

Recommended references

[Munkres, 1997, Zorich, 2015, Coleman, 2012]

4 / 24

SLIDE 9

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X, sets: S

5 / 24

SLIDE 10

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X, sets: S – vectors are always column vectors, unless stated otherwise

5 / 24

SLIDE 11

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X, sets: S – vectors are always column vectors, unless stated otherwise – xi: i-th element of x, xij: (i, j)-th element of X, xi: i-th row

f X as a row vector, xj: j-th column of X as a column

vector

5 / 24

SLIDE 12

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X, sets: S – vectors are always column vectors, unless stated otherwise – xi: i-th element of x, xij: (i, j)-th element of X, xi: i-th row

f X as a row vector, xj: j-th column of X as a column

vector – R: real numbers, R+: positive reals, Rn: space of n-dimensional vectors, Rm×n: space of m × n matrices, Rm×n×k: space of m × n × k tensors, etc

5 / 24

SLIDE 13

Our notation

– scalars: x, vectors: x, matrices: X, tensors: X, sets: S – vectors are always column vectors, unless stated otherwise – xi: i-th element of x, xij: (i, j)-th element of X, xi: i-th row

f X as a row vector, xj: j-th column of X as a column

vector – R: real numbers, R+: positive reals, Rn: space of n-dimensional vectors, Rm×n: space of m × n matrices, Rm×n×k: space of m × n × k tensors, etc – [n] . = {1, . . . , n}

5 / 24

SLIDE 14

Differentiability — first order

Consider f (x) : Rn → Rm – Definition: First-order differentiable at a point x if there exists a matrix B ∈ Rm×n such that f (x + δ) − f (x) − Bδ δ2 → 0 as δ → 0.

6 / 24

SLIDE 15

Differentiability — first order

Consider f (x) : Rn → Rm – Definition: First-order differentiable at a point x if there exists a matrix B ∈ Rm×n such that f (x + δ) − f (x) − Bδ δ2 → 0 as δ → 0. i.e., f (x + δ) = f (x) + Bδ + o(δ2) as δ → 0.

6 / 24

SLIDE 16

Differentiability — first order

Consider f (x) : Rn → Rm – Definition: First-order differentiable at a point x if there exists a matrix B ∈ Rm×n such that f (x + δ) − f (x) − Bδ δ2 → 0 as δ → 0. i.e., f (x + δ) = f (x) + Bδ + o(δ2) as δ → 0. – B is called the (Fr´ echet) derivative. When m = 1, b⊺ (i.e., B⊺) called gradient, denoted as ∇f (x). For general m, also called Jacobian matrix, denoted as Jf (x).

6 / 24

SLIDE 17

Differentiability — first order

Consider f (x) : Rn → Rm – Definition: First-order differentiable at a point x if there exists a matrix B ∈ Rm×n such that f (x + δ) − f (x) − Bδ δ2 → 0 as δ → 0. i.e., f (x + δ) = f (x) + Bδ + o(δ2) as δ → 0. – B is called the (Fr´ echet) derivative. When m = 1, b⊺ (i.e., B⊺) called gradient, denoted as ∇f (x). For general m, also called Jacobian matrix, denoted as Jf (x). – Calculation: bij = ∂fi

∂xj (x)

6 / 24

SLIDE 18

Differentiability — first order

Consider f (x) : Rn → Rm – Definition: First-order differentiable at a point x if there exists a matrix B ∈ Rm×n such that f (x + δ) − f (x) − Bδ δ2 → 0 as δ → 0. i.e., f (x + δ) = f (x) + Bδ + o(δ2) as δ → 0. – B is called the (Fr´ echet) derivative. When m = 1, b⊺ (i.e., B⊺) called gradient, denoted as ∇f (x). For general m, also called Jacobian matrix, denoted as Jf (x). – Calculation: bij = ∂fi

∂xj (x)

– Sufficient condition: if all partial derivatives exist and are continuous at x, then f (x) is differentiable at x.

6 / 24

SLIDE 19

Calculus rules

Assume f, g : Rn → Rm are differentiable at a point x ∈ Rn. – linearity: λ1f + λ2g is differentiable at x and ∇ [λ1f + λ2g] (x) = λ1∇f (x) + λ2∇g (x) – product: assume m = 1, fg is differentiable at x and ∇ [fg] (x) = f (x) ∇g (x) + g (x) ∇f (x) – quotient: assume m = 1 and g (x) = 0, f

g is differentiable at x and

∇

f

g

(x) = g(x)∇f(x)−f(x)∇g(x)

g2(x)

7 / 24

SLIDE 20

Calculus rules

Assume f, g : Rn → Rm are differentiable at a point x ∈ Rn. – linearity: λ1f + λ2g is differentiable at x and ∇ [λ1f + λ2g] (x) = λ1∇f (x) + λ2∇g (x) – product: assume m = 1, fg is differentiable at x and ∇ [fg] (x) = f (x) ∇g (x) + g (x) ∇f (x) – quotient: assume m = 1 and g (x) = 0, f

g is differentiable at x and

∇

f

g

(x) = g(x)∇f(x)−f(x)∇g(x)

g2(x)

– Chain rule: Let f : Rm → Rn and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and J[h◦f] (x) = Jh (f (x)) Jf (x) .

7 / 24

SLIDE 21

Calculus rules

Assume f, g : Rn → Rm are differentiable at a point x ∈ Rn. – linearity: λ1f + λ2g is differentiable at x and ∇ [λ1f + λ2g] (x) = λ1∇f (x) + λ2∇g (x) – product: assume m = 1, fg is differentiable at x and ∇ [fg] (x) = f (x) ∇g (x) + g (x) ∇f (x) – quotient: assume m = 1 and g (x) = 0, f

g is differentiable at x and

∇

f

g

(x) = g(x)∇f(x)−f(x)∇g(x)

g2(x)

– Chain rule: Let f : Rm → Rn and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and J[h◦f] (x) = Jh (f (x)) Jf (x) . When k = 1, ∇ [h ◦ f] (x) = J⊤

f (x) ∇h (f (x)) .

7 / 24

SLIDE 22

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a small ball around x – Write

∂f 2 ∂xj∂xi (x) .

=

∂

∂xj

∂f

∂xi

(x) provided the right side well

defined

8 / 24

SLIDE 23

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a small ball around x – Write

∂f 2 ∂xj∂xi (x) .

=

∂

∂xj

∂f

∂xi

(x) provided the right side well

defined – Symmetry: If both

∂f 2 ∂xj∂xi (x) and ∂f 2 ∂xi∂xj (x) exist and both are

continuous at x, then they are equal. – Hessian (matrix): ∇2f (x) . =

∂f 2

∂xj∂xi (x)

j,i

, (1) where

∂f 2

∂xj∂xi (x)

j,i ∈ Rn×n has its (j, i)-th element as

∂f 2 ∂xj∂xi (x).

8 / 24

SLIDE 24

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a small ball around x – Write

∂f 2 ∂xj∂xi (x) .

=

∂

∂xj

∂f

∂xi

(x) provided the right side well

defined – Symmetry: If both

∂f 2 ∂xj∂xi (x) and ∂f 2 ∂xi∂xj (x) exist and both are

continuous at x, then they are equal. – Hessian (matrix): ∇2f (x) . =

∂f 2

∂xj∂xi (x)

j,i

, (1) where

∂f 2

∂xj∂xi (x)

j,i ∈ Rn×n has its (j, i)-th element as

∂f 2 ∂xj∂xi (x).

– ∇2f is symmetric.

8 / 24

SLIDE 25

Differentiability — second order

Consider f (x) : Rn → R and assume f is 1st-order differentiable in a small ball around x – Write

∂f 2 ∂xj∂xi (x) .

=

∂

∂xj

∂f

∂xi

(x) provided the right side well

defined – Symmetry: If both

∂f 2 ∂xj∂xi (x) and ∂f 2 ∂xi∂xj (x) exist and both are

continuous at x, then they are equal. – Hessian (matrix): ∇2f (x) . =

∂f 2

∂xj∂xi (x)

j,i

, (1) where

∂f 2

∂xj∂xi (x)

j,i ∈ Rn×n has its (j, i)-th element as

∂f 2 ∂xj∂xi (x).

– ∇2f is symmetric. – Sufficient condition: if all

∂f 2 ∂xj∂xi (x) exist and are continuous, f

is 2nd-order differentiable at x (not converse; we omit the definition due to its technicality).

8 / 24

SLIDE 26

Taylor’s theorem

Vector version: consider f (x) : Rn → R – If f is 1st-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + o(δ2) as δ → 0.

9 / 24

SLIDE 27

Taylor’s theorem

Vector version: consider f (x) : Rn → R – If f is 1st-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + o(δ2) as δ → 0. – If f is 2nd-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + 1 2

δ, ∇2f (x) δ
+ o(δ2

2) as δ → 0.

9 / 24

SLIDE 28

Taylor’s theorem

Vector version: consider f (x) : Rn → R – If f is 1st-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + o(δ2) as δ → 0. – If f is 2nd-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + 1 2

δ, ∇2f (x) δ
+ o(δ2

2) as δ → 0.

Matrix version: consider f (X) : Rm×n → R – If f is 1st-order differentiable at X, then f (X + ∆) = f (X) + ∇f (X) , ∆ + o(∆F ) as ∆ → 0.

9 / 24

SLIDE 29

Taylor’s theorem

Vector version: consider f (x) : Rn → R – If f is 1st-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + o(δ2) as δ → 0. – If f is 2nd-order differentiable at x, then f (x + δ) = f (x) + ∇f (x) , δ + 1 2

δ, ∇2f (x) δ
+ o(δ2

2) as δ → 0.

Matrix version: consider f (X) : Rm×n → R – If f is 1st-order differentiable at X, then f (X + ∆) = f (X) + ∇f (X) , ∆ + o(∆F ) as ∆ → 0. – If f is 2nd-order differentiable at X, then f (X + ∆) = f (X) + ∇f (X) , ∆ + 1 2

∆, ∇2f (X) ∆
+ o(∆2

F )

as ∆ → 0.

9 / 24

SLIDE 30

Taylor approximation — asymptotic uniqueness

Let f : R → R be k (k ≥ 1 integer) times differentiable at a point x. If P(δ) is a k-th order polynomial satisfying f (x + δ) − P (δ) = o(δk) as δ → 0, then P (δ) = Pk(δ) . = f(x) + k

i=1 1 k!f (k) (x) δk.

10 / 24

SLIDE 31

Taylor approximation — asymptotic uniqueness

Let f : R → R be k (k ≥ 1 integer) times differentiable at a point x. If P(δ) is a k-th order polynomial satisfying f (x + δ) − P (δ) = o(δk) as δ → 0, then P (δ) = Pk(δ) . = f(x) + k

i=1 1 k!f (k) (x) δk.

Generalization to the vector version – Assume f (x) : Rn → R is 1-order differentiable at x. If P (δ) . = f (x) + v, δ satisties that f (x + δ) − P (δ) = o(δ2) as δ → 0, then P (δ) = f (x) + ∇f (x) , δ, i.e., the 1st-order Taylor expansion.

10 / 24

SLIDE 32

Taylor approximation — asymptotic uniqueness

Let f : R → R be k (k ≥ 1 integer) times differentiable at a point x. If P(δ) is a k-th order polynomial satisfying f (x + δ) − P (δ) = o(δk) as δ → 0, then P (δ) = Pk(δ) . = f(x) + k

i=1 1 k!f (k) (x) δk.

Generalization to the vector version – Assume f (x) : Rn → R is 1-order differentiable at x. If P (δ) . = f (x) + v, δ satisties that f (x + δ) − P (δ) = o(δ2) as δ → 0, then P (δ) = f (x) + ∇f (x) , δ, i.e., the 1st-order Taylor expansion. – Assume f (x) : Rn → R is 2-order differentiable at x. If P (δ) . = f (x) + v, δ + 1

2 δ, Hδ with H symmetric satisties that

f (x + δ) − P (δ) = o(δ2

2)

as δ → 0, then P (δ) = f (x) + ∇f (x) , δ + 1

2

δ, ∇2f (x) δ
, i.e., the 2nd-order

Taylor expansion. We can read off ∇f and ∇2f if we know the expansion!

10 / 24

SLIDE 33

Taylor approximation — asymptotic uniqueness

Let f : R → R be k (k ≥ 1 integer) times differentiable at a point x. If P(δ) is a k-th order polynomial satisfying f (x + δ) − P (δ) = o(δk) as δ → 0, then P (δ) = Pk(δ) . = f(x) + k

i=1 1 k!f (k) (x) δk.

Generalization to the vector version – Assume f (x) : Rn → R is 1-order differentiable at x. If P (δ) . = f (x) + v, δ satisties that f (x + δ) − P (δ) = o(δ2) as δ → 0, then P (δ) = f (x) + ∇f (x) , δ, i.e., the 1st-order Taylor expansion. – Assume f (x) : Rn → R is 2-order differentiable at x. If P (δ) . = f (x) + v, δ + 1

2 δ, Hδ with H symmetric satisties that

f (x + δ) − P (δ) = o(δ2

2)

as δ → 0, then P (δ) = f (x) + ∇f (x) , δ + 1

2

δ, ∇2f (x) δ
, i.e., the 2nd-order

Taylor expansion. We can read off ∇f and ∇2f if we know the expansion! Similarly for the matrix version. See Chap 5 of [Coleman, 2012] for other forms of Taylor theorems and proofs of the asymptotic uniqueness.

10 / 24

SLIDE 34

Asymptotic uniqueness — why interesting?

Two ways of deriving gradients and Hessians (Recall HW0!)

11 / 24

SLIDE 35

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions f (W ) =

i

yi − W kW k−1 . . . W 2W 1xi2

F 12 / 24

SLIDE 36

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions f (W ) =

i

yi − W kW k−1 . . . W 2W 1xi2

F

How to derive the gradient? – Scalar chain rule?

12 / 24

SLIDE 37

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions f (W ) =

i

yi − W kW k−1 . . . W 2W 1xi2

F

How to derive the gradient? – Scalar chain rule? – Vector chain rule?

12 / 24

SLIDE 38

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions f (W ) =

i

yi − W kW k−1 . . . W 2W 1xi2

F

How to derive the gradient? – Scalar chain rule? – Vector chain rule? – First-order Taylor expansion

12 / 24

SLIDE 39

Asymptotic uniqueness — why interesting?

Think of neural networks with identity activation functions f (W ) =

i

yi − W kW k−1 . . . W 2W 1xi2

F

How to derive the gradient? – Scalar chain rule? – Vector chain rule? – First-order Taylor expansion Why interesting? See e.g., [Kawaguchi, 2016, Lampinen and Ganguli, 2018]

12 / 24

SLIDE 40

Directional derivatives and curvatures

Consider f (x) : Rn → R – directional derivative: Dvf (x) . = d

dtf (x + tv)

13 / 24

SLIDE 41

Directional derivatives and curvatures

Consider f (x) : Rn → R – directional derivative: Dvf (x) . = d

dtf (x + tv)

– When f is 1-st order differentiable at x, Dvf (x) = ∇f (x) , v .

13 / 24

SLIDE 42

Directional derivatives and curvatures

Consider f (x) : Rn → R – directional derivative: Dvf (x) . = d

dtf (x + tv)

– When f is 1-st order differentiable at x, Dvf (x) = ∇f (x) , v . – Now Dvf (x) : Rn → R, what is Du (Dvf) (x)?

13 / 24

SLIDE 43

Directional derivatives and curvatures

Consider f (x) : Rn → R – directional derivative: Dvf (x) . = d

dtf (x + tv)

– When f is 1-st order differentiable at x, Dvf (x) = ∇f (x) , v . – Now Dvf (x) : Rn → R, what is Du (Dvf) (x)? Du (Dvf) (x) =

u, ∇2f (x) v
.

13 / 24

SLIDE 44

Directional derivatives and curvatures

Consider f (x) : Rn → R – directional derivative: Dvf (x) . = d

dtf (x + tv)

– When f is 1-st order differentiable at x, Dvf (x) = ∇f (x) , v . – Now Dvf (x) : Rn → R, what is Du (Dvf) (x)? Du (Dvf) (x) =

u, ∇2f (x) v
.

– When u = v, Du (Duf) (x) =

u, ∇2f (x) u
= d2

dt2 f (x + tu) .

13 / 24

SLIDE 45

Directional derivatives and curvatures

Consider f (x) : Rn → R – directional derivative: Dvf (x) . = d

dtf (x + tv)

– When f is 1-st order differentiable at x, Dvf (x) = ∇f (x) , v . – Now Dvf (x) : Rn → R, what is Du (Dvf) (x)? Du (Dvf) (x) =

u, ∇2f (x) v
.

– When u = v, Du (Duf) (x) =

u, ∇2f (x) u
= d2

dt2 f (x + tu) . – u,∇2f(x)u

u2

2

is the directional curvature along u independent of the norm of u

13 / 24

SLIDE 46

Directional curvature

u,∇2f(x)u

u2

2

is the directional curvature along u independent of the norm of u

14 / 24

SLIDE 47

Directional curvature

u,∇2f(x)u

u2

2

is the directional curvature along u independent of the norm of u Blue: negative curvature (bending down) Red: positive curvature (bending up)

14 / 24

SLIDE 48

Outline

Elements of multivatiate calculus Optimality conditions of unconstrained optimization

15 / 24

SLIDE 49

Optimization problems

Nothing takes place in the world whose meaning is not that of some maximum or minimum. – Euler min

x f (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint (or feasible) set

16 / 24

SLIDE 50

Optimization problems

Nothing takes place in the world whose meaning is not that of some maximum or minimum. – Euler min

x f (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint (or feasible) set – C consists of discrete values (e.g., {−1, +1}n): discrete

ptimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization

16 / 24

SLIDE 51

Optimization problems

Nothing takes place in the world whose meaning is not that of some maximum or minimum. – Euler min

x f (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint (or feasible) set – C consists of discrete values (e.g., {−1, +1}n): discrete

ptimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization – C whole space Rn: unconstrained optimization; C a strict subset

f the space: constrained optimization

16 / 24

SLIDE 52

Optimization problems

Nothing takes place in the world whose meaning is not that of some maximum or minimum. – Euler min

x f (x) s. t. x ∈ C.

– x: optimization variables, f (x): objective function, C: constraint (or feasible) set – C consists of discrete values (e.g., {−1, +1}n): discrete

ptimization; C consists of continuous values (e.g., Rn, [0, 1]n):

continuous optimization – C whole space Rn: unconstrained optimization; C a strict subset

f the space: constrained optimization

We focus on continuous, unconstrained optimization here.

16 / 24

SLIDE 53

Global and local mins

Credit: study.com

Let f (x) : Rn → R, min

x∈Rn f (x)

– x0 is a local minimizer if: ∃ε > 0, so that f (x0) ≤ f (x) for all x satisfying x − x02 < ε. The value f (x0) is called a local minimum.

17 / 24

SLIDE 54

Global and local mins

Credit: study.com

Let f (x) : Rn → R, min

x∈Rn f (x)

– x0 is a local minimizer if: ∃ε > 0, so that f (x0) ≤ f (x) for all x satisfying x − x02 < ε. The value f (x0) is called a local minimum. – x0 is a global minimizer if: f (x0) ≤ f (x) for all x ∈ Rn. The value is f (x0) called the global minimum.

17 / 24

SLIDE 55

A naive solution

Grid search – For 1D problem, assume we know the global min lies in [−1, 1]

18 / 24

SLIDE 56

A naive solution

Grid search – For 1D problem, assume we know the global min lies in [−1, 1] – Take uniformly grid points in [−1, 1] so that any adjanent points are separated by ε.

18 / 24

SLIDE 57

A naive solution

Grid search – For 1D problem, assume we know the global min lies in [−1, 1] – Take uniformly grid points in [−1, 1] so that any adjanent points are separated by ε. – Need O(ε−1) points to get an ε-close point to the global min by exhaustive search

18 / 24

SLIDE 58

A naive solution

Grid search – For 1D problem, assume we know the global min lies in [−1, 1] – Take uniformly grid points in [−1, 1] so that any adjanent points are separated by ε. – Need O(ε−1) points to get an ε-close point to the global min by exhaustive search For N-D problems, need O

ε−n

computation.

18 / 24

SLIDE 59

A naive solution

Grid search – For 1D problem, assume we know the global min lies in [−1, 1] – Take uniformly grid points in [−1, 1] so that any adjanent points are separated by ε. – Need O(ε−1) points to get an ε-close point to the global min by exhaustive search For N-D problems, need O

ε−n

computation. Better characterization of the local/global mins may help avoid this.

18 / 24

SLIDE 60

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0.

19 / 24

SLIDE 61

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0. Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0)

19 / 24

SLIDE 62

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0. Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0) Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + o

δ2
. If x0 is a local min:

19 / 24

SLIDE 63

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0. Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0) Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + o

δ2
. If x0 is a local min:

– For all δ sufficiently small, f (x0 + δ) − f (x0) = ∇f (x0) , δ + o

δ2
≥ 0

19 / 24

SLIDE 64

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0. Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0) Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + o

δ2
. If x0 is a local min:

– For all δ sufficiently small, f (x0 + δ) − f (x0) = ∇f (x0) , δ + o

δ2
≥ 0

– For all δ sufficiently small, sign of ∇f (x0) , δ + o

δ2
determined by

the sign of ∇f (x0) , δ, i.e., ∇f (x0) , δ ≥ 0.

19 / 24

SLIDE 65

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0. Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0) Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + o

δ2
. If x0 is a local min:

– For all δ sufficiently small, f (x0 + δ) − f (x0) = ∇f (x0) , δ + o

δ2
≥ 0

– For all δ sufficiently small, sign of ∇f (x0) , δ + o

δ2
determined by

the sign of ∇f (x0) , δ, i.e., ∇f (x0) , δ ≥ 0. – So for all δ sufficiently small, ∇f (x0) , δ ≥ 0 and ∇f (x0) , −δ = − ∇f (x0) , δ ≥ 0 = ⇒ ∇f (x0) , δ = 0

19 / 24

SLIDE 66

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, ∇f (x0) = 0. Intuition: ∇f is “rate of change” of function

value. If the rate is not zero at x0, possible to

decrease f along −∇f (x0) Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + o

δ2
. If x0 is a local min:

– For all δ sufficiently small, f (x0 + δ) − f (x0) = ∇f (x0) , δ + o

δ2
≥ 0

– For all δ sufficiently small, sign of ∇f (x0) , δ + o

δ2
determined by

the sign of ∇f (x0) , δ, i.e., ∇f (x0) , δ ≥ 0. – So for all δ sufficiently small, ∇f (x0) , δ ≥ 0 and ∇f (x0) , −δ = − ∇f (x0) , δ ≥ 0 = ⇒ ∇f (x0) , δ = 0 – So ∇f (x0) = 0.

19 / 24

SLIDE 67

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient?

20 / 24

SLIDE 68

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient? for convex functions

20 / 24

SLIDE 69

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient? for convex functions

Credit: Wikipedia

20 / 24

SLIDE 70

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any line segment connecting two points of its graph always lies above the graph

20 / 24

SLIDE 71

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any line segment connecting two points of its graph always lies above the graph – algebra def.: ∀ x, y and α ∈ [0, 1]: f (αx + (1 − α) y) ≤ αf (x)+(1 − α) f (y) .

20 / 24

SLIDE 72

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any line segment connecting two points of its graph always lies above the graph – algebra def.: ∀ x, y and α ∈ [0, 1]: f (αx + (1 − α) y) ≤ αf (x)+(1 − α) f (y) . Any convex function has only one local minimum (value!), which is also global!

20 / 24

SLIDE 73

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. When sufficient? for convex functions

Credit: Wikipedia

– geometric def.: function for which any line segment connecting two points of its graph always lies above the graph – algebra def.: ∀ x, y and α ∈ [0, 1]: f (αx + (1 − α) y) ≤ αf (x)+(1 − α) f (y) . Any convex function has only one local minimum (value!), which is also global! Proof sketch: if x, z are both local minimizers and f (z) < f (x), f (αz + (1 − α) x) ≤ αf (z) + (1 − α) f (x) < αf (x) + (1 − α) f (x) = f (x). But αz + (1 − α) x → x as α → 0.

20 / 24

SLIDE 74

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. Sufficient condition: Assume f is convex and 1st-order differentiable. If ∇f (x) = 0 at a point x = x0, then x0 is a local/global minimizer. – Convex analysis (i.e., theory) and optimization (i.e., numerical methods) are relatively mature. Recommended resources: analysis: [Hiriart-Urruty and Lemar´ echal, 2001], optimization: [Boyd and Vandenberghe, 2004]

21 / 24

SLIDE 75

First-order optimality condition

Necessary condition: Assume f is 1st-order differentiable at x0. If x0 is a local minimizer, then ∇f (x0) = 0. Sufficient condition: Assume f is convex and 1st-order differentiable. If ∇f (x) = 0 at a point x = x0, then x0 is a local/global minimizer. – Convex analysis (i.e., theory) and optimization (i.e., numerical methods) are relatively mature. Recommended resources: analysis: [Hiriart-Urruty and Lemar´ echal, 2001], optimization: [Boyd and Vandenberghe, 2004] – We don’t assume convexity unless stated, as DNN objectives are almost always nonconvex.

21 / 24

SLIDE 76

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite).

22 / 24

SLIDE 77

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite). Sufficient condition: Assume f (x) is 2-order differentiable at x0. If ∇f (x0) = 0 and ∇2f (x0) ≻ 0 (i.e., positive definite), x0 is a local min.

22 / 24

SLIDE 78

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite). Sufficient condition: Assume f (x) is 2-order differentiable at x0. If ∇f (x0) = 0 and ∇2f (x0) ≻ 0 (i.e., positive definite), x0 is a local min. Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

22 / 24

SLIDE 79

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite). Sufficient condition: Assume f (x) is 2-order differentiable at x0. If ∇f (x0) = 0 and ∇2f (x0) ≻ 0 (i.e., positive definite), x0 is a local min. Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and f (x0 + δ) = f (x0) + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

22 / 24

SLIDE 80

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite). Sufficient condition: Assume f (x) is 2-order differentiable at x0. If ∇f (x0) = 0 and ∇2f (x0) ≻ 0 (i.e., positive definite), x0 is a local min. Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and f (x0 + δ) = f (x0) + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– So f (x0 + δ) − f (x0) = 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

≥ 0 for all δ

sufficiently small

22 / 24

SLIDE 81

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite). Sufficient condition: Assume f (x) is 2-order differentiable at x0. If ∇f (x0) = 0 and ∇2f (x0) ≻ 0 (i.e., positive definite), x0 is a local min. Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and f (x0 + δ) = f (x0) + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– So f (x0 + δ) − f (x0) = 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

≥ 0 for all δ

sufficiently small – For all δ sufficiently small, sign of 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

determined by the sign of 1

2

δ, ∇2f (x0) δ
=

⇒ 1

2

δ, ∇2f (x0) δ
≥ 0

22 / 24

SLIDE 82

Second-order optimality condition

Necessary condition: Assume f (x) is 2-order differentiable at x0. If x0 is a local min, ∇f (x0) = 0 and ∇2f (x0) 0 (i.e., positive semidefinite). Sufficient condition: Assume f (x) is 2-order differentiable at x0. If ∇f (x0) = 0 and ∇2f (x0) ≻ 0 (i.e., positive definite), x0 is a local min. Taylor’s: f (x0 + δ) = f (x0) + ∇f (x0) , δ + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– If x0 is a local min, ∇f (x0) = 0 (1st-order condition) and f (x0 + δ) = f (x0) + 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

.

– So f (x0 + δ) − f (x0) = 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

≥ 0 for all δ

sufficiently small – For all δ sufficiently small, sign of 1

2

δ, ∇2f (x0) δ
+ o
δ2

2

determined by the sign of 1

2

δ, ∇2f (x0) δ
=

⇒ 1

2

δ, ∇2f (x0) δ
≥ 0

– So ∇2f (x0) 0.

22 / 24

SLIDE 83

What’s in between?

2nd order sufficient: ∇f (x0) = 0 and ∇2f (x0) ≻ 0 2nd order necessary: ∇f (x0) = 0 and ∇2f (x0) 0 ∇f =

2x

−2y

, ∇2f =
2

−2

∇g =
3x2

−3y2

, ∇2g =
6x

−6y

23 / 24

SLIDE 84

References i

[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex

Optimization. Cambridge University Press.

[Coleman, 2012] Coleman, R. (2012). Calculus on Normed Vector Spaces. Springer New York. [Hiriart-Urruty and Lemar´ echal, 2001] Hiriart-Urruty, J.-B. and Lemar´ echal, C. (2001). Fundamentals of Convex Analysis. Springer Berlin Heidelberg. [Kawaguchi, 2016] Kawaguchi, K. (2016). Deep learning without poor local minima. arXiv:1605.07110. [Lampinen and Ganguli, 2018] Lampinen, A. K. and Ganguli, S. (2018). An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv:1809.10374. [Munkres, 1997] Munkres, J. R. (1997). Analysis On Manifolds. Taylor & Francis Inc. [Zorich, 2015] Zorich, V. A. (2015). Mathematical Analysis I. Springer Berlin Heidelberg. 24 / 24