Summary Key topics. Familiarity with form of basic network - - PowerPoint PPT Presentation

summary
SMART_READER_LITE
LIVE PREVIEW

Summary Key topics. Familiarity with form of basic network - - PowerPoint PPT Presentation

Summary Key topics. Familiarity with form of basic network gradient. Deep network initialization. Minibatches. Momentum. Next time: convexity. 17 / 42 Part 2: convexity Why convexity? Deep networks are not convex in their


slide-1
SLIDE 1

Summary

Key topics. ◮ Familiarity with form of basic network gradient. ◮ Deep network initialization. ◮ Minibatches. ◮ Momentum. Next time: convexity.

17 / 42

slide-2
SLIDE 2

Part 2: convexity

slide-3
SLIDE 3

Why convexity?

Deep networks are not convex in their parameters. Why study convexity? ◮ Convexity is pervasive in ML and mathematics; e.g., our losses for deep learning are still convex. ◮ Convexity exemplifies nice “local-to-global” structure.

18 / 42

slide-4
SLIDE 4
  • 6. Convex sets and functions
slide-5
SLIDE 5

Convex sets

A set S is convex if, for every pair of points {x, x′} in S, the line segment between x and x′ is also contained in S. ({x, x′} ∈ S = ⇒ [x, x′] ∈ S.)

convex not convex convex convex

19 / 42

slide-6
SLIDE 6

Convex sets

A set S is convex if, for every pair of points {x, x′} in S, the line segment between x and x′ is also contained in S. ({x, x′} ∈ S = ⇒ [x, x′] ∈ S.)

convex not convex convex convex

Examples: ◮ All of Rd. ◮ Empty set. ◮ Half-spaces: {x ∈ Rd : aTx ≤ b}. ◮ Intersections of convex sets. ◮ Polyhedra:

  • x ∈ Rd : Ax ≤ b
  • = m

i=1

  • x ∈ Rd : aT

i x ≤ bi

  • .

◮ Convex hulls: conv(S) := {k

i=1 αixi : k ∈ N, xi ∈ S, αi ≥ 0, k i=1 αi = 1}.

(Infinite convex hulls: intersection of all convex supersets.)

19 / 42

slide-7
SLIDE 7

Convex functions from convex sets

The epigraph of a function f is the area above the curve: epi(f) :=

  • (x, y) ∈ Rd+1 : y ≥ f(x)
  • .

A function is convex if its epigraph is convex.

f is not convex f is convex

20 / 42

slide-8
SLIDE 8

Convex functions (standard definition)

A function f : Rd → R is convex if for any x, x′ ∈ Rd and α ∈ [0, 1], f ((1 − α)x + αx′) ≤ (1 − α) · f(x) + α · f(x′).

f is not convex f is convex x x′ x x′

21 / 42

slide-9
SLIDE 9

Convex functions (standard definition)

A function f : Rd → R is convex if for any x, x′ ∈ Rd and α ∈ [0, 1], f ((1 − α)x + αx′) ≤ (1 − α) · f(x) + α · f(x′).

f is not convex f is convex x x′ x x′

Examples: ◮ f(x) = cx for any c > 0 (on R) ◮ f(x) = |x|c for any c ≥ 1 (on R) ◮ f(x) = bTx for any b ∈ Rd. ◮ f(x) =x for any norm ·. ◮ f(x) = xTAx for symmetric positive semidefinite A. ◮ f(x) = ln d

i=1 exp(xi)

  • , which approximates maxi xi.

21 / 42

slide-10
SLIDE 10

Example verification: norms

Is f(x) = x convex?

22 / 42

slide-11
SLIDE 11

Example verification: norms

Is f(x) = x convex? Pick any α ∈ [0, 1] and any x, x′ ∈ Rd.

22 / 42

slide-12
SLIDE 12

Example verification: norms

Is f(x) = x convex? Pick any α ∈ [0, 1] and any x, x′ ∈ Rd. f ((1 − α)x + αx′) = (1 − α)x + αx′

22 / 42

slide-13
SLIDE 13

Example verification: norms

Is f(x) = x convex? Pick any α ∈ [0, 1] and any x, x′ ∈ Rd. f ((1 − α)x + αx′) = (1 − α)x + αx′ ≤ (1 − α)x + αx′ (triangle inequality)

22 / 42

slide-14
SLIDE 14

Example verification: norms

Is f(x) = x convex? Pick any α ∈ [0, 1] and any x, x′ ∈ Rd. f ((1 − α)x + αx′) = (1 − α)x + αx′ ≤ (1 − α)x + αx′ (triangle inequality) = (1 − α) x + α x′ (homogeneity)

22 / 42

slide-15
SLIDE 15

Example verification: norms

Is f(x) = x convex? Pick any α ∈ [0, 1] and any x, x′ ∈ Rd. f ((1 − α)x + αx′) = (1 − α)x + αx′ ≤ (1 − α)x + αx′ (triangle inequality) = (1 − α) x + α x′ (homogeneity) = (1 − α)f(x) + αf(x′). Yes, f is convex.

22 / 42

slide-16
SLIDE 16

Operations preserving convexity

Summations: if (f1, . . . , fk) convex and (α1, . . . , αk) nonnegative, x → α1f1(x) + · · · + αkfk(x) is convex. Affine composition: if f is convex, the for any A ∈ Rm×d and b ∈ Rm, x → f (Ax + b) is convex. Maxima: if (f1, . . . , fk) are convex, x → max

i

fi(x) is convex. (Infinitely many functions: use a supremum.)

23 / 42

slide-17
SLIDE 17

Example: linear classification and margin losses

If ℓ is convex and the predictor is linear, then the empirical risk is convex: ◮ Define ℓi(w) = ℓ(wTxiyi), convex since composition of convex and affine; ◮ thus the empirical risk

  • R(w) = 1

n

n

  • i=1

ℓ(w

Txiyi) = 1

n

n

  • i=1

ℓi(w) is the nonnegative combination of convex functions, and convex.

24 / 42

slide-18
SLIDE 18
  • 7. Various forms of convexity
slide-19
SLIDE 19

Convexity of differentiable functions

Differentiable functions If f : Rd → R is differentiable, then f is convex if and only if f(x) ≥ f(x0) + ∇f(x0)

T(x − x0)

for all x, x0 ∈ Rd. Note: this implies increasing slopes:

  • ∇f(x) − ∇f(y)

T (x − y) ≥ 0. f(x) x0 a(x) a(x) = f(x0) + f ′(x0)(x − x0)

25 / 42

slide-20
SLIDE 20

Convexity of differentiable functions

Differentiable functions If f : Rd → R is differentiable, then f is convex if and only if f(x) ≥ f(x0) + ∇f(x0)

T(x − x0)

for all x, x0 ∈ Rd. Note: this implies increasing slopes:

  • ∇f(x) − ∇f(y)

T (x − y) ≥ 0. f(x) x0 a(x) a(x) = f(x0) + f ′(x0)(x − x0) Twice-differentiable functions If f : Rd → R is twice-differentiable, then f is convex if and only if ∇2f(x) 0 for all x ∈ Rd (i.e., the Hessian, or matrix of second-derivatives, is positive semi-definite for all x).

25 / 42

slide-21
SLIDE 21

Verifying convexity of differentiable functions

Is f(x) = x4 convex?

26 / 42

slide-22
SLIDE 22

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0.

26 / 42

slide-23
SLIDE 23

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex.

26 / 42

slide-24
SLIDE 24

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex. Is f(x) = ea,x convex?

26 / 42

slide-25
SLIDE 25

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex. Is f(x) = ea,x convex? Use first-order condition for convexity. ∇f(x) = ea,x∇ {a, x} = ea,xa (chain rule).

26 / 42

slide-26
SLIDE 26

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex. Is f(x) = ea,x convex? Use first-order condition for convexity. ∇f(x) = ea,x∇ {a, x} = ea,xa (chain rule). Difference between f and its affine approximation: f(x) −

  • f(x0) + ∇f(x0), x − x0
  • =

ea,x −

  • ea,x0 + ea,x0a, x − x0
  • 26 / 42
slide-27
SLIDE 27

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex. Is f(x) = ea,x convex? Use first-order condition for convexity. ∇f(x) = ea,x∇ {a, x} = ea,xa (chain rule). Difference between f and its affine approximation: f(x) −

  • f(x0) + ∇f(x0), x − x0
  • =

ea,x −

  • ea,x0 + ea,x0a, x − x0
  • =

ea,x0 ea,x−x0 −

  • 1 + a, x − x0
  • 26 / 42
slide-28
SLIDE 28

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex. Is f(x) = ea,x convex? Use first-order condition for convexity. ∇f(x) = ea,x∇ {a, x} = ea,xa (chain rule). Difference between f and its affine approximation: f(x) −

  • f(x0) + ∇f(x0), x − x0
  • =

ea,x −

  • ea,x0 + ea,x0a, x − x0
  • =

ea,x0 ea,x−x0 −

  • 1 + a, x − x0

(because 1 + z ≤ ez for all z ∈ R).

26 / 42

slide-29
SLIDE 29

Verifying convexity of differentiable functions

Is f(x) = x4 convex? Use second-order condition for convexity.

∂ ∂x f(x)

= 4x3

∂2 ∂x2 f(x)

= 12x2 ≥ 0. Yes, f is convex. Is f(x) = ea,x convex? Use first-order condition for convexity. ∇f(x) = ea,x∇ {a, x} = ea,xa (chain rule). Difference between f and its affine approximation: f(x) −

  • f(x0) + ∇f(x0), x − x0
  • =

ea,x −

  • ea,x0 + ea,x0a, x − x0
  • =

ea,x0 ea,x−x0 −

  • 1 + a, x − x0

(because 1 + z ≤ ez for all z ∈ R). Yes, f is convex.

26 / 42

slide-30
SLIDE 30

Strict convexity

Function values: ∀x, y, ∀α ∈ [0, 1]: f

  • αx + (1 − α)y
  • ≤ αf(x) + (1 − α)f(y).

Derivatives: ∀x, y, f(y) ≥ f(x) + ∇f(x)⊤(y − x). Hessians: ∀x, ∇2f(x) 0.

27 / 42

slide-31
SLIDE 31

Strict convexity

Function values: ∀x=y, ∀α ∈ (0, 1): f

  • αx + (1 − α)y
  • < αf(x) + (1 − α)f(y).

Derivatives: ∀x=y, f(y) > f(x) + ∇f(x)⊤(y − x). Hessians: ∀x, ∇2f(x) ≻ 0.

27 / 42

slide-32
SLIDE 32

λ-Strong-Convexity.

Function values: ∀x, y, ∀α ∈ [0, 1] f

  • αx + (1 − α)y
  • ≤ αf(x) + (1 − α)f(y).

Derivatives: ∀x, y f(y) ≥ f(x) + ∇f(x)⊤(y − x). Hessians: ∀x, ∇2f(x) 0.

28 / 42

slide-33
SLIDE 33

λ-Strong-Convexity.

Function values: ∀x, y, ∀α ∈ [0, 1] f

  • αx + (1 − α)y
  • ≤ αf(x) + (1 − α)f(y) −λα(1 − α)

2 x − y2. Derivatives: ∀x, y f(y) ≥ f(x) + ∇f(x)⊤(y − x) +λ 2 y − x2. Hessians: ∀x, ∇2f(x) λI.

28 / 42

slide-34
SLIDE 34

Convexity of key losses.

Logistic loss z → ln(1 + exp(−z)) is strictly convex. (e.g., verify that second derivative is positive.)

29 / 42

slide-35
SLIDE 35

Convexity of key losses.

Logistic loss z → ln(1 + exp(−z)) is strictly convex. (e.g., verify that second derivative is positive.) Squared (margin) loss z → 1

2(1 − z)2 is 1-strongly-convex.

(e.g., second derivarive is 1.)

29 / 42

slide-36
SLIDE 36

Convexity of key losses.

Logistic loss z → ln(1 + exp(−z)) is strictly convex. (e.g., verify that second derivative is positive.) Squared (margin) loss z → 1

2(1 − z)2 is 1-strongly-convex.

(e.g., second derivarive is 1.) Combined with our earlier linear prediction calculation, logistic regression and least squares are convex!

29 / 42

slide-37
SLIDE 37
  • 8. Convex optimization problems
slide-38
SLIDE 38

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n.

30 / 42

slide-39
SLIDE 39

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function;

30 / 42

slide-40
SLIDE 40

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function; ◮ f1, . . . , fn : Rd → R are the constraint functions;

30 / 42

slide-41
SLIDE 41

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function; ◮ f1, . . . , fn : Rd → R are the constraint functions; ◮ inequalities fi(x) ≤ 0 are constraints;

30 / 42

slide-42
SLIDE 42

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function; ◮ f1, . . . , fn : Rd → R are the constraint functions; ◮ inequalities fi(x) ≤ 0 are constraints; ◮ A :=

  • x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
  • is the feasible region.

30 / 42

slide-43
SLIDE 43

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function; ◮ f1, . . . , fn : Rd → R are the constraint functions; ◮ inequalities fi(x) ≤ 0 are constraints; ◮ A :=

  • x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
  • is the feasible region.

◮ Goal: Find x ∈ A so that f0(x) is as small as possible.

30 / 42

slide-44
SLIDE 44

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function; ◮ f1, . . . , fn : Rd → R are the constraint functions; ◮ inequalities fi(x) ≤ 0 are constraints; ◮ A :=

  • x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
  • is the feasible region.

◮ Goal: Find x ∈ A so that f0(x) is as small as possible. ◮ (Optimal) value of the optimization problem is the smallest value f0(x) achieved by a feasible point x ∈ A.

30 / 42

slide-45
SLIDE 45

Optimization problems

A typical optimization problem (in standard form) is written as min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n. ◮ f0 : Rd → R is the objective function; ◮ f1, . . . , fn : Rd → R are the constraint functions; ◮ inequalities fi(x) ≤ 0 are constraints; ◮ A :=

  • x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
  • is the feasible region.

◮ Goal: Find x ∈ A so that f0(x) is as small as possible. ◮ (Optimal) value of the optimization problem is the smallest value f0(x) achieved by a feasible point x ∈ A. ◮ Point x ∈ A achieving the optimal value is a (global) minimizer.

30 / 42

slide-46
SLIDE 46

Convex optimization problems

Standard form of a convex optimization problem: min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n where f0, f1, . . . , fn : Rd → R are convex functions.

31 / 42

slide-47
SLIDE 47

Convex optimization problems

Standard form of a convex optimization problem: min

x∈Rd

f0(x) s.t. fi(x) ≤ 0 for all i = 1, . . . , n where f0, f1, . . . , fn : Rd → R are convex functions. Fact: the feasible set A :=

  • x ∈ Rd : fi(x) ≤ 0 for all i = 1, 2, . . . , n
  • is a convex set.

(SVMs next week will give us an example.)

31 / 42

slide-48
SLIDE 48

Local minimizers

Consider an optimization problem (not necessarily convex): min

x∈Rd

f0(x) s.t. x ∈ A.

32 / 42

slide-49
SLIDE 49

Local minimizers

Consider an optimization problem (not necessarily convex): min

x∈Rd

f0(x) s.t. x ∈ A. We say ˜ x ∈ A is a local minimizer if there is an “open ball” U :=

  • x ∈ Rd :x − ˜

x2 < r

  • f positive radius r > 0 such that ˜

x is a global minimizer for min

x∈Rd

f0(x) s.t. x ∈ A ∩ U.

32 / 42

slide-50
SLIDE 50

Local minimizers

Consider an optimization problem (not necessarily convex): min

x∈Rd

f0(x) s.t. x ∈ A. We say ˜ x ∈ A is a local minimizer if there is an “open ball” U :=

  • x ∈ Rd :x − ˜

x2 < r

  • f positive radius r > 0 such that ˜

x is a global minimizer for min

x∈Rd

f0(x) s.t. x ∈ A ∩ U. Nothing looks better than ˜ x in the immediate vicinity of ˜ x.

locally optimal

32 / 42

slide-51
SLIDE 51

Local minimizers

Consider an optimization problem (not necessarily convex): min

x∈Rd

f0(x) s.t. x ∈ A. We say ˜ x ∈ A is a local minimizer if there is an “open ball” U :=

  • x ∈ Rd :x − ˜

x2 < r

  • f positive radius r > 0 such that ˜

x is a global minimizer for min

x∈Rd

f0(x) s.t. x ∈ A ∩ U. Nothing looks better than ˜ x in the immediate vicinity of ˜ x.

locally optimal

This is one local-to-global consequence of convexity; more generally, tangents lower bound the function everywhere.

32 / 42

slide-52
SLIDE 52

Local minimizers of convex problems

If the optimization problem is convex, and ˜ x ∈ A is a local minimizer, then it is also a global minimizer.

local global

33 / 42

slide-53
SLIDE 53
  • 9. Convergence rates for gradient descent
slide-54
SLIDE 54

Gradient descent

  • 1. Let w0 ∈ Rd be given.
  • 2. For i ∈ (0, 1, . . . , t − 1):

2.1 wi+1 := wi − ηi∇f(wi). Intuition: convexity implies “no bumps”.

34 / 42

slide-55
SLIDE 55

Smoothness

To analyze gradient descent, we’ll use a notion of gradient stability.

35 / 42

slide-56
SLIDE 56

Smoothness

To analyze gradient descent, we’ll use a notion of gradient stability. λ-strong-convexity was a Taylor lower bound: ∀w, w′, f(w′) ≥ f(w) + ∇f(w)⊤(w′ − w) + λ 2 w′ − w2. Say f : Rd → R is β-smooth when reverse holds: ∀w, w′, f(w′) ≤ f(w) + ∇f(w)⊤(w′ − w) + β 2 w′ − w2. (Also called Lipschitz gradients.)

35 / 42

slide-57
SLIDE 57

GD for smooth, non-convex functions.

Theorem (smoothness leads to approximate critical points). Let w0 be given, and wi+1 := wi − η∇f(wi). If f is β-smooth and η = 1/β, min

i≤t

  • ∇f(wi−1)
  • 2 ≤ 1

t

t

  • i=1
  • ∇f(wi−1)
  • 2 ≤ 2β

t

  • f(w0) − min

w f(w)

  • .

36 / 42

slide-58
SLIDE 58

GD for smooth, non-convex functions.

Theorem (smoothness leads to approximate critical points). Let w0 be given, and wi+1 := wi − η∇f(wi). If f is β-smooth and η = 1/β, min

i≤t

  • ∇f(wi−1)
  • 2 ≤ 1

t

t

  • i=1
  • ∇f(wi−1)
  • 2 ≤ 2β

t

  • f(w0) − min

w f(w)

  • .
  • Proof. Combining the definitions with choice of iterates gives (for each i ≤ t)

f(wi) ≤ f(wi−1) − ∇f(wi−1)⊤(wi − wi−1) + β 2 wi − wi−12 = f(wi−1) − 1 2β

  • ∇f(wi−1)
  • 2 .

36 / 42

slide-59
SLIDE 59

GD for smooth, non-convex functions.

Theorem (smoothness leads to approximate critical points). Let w0 be given, and wi+1 := wi − η∇f(wi). If f is β-smooth and η = 1/β, min

i≤t

  • ∇f(wi−1)
  • 2 ≤ 1

t

t

  • i=1
  • ∇f(wi−1)
  • 2 ≤ 2β

t

  • f(w0) − min

w f(w)

  • .
  • Proof. Combining the definitions with choice of iterates gives (for each i ≤ t)

f(wi) ≤ f(wi−1) − ∇f(wi−1)⊤(wi − wi−1) + β 2 wi − wi−12 = f(wi−1) − 1 2β

  • ∇f(wi−1)
  • 2 .

Averaging these inequalities (over i ≤ t) gives 1 t

t

  • i=1
  • ∇f(wi−1)
  • 2 ≤ 2β

t

  • f(w0) − f(wt)
  • .

36 / 42

slide-60
SLIDE 60

GD for smooth, convex functions.

Theorem. Let w0 be given, and wi+1 := wi − η∇f(wi). If convex f is β-smooth and η = 1/β, ∀u ∈ Rd f(wt) − f(u) ≤ 1 t

t

  • i=1
  • f(wi) − f(u)
  • ≤ β

2t

  • w0 − u2 −wt − u2

.

37 / 42

slide-61
SLIDE 61

GD for smooth, convex functions.

Theorem. Let w0 be given, and wi+1 := wi − η∇f(wi). If convex f is β-smooth and η = 1/β, ∀u ∈ Rd f(wt) − f(u) ≤ 1 t

t

  • i=1
  • f(wi) − f(u)
  • ≤ β

2t

  • w0 − u2 −wt − u2

.

  • Proof. For each i ≤ t, using the previous proof,

wi − u2 =wi−1 − u2 − 2η∇f(wi−1)⊤(wi−1 − u) + η2 ∇f(wi−1)

  • 2

≤wi−1 − u2 + 2η

  • f(u) − f(wi−1)
  • + 2η2β
  • f(wi−1) − f(wi)
  • =wi−1 − u2 + 2

β

  • f(u) − f(wi)
  • .

Rearranging and then averaging these inequalities over i ≤ t gives the bound.

37 / 42

slide-62
SLIDE 62
  • 10. Convexity and differentiability
slide-63
SLIDE 63

Convexity and differentiability

Many useful convex functions are not differentiable. x → |x|. Question: how can we do gradient descent?

38 / 42

slide-64
SLIDE 64

Subgradients

Derivatives give tangents and descent directions: f(w′) ≥ f(x) + ∇f(x)

T(w′ − x).

Subdifferential set: ∂f(x) =

  • s ∈ Rd : ∀w′ f(w′) ≥ f(x) + s

T(w′ − x)

  • .

39 / 42

slide-65
SLIDE 65

Subgradients: first order condition.

Suppose f : Rd → R is convex. First order conditions: For any y ∈ Rd, 0 ∈ ∂f(y) ⇐ ⇒ f(y) = inf

x f(x).

40 / 42

slide-66
SLIDE 66

Subgradients: first order condition.

Suppose f : Rd → R is convex. First order conditions: For any y ∈ Rd, 0 ∈ ∂f(y) ⇐ ⇒ f(y) = inf

x f(x).

Magic of convexity: local information gives global structure.

40 / 42

slide-67
SLIDE 67

Subgradients: Jensen’s inequality.

If f : Rd → R is convex, then Ef(X) ≥ f (EX).

  • Proof. Set y := EX, and pick any s ∈ ∂f (EX). Then

Ef(X) ≥ E

  • f(y) + s

T(X − y)

  • = f(y) + s

TE (X − y) = f(y).

  • Note. This inequality comes up often!

41 / 42

slide-68
SLIDE 68
  • 11. Summary
slide-69
SLIDE 69

Summary

◮ Convex sets and functions. ◮ Ways to verify convexity. ◮ Strict convexity, strong convexity. ◮ Optimization problems, related terminology (feasible set, etc). ◮ Intuition for gradient descent convergence: local-to-global structure, no bumps. ◮ Jensen’s inequality.

42 / 42