Statistical Machine Learning Lecture 04: Optimization Refresher - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning Lecture 04: Optimization Refresher - - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 04: Optimization Refresher Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 65 Todays Objectives Make


slide-1
SLIDE 1

Statistical Machine Learning

Lecture 04: Optimization Refresher

Kristian Kersting TU Darmstadt

Summer Term 2020

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

1 / 65

slide-2
SLIDE 2

Today’s Objectives

Make you remember Calculus and teach you advanced topics! Brute Force right through optimization! Covered Topics:

Unconstrained Optimization Lagrangian Optimization Numerical Methods (Gradient Descent)

Go deeper?

Take the Optimization Class of Prof. von Stryk / SIM! Read Convex Optimization by Boyd & Vandenberghe - http:// www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

2 / 65

slide-3
SLIDE 3

Outline

  • 1. Motivation
  • 2. Convexity

Convex Sets Convex Functions

  • 3. Unconstrained & Constrained Optimization
  • 4. Numerical Optimization
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

3 / 65

slide-4
SLIDE 4
  • 1. Motivation

Outline

  • 1. Motivation
  • 2. Convexity

Convex Sets Convex Functions

  • 3. Unconstrained & Constrained Optimization
  • 4. Numerical Optimization
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

4 / 65

slide-5
SLIDE 5
  • 1. Motivation
  • 1. Motivation

“All learning problems are essentially optimization problems on data.”

Christopher G. Atkeson, Professor at CMU

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

5 / 65

slide-6
SLIDE 6
  • 1. Motivation

Robot Arm

You want to predict the torques of a robot arm

y = I¨ q − µ˙ q + mlg sin (q) = ¨ q ˙ q sin(q) I −µ mlg ⊺ = φ (x)⊺ θ

Can we do this with a data set?

D = {(xi, yi) |i = 1 · · · n}

Yes, by minimizing the sum of the squared error minθ J (θ, D) = n

i=1 (yi − φ (xi)⊺ θ)2

Carl Friedrich Gauss (1777–1855)

Note that this is just one way to measure an error...

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

6 / 65

slide-7
SLIDE 7
  • 1. Motivation

Will the previous method work?

Sure! But the solution may be faulty, e.g., m = −1kg, . . . Hence, we need to ensure some extra conditions, and our problem results in a constrained optimization problem min

θ

J (θ, D) =

n

  • i=1

(yi − φ (xi)⊺ θ)2 s.t. g (θ, D) ≥

where g (θ, D) =

  • θ1

−θ2 ⊺

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

7 / 65

slide-8
SLIDE 8
  • 1. Motivation

Motivation

ALL learning problems are optimization problems In any learning system, we have

  • 1. Parameters θ to enable learning
  • 2. Data set D to learn from
  • 3. A cost function J(θ, D) to measure our performance
  • 4. Some assumptions on the data, with equality and inequality

constraints, f (θ, D) = 0 and g(θ, D) > 0

How can we solve such problems in general?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 65

slide-9
SLIDE 9
  • 1. Motivation

Optimization problems in Machine Learning

Machine Learning tells us how to come up with data-based cost functions such that optimization can solve them!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

9 / 65

slide-10
SLIDE 10
  • 1. Motivation

Most Cost Functions are Useless

Good Machine Learning tells us how to come up with data-based cost functions such that optimization can solve them efficiently!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 65

slide-11
SLIDE 11
  • 1. Motivation

Good cost functions should be Convex

Ideally, the Cost Functions should be Convex!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

11 / 65

slide-12
SLIDE 12
  • 2. Convexity

Outline

  • 1. Motivation
  • 2. Convexity

Convex Sets Convex Functions

  • 3. Unconstrained & Constrained Optimization
  • 4. Numerical Optimization
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

12 / 65

slide-13
SLIDE 13
  • 2. Convexity : Convex Sets

Convex Sets

A set C ⊆ Rn is convex if ∀x, y ∈ C and ∀α ∈ [0, 1] αx + (1 − α) y ∈ C This is the equation of the line segment between x and y. I.e., for a given α, the point αx + (1 − α) y lies in the line segment between x and y

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

13 / 65

slide-14
SLIDE 14
  • 2. Convexity : Convex Sets

Examples of Convex Sets

All of Rn (obvious) Non-negative orthant: Rn

+. Let x 0, y 0, clearly

αx + (1 − α) y 0 Norm balls. Let x ≤ 1, y ≤ 1, then αx + (1 − α) y ≤ αx + (1 − α) y = α x + (1 − α) y ≤ 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

14 / 65

slide-15
SLIDE 15
  • 2. Convexity : Convex Sets

Examples of Convex Sets

Affine subspaces (linear manifold): Ax = b, Ay = b, then A (αx + (1 − α) y) = αAx + (1 − α) Ay = αb + (1 − α) b = b

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

15 / 65

slide-16
SLIDE 16
  • 2. Convexity : Convex Functions

Convex Functions

A function f : Rn → R is convex if ∀x, y ∈ dom (f ) and ∀α ∈ [0, 1] f (αx + (1 − α) y) ≤ αf (x) + (1 − α) f (y)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

16 / 65

slide-17
SLIDE 17
  • 2. Convexity : Convex Functions

Examples of Convex Functions

Linear/affine functions f (x) = b⊺x + c Quadratic functions f (x) = 1 2x⊺Ax + b⊺x + c where A 0 (positive semidefinite matrix)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

17 / 65

slide-18
SLIDE 18
  • 2. Convexity : Convex Functions

Examples of Convex Functions

Norms (such as l1 and l2) αx + (1 − α) y ≤ αx + (1 − α) y = α x + (1 − α) y Log-sum-exp (aka softmax, a smooth approximation to the maximum function often used on machine learning) f (x) = log n

  • i=1

exp (xi)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

18 / 65

slide-19
SLIDE 19
  • 2. Convexity : Convex Functions

Important Convex Functions from Classification

SVM loss

f (w) =

  • 1 − yix⊺

i w

  • +

Binary logistic loss

f (w) = log

  • 1 + exp
  • −yix⊺

i w

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

19 / 65

slide-20
SLIDE 20
  • 2. Convexity : Convex Functions

First-Order Convexity Condition

Suppose f : Rn → R is differentiable. Then f is convex iff ∀x, y ∈ dom (f ) f (y) ≥ f (x) + ∇xf (x)⊺ (y − x)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

20 / 65

slide-21
SLIDE 21
  • 2. Convexity : Convex Functions

First-Order Convexity Condition - generally...

The subgradient, or subdifferential set, ∂f (x) of f at x is ∂f (x) = {g : f (y) ≥ f (x) + g⊺ (y − x) , ∀y} Differentiability is not a requirement!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

21 / 65

slide-22
SLIDE 22
  • 2. Convexity : Convex Functions

Second-Order Convexity Condition

Suppose f : Rn → R is twice differentiable. Then f is convex iff ∀x ∈ dom (f ) ∇2

xf (x) 0

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

22 / 65

slide-23
SLIDE 23
  • 2. Convexity : Convex Functions

Ideal Machine Learning Cost Functions

min

θ

J (θ, D) = Convex Function s.t. f (θ, D) = Affine/Linear Function g (θ, D) ≥ Convex Set

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

23 / 65

slide-24
SLIDE 24
  • 2. Convexity : Convex Functions

Why are these conditions nice?

Local solutions are globally optimal! Fast and well studied optimizers already exist for a long time!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

24 / 65

slide-25
SLIDE 25
  • 3. Unconstrained & Constrained Optimization

Outline

  • 1. Motivation
  • 2. Convexity

Convex Sets Convex Functions

  • 3. Unconstrained & Constrained Optimization
  • 4. Numerical Optimization
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

25 / 65

slide-26
SLIDE 26
  • 3. Unconstrained & Constrained Optimization

Unconstrained optimization

Can you solve this problem? max

θ

J (θ) = 1 − θ2

1 − θ2 2

With θ∗ =

  • ⊺, J∗ = 1

For any other θ = 0, J < 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

26 / 65

slide-27
SLIDE 27
  • 3. Unconstrained & Constrained Optimization

Constrained optimization

Can you solve this problem? max

θ

J (θ) = 1 − θ2

1 − θ2 2

s.t. f (θ) = θ1 + θ2 − 1 = 0 First approach: convert the problem to an unconstrained problem Second approach: Lagrange Multipliers

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

27 / 65

slide-28
SLIDE 28
  • 3. Unconstrained & Constrained Optimization

Key Insight

Taylor expansion around a vinicity of θA

f (θA + δθ) ≈ f (θA) + δθ⊺∇f (θA)

With the constraint that the gradient is normal to the vinicity around θA

δθ⊺∇f (θA) = 0 We have f (θA + δθ) = f (θA)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

28 / 65

slide-29
SLIDE 29
  • 3. Unconstrained & Constrained Optimization

Key Insight

We have to seek a point such that

∇J (θ) + λ⊺∇f (θ) = 0 where λ are the Lagrange multipliers (δθ)

Hence, we have the Langrangian function

L (θ, λ) = J (θ) + λ⊺f (θ)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

29 / 65

slide-30
SLIDE 30
  • 3. Unconstrained & Constrained Optimization

Back to our problem...

Can you solve this problem? max

θ

J (θ) = 1 − θ2

1 − θ2 2

s.t. f (θ) = θ1 + θ2 − 1 = 0 We can write the Lagrangian L (θ, λ) =

  • 1 − θ2

1 − θ2 2

  • + λ (θ1 + θ2 − 1)
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

30 / 65

slide-31
SLIDE 31
  • 3. Unconstrained & Constrained Optimization

The optimal solution

L (θ, λ) =

  • 1 − θ2

1 − θ2 2

  • + λ (θ1 + θ2 − 1)

∇θ1L = −2θ1 + λ = 0 ∇θ2L = −2θ2 + λ = 0 ∇λL = θ1 + θ2 − 1 = 0 θ∗

1 = θ∗ 2 = 1 2λ∗ = 1 2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

31 / 65

slide-32
SLIDE 32
  • 3. Unconstrained & Constrained Optimization

General Formulation

For a problem written in the form max

θ

J (θ) s.t. f (θ) = g (θ) ≥ We have the Lagrangian L (θ, λ, µ) = J (θ) + λ⊺f (θ) + µ⊺g (θ)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

32 / 65

slide-33
SLIDE 33
  • 3. Unconstrained & Constrained Optimization

Langrangian Dual Formulation

The Primal Problem, with corresponding primal variables θ is min

θ

J (θ) s.t. gi (θ) ≤ 0 ∀i = 1, . . . , m

Where each equality constraint can be converted into two equivalent inequality constraints (f = 0 ≡ f ≥ 0 ∧ f ≤ 0)

Hence we have the Lagrangian L (θ, λ) = J (θ) + λ⊺g (θ) The Dual Problem1, with corresponding dual variables λ is max

λ

G (λ) = maxλ minθ L (θ, λ) s.t. λi ≥ 0 ∀i = 1, . . . , m

1In words: Add the constraints to the objective function using nonnegative Lagrange multipliers. Then solve for the primal variables θ that minimize this. The solution gives the primal variables λ as functions of the Lagrange multipliers. Now maximize this with respect to the dual variables under the derived constraints

  • n the dual variables (including at least the nonnegativity constraints
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

33 / 65

slide-34
SLIDE 34
  • 3. Unconstrained & Constrained Optimization

Langrangian Dual Formulation

Why maximization? If λ∗ is the solution of the dual problem, then G (λ∗) is a lower bound for the primal problem due to two concepts:

Minimax inequality: for any function of two arguments φ (x, y), the maximin is less or equal than the minimax max

y

min

x φ (x, y) ≤ min x max y

φ (x, y) Weak duality: the primal values are always greater or equal than the dual values min

θ max λ≥0 L (θ, λ) ≥ max λ≥0 min θ L (θ, λ)

Check Boyd, Convex Optimization, Ch. 5 for more detailed information.

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

34 / 65

slide-35
SLIDE 35
  • 3. Unconstrained & Constrained Optimization

Duality Gap and Strong Duality

The duality gap is the difference between the values of any primal solutions and any dual solutions. It is always greater than

  • r equal to 0, due to weak duality.

The duality gap is zero if and only if strong duality holds.

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

35 / 65

slide-36
SLIDE 36
  • 3. Unconstrained & Constrained Optimization

Langrangian Dual Formulation

Why do we care about the dual formulation?

minθ L (θ, λ) is an unconstrained problem, for a given λ If it is easy to solve, the overall problem is easy to solve, because G (λ) is a concave function and thus easy to optimize, even though J and gi may be nonconvex

In ML, the dual is often more useful than the primal!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

36 / 65

slide-37
SLIDE 37
  • 3. Unconstrained & Constrained Optimization

General Recipe to Solve Optimization Problems with the Langrangian Dual Formulation

We want to solve min

θ

J (θ) s.t. gi (θ) ≤ 0 ∀i = 1, . . . , m (Assume J and gi are both differentiable functions) Write down the Lagrangian L (θ, λ) = J (θ) + λ⊺g (θ) Solve the problem minθ L (θ, λ)

Differentiate L w.r.t. θ, set to zero, and write the solution θ∗ as a function of λ

Replace θ∗ back in the Lagrangian G (λ) = L (θ∗, λ) = J (θ∗) + λ⊺g (θ∗) and solve the optimization problem maxλ G (λ) , s.t. ∀λi ≥ 0

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

37 / 65

slide-38
SLIDE 38
  • 4. Numerical Optimization

Outline

  • 1. Motivation
  • 2. Convexity

Convex Sets Convex Functions

  • 3. Unconstrained & Constrained Optimization
  • 4. Numerical Optimization
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

38 / 65

slide-39
SLIDE 39
  • 4. Numerical Optimization
  • 4. Numerical Optimization

For some problems we do not know how to compute the solution analytically. What can we do in that situation? We solve it numerically using a computer!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

39 / 65

slide-40
SLIDE 40
  • 4. Numerical Optimization

Evaluation of Numerical Algorithms

The performance of different optimization algorithms can be measured by answering the following questions

Does the algorithm converge to the optimal solution? How many steps does it take to converge? Is the convergence smooth or bumpy? Does it work for all types of functions or just on a special type (for instance convex functions)? ...

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

40 / 65

slide-41
SLIDE 41
  • 4. Numerical Optimization

Test Functions

To answer these questions we evaluated the performance in a set of well-known functions with interesting properties

Quadratic Function

Variable θ2 Variable θ1

quadratic function

−10 10 −5 5 10 15 50 100 150 200

J (θ) = (θ1 − 5)2+(θ1 − 5) (θ2 − 5)+(θ2 − 5)2

Rosenbrock Function

Variable θ2 Variable θ1

rosenbrocks function

−2 2 −2 −1 1 2 3 4 10 20 30 40

J (θ) =

  • θ2 − θ2

1

2 +0.01 (1 − θ1)2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

41 / 65

slide-42
SLIDE 42
  • 4. Numerical Optimization

Numerical Optimization - Key Ideas

quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

Find a δθ such that J(θ + αδθ) < J(θ) Iterative update rules like θn+1 = θn + αδθ Key questions: What is a good direction δθ? What is a good step size α?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

42 / 65

slide-43
SLIDE 43
  • 4. Numerical Optimization

Line Search vs Constant Learning Rate

Update rule: αn = arg minα J (θn + αδθn)

Optimal step size by Line Search αn = arg minα J (θn + αδθn)

quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

Other step sizes αn = const or αn = 1/n

quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

43 / 65

slide-44
SLIDE 44
  • 4. Numerical Optimization

Method 1 - Axial Iteration (aka coordinate descent)

Alternate minimization over both axes! quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

44 / 65

slide-45
SLIDE 45
  • 4. Numerical Optimization

Method 2 - Steepest descent

What you usually know as gradient descent Move in the direction of the gradient ∇J (θ) quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

45 / 65

slide-46
SLIDE 46
  • 4. Numerical Optimization

Method 2 - Steepest descent

The gradient is perpendicular to the contour lines After each line minimization the new gradient is always

  • rthogonal to the previous step direction (true for any line

minimization) Consequently, the iterations tend to zig-zag down the valley in a very inefficient manner

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

46 / 65

slide-47
SLIDE 47
  • 4. Numerical Optimization

Method 2 - Steepest descent

A very basic but cautious word for some source of errors

Remember that the gradient points in the direction of the maximum Pay attention to the problem you’re trying to solve!

maxθ J (θ), the update rule becomes θ ← θ+α∇θJ minθJ (θ), the update rule becomes θ ← θ−α∇θJ With α > 0

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

47 / 65

slide-48
SLIDE 48
  • 4. Numerical Optimization

Steepest descent on the Rosenbrock function

The algorithm crawls down the valley...

Variable θ2

rosenbrocks function

Variable θ1 −2 2 −2 −1 1 2 3 4 10 20 30 40

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

48 / 65

slide-49
SLIDE 49
  • 4. Numerical Optimization

Method 3 - Newton’s Method

Taylor approximations can approximate functions locally. For instance: J (θ + δθ) ≈ J (θ) + ∇θJ (θ)⊺ δθ + 1 2δθ⊺∇2

θJ (θ) δθ

= c + g⊺δθ + 1 2δθ⊺Hδθ = ˜ J (δθ)

where g is the Jacobian and H is the Hessian

We can minimize quadratic functions straightforwardly δθ = arg min

δθ

˜ J (δθ) = arg min

δθ

  • c + g⊺δθ + 1

2δθ⊺Hδθ

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

49 / 65

slide-50
SLIDE 50
  • 4. Numerical Optimization

Method 3 - Newton’s Method

We want to solve δθ = arg min

δθ

˜ J (δθ) = arg min

δθ

  • c + g⊺δθ + 1

2δθ⊺Hδθ

  • This leads to computing

∇δθ˜ J (δθ) = ∇δθ

  • c + g⊺δθ + 1

2δθ⊺Hδθ

  • = g + Hδθ = 0

Which yields the solution δθ = −H−1g

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

50 / 65

slide-51
SLIDE 51
  • 4. Numerical Optimization

Method 3 - Newton’s Method

For quadratic J(θ), the optimal solution is found in one step θn+1 = θn − H−1 (θn) g (θn) has quadratic convergence The solution δθ = −H−1g is guaranteed to be downhill if H is positive definite Rather than jumping straight to the predicted solution at δθ = −H−1g, better do a line search θn+1 = θn − αH−1g For H = I, this is just the steepest descent quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

51 / 65

slide-52
SLIDE 52
  • 4. Numerical Optimization

Newton’s Method on Rosenbrock’s Function

The algorithm converges in only 15 iterations compared to the 101 for conjugate gradients (to come later), and 300 for the regular gradients

Variable θ2

rosenbrocks function

Variable θ1 −2 2 −2 −1 1 2 3 4 10 20 30 40

What is the problem with this method? (δθ = −H−1g)

Computing the Hessian matrix at each iteration – this is not always feasible and often too expensive

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

52 / 65

slide-53
SLIDE 53
  • 4. Numerical Optimization

Quasi-Newton Method: BFGS

Approximate the Hessian matrix using the following ideas

Hessians change slowly Hessians are symmetric Derivatives interpolate

These lead to the optimization problem min H − Hn s.t. H = H⊺ H (θn+1 − θn) = g (θn+1) − g (θn)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

53 / 65

slide-54
SLIDE 54
  • 4. Numerical Optimization

Quasi-Newton Method: BFGS

Thus the Hessian can be computed iteratively H−1

n+1 =

  • I − sky⊺

k

s⊺

kyk

  • H−1

n

  • I − sky⊺

k

s⊺

kyk

⊺ + sky⊺

k

s⊺

kyk

where yn = g (θn+1) − g (θn) and sn = θn+1 − θn

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

54 / 65

slide-55
SLIDE 55
  • 4. Numerical Optimization

Quasi-Newton Method: BFGS

First step can be fully off due to initialization but slight errors can be helpful all the way For reasonable dimensions BFGS is preferred

quadratic function

Variable θ2 Variable θ1 −10 −5 5 10 15 −5 5 10 15 50 100 150 200 Variable θ2

rosenbrocks function

Variable θ1 −2 2 −2 −1 1 2 3 4 10 20 30 40

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

55 / 65

slide-56
SLIDE 56
  • 4. Numerical Optimization

Method 4 - Conjugate Gradients (a sketch)

The method of conjugate gradients chooses successive descent directions δθn such that it is guaranteed to reach the minimum in a finite number of steps Each δθn is chosen to be conjugate to all previous search directions with respect to the Hessian H

δθT

nHδθj = 0 for 0 ≤ j < n

The resulting search directions are mutually linearly independent

Remarkably, δθn can be chosen using only the knowledge of δθn−1, ∇J (θn) and ∇J (θn−1) δθn = ∇θJ (θn) + ∇θJ (θn)⊺ ∇θJ (θn) ∇θJ (θn−1)⊺ ∇θJ (θn−1)δθn−1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

56 / 65

slide-57
SLIDE 57
  • 4. Numerical Optimization

Method 4 - Conjugate Gradients

It uses first derivatives only, but avoids “undoing” previous work An N-dimensional quadratic form can be minimized in at most N conjugate descent steps

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

57 / 65

slide-58
SLIDE 58
  • 4. Numerical Optimization

Method 4 - Conjugate Gradients

3 different starting points The minimum is reached in exactly 2 steps

quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

quadratic function

Variable θ2 Variable θ1 −10 10 −5 5 10 15 50 100 150 200

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

58 / 65

slide-59
SLIDE 59
  • 4. Numerical Optimization

Conjugate Gradients on Rosenbrock’s Function

The algorithm converges in 101 iterations Far superior to steepest descent but slower than Newton’s methods However, it avoids computing the Hessian which can be more expensive for more dimensions

Variable θ2

rosenbrocks function

Variable θ1 −2 2 −2 −1 1 2 3 4 10 20 30 40

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

59 / 65

slide-60
SLIDE 60
  • 4. Numerical Optimization

Conjugate Gradients vs BFGS

BFGS is more costly than CG per iteration BFGS in converges in fewer steps than CG BFGS has less of a tendency to get "stuck" BFGS requires algorithmic “hacks” to achieve significant descent for each iteration Which one is better depends on your problem!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

60 / 65

slide-61
SLIDE 61
  • 4. Numerical Optimization

Performance Issues

Number of iterations required Cost per iteration Memory footprint Region of convergence Is the cost function noisy?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

61 / 65

slide-62
SLIDE 62
  • 5. Wrap-Up

Outline

  • 1. Motivation
  • 2. Convexity

Convex Sets Convex Functions

  • 3. Unconstrained & Constrained Optimization
  • 4. Numerical Optimization
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

62 / 65

slide-63
SLIDE 63
  • 5. Wrap-Up
  • 5. Wrap-Up

You know now: How machine learning relates to optimization What a good cost function looks like What convex sets and functions are Why convex functions are important in machine learning What unconstrained and constrained optimization are What the Lagrangian is Different numerical optimization methods

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

63 / 65

slide-64
SLIDE 64
  • 5. Wrap-Up

Self-Test Questions

Why is optimization important for machine learning? What do well-formulated learning problems look like? What is a convex set and what is a convex function? How do I find the maximum of a vector-valued function? How to deal with constrained optimization problems? How to solve such problems numerically?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

64 / 65

slide-65
SLIDE 65
  • 5. Wrap-Up

Homework

Reading Assignment for next lecture

Bishop ch 1.5 Murphy ch. 5.7

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

65 / 65