[PPT] - Numerical Optimization Shan-Hung Wu shwu@cs.nthu.edu.tw Department PowerPoint Presentation

SLIDE 1

Numerical Optimization

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 1 / 74

SLIDE 2

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 2 / 74

SLIDE 3

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 3 / 74

SLIDE 4

Numerical Computation

Machine learning algorithms usually require a high amount of numerical computation in involving real numbers

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 4 / 74

SLIDE 5

Numerical Computation

Machine learning algorithms usually require a high amount of numerical computation in involving real numbers However, real numbers cannot be represented precisely using a finite amount of memory

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 4 / 74

SLIDE 6

Numerical Computation

Machine learning algorithms usually require a high amount of numerical computation in involving real numbers However, real numbers cannot be represented precisely using a finite amount of memory Watch out the numeric errors when implementing machine learning algorithms

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 4 / 74

SLIDE 7

Overflow and Underflow I

Consider the softmax function softmax : Rd ! Rd: softmax(x)i = exp(xi) ∑d

j=1 exp(xj)

Commonly used to transform a group of real values to “probabilities”

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 5 / 74

SLIDE 8

Overflow and Underflow I

Consider the softmax function softmax : Rd ! Rd: softmax(x)i = exp(xi) ∑d

j=1 exp(xj)

Commonly used to transform a group of real values to “probabilities” Analytically, if xi = c for all i, then softmax(x)i = 1/d

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 5 / 74

SLIDE 9

Overflow and Underflow I

Consider the softmax function softmax : Rd ! Rd: softmax(x)i = exp(xi) ∑d

j=1 exp(xj)

Commonly used to transform a group of real values to “probabilities” Analytically, if xi = c for all i, then softmax(x)i = 1/d Numerically, this may not occur when |c| is large

A positive c causes overflow A negative c causes underflow and divide-by-zero error

How to avoid these errors?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 5 / 74

SLIDE 10

Overflow and Underflow II

Instead of evaluating softmax(x) directly, we can transform x into z = xmax

i

xi1 and then evaluate softmax(z)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 6 / 74

SLIDE 11

Overflow and Underflow II

Instead of evaluating softmax(x) directly, we can transform x into z = xmax

i

xi1 and then evaluate softmax(z) softmax(z)i =

exp(xim) ∑exp(xjm) = exp(xi)/exp(m) ∑exp(xj)/exp(m) = exp(xi) ∑exp(xj) = softmax(x)i

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 6 / 74

SLIDE 12

Overflow and Underflow II

Instead of evaluating softmax(x) directly, we can transform x into z = xmax

i

xi1 and then evaluate softmax(z) softmax(z)i =

exp(xim) ∑exp(xjm) = exp(xi)/exp(m) ∑exp(xj)/exp(m) = exp(xi) ∑exp(xj) = softmax(x)i

No overflow, as exp(largest attribute of x) = 1 Denominator is at least 1, no divide-by-zero error

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 6 / 74

SLIDE 13

Overflow and Underflow II

Instead of evaluating softmax(x) directly, we can transform x into z = xmax

i

xi1 and then evaluate softmax(z) softmax(z)i =

exp(xim) ∑exp(xjm) = exp(xi)/exp(m) ∑exp(xj)/exp(m) = exp(xi) ∑exp(xj) = softmax(x)i

No overflow, as exp(largest attribute of x) = 1 Denominator is at least 1, no divide-by-zero error What are the numerical issues of logsoftmax(z)? How to stabilize it? [Homework]

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 6 / 74

SLIDE 14

Poor Conditioning I

The “conditioning” refer to how much the input of a function can change given a small change in the output

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 7 / 74

SLIDE 15

Poor Conditioning I

The “conditioning” refer to how much the input of a function can change given a small change in the output Suppose we want to solve x in f(x) = Ax = y, where A1 exists

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 7 / 74

SLIDE 16

Poor Conditioning I

The “conditioning” refer to how much the input of a function can change given a small change in the output Suppose we want to solve x in f(x) = Ax = y, where A1 exists The condition umber of A can be expressed by κ(A) = max

i,j

λi

λj

Shan-Hung Wu (CS, NTHU)

Numerical Optimization Machine Learning 7 / 74

SLIDE 17

Poor Conditioning I

The “conditioning” refer to how much the input of a function can change given a small change in the output Suppose we want to solve x in f(x) = Ax = y, where A1 exists The condition umber of A can be expressed by κ(A) = max

i,j

λi

λj

We say the problem is poorly (or ill-) conditioned when κ(A) is large

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 7 / 74

SLIDE 18

Poor Conditioning I

The “conditioning” refer to how much the input of a function can change given a small change in the output Suppose we want to solve x in f(x) = Ax = y, where A1 exists The condition umber of A can be expressed by κ(A) = max

i,j

λi

λj

We say the problem is poorly (or ill-) conditioned when κ(A) is large

Hard to solve x = A1y precisely given a rounded y

A1 amplifies pre-existing numeric errors

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 7 / 74

SLIDE 19

Poor Conditioning II

The contours of f(x) = 1

2x>Ax+b>x+c, where A is symmetric:

When κ(A) is large, f stretches space differently along different attribute directions

Surface is flat in some directions but steep in others

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 8 / 74

SLIDE 20

Poor Conditioning II

The contours of f(x) = 1

2x>Ax+b>x+c, where A is symmetric:

When κ(A) is large, f stretches space differently along different attribute directions

Surface is flat in some directions but steep in others

Hard to solve f 0(x) = 0 ) x = A1b

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 8 / 74

SLIDE 21

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 9 / 74

SLIDE 22

Optimization Problems

An optimization problem is to minimize a cost function f : Rd ! R: minx f(x) subject to x 2 C where C ✓ Rd is called the feasible set containing feasible points

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

SLIDE 23

Optimization Problems

An optimization problem is to minimize a cost function f : Rd ! R: minx f(x) subject to x 2 C where C ✓ Rd is called the feasible set containing feasible points

Or, maximizing an objective function Maximizing f equals to minimizing f

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

SLIDE 24

Optimization Problems

An optimization problem is to minimize a cost function f : Rd ! R: minx f(x) subject to x 2 C where C ✓ Rd is called the feasible set containing feasible points

Or, maximizing an objective function Maximizing f equals to minimizing f

If C = Rd, we say the optimization problem is unconstrained

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

SLIDE 25

Optimization Problems

An optimization problem is to minimize a cost function f : Rd ! R: minx f(x) subject to x 2 C where C ✓ Rd is called the feasible set containing feasible points

Or, maximizing an objective function Maximizing f equals to minimizing f

If C = Rd, we say the optimization problem is unconstrained C can be a set of function constrains, i.e., C = {x : g(i)(x)  0}i

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

SLIDE 26

Optimization Problems

An optimization problem is to minimize a cost function f : Rd ! R: minx f(x) subject to x 2 C where C ✓ Rd is called the feasible set containing feasible points

Or, maximizing an objective function Maximizing f equals to minimizing f

If C = Rd, we say the optimization problem is unconstrained C can be a set of function constrains, i.e., C = {x : g(i)(x)  0}i Sometimes, we single out equality constrains C = {x : g(i)(x)  0,h(j)(x) = 0}i,j

Each equality constrain can be written as two inequality constrains

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 10 / 74

SLIDE 27

Minimums and Optimal Points

Critical points: {x : f 0(x) = 0}

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

SLIDE 28

Minimums and Optimal Points

Critical points: {x : f 0(x) = 0}

Minima: {x : f 0(x) = 0 and H(f)(x) O}, where H(f)(x) is the Hessian matrix (containing curvatures) of f at point x

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

SLIDE 29

Minimums and Optimal Points

Critical points: {x : f 0(x) = 0}

Minima: {x : f 0(x) = 0 and H(f)(x) O}, where H(f)(x) is the Hessian matrix (containing curvatures) of f at point x Maxima: {x : f 0(x) = 0 and H(f)(x) O}

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

SLIDE 30

Minimums and Optimal Points

Critical points: {x : f 0(x) = 0}

Minima: {x : f 0(x) = 0 and H(f)(x) O}, where H(f)(x) is the Hessian matrix (containing curvatures) of f at point x Maxima: {x : f 0(x) = 0 and H(f)(x) O} Plateau or saddle points: {x : f 0(x) = 0 and H(f)(x) = O or indefinite}

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

SLIDE 31

Minimums and Optimal Points

Critical points: {x : f 0(x) = 0}

Minima: {x : f 0(x) = 0 and H(f)(x) O}, where H(f)(x) is the Hessian matrix (containing curvatures) of f at point x Maxima: {x : f 0(x) = 0 and H(f)(x) O} Plateau or saddle points: {x : f 0(x) = 0 and H(f)(x) = O or indefinite}

y⇤ = minx2C f(x) 2 R is called the global minimum

Global minima vs. local minima

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

SLIDE 32

Minimums and Optimal Points

Critical points: {x : f 0(x) = 0}

Minima: {x : f 0(x) = 0 and H(f)(x) O}, where H(f)(x) is the Hessian matrix (containing curvatures) of f at point x Maxima: {x : f 0(x) = 0 and H(f)(x) O} Plateau or saddle points: {x : f 0(x) = 0 and H(f)(x) = O or indefinite}

y⇤ = minx2C f(x) 2 R is called the global minimum

Global minima vs. local minima

x⇤ = argminx2C f(x) is called the optimal point

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 11 / 74

SLIDE 33

Convex Optimization Problems

An optimization problem is convex iff

1

f is convex by having a “convex hull” surface, i.e., H(f)(x) ⌫ 0,8x

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

SLIDE 34

Convex Optimization Problems

An optimization problem is convex iff

1

f is convex by having a “convex hull” surface, i.e., H(f)(x) ⌫ 0,8x

2

gi(x)’s are convex and hj(x)’s are affine

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

SLIDE 35

Convex Optimization Problems

An optimization problem is convex iff

1

f is convex by having a “convex hull” surface, i.e., H(f)(x) ⌫ 0,8x

2

gi(x)’s are convex and hj(x)’s are affine

Convex problems are “easier” since

Local minima are necessarily global minima No saddle point

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

SLIDE 36

Convex Optimization Problems

An optimization problem is convex iff

1

f is convex by having a “convex hull” surface, i.e., H(f)(x) ⌫ 0,8x

2

gi(x)’s are convex and hj(x)’s are affine

Convex problems are “easier” since

Local minima are necessarily global minima No saddle point We can get the global minimum by solving f 0(x) = 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 12 / 74

SLIDE 37

Analytical Solutions vs. Numerical Solutions I

Consider the problem: argmin

x

1 2

kAxbk2 +λkxk2

Analytical solutions?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

SLIDE 38

Analytical Solutions vs. Numerical Solutions I

Consider the problem: argmin

x

1 2

kAxbk2 +λkxk2

Analytical solutions? The cost function f(x) = 1

2x>

A>A+λI

xb>Ax+ 1

2kbk2 is convex

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

SLIDE 39

Analytical Solutions vs. Numerical Solutions I

Consider the problem: argmin

x

1 2

kAxbk2 +λkxk2

Analytical solutions? The cost function f(x) = 1

2x>

A>A+λI

xb>Ax+ 1

2kbk2 is convex

Solving f 0(x) = x> A>A+λI

b>A = 0, we have

x⇤ = ⇣ A>A+λI ⌘1 A>b

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 13 / 74

SLIDE 40

Analytical Solutions vs. Numerical Solutions II

Problem (A 2 Rn⇥d, b 2 Rn, λ 2 R): arg min

x2Rd

1 2

kAxbk2 +λkxk2

Analytical solution: x⇤ =

A>A+λI

1 A>b

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

SLIDE 41

Analytical Solutions vs. Numerical Solutions II

Problem (A 2 Rn⇥d, b 2 Rn, λ 2 R): arg min

x2Rd

1 2

kAxbk2 +λkxk2

Analytical solution: x⇤ =

A>A+λI

1 A>b In practice, we may not be able to solve f 0(x) = 0 analytically and get x in a closed form

E.g., when λ = 0 and n < d

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

SLIDE 42

Analytical Solutions vs. Numerical Solutions II

Problem (A 2 Rn⇥d, b 2 Rn, λ 2 R): arg min

x2Rd

1 2

kAxbk2 +λkxk2

Analytical solution: x⇤ =

A>A+λI

1 A>b In practice, we may not be able to solve f 0(x) = 0 analytically and get x in a closed form

E.g., when λ = 0 and n < d

Even if we can, the computation cost may be too hight

E.g, inverting A>A+λI 2 Rd⇥d takes O(d3) time

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

SLIDE 43

Analytical Solutions vs. Numerical Solutions II

Problem (A 2 Rn⇥d, b 2 Rn, λ 2 R): arg min

x2Rd

1 2

kAxbk2 +λkxk2

Analytical solution: x⇤ =

A>A+λI

1 A>b In practice, we may not be able to solve f 0(x) = 0 analytically and get x in a closed form

E.g., when λ = 0 and n < d

Even if we can, the computation cost may be too hight

E.g, inverting A>A+λI 2 Rd⇥d takes O(d3) time

Numerical methods: since numerical errors are inevitable, why not just obtain an approximation of x⇤?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

SLIDE 44

Analytical Solutions vs. Numerical Solutions II

Problem (A 2 Rn⇥d, b 2 Rn, λ 2 R): arg min

x2Rd

1 2

kAxbk2 +λkxk2

Analytical solution: x⇤ =

A>A+λI

1 A>b In practice, we may not be able to solve f 0(x) = 0 analytically and get x in a closed form

E.g., when λ = 0 and n < d

Even if we can, the computation cost may be too hight

E.g, inverting A>A+λI 2 Rd⇥d takes O(d3) time

Numerical methods: since numerical errors are inevitable, why not just obtain an approximation of x⇤? Start from x(0), iteratively calculating x(1),x(2),··· such that f(x(1)) f(x(2)) ···

Usually require much less time to have a good enough x(t) ⇡ x⇤

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 14 / 74

SLIDE 45

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 15 / 74

SLIDE 46

Unconstrained Optimization

Problem: min

x2Rd f(x),

where f : Rd ! R is not necessarily convex

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 16 / 74

SLIDE 47

General Descent Algorithm

Input: x(0) 2 Rd, an initial guess repeat Determine a descent direction d(t) 2 Rd ; Line search: choose a step size or learning rate η(t) > 0 such that f(x(t) +η(t)d(t)) is minimal along the ray x(t) +η(t)d(t) ; Update rule: x(t+1) x(t) +η(t)d(t) ; until convergence criterion is satisfied;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 17 / 74

SLIDE 48

General Descent Algorithm

Input: x(0) 2 Rd, an initial guess repeat Determine a descent direction d(t) 2 Rd ; Line search: choose a step size or learning rate η(t) > 0 such that f(x(t) +η(t)d(t)) is minimal along the ray x(t) +η(t)d(t) ; Update rule: x(t+1) x(t) +η(t)d(t) ; until convergence criterion is satisfied; Convergence criterion: kx(t+1) x(t)k  ε, k∇f(x(t+1))k  ε, etc. Line search step could be skipped by letting η(t) be a small constant

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 17 / 74

SLIDE 49

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 18 / 74

SLIDE 50

Gradient Descent I

By Taylor’s theorem, we can approximate f locally at point x(t) using a linear function ˜ f, i.e., f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t)) for x close enough to x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

SLIDE 51

Gradient Descent I

By Taylor’s theorem, we can approximate f locally at point x(t) using a linear function ˜ f, i.e., f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t)) for x close enough to x(t) This implies that if we pick a close x(t+1) that decreases ˜ f, we are likely to decrease f as well

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

SLIDE 52

Gradient Descent I

By Taylor’s theorem, we can approximate f locally at point x(t) using a linear function ˜ f, i.e., f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t)) for x close enough to x(t) This implies that if we pick a close x(t+1) that decreases ˜ f, we are likely to decrease f as well We can pick x(t+1) = x(t) η∇f(x(t)) for some small η > 0, since ˜ f(x(t+1)) = f(x(t))ηk∇f(x(t))k2  ˜ f(x(t))

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 19 / 74

SLIDE 53

Gradient Descent II

Input: x(0) 2 Rd an initial guess, a small η > 0 repeat x(t+1) x(t) η∇f(x(t)) ; until convergence criterion is satisfied;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 20 / 74

SLIDE 54

Is Negative Gradient a Good Direction? I

Update rule: x(t+1) x(t) η∇f(x(t))

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

SLIDE 55

Is Negative Gradient a Good Direction? I

Update rule: x(t+1) x(t) η∇f(x(t)) Yes, as ∇f(x(t)) 2 Rd denotes the steepest ascent direction of f at point x(t) ∇f(x(t)) 2 Rd the steepest descent direction

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

SLIDE 56

Is Negative Gradient a Good Direction? I

Update rule: x(t+1) x(t) η∇f(x(t)) Yes, as ∇f(x(t)) 2 Rd denotes the steepest ascent direction of f at point x(t) ∇f(x(t)) 2 Rd the steepest descent direction But why?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 21 / 74

SLIDE 57

Is Negative Gradient a Good Direction? II

Consider the slope of f in a given direction u at point x(t) This is the directional derivative of f, i.e., the derivative of function f(x(t) +εu) with respect to ε, evaluated at ε = 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 22 / 74

SLIDE 58

Is Negative Gradient a Good Direction? II

Consider the slope of f in a given direction u at point x(t) This is the directional derivative of f, i.e., the derivative of function f(x(t) +εu) with respect to ε, evaluated at ε = 0 By the chain rule, we have

∂ ∂ε f(x(t) +εu) = ∇f(x(t) +εu)>u, which

equals to ∇f(x(t))>u when ε = 0 Theorem (Chain Rule) Let g : R ! Rd and f : Rd ! R, then (f g)0(x) = f 0(g(x))g0(x) = ∇f(g(x))> 2 6 4 g0

1(x)

. . . g0

n(x)

3 7 5.

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 22 / 74

SLIDE 59

Is Negative Gradient a Good Direction? III

To find the direction that decreases f fastest at x(t), we solve the problem: arg min

u,kuk=1∇f(x(t))>u = arg min u,kuk=1k∇f(x(t))kkukcosθ

where θ is the the angle between u and ∇f(x(t))

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 23 / 74

SLIDE 60

Is Negative Gradient a Good Direction? III

To find the direction that decreases f fastest at x(t), we solve the problem: arg min

u,kuk=1∇f(x(t))>u = arg min u,kuk=1k∇f(x(t))kkukcosθ

where θ is the the angle between u and ∇f(x(t)) This amounts to solve argmin

u cosθ

So, u⇤ = ∇f(x(t)) is the steepest descent direction of f at point x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 23 / 74

SLIDE 61

How to Set Learning Rate η? I

Too small an η results in slow descent speed and many iterations Too large an η may overshoot the optimal point along the gradient and goes uphill

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 24 / 74

SLIDE 62

How to Set Learning Rate η? I

Too small an η results in slow descent speed and many iterations Too large an η may overshoot the optimal point along the gradient and goes uphill One way to set a better η is to leverage the curvatures of f

The more curvy f at point x(t), the smaller the η

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 24 / 74

SLIDE 63

How to Set Learning Rate η? II

By Taylor’s theorem, we can approximate f locally at point x(t) using a quadratic function ˜ f: f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t))+

1 2(xx(t))>H(f)(x(t))(xx(t))

for x close enough to x(t)

H(f)(x(t)) 2 Rd⇥d is the (symmetric) Hessian matrix of f at x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 25 / 74

SLIDE 64

How to Set Learning Rate η? II

By Taylor’s theorem, we can approximate f locally at point x(t) using a quadratic function ˜ f: f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t))+

1 2(xx(t))>H(f)(x(t))(xx(t))

for x close enough to x(t)

H(f)(x(t)) 2 Rd⇥d is the (symmetric) Hessian matrix of f at x(t)

Line search at step t: argminη ˜ f(x(t) η∇f(x(t))) = argminη f(x(t))η∇f(x(t))>∇f(x(t))+ η2

2 ∇f(x(t))>H(f)(x(t))∇f(x(t))

If f(x(t))>H(f)(x(t))∇f(x(t)) > 0, we can solve

∂ ∂η ˜

f(x(t) η∇f(x(t))) = 0 and get: η(t) = ∇f(x(t))>∇f(x(t)) ∇f(x(t))>H(f)(x(t))∇f(x(t))

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 25 / 74

SLIDE 65

Problems of Gradient Descent

Gradient descent is designed to find the steepest descent direction at step x(t)

Not aware of the conditioning of the Hessian matrix H(f)(x(t))

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

SLIDE 66

Problems of Gradient Descent

Gradient descent is designed to find the steepest descent direction at step x(t)

Not aware of the conditioning of the Hessian matrix H(f)(x(t))

If H(f)(x(t)) has a large condition number, then f is curvy in some directions but flat in others at x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

SLIDE 67

Problems of Gradient Descent

Gradient descent is designed to find the steepest descent direction at step x(t)

Not aware of the conditioning of the Hessian matrix H(f)(x(t))

If H(f)(x(t)) has a large condition number, then f is curvy in some directions but flat in others at x(t) E.g., suppose f is a quadratic function whose Hessian has a large condition number A step in gradient descent may

vershoot the optimal points along

flat attributes

“Zig-zags” around a narrow valley

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

SLIDE 68

Problems of Gradient Descent

Gradient descent is designed to find the steepest descent direction at step x(t)

Not aware of the conditioning of the Hessian matrix H(f)(x(t))

If H(f)(x(t)) has a large condition number, then f is curvy in some directions but flat in others at x(t) E.g., suppose f is a quadratic function whose Hessian has a large condition number A step in gradient descent may

vershoot the optimal points along

flat attributes

“Zig-zags” around a narrow valley

Why not take conditioning into account when picking descent directions?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 26 / 74

SLIDE 69

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 27 / 74

SLIDE 70

Newton’s Method I

By Taylor’s theorem, we can approximate f locally at point x(t) using a quadratic function ˜ f, i.e., f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t))+

1 2(xx(t))>H(f)(x(t))(xx(t))

for x close enough to x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 28 / 74

SLIDE 71

Newton’s Method I

By Taylor’s theorem, we can approximate f locally at point x(t) using a quadratic function ˜ f, i.e., f(x) ⇡ ˜ f(x;x(t)) = f(x(t))+∇f(x(t))>(xx(t))+

1 2(xx(t))>H(f)(x(t))(xx(t))

for x close enough to x(t) If f is strictly convex (i.e., H(f)(a) O,8a), we can find x(t+1) that minimizes ˜ f in order to decrease f Solving ∇˜ f(x(t+1);x(t)) = 0, we have x(t+1) = x(t) H(f)(x(t))1∇f(x(t))

H(f)(x(t))1 as a “corrector” to the negative gradient

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 28 / 74

SLIDE 72

Newton’s Method II

Input: x(0) 2 Rd an initial guess, η > 0 repeat x(t+1) x(t) ηH(f)(x(t))1∇f(x(t)) ; until convergence criterion is satisfied;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 29 / 74

SLIDE 73

Newton’s Method II

Input: x(0) 2 Rd an initial guess, η > 0 repeat x(t+1) x(t) ηH(f)(x(t))1∇f(x(t)) ; until convergence criterion is satisfied; In practice, we multiply the shift by a small η > 0 to make sure that x(t+1) is close to x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 29 / 74

SLIDE 74

Newton’s Method III

If f is positive definite quadratic, then only one step is required

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 30 / 74

SLIDE 75

General Functions

Update rule: x(t+1) x(t) ηH(f)(x(t))1∇f(x(t)) What if f is not strictly convex?

H(f)(x(t)) O or indefinite

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

SLIDE 76

General Functions

Update rule: x(t+1) x(t) ηH(f)(x(t))1∇f(x(t)) What if f is not strictly convex?

H(f)(x(t)) O or indefinite

The Levenberg–Marquardt extension: x(t+1) = x(t) η ⇣ H(f)(x(t))+αI ⌘1 ∇f(x(t)) for some α > 0

With a large α, degenerates into gradient descent of learning rate 1/α

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

SLIDE 77

General Functions

Update rule: x(t+1) x(t) ηH(f)(x(t))1∇f(x(t)) What if f is not strictly convex?

H(f)(x(t)) O or indefinite

The Levenberg–Marquardt extension: x(t+1) = x(t) η ⇣ H(f)(x(t))+αI ⌘1 ∇f(x(t)) for some α > 0

With a large α, degenerates into gradient descent of learning rate 1/α

Input: x(0) 2 Rd an initial guess, η > 0, α > 0 repeat x(t+1) x(t) η

H(f)(x(t))+αI

1 ∇f(x(t)) ; until convergence criterion is satisfied;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 31 / 74

SLIDE 78

Gradient Descent vs. Newton’s Method

Steps of Gradient descent when f is a Rosenbrock’s banana: Steps of Newton’s method:

Only 6 steps in total

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 32 / 74

SLIDE 79

Problems of Newton’s Method

Computing H(f)(x(t))1 is slow

Takes O(d3) time at each step, which is much slower then O(d) of gradient descent

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

SLIDE 80

Problems of Newton’s Method

Computing H(f)(x(t))1 is slow

Takes O(d3) time at each step, which is much slower then O(d) of gradient descent

Imprecise x(t+1) = x(t) ηH(f)(x(t))1∇f(x(t)) due to numerical errors

H(f)(x(t)) may have a large condition number

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

SLIDE 81

Problems of Newton’s Method

Computing H(f)(x(t))1 is slow

Takes O(d3) time at each step, which is much slower then O(d) of gradient descent

Imprecise x(t+1) = x(t) ηH(f)(x(t))1∇f(x(t)) due to numerical errors

H(f)(x(t)) may have a large condition number

Attracted to saddle points (when f is not convex)

The x(t+1) solved from ∇˜ f(x(t+1);x(t)) = 0 is a critical point

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 33 / 74

SLIDE 82

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 34 / 74

SLIDE 83

Who is Afraid of Non-convexity?

In ML, the function to solve is usually the cost function C(w) of a model F = {f : f parametrized by w}

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

SLIDE 84

Who is Afraid of Non-convexity?

In ML, the function to solve is usually the cost function C(w) of a model F = {f : f parametrized by w} Many ML models have convex cost functions in order to take advantages of convex optimization

E.g., perceptron, linear regression, logistic regression, SVMs, etc.

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

SLIDE 85

Who is Afraid of Non-convexity?

In ML, the function to solve is usually the cost function C(w) of a model F = {f : f parametrized by w} Many ML models have convex cost functions in order to take advantages of convex optimization

E.g., perceptron, linear regression, logistic regression, SVMs, etc.

However, in deep learning, the cost function of a neural network is typically not convex

We will discuss techniques that tackle non-convexity later

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 35 / 74

SLIDE 86

Assumption on Cost Functions

In ML, we usually assume that the (real-valued) cost function is Lipschitz continuous and/or have Lipschitz continuous derivatives I.e., the rate of change of C if bounded by a Lipschitz constant K: |C(w(1))C(w(2))|  Kkw(1) w(2)k,8w(1),w(2)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 36 / 74

SLIDE 87

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 37 / 74

SLIDE 88

Perceptron & Neurons

Perceptron, proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

SLIDE 89

Perceptron & Neurons

Perceptron, proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

SLIDE 90

Perceptron & Neurons

Perceptron, proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943

Our brains consist of interconnected neurons

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

SLIDE 91

Perceptron & Neurons

Perceptron, proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943

Our brains consist of interconnected neurons Each neuron takes signals from other neurons as input

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

SLIDE 92

Perceptron & Neurons

Perceptron, proposed in 1950’s by Rosenblatt, is one of the first ML algorithms for binary classification Inspired by McCullock-Pitts (MCP) neuron, published in 1943

Our brains consist of interconnected neurons Each neuron takes signals from other neurons as input If the accumulated signal exceeds a certain threshold, an output signal is generated

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 38 / 74

SLIDE 93

Model

Binary classification problem:

Training dataset: X = {(x(i),y(i))}i, where x(i) 2 RD and y(i) 2 {1,1} Output: a function f(x) = ˆ y such that ˆ y is close to the true label y

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

SLIDE 94

Model

Binary classification problem:

Training dataset: X = {(x(i),y(i))}i, where x(i) 2 RD and y(i) 2 {1,1} Output: a function f(x) = ˆ y such that ˆ y is close to the true label y

Model: {f : f(x;w,b) = sign(w>xb)}

sign(a) = 1 if a 0; otherwise 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

SLIDE 95

Model

Binary classification problem:

Training dataset: X = {(x(i),y(i))}i, where x(i) 2 RD and y(i) 2 {1,1} Output: a function f(x) = ˆ y such that ˆ y is close to the true label y

Model: {f : f(x;w,b) = sign(w>xb)}

sign(a) = 1 if a 0; otherwise 0 For simplicity, we use shorthand f(x;w) = sign(w>x) where w = [b,w1,··· ,wD]> and x = [1,x1,··· ,xD]>

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 39 / 74

SLIDE 96

Iterative Training Algorithm I

1

Initiate w(0) and learning rate η > 0

2

Epoch: for each example (x(t),y(t)), update w by w(t+1) = w(t) +η(y(t) ˆ y(t))x(t) where ˆ y(t) = f(x(t);w(t)) = sign(w(t)>x(t))

3

Repeat epoch several times (or until converge)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 40 / 74

SLIDE 97

Iterative Training Algorithm II

Update rule: w(t+1) = w(t) +η(y(t) ˆ y(t))x(t) If ˆ y(t) is correct, we have w(t+1) = w(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

SLIDE 98

Iterative Training Algorithm II

Update rule: w(t+1) = w(t) +η(y(t) ˆ y(t))x(t) If ˆ y(t) is correct, we have w(t+1) = w(t) If ˆ y(t) is incorrect, we have w(t+1) = w(t) +2ηy(t)x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

SLIDE 99

Iterative Training Algorithm II

Update rule: w(t+1) = w(t) +η(y(t) ˆ y(t))x(t) If ˆ y(t) is correct, we have w(t+1) = w(t) If ˆ y(t) is incorrect, we have w(t+1) = w(t) +2ηy(t)x(t)

If y(t) = 1, the updated prediction will more likely to be positive, as sign(w(t+1)>x(t)) = sign(w(t)>x(t) +c) for some c > 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

SLIDE 100

Iterative Training Algorithm II

Update rule: w(t+1) = w(t) +η(y(t) ˆ y(t))x(t) If ˆ y(t) is correct, we have w(t+1) = w(t) If ˆ y(t) is incorrect, we have w(t+1) = w(t) +2ηy(t)x(t)

If y(t) = 1, the updated prediction will more likely to be positive, as sign(w(t+1)>x(t)) = sign(w(t)>x(t) +c) for some c > 0 If y(t) = 1, the updated prediction will more likely to be negative

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

SLIDE 101

Iterative Training Algorithm II

Update rule: w(t+1) = w(t) +η(y(t) ˆ y(t))x(t) If ˆ y(t) is correct, we have w(t+1) = w(t) If ˆ y(t) is incorrect, we have w(t+1) = w(t) +2ηy(t)x(t)

If y(t) = 1, the updated prediction will more likely to be positive, as sign(w(t+1)>x(t)) = sign(w(t)>x(t) +c) for some c > 0 If y(t) = 1, the updated prediction will more likely to be negative

Does not converge if the dataset cannot be separated by a hyperplane

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 41 / 74

SLIDE 102

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 42 / 74

SLIDE 103

ADAptive LInear NEuron (Adaline)

Proposed in 1960’s by Widrow et al. Defines and minimizes a cost function for training: argmin

w C(w;X) = argmin w

1 2

N

∑

i=1

⇣ y(i) w>x(i)⌘2

Links numerical optimization to ML

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 43 / 74

SLIDE 104

ADAptive LInear NEuron (Adaline)

Proposed in 1960’s by Widrow et al. Defines and minimizes a cost function for training: argmin

w C(w;X) = argmin w

1 2

N

∑

i=1

⇣ y(i) w>x(i)⌘2

Links numerical optimization to ML

Sign function is only used for binary prediction after training

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 43 / 74

SLIDE 105

Training Using Gradient Descent

Update rule: w(t+1) = w(t) η∇C(w(t)) = w(t) +η ∑i(y(i) w(t)>x(i))x(i)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 44 / 74

SLIDE 106

Training Using Gradient Descent

Update rule: w(t+1) = w(t) η∇C(w(t)) = w(t) +η ∑i(y(i) w(t)>x(i))x(i) Since the cost function is convex, the training iterations will converge

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 44 / 74

SLIDE 107

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 45 / 74

SLIDE 108

Cost as an Expectation

In ML, the cost function to minimize is usually a sum of losses over training examples E.g., in Adaline: sum of square losses (functions) argmin

w C(w;X) = argmin w

1 2

N

∑

i=1

⇣ y(i) w>x(i)⌘2

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 46 / 74

SLIDE 109

Cost as an Expectation

In ML, the cost function to minimize is usually a sum of losses over training examples E.g., in Adaline: sum of square losses (functions) argmin

w C(w;X) = argmin w

1 2

N

∑

i=1

⇣ y(i) w>x(i)⌘2 Let examples be i.i.d. samples of random variables (x,y) We effectively minimize the estimate of E[C(w)] over the distribution P(x,y): argmin

w Ex,y⇠P[C(w)]

P(x,y) may be unknown

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 46 / 74

SLIDE 110

Cost as an Expectation

In ML, the cost function to minimize is usually a sum of losses over training examples E.g., in Adaline: sum of square losses (functions) argmin

w C(w;X) = argmin w

1 2

N

∑

i=1

⇣ y(i) w>x(i)⌘2 Let examples be i.i.d. samples of random variables (x,y) We effectively minimize the estimate of E[C(w)] over the distribution P(x,y): argmin

w Ex,y⇠P[C(w)]

P(x,y) may be unknown

Since the problem is stochastic by nature, why not make the training algorithm stochastic too?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 46 / 74

SLIDE 111

Stochastic Gradient Descent

Input: w(0) 2 Rd an initial guess, η > 0, M 1 repeat epoch: Randomly partition the training set X into the minibatches {X(j)}j, |X(j)| = M; foreach j do w(t+1) w(t) η∇C(w(t);X(j)) ; end until convergence criterion is satisfied;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 47 / 74

SLIDE 112

Stochastic Gradient Descent

Input: w(0) 2 Rd an initial guess, η > 0, M 1 repeat epoch: Randomly partition the training set X into the minibatches {X(j)}j, |X(j)| = M; foreach j do w(t+1) w(t) η∇C(w(t);X(j)) ; end until convergence criterion is satisfied; C(w;X(j)) is still an estimate of Ex,y⇠P[C(w)]

X(j) are samples of the same distribution P(x,y)

It’s common to set M = 1 on a single machine

E.g., update rule for Adaline: w(t+1) = w(t) +η(y(t) w(t)>x(t))x(t), which is similar to that of Perceptron

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 47 / 74

SLIDE 113

Stochastic Gradient Descent

Input: w(0) 2 Rd an initial guess, η > 0, M 1 repeat epoch: Randomly partition the training set X into the minibatches {X(j)}j, |X(j)| = M; foreach j do w(t+1) w(t) η∇C(w(t);X(j)) ; end until convergence criterion is satisfied; C(w;X(j)) is still an estimate of Ex,y⇠P[C(w)]

X(j) are samples of the same distribution P(x,y)

It’s common to set M = 1 on a single machine

E.g., update rule for Adaline: w(t+1) = w(t) +η(y(t) w(t)>x(t))x(t), which is similar to that of Perceptron

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 47 / 74

SLIDE 114

SGD vs. GD

Each iteration can run much faster when M ⌧ N

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 48 / 74

SLIDE 115

SGD vs. GD

Each iteration can run much faster when M ⌧ N Converges faster (in both #epochs and time) with large datasets

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 48 / 74

SLIDE 116

SGD vs. GD

Each iteration can run much faster when M ⌧ N Converges faster (in both #epochs and time) with large datasets Supports online learning

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 48 / 74

SLIDE 117

SGD vs. GD

Each iteration can run much faster when M ⌧ N Converges faster (in both #epochs and time) with large datasets Supports online learning But may wander around the optimal points

In practice, we set η = O(t1)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 48 / 74

SLIDE 118

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 49 / 74

SLIDE 119

Constrained Optimization

Problem: minx f(x) subject to x 2 C f : Rd ! R is not necessarily convex C = {x : g(i)(x)  0,h(j)(x) = 0}i,j

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 50 / 74

SLIDE 120

Constrained Optimization

Problem: minx f(x) subject to x 2 C f : Rd ! R is not necessarily convex C = {x : g(i)(x)  0,h(j)(x) = 0}i,j Iterative descent algorithm?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 50 / 74

SLIDE 121

Common Methods

Projective gradient descent: if x(t) falls outside C at step t, we “project” back the point to the tangent space (edge) of C

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 51 / 74

SLIDE 122

Common Methods

Projective gradient descent: if x(t) falls outside C at step t, we “project” back the point to the tangent space (edge) of C Penalty/barrier methods: convert the constrained problem into one

r more unconstrained ones

And more...

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 51 / 74

SLIDE 123

Karush-Kuhn-Tucker (KKT) Methods I

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 52 / 74

SLIDE 124

Karush-Kuhn-Tucker (KKT) Methods I

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x) minx maxα,β L means “minimize L with respect to x, at which L is maximized with respect to α and β”

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 52 / 74

SLIDE 125

Karush-Kuhn-Tucker (KKT) Methods I

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x) minx maxα,β L means “minimize L with respect to x, at which L is maximized with respect to α and β” The function L(x,α,β) is called the (generalized) Lagrangian

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 52 / 74

SLIDE 126

Karush-Kuhn-Tucker (KKT) Methods I

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x) minx maxα,β L means “minimize L with respect to x, at which L is maximized with respect to α and β” The function L(x,α,β) is called the (generalized) Lagrangian α and β are called KKT multipliers

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 52 / 74

SLIDE 127

Karush-Kuhn-Tucker (KKT) Methods II

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 53 / 74

SLIDE 128

Karush-Kuhn-Tucker (KKT) Methods II

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x) Observe that for any feasible point x, we have max

α,β,α0L(x,α,β) = f(x)

The optimal feasible point is unchanged

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 53 / 74

SLIDE 129

Karush-Kuhn-Tucker (KKT) Methods II

Converts the problem minx f(x) subject to x 2 {x : g(i)(x)  0,h(j)(x) = 0}i,j into minx maxα,β,α0 L(x,α,β) = minx maxα,β,α0 f(x)+∑i αig(i)(x)+∑j βjh(j)(x) Observe that for any feasible point x, we have max

α,β,α0L(x,α,β) = f(x)

The optimal feasible point is unchanged

And for any infeasible point x, we have max

α,β,α0L(x,α,β) = ∞

Infeasible points will never be optimal (if there are feasible points)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 53 / 74

SLIDE 130

Alternate Iterative Algorithm

min

x

max

α,β,α0f(x)+∑ i

αig(i)(x)+∑

j

βjh(j)(x)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 54 / 74

SLIDE 131

Alternate Iterative Algorithm

min

x

max

α,β,α0f(x)+∑ i

αig(i)(x)+∑

j

βjh(j)(x) “Large” α and β create a “barrier” for feasible solutions Input: x(0) an initial guess, α(0) = 0, β (0) = 0 repeat Solve x(t+1) = argminx L(x;α(t),β (t)) using some iterative algorithm starting at x(t); if x(t+1) / 2 C then Increase α(t) to get α(t+1); Get β (t+1) by increasing the magnitude of β (t) and set sign(β (t+1)

j

) = sign(h(j)(x(t+1))); end until x(t+1) 2 C;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 54 / 74

SLIDE 132

KKT Conditions

Theorem (KKT Conditions) If x⇤ is an optimal point, then there exists KKT multipliers α⇤ and β ⇤ such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied:

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 55 / 74

SLIDE 133

KKT Conditions

Theorem (KKT Conditions) If x⇤ is an optimal point, then there exists KKT multipliers α⇤ and β ⇤ such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied: Lagrangian stationarity: ∇L(x⇤,α⇤,β ⇤) = 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 55 / 74

SLIDE 134

KKT Conditions

Theorem (KKT Conditions) If x⇤ is an optimal point, then there exists KKT multipliers α⇤ and β ⇤ such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied: Lagrangian stationarity: ∇L(x⇤,α⇤,β ⇤) = 0 Primal feasibility: g(i)(x⇤)  0 and h(j)(x⇤) = 0 for all i and j Dual feasibility: α⇤ 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 55 / 74

SLIDE 135

KKT Conditions

Theorem (KKT Conditions) If x⇤ is an optimal point, then there exists KKT multipliers α⇤ and β ⇤ such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied: Lagrangian stationarity: ∇L(x⇤,α⇤,β ⇤) = 0 Primal feasibility: g(i)(x⇤)  0 and h(j)(x⇤) = 0 for all i and j Dual feasibility: α⇤ 0 Complementary slackness: α⇤

i g(i)(x⇤) = 0 for all i.

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 55 / 74

SLIDE 136

KKT Conditions

Theorem (KKT Conditions) If x⇤ is an optimal point, then there exists KKT multipliers α⇤ and β ⇤ such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied: Lagrangian stationarity: ∇L(x⇤,α⇤,β ⇤) = 0 Primal feasibility: g(i)(x⇤)  0 and h(j)(x⇤) = 0 for all i and j Dual feasibility: α⇤ 0 Complementary slackness: α⇤

i g(i)(x⇤) = 0 for all i.

Only a necessary condition for x⇤ being optimal

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 55 / 74

SLIDE 137

KKT Conditions

Theorem (KKT Conditions) If x⇤ is an optimal point, then there exists KKT multipliers α⇤ and β ⇤ such that the Karush-Kuhn-Tucker (KKT) conditions are satisfied: Lagrangian stationarity: ∇L(x⇤,α⇤,β ⇤) = 0 Primal feasibility: g(i)(x⇤)  0 and h(j)(x⇤) = 0 for all i and j Dual feasibility: α⇤ 0 Complementary slackness: α⇤

i g(i)(x⇤) = 0 for all i.

Only a necessary condition for x⇤ being optimal Sufficient if the original problem is convex

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 55 / 74

SLIDE 138

Complementary Slackness

Why α⇤

i g(i)(x⇤) = 0?

For x⇤ being feasible, we have g(i)(x⇤)  0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 56 / 74

SLIDE 139

Complementary Slackness

Why α⇤

i g(i)(x⇤) = 0?

For x⇤ being feasible, we have g(i)(x⇤)  0 If g(i) is active (i.e., g(i)(x⇤) = 0) then α⇤

i g(i)(x⇤) = 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 56 / 74

SLIDE 140

Complementary Slackness

Why α⇤

i g(i)(x⇤) = 0?

For x⇤ being feasible, we have g(i)(x⇤)  0 If g(i) is active (i.e., g(i)(x⇤) = 0) then α⇤

i g(i)(x⇤) = 0

If g(i) is inactive (i.e., g(i)(x⇤) < 0), then

To maximize the αig(i)(x⇤) term in the Lagrangian in terms of αi subject to αi 0, we have α⇤

i = 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 56 / 74

SLIDE 141

Complementary Slackness

Why α⇤

i g(i)(x⇤) = 0?

For x⇤ being feasible, we have g(i)(x⇤)  0 If g(i) is active (i.e., g(i)(x⇤) = 0) then α⇤

i g(i)(x⇤) = 0

If g(i) is inactive (i.e., g(i)(x⇤) < 0), then

To maximize the αig(i)(x⇤) term in the Lagrangian in terms of αi subject to αi 0, we have α⇤

i = 0

Again α⇤

i g(i)(x⇤) = 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 56 / 74

SLIDE 142

Complementary Slackness

Why α⇤

i g(i)(x⇤) = 0?

For x⇤ being feasible, we have g(i)(x⇤)  0 If g(i) is active (i.e., g(i)(x⇤) = 0) then α⇤

i g(i)(x⇤) = 0

If g(i) is inactive (i.e., g(i)(x⇤) < 0), then

To maximize the αig(i)(x⇤) term in the Lagrangian in terms of αi subject to αi 0, we have α⇤

i = 0

Again α⇤

i g(i)(x⇤) = 0

So what?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 56 / 74

SLIDE 143

Complementary Slackness

Why α⇤

i g(i)(x⇤) = 0?

For x⇤ being feasible, we have g(i)(x⇤)  0 If g(i) is active (i.e., g(i)(x⇤) = 0) then α⇤

i g(i)(x⇤) = 0

If g(i) is inactive (i.e., g(i)(x⇤) < 0), then

To maximize the αig(i)(x⇤) term in the Lagrangian in terms of αi subject to αi 0, we have α⇤

i = 0

Again α⇤

i g(i)(x⇤) = 0

So what?

α⇤

i > 0 implies g(i)(x⇤) = 0

Once x⇤ is solved, we can quickly find out the active inequality constrains by checking α⇤

i > 0

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 56 / 74

SLIDE 144

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 57 / 74

SLIDE 145

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 58 / 74

SLIDE 146

The Regression Problem

Given a training dataset: X = {(x(i),y(i))}N

i=1

x(i) 2 RD, called explanatory variables (attributes/features) y(i) 2 R, called response/target variables (labels)

Goal: to find a function f(x) = ˆ y such that ˆ y is close to the true label y

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 59 / 74

SLIDE 147

The Regression Problem

Given a training dataset: X = {(x(i),y(i))}N

i=1

x(i) 2 RD, called explanatory variables (attributes/features) y(i) 2 R, called response/target variables (labels)

Goal: to find a function f(x) = ˆ y such that ˆ y is close to the true label y Example: to predict the price of a stock tomorrow

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 59 / 74

SLIDE 148

The Regression Problem

Given a training dataset: X = {(x(i),y(i))}N

i=1

x(i) 2 RD, called explanatory variables (attributes/features) y(i) 2 R, called response/target variables (labels)

Goal: to find a function f(x) = ˆ y such that ˆ y is close to the true label y Example: to predict the price of a stock tomorrow Could you define a model F = {f} and cost function C[f]?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 59 / 74

SLIDE 149

The Regression Problem

Given a training dataset: X = {(x(i),y(i))}N

i=1

x(i) 2 RD, called explanatory variables (attributes/features) y(i) 2 R, called response/target variables (labels)

Goal: to find a function f(x) = ˆ y such that ˆ y is close to the true label y Example: to predict the price of a stock tomorrow Could you define a model F = {f} and cost function C[f]? How about “relaxing” the Adaline by removing the sign function when making the final prediction?

Adaline: ˆ y = sign(w>xb) Regressor: ˆ y = w>xb

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 59 / 74

SLIDE 150

Linear Regression I

Model: F = {f : f(x;w,b) = w>xb}

Shorthand: f(x;w) = w>x, where w = [b,w1,··· ,wD]> and x = [1,x1,··· ,xD]>

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 60 / 74

SLIDE 151

Linear Regression I

Model: F = {f : f(x;w,b) = w>xb}

Shorthand: f(x;w) = w>x, where w = [b,w1,··· ,wD]> and x = [1,x1,··· ,xD]>

Cost function and optimization problem: argmin

w

1 2

N

∑

i=1

ky(i) w>x(i)k2 = argmin

w

1 2kyXwk2

X = 2 6 4 1 x(1)> . . . . . . 1 x(N)> 3 7 5 2 RN⇥(D+1) the design matrix y = [y(1),··· ,y(N)]> the label vector

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 60 / 74

SLIDE 152

Linear Regression II

argmin

w

1 2

N

∑

i=1

ky(i) w>x(i)k2 = argmin

w

1 2kyXwk2 Basically, we fit a hyperplane to training data

Each f(x) = w>xb 2 F is a hyperplane in the graph

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 61 / 74

SLIDE 153

Training Using Gradient Descent

argmin

w

1 2

N

∑

i=1

ky(i) w>x(i)k2 = argmin

w

1 2kyXwk2 Batch: w(t+1) = w(t) +η

N

∑

i=1

(y(i) w(t)>x(i))x(i) = w(t) +ηX>(yXw) Stochastic (with minibatch size |M| = 1): w(t+1) = w(t) +η(y(t) w(t)>x(t))x(t)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 62 / 74

SLIDE 154

Evaluation Metrics of Regression Models

Given a training/testing set X = {(x(i),y(i))}N

i=1

How to evaluate the predictions ˆ y(i) made by a function f?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 63 / 74

SLIDE 155

Evaluation Metrics of Regression Models

Given a training/testing set X = {(x(i),y(i))}N

i=1

How to evaluate the predictions ˆ y(i) made by a function f? Sum of Square Errors (SSE): ∑N

i=1(y(i) ˆ

y(i))2

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 63 / 74

SLIDE 156

Evaluation Metrics of Regression Models

Given a training/testing set X = {(x(i),y(i))}N

i=1

How to evaluate the predictions ˆ y(i) made by a function f? Sum of Square Errors (SSE): ∑N

i=1(y(i) ˆ

y(i))2 Mean Square Error (MSE): 1

N ∑N i=1(y(i) ˆ

y(i))2

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 63 / 74

SLIDE 157

Evaluation Metrics of Regression Models

Given a training/testing set X = {(x(i),y(i))}N

i=1

How to evaluate the predictions ˆ y(i) made by a function f? Sum of Square Errors (SSE): ∑N

i=1(y(i) ˆ

y(i))2 Mean Square Error (MSE): 1

N ∑N i=1(y(i) ˆ

y(i))2 Relative Square Error (RSE): ∑N

i=1(y(i) ˆ

y(i))2 ∑N

i=1(y(i) ¯

y(i))2 , where ¯ y = 1

N ∑i y(i)

What does it mean?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 63 / 74

SLIDE 158

Evaluation Metrics of Regression Models

Given a training/testing set X = {(x(i),y(i))}N

i=1

How to evaluate the predictions ˆ y(i) made by a function f? Sum of Square Errors (SSE): ∑N

i=1(y(i) ˆ

y(i))2 Mean Square Error (MSE): 1

N ∑N i=1(y(i) ˆ

y(i))2 Relative Square Error (RSE): ∑N

i=1(y(i) ˆ

y(i))2 ∑N

i=1(y(i) ¯

y(i))2 , where ¯ y = 1

N ∑i y(i)

What does it mean? Compares f with a dummy prediction ¯ y

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 63 / 74

SLIDE 159

Evaluation Metrics of Regression Models

Given a training/testing set X = {(x(i),y(i))}N

i=1

How to evaluate the predictions ˆ y(i) made by a function f? Sum of Square Errors (SSE): ∑N

i=1(y(i) ˆ

y(i))2 Mean Square Error (MSE): 1

N ∑N i=1(y(i) ˆ

y(i))2 Relative Square Error (RSE): ∑N

i=1(y(i) ˆ

y(i))2 ∑N

i=1(y(i) ¯

y(i))2 , where ¯ y = 1

N ∑i y(i)

What does it mean? Compares f with a dummy prediction ¯ y

Coefficient of Determination: R2 = 1RSE 2 [0,1]

Higher the better

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 63 / 74

SLIDE 160

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 64 / 74

SLIDE 161

Polynomial Regression

In practice, the relationship between explanatory variables and target variables may not be linear Polynomial regression fits a high-order polynomial to the training data

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 65 / 74

SLIDE 162

Polynomial Regression

In practice, the relationship between explanatory variables and target variables may not be linear Polynomial regression fits a high-order polynomial to the training data How?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 65 / 74

SLIDE 163

Data Augmentation

Suppose D = 2, i.e., x = [x1,x2]> Linear model: F = {f : f(x;w) = w>x+w0 = w0 +w1x1 +w2x2}

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 66 / 74

SLIDE 164

Data Augmentation

Suppose D = 2, i.e., x = [x1,x2]> Linear model: F = {f : f(x;w) = w>x+w0 = w0 +w1x1 +w2x2} Quadratic model: F = {f : f(x;w) = w0 +w1x1 +w2x2 +w3x2

1 +w4x1x2 +w5x2 2}

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 66 / 74

SLIDE 165

Data Augmentation

Suppose D = 2, i.e., x = [x1,x2]> Linear model: F = {f : f(x;w) = w>x+w0 = w0 +w1x1 +w2x2} Quadratic model: F = {f : f(x;w) = w0 +w1x1 +w2x2 +w3x2

1 +w4x1x2 +w5x2 2}

We can simply augment the data dimension to reduce a quadratic model to a linear one

A general technique in ML to “transform” a linear model into a nonlinear one

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 66 / 74

SLIDE 166

Data Augmentation

Suppose D = 2, i.e., x = [x1,x2]> Linear model: F = {f : f(x;w) = w>x+w0 = w0 +w1x1 +w2x2} Quadratic model: F = {f : f(x;w) = w0 +w1x1 +w2x2 +w3x2

1 +w4x1x2 +w5x2 2}

We can simply augment the data dimension to reduce a quadratic model to a linear one

A general technique in ML to “transform” a linear model into a nonlinear one

How many variables to solve in w for a polynomial regression problem

f degree P? [Homework]

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 66 / 74

SLIDE 167

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 67 / 74

SLIDE 168

Regularization

There’s another major difference between the ML algorithms and

ptimization techniques:

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 68 / 74

SLIDE 169

Regularization

There’s another major difference between the ML algorithms and

ptimization techniques:

We usually care about the testing performance rather than the training performance

E.g., in classification, we report the testing accuracy

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 68 / 74

SLIDE 170

Regularization

There’s another major difference between the ML algorithms and

ptimization techniques:

We usually care about the testing performance rather than the training performance

E.g., in classification, we report the testing accuracy

Goal: to learn a function that generalizes to unseen data well

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 68 / 74

SLIDE 171

Regularization

There’s another major difference between the ML algorithms and

ptimization techniques:

We usually care about the testing performance rather than the training performance

E.g., in classification, we report the testing accuracy

Goal: to learn a function that generalizes to unseen data well Regularization: techniques that improve the generalizability of the learned function

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 68 / 74

SLIDE 172

Regularization

There’s another major difference between the ML algorithms and

ptimization techniques:

We usually care about the testing performance rather than the training performance

E.g., in classification, we report the testing accuracy

Goal: to learn a function that generalizes to unseen data well Regularization: techniques that improve the generalizability of the learned function How to regularize the linear regression? argmin

w

1 2kyXwk2

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 68 / 74

SLIDE 173

Regularized Linear Regression

One way to improve the generalizability of f is to make it “flat:” argminw2RD,b

1 2ky(Xwb)k2

subject to kwk2  T = argminw2RD+1 1

2ky(Xw)k2

subject to w>Sw  T

S = diag([0,1,··· ,1]>) 2 R(D+1)⇥(D+1) (b is not regularized)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 69 / 74

SLIDE 174

Regularized Linear Regression

One way to improve the generalizability of f is to make it “flat:” argminw2RD,b

1 2ky(Xwb)k2

subject to kwk2  T = argminw2RD+1 1

2ky(Xw)k2

subject to w>Sw  T

S = diag([0,1,··· ,1]>) 2 R(D+1)⇥(D+1) (b is not regularized)

We will explain why this works later

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 69 / 74

SLIDE 175

Regularized Linear Regression

One way to improve the generalizability of f is to make it “flat:” argminw2RD,b

1 2ky(Xwb)k2

subject to kwk2  T = argminw2RD+1 1

2ky(Xw)k2

subject to w>Sw  T

S = diag([0,1,··· ,1]>) 2 R(D+1)⇥(D+1) (b is not regularized)

We will explain why this works later How to solve this problem?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 69 / 74

SLIDE 176

Regularized Linear Regression

One way to improve the generalizability of f is to make it “flat:” argminw2RD,b

1 2ky(Xwb)k2

subject to kwk2  T = argminw2RD+1 1

2ky(Xw)k2

subject to w>Sw  T

S = diag([0,1,··· ,1]>) 2 R(D+1)⇥(D+1) (b is not regularized)

We will explain why this works later How to solve this problem? Using the KKT method, we have argmin

w

max

α,α0L(w,α) = argmin w

max

α,α0

1 2 ⇣ kyXwk2 +α(w>SwT) ⌘

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 69 / 74

SLIDE 177

Alternate Iterative Algorithm

argmin

w

max

α,α0L(w,α) = argmin w

max

α,α0

1 2 ⇣ kyXwk2 +α(w>SwT) ⌘ Input: w(0) 2 Rd an initial guess, α(0) = 0, δ > 0 repeat Solve w(t+1) = argminw L(w;α(t)) using some iterative algorithm starting at w(t); if w(t+1)>w(t+1) > T then α(t+1) = α(t) +δ in order to increase L(α;w(t+1)); end until w(t+1)>w(t+1)  T;

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 70 / 74

SLIDE 178

Alternate Iterative Algorithm

argmin

w

max

α,α0L(w,α) = argmin w

max

α,α0

1 2 ⇣ kyXwk2 +α(w>SwT) ⌘ Input: w(0) 2 Rd an initial guess, α(0) = 0, δ > 0 repeat Solve w(t+1) = argminw L(w;α(t)) using some iterative algorithm starting at w(t); if w(t+1)>w(t+1) > T then α(t+1) = α(t) +δ in order to increase L(α;w(t+1)); end until w(t+1)>w(t+1)  T; We could also solve w(t+1) analytically from

∂ ∂xL(w;α(t)) = 0:

w(t+1) = ⇣ X>X +α(t)S ⌘1 X>y

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 70 / 74

SLIDE 179

Outline

1 Numerical Computation

2 Optimization Problems

3 Unconstrained Optimization Gradient Descent Newton’s Method

4 Optimization in ML: Stochastic Gradient Descent Perceptron Adaline Stochastic Gradient Descent

5 Constrained Optimization

6 Optimization in ML: Regularization Linear Regression Polynomial Regression Generalizability & Regularization

7 Duality*

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 71 / 74

SLIDE 180

Dual Problem

Given a problem (called primal problem): p⇤ = min

x

max

α,β,α0L(x,α,β)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 72 / 74

SLIDE 181

Dual Problem

Given a problem (called primal problem): p⇤ = min

x

max

α,β,α0L(x,α,β)

We define its dual problem as: d⇤ = max

α,β,α0min x L(x,α,β)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 72 / 74

SLIDE 182

Dual Problem

Given a problem (called primal problem): p⇤ = min

x

max

α,β,α0L(x,α,β)

We define its dual problem as: d⇤ = max

α,β,α0min x L(x,α,β)

By the max-min inequality, we have d⇤  p⇤ [Homework]

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 72 / 74

SLIDE 183

Dual Problem

Given a problem (called primal problem): p⇤ = min

x

max

α,β,α0L(x,α,β)

We define its dual problem as: d⇤ = max

α,β,α0min x L(x,α,β)

By the max-min inequality, we have d⇤  p⇤ [Homework] (p⇤ d⇤) is called the duality gap

p⇤ and d⇤ are called the primal and dual values, respectively

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 72 / 74

SLIDE 184

Strong Duality

Strong duality holds if d⇤ = p⇤ When will it happen?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 73 / 74

SLIDE 185

Strong Duality

Strong duality holds if d⇤ = p⇤ When will it happen? If the primal problem has solution and convex Why considering dual problem?

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 73 / 74

SLIDE 186

Example

Consider a primal problem: argminx2Rd 1

2kxk2

subject to Ax b, A 2 Rn⇥d = argmin

x

max

α,α0

1 2kxk2 α>(Axb)

Convex, so strong duality holds

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 74 / 74

SLIDE 187

Example

Consider a primal problem: argminx2Rd 1

2kxk2

subject to Ax b, A 2 Rn⇥d = argmin

x

max

α,α0

1 2kxk2 α>(Axb)

Convex, so strong duality holds

We can get the same solution via the dual problem: arg max

α,α0min x

1 2kxk2 α>(Axb) Solving minx L(x,α) analytically, we have x⇤ = A>α

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 74 / 74

SLIDE 188

Example

Consider a primal problem: argminx2Rd 1

2kxk2

subject to Ax b, A 2 Rn⇥d = argmin

x

max

α,α0

1 2kxk2 α>(Axb)

Convex, so strong duality holds

We can get the same solution via the dual problem: arg max

α,α0min x

1 2kxk2 α>(Axb) Solving minx L(x,α) analytically, we have x⇤ = A>α Substituting this into the dual, we get arg max

α,α01

2kA>αk2 +b>α We now solve n variables instead of d (beneficial when n ⌧ d)

Shan-Hung Wu (CS, NTHU) Numerical Optimization Machine Learning 74 / 74