10. Regularization More on tradeoffs Regularization Effect of - - PowerPoint PPT Presentation

10 regularization
SMART_READER_LITE
LIVE PREVIEW

10. Regularization More on tradeoffs Regularization Effect of - - PowerPoint PPT Presentation

CS/ECE/ISyE 524 Introduction to Optimization Spring 201718 10. Regularization More on tradeoffs Regularization Effect of using different norms Example: hovercraft revisited Laurent Lessard (www.laurentlessard.com) Review of


slide-1
SLIDE 1

CS/ECE/ISyE 524 Introduction to Optimization Spring 2017–18

  • 10. Regularization

❼ More on tradeoffs ❼ Regularization ❼ Effect of using different norms ❼ Example: hovercraft revisited

Laurent Lessard (www.laurentlessard.com)

slide-2
SLIDE 2

Review of tradeoffs

Recap of tradeoffs:

❼ We want to make both J1(x) and J2(x) small subject to

constraints.

❼ Choose a parameter λ > 0, solve

minimize

x

J1(x) + λJ2(x) subject to: constraints

❼ Each λ > 0 yields a solution ˆ

xλ.

❼ Can visualize tradeoff by plotting J2(ˆ

xλ) vs J1(ˆ xλ). This is called the Pareto curve.

10-2

slide-3
SLIDE 3

Multi-objective tradeoff

❼ Similar procedure if we have more than two costs we’d like

to make small, e.g. J1, J2, J3

❼ Choose parameters λ > 0 and µ > 0. Then solve:

minimize

x

J1(x) + λJ2(x) + µJ3(x) subject to: constraints

❼ Each λ > 0 and µ > 0 yields a solution ˆ

xλ,µ.

❼ Can visualize tradeoff by plotting J3(ˆ

xλ,µ) vs J2(ˆ xλ,µ) vs J1(ˆ xλ,µ) on a 3D plot. You then obtain a Pareto surface.

10-3

slide-4
SLIDE 4

Minimum-norm as a regularization

❼ When Ax = b is underdetermined (A is wide), we can

resolve ambiguity by adding a cost function, e.g. min-norm LS: minimize

x

x2 subject to: Ax = b

❼ Alternative approach: express it as a tradeoff!

minimize

x

Ax − b2 + λx2 Tradeoffs of this type are called regularization and λ is called the regularization parameter or regularization weight

❼ If we let λ → ∞, we just obtain ˆ

x = 0

❼ If we let λ → 0, we obtain the minimum-norm solution!

10-4

slide-5
SLIDE 5

Proof of minimum-norm equivalence

minimize

x

Ax − b2 + λx2 Equivalent to the least squares problem: minimize

x

  • A

√ λI

  • x −

b

  • 2

Solution is found via pseudoinverse (for tall matrix) ˆ x = A √ λI T A √ λI −1 A √ λI T b

  • = (ATA + λI)−1ATb

10-5

slide-6
SLIDE 6

Proof of minimum-norm equivalence

Solution of 2-norm regularization is: ˆ x = (ATA + λI)−1ATb

❼ Can’t simply set λ → 0 because A is wide, and therefore

ATA will not be invertible.

❼ Use the fact that: ATAAT + λAT can be factored two ways:

(ATA + λI)AT = ATAAT + λAT = AT(AAT + λI) (ATA + λI)AT = AT(AAT + λI) AT(AAT + λI)−1 = (ATA + λI)−1AT

10-6

slide-7
SLIDE 7

Proof of minimum-norm equivalence

Solution of 2-norm regularization is: ˆ x = (ATA + λI)−1ATb Also equal to: ˆ x = AT(AAT + λI)−1b

❼ Since AAT is invertible, we can take the limit λ → 0 by just

setting λ = 0.

❼ In the limit: ˆ

x = AT(AAT)−1b. This is the exact solution to the minimum-norm least squares problem we found before!

10-7

slide-8
SLIDE 8

Tradeoff visualization

minimize

x

Ax − b2 + λx2

Ax − b2 x2

λ → 0

  • 0, A†b2

λ → ∞

  • b2, 0
  • 10-8
slide-9
SLIDE 9

Regularization

Regularization: Additional penalty term added to the cost function to encourage a solution with desirable properties. Regularized least squares: minimize

x

Ax − b2 + λR(x)

❼ R(x) is the regularizer (penalty function) ❼ λ is the regularization parameter ❼ The model has different names depending on R(x).

10-9

slide-10
SLIDE 10

Regularization

minimize

x

Ax − b2 + λR(x)

  • 1. If R(x) = x2 = x2

1 + x2 2 + · · · + x2 n

It is called: L2 regularization, Tikhonov regularization, or Ridge regression depending on the application. It has the effect of smoothing the solution.

  • 2. If R(x) = x1 = |x1| + |x2| + · · · + |xn|

It is called: L1 regularization or LASSO. It has the effect of sparsifying the solution (ˆ x will have few nonzero entries).

  • 3. R(x) = x∞ = max{|x1|, |x2|, . . . , |xn|}

It is called L∞ regularization and it has the effect of equalizing the solution (makes most components equal).

10-10

slide-11
SLIDE 11

Norm balls

For a norm ·p, the norm ball of radius r is the set: Br = {x ∈ Rn | xp ≤ r}

  • 1.5 -1.0 -0.5

0.5 1.0 1.5

  • 1.5
  • 1.0
  • 0.5

0.5 1.0 1.5

x2 ≤ 1 x2 + y 2 ≤ 1

  • 1.5 -1.0 -0.5

0.5 1.0 1.5

  • 1.5
  • 1.0
  • 0.5

0.5 1.0 1.5

x1 ≤ 1 |x| + |y| ≤ 1

  • 1.5 -1.0 -0.5

0.5 1.0 1.5

  • 1.5
  • 1.0
  • 0.5

0.5 1.0 1.5

x∞ ≤ 1 max{|x|, |y|} ≤ 1

10-11

slide-12
SLIDE 12

Simple example

Consider the minimum-norm problem for different norms: minimize

x

xp subject to: Ax = b

❼ set of solutions to Ax = b

is an affine subspace

❼ solution is point belonging

to smallest norm ball

❼ for p = 2, this occurs at

the perpendicular distance

x

  • 1

1 2 3 4

  • 0.5

0.5 1.0 1.5 2.0 2.5

10-12

slide-13
SLIDE 13

Simple example

❼ for p = 1, this occurs at

  • ne of the axes.

❼ sparsifying behavior

x

  • 1

1 2 3 4

  • 0.5

0.5 1.0 1.5 2.0 2.5

❼ for p = ∞, this occurs at

equal values of coordinates

❼ equalizing behavior

x

  • 1

1 2 3 4

  • 0.5

0.5 1.0 1.5 2.0 2.5

10-13

slide-14
SLIDE 14

Another simple example

Suppose we have data points {y1, . . . , ym} ⊂ R, and we would like to find the best estimator for the data, according to different norms. Suppose data is sorted: y1 ≤ · · · ≤ ym.

minimize

x

  y1 . . . ym    −    x . . . x   

  • p

❼ p = 2: ˆ

x = 1

m(y1 + · · · + ym). This is the mean of the data.

❼ p = 1: ˆ

x = y ⌈m/2⌉. This is the median of the data.

❼ p = ∞: ˆ

x = 1

2(y1 + ym). This is the mid-range of the data.

Julia demo: Data Norm.ipynb

10-14

slide-15
SLIDE 15

Example: hovercraft revisited

One-dimensional version of the hovercraft problem:

❼ Start at x1 = 0 with v1 = 0 (at rest at position zero) ❼ Finish at x50 = 100 with v50 = 0 (at rest at position 100) ❼ Same simple dynamics as before:

xt+1 = xt + vt vt+1 = vt + ut for: t = 1, 2, . . . , 49

❼ Decide thruster inputs u1, u2, . . . , u49. ❼ This time: minimize up

10-15

slide-16
SLIDE 16

Example: hovercraft revisited

minimize

xt,vt,ut

up subject to: xt+1 = xt + vt for t = 1, . . . , 49 vt+1 = vt + ut for t = 1, . . . , 49 x1 = 0, x50 = 100 v1 = 0, v50 = 0

❼ This model has 150 variables, but very easy to understand. ❼ We can simplify the model considerably...

10-16

slide-17
SLIDE 17

Model simplification

xt+1 = xt + vt vt+1 = vt + ut for: t = 1, 2, . . . , 49 v50 = v49 + u49 = v48 + u48 + u49 = . . . = v1 + (u1 + u2 + · · · + u49)

10-17

slide-18
SLIDE 18

Model simplification

xt+1 = xt + vt vt+1 = vt + ut for: t = 1, 2, . . . , 49 x50 = x49 + v49 = x48 + 2v48 + u48 = x47 + 3v47 + 2u47 + u48 = . . . = x1 + 49v1 + (48u1 + 47u2 + · · · + 2u47 + u48)

10-18

slide-19
SLIDE 19

Model simplification

xt+1 = xt + vt vt+1 = vt + ut for: t = 1, 2, . . . , 49 Constraint can be rewritten as: 48 47 . . . 2 1 1 1 . . . 1 1 1

    u1 u2 . . . u49      = x50 − x1 − 49v1 v50 − v1

  • so we don’t need the intermediate variables xt and vt!

Julia demo: Hover 1D.ipynb

10-19

slide-20
SLIDE 20

Results

  • 1. Minimizing u2

2 (smooth)

10 20 30 40 50 Time 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Thrust

  • 2. Minimizing u1 (sparse)

10 20 30 40 50 Time 3 2 1 1 2 3 Thrust

  • 3. Minimizing u∞ (equalized)

10 20 30 40 50 Time 0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 Thrust

10-20

slide-21
SLIDE 21

Tradeoff studies

  • 1. Minimizing u2

2 + λu1 (smooth and sparse)

10 20 30 40 50 Time 0.4 0.2 0.0 0.2 0.4 Thrust

  • 2. Minimizing u∞ + λu1 (equalized and sparse)

10 20 30 40 50 Time 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Thrust

  • 3. Minimizing u2

2 + λu∞ (equalized and smooth)

10 20 30 40 50 Time 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Thrust

10-21