Nondifferentiable Convex Functions DS-GA 1013 / MATH-GA 2824 - - PowerPoint PPT Presentation

nondifferentiable convex functions
SMART_READER_LITE
LIVE PREVIEW

Nondifferentiable Convex Functions DS-GA 1013 / MATH-GA 2824 - - PowerPoint PPT Presentation

Nondifferentiable Convex Functions DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Applications Subgradients Optimization methods Regression The


slide-1
SLIDE 1

Nondifferentiable Convex Functions

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Applications Subgradients Optimization methods

slide-3
SLIDE 3

Regression

The aim is to learn a function h that relates

◮ a response or dependent variable y ◮ to several observed variables x1, x2, . . . , xp, known as covariates,

features or independent variables The response is assumed to be of the form y = h ( x) + z where x ∈ Rp contains the features and z is noise

slide-4
SLIDE 4

Linear regression

The regression function h is assumed to be linear y(i) = x (i) T β∗ + z(i), 1 ≤ i ≤ n Our aim is to estimate β∗ ∈ Rp from the data

slide-5
SLIDE 5

Linear regression

In matrix form     y(1) y(2) · · · y(n)     =     

  • x (1)

1

  • x (1)

2

· · ·

  • x (1)

p

  • x (2)

1

  • x (2)

2

· · ·

  • x (2)

p

· · · · · · · · · · · ·

  • x (n)

1

  • x (n)

2

· · ·

  • x (n)

p

         

  • β∗

1

  • β∗

2

· · ·

  • β∗

p

     +     z(1) z(2) · · · z(n)     Equivalently,

  • y = X

β∗ + z

slide-6
SLIDE 6

Sparse linear regression

Only a subset of the features are relevant Model selection problem Two objectives:

◮ Good fit to the data;

  • X

β − y

  • 2

2 should be as small as possible ◮ Using a small number of features;

β should be as sparse as possible

slide-7
SLIDE 7

Sparse linear regression

    y(1) y(2) · · · y(n)     =     

  • x (1)

j

  • x (1)

l

  • x (2)

j

  • x (2)

l

· · · · · ·

  • x (n)

l

  • x (n)

l

    

  • β∗

j

  • β∗

l

  • +

    z(1) z(2) · · · z(n)     =     

  • x (1)

1

· · ·

  • x (1)

j

· · ·

  • x (1)

l

· · ·

  • x (1)

p

  • x (2)

1

· · ·

  • x (2)

j

· · ·

  • x (2)

l

· · ·

  • x (2)

p

· · ·

  • x (n)

1

· · ·

  • x (n)

j

· · ·

  • x (n)

l

· · ·

  • x (n)

p

                · · ·

  • β∗

j

· · ·

  • β∗

l

· · ·            +     z(1) z(2) · · · z(n)     = X β∗ + z

slide-8
SLIDE 8

Sparse linear regression with 2 features

  • y := α

x1 + z X :=

  • x1
  • x2
  • ||

x1||2 = 1 || x2||2 = 1

  • x1,

x2 = ρ

slide-9
SLIDE 9

Least squares: not sparse

  • βLS =
  • X TX

−1 X T y =  1 ρ ρ 1  

−1 

 xT

1

y

  • xT

2

y   = 1 1 − ρ2   1 −ρ −ρ 1     α + xT

1

z αρ + xT

2

z   =  α   + 1 1 − ρ2   x1 − ρ x2, z

  • x2 − ρ

x1, z  

slide-10
SLIDE 10

The lasso

Idea: Use ℓ1-norm regularization to promote sparse coefficients

  • βlasso := arg min
  • β

1 2

  • y − X

β

  • 2

2 +λ

  • β
  • 1
slide-11
SLIDE 11

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

  • i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative

slide-12
SLIDE 12

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

  • i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y)

slide-13
SLIDE 13

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

  • i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y) =

m

  • i=1

αi fi (θ x + (1 − θ) y)

slide-14
SLIDE 14

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

  • i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y) =

m

  • i=1

αi fi (θ x + (1 − θ) y) ≤

m

  • i=1

αi (θfi ( x) + (1 − θ) fi ( y))

slide-15
SLIDE 15

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

  • i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y) =

m

  • i=1

αi fi (θ x + (1 − θ) y) ≤

m

  • i=1

αi (θfi ( x) + (1 − θ) fi ( y)) = θ f ( x) + (1 − θ) f ( y)

slide-16
SLIDE 16

Regularized least-squares

Regularized least-squares cost functions ||A x − y||2

2 + ||

x|| are convex

slide-17
SLIDE 17

It works

10-3 10-2 10-1 100 Regularization parameter 0.0 0.2 0.4 0.6 0.8 1.0 Coefficients

slide-18
SLIDE 18

Ridge regression doesn’t work

10-3 10-2 10-1 100 101 102 103

Regularization parameter

0.0 0.2 0.4 0.6 0.8 1.0

Coefficients

slide-19
SLIDE 19

Prostate cancer data set

◮ 8 features (age, weight, analysis results) ◮ Response: Prostate-specific antigen (PSA), associated to cancer ◮ Training set: 60 patients ◮ Test set: 37 patients

slide-20
SLIDE 20

Prostate cancer data set

10-2 10-1 100

λ

0.0 0.5 1.0 1.5

Coefficient Values Training Loss Test Loss

slide-21
SLIDE 21

Principal component analysis

Given n data vectors x1, x2, . . . , xn ∈ Rd,

  • 1. Center the data,
  • ci =

xi − av ( x1, x2, . . . , xn) , 1 ≤ i ≤ n

  • 2. Group the centered data as columns of a matrix

C =

  • c1
  • c2

· · ·

  • cn
  • .
  • 3. Compute the SVD of C

The left singular vectors are the principal directions The principal values are the coefficients of the centered vectors in the basis of principal directions.

slide-22
SLIDE 22

Example

C :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2  

slide-23
SLIDE 23

Principal component analysis

slide-24
SLIDE 24

Example

C :=   −2 −1 5 1 2 −2 −1 1 2 −2 −1 1 2  

slide-25
SLIDE 25

Principal component analysis

slide-26
SLIDE 26

Outliers

Problem: Outliers distort principal directions Model: Data equals low-rank component + sparse component + = L S Y Idea: Fit model to data, then apply PCA to L

slide-27
SLIDE 27

Robust PCA

Data: Y ∈ Rn×m Robust PCA estimator of low-rank component: LRPCA := arg min

L ||L||∗ + λ ||Y − L||1

where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: SRPCA := Y − LRPCA ||·||1 is the ℓ1 norm of the vectorized matrix

slide-28
SLIDE 28

Example

10-1 100

λ

4 2 2 4

Low Rank Sparse

slide-29
SLIDE 29

λ =

1 √n

L S

slide-30
SLIDE 30

Large λ

L S

slide-31
SLIDE 31

Small λ

L S

slide-32
SLIDE 32

Background subtraction

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-33
SLIDE 33

Background subtraction

Matrix with vectorized frames as columns Static image: Y =

  • x
  • x

· · ·

  • x
  • =

x

  • 1

1 · · · 1

  • Slowly varying background: Low-rank

Rapidly varying foreground: Sparse

slide-34
SLIDE 34

Frame 17

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-35
SLIDE 35

Low-rank component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-36
SLIDE 36

Sparse component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-37
SLIDE 37

Frame 42

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-38
SLIDE 38

Low-rank component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-39
SLIDE 39

Sparse component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-40
SLIDE 40

Frame 75

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-41
SLIDE 41

Low-rank component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-42
SLIDE 42

Sparse component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

slide-43
SLIDE 43

Applications Subgradients Optimization methods

slide-44
SLIDE 44

Gradient

A differentiable function f : Rn → R is convex if and only if for every

  • x,

y ∈ Rn f ( y) ≥ f ( x) + ∇f ( x)T ( y − x)

slide-45
SLIDE 45

Gradient

x f ( y)

slide-46
SLIDE 46

Subgradient

The subgradient of f : Rn → R at x ∈ Rn is a vector g ∈ Rn such that f ( y) ≥ f ( x) + gT ( y − x) , for all y ∈ Rn Geometrically, the hyperplane H

g :=

   y | y[n + 1] = gT    

  • y[1]

· · ·

  • y[n]

       is a supporting hyperplane of the epigraph at x The set of all subgradients at x is called the subdifferential

slide-47
SLIDE 47

Subgradients

slide-48
SLIDE 48

Subgradient of differentiable function

If a function is differentiable, the only subgradient at each point is the gradient

slide-49
SLIDE 49

Proof

Assume g is a subgradient at x, for any α ≥ 0 f ( x + α ei) ≥ f ( x) + gTα ei = f ( x) + g[i] α f ( x) ≥ f ( x − α ei) + gTα ei = f ( x − α ei) + g[i] α

slide-50
SLIDE 50

Proof

Assume g is a subgradient at x, for any α ≥ 0 f ( x + α ei) ≥ f ( x) + gTα ei = f ( x) + g[i] α f ( x) ≥ f ( x − α ei) + gTα ei = f ( x − α ei) + g[i] α Combining both inequalities f ( x) − f ( x − α ei) α ≤ g[i] ≤ f ( x + α ei) − f ( x) α

slide-51
SLIDE 51

Proof

Assume g is a subgradient at x, for any α ≥ 0 f ( x + α ei) ≥ f ( x) + gTα ei = f ( x) + g[i] α f ( x) ≥ f ( x − α ei) + gTα ei = f ( x − α ei) + g[i] α Combining both inequalities f ( x) − f ( x − α ei) α ≤ g[i] ≤ f ( x + α ei) − f ( x) α Letting α → 0, implies g[i] = ∂f (

x) ∂ x[i]

slide-52
SLIDE 52

Subgradient

A function f : Rn → R is convex if and only if it has a subgradient at every point It is strictly convex if and only for all x ∈ Rn there exists g ∈ Rn such that f ( y) > f ( x) + g T ( y − x) , for all y = x.

slide-53
SLIDE 53

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x)

slide-54
SLIDE 54

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x) = f ( x) for all y ∈ Rn

slide-55
SLIDE 55

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x) = f ( x) for all y ∈ Rn Under strict convexity the minimum is unique

slide-56
SLIDE 56

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

  • g :=

g1+ g2 is a subgradient of f := f1 + f2 at x

slide-57
SLIDE 57

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

  • g :=

g1+ g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f ( y) = f1 ( y) + f2 ( y)

slide-58
SLIDE 58

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

  • g :=

g1+ g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f ( y) = f1 ( y) + f2 ( y) ≥ f1 ( x) + g T

1 (

y − x) + f2 ( y) + g T

2 (

y − x)

slide-59
SLIDE 59

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

  • g :=

g1+ g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f ( y) = f1 ( y) + f2 ( y) ≥ f1 ( x) + g T

1 (

y − x) + f2 ( y) + g T

2 (

y − x) ≥ f ( x) + g T ( y − x)

slide-60
SLIDE 60

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x

slide-61
SLIDE 61

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x Proof: For any y ∈ Rn f2 ( y) = ηf1 ( y)

slide-62
SLIDE 62

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x Proof: For any y ∈ Rn f2 ( y) = ηf1 ( y) ≥ η

  • f1 (

x) + g T

1 (

y − x)

slide-63
SLIDE 63

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x Proof: For any y ∈ Rn f2 ( y) = ηf1 ( y) ≥ η

  • f1 (

x) + g T

1 (

y − x)

  • ≥ f2 (

x) + g T

2 (

y − x)

slide-64
SLIDE 64

Subdifferential of absolute value

f(x) = |x|

slide-65
SLIDE 65

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0)

slide-66
SLIDE 66

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0) |y| ≥ gy

slide-67
SLIDE 67

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0) |y| ≥ gy Holds if and only if |g| ≤ 1

slide-68
SLIDE 68

Subdifferential of ℓ1 norm

  • g is a subgradient of the ℓ1 norm at

x ∈ Rn if and only if

  • g[i] = sign (x[i])

if x[i] = 0 | g[i]| ≤ 1 if x[i] = 0

slide-69
SLIDE 69

Proof

  • g is a subgradient of ||·||1 at

x if and only if g[i] is a subgradient

  • f |·| at

x[i] for all 1 ≤ i ≤ n

slide-70
SLIDE 70

Proof

If g is a subgradient of ||·||1 at x then for any y ∈ R |y| = | x[i]| + || x + (y − x[i]) ei||1 − || x||1

slide-71
SLIDE 71

Proof

If g is a subgradient of ||·||1 at x then for any y ∈ R |y| = | x[i]| + || x + (y − x[i]) ei||1 − || x||1 ≥ | x[i]| + || x||1 + g T (y − x[i]) ei − || x||1

slide-72
SLIDE 72

Proof

If g is a subgradient of ||·||1 at x then for any y ∈ R |y| = | x[i]| + || x + (y − x[i]) ei||1 − || x||1 ≥ | x[i]| + || x||1 + g T (y − x[i]) ei − || x||1 = | x[i]| + g[i] (y − x[i]) so g[i] is a subgradient of |·| at | x[i]| for all 1 ≤ i ≤ n

slide-73
SLIDE 73

Proof

If g[i] is a subgradient of |·| at | x[i]| for 1 ≤ i ≤ n then for any y ∈ Rn || y||1 =

n

  • i=1

| y [i]|

slide-74
SLIDE 74

Proof

If g[i] is a subgradient of |·| at | x[i]| for 1 ≤ i ≤ n then for any y ∈ Rn || y||1 =

n

  • i=1

| y [i]| ≥

n

  • i=1

| x[i]| + g[i] ( y [i] − x[i])

slide-75
SLIDE 75

Proof

If g[i] is a subgradient of |·| at | x[i]| for 1 ≤ i ≤ n then for any y ∈ Rn || y||1 =

n

  • i=1

| y [i]| ≥

n

  • i=1

| x[i]| + g[i] ( y [i] − x[i]) = || x||1 + g T ( y − x) so g is a subgradient of ||·||1 at x

slide-76
SLIDE 76

Subdifferential of ℓ1 norm

slide-77
SLIDE 77

Subdifferential of ℓ1 norm

slide-78
SLIDE 78

Subdifferential of ℓ1 norm

slide-79
SLIDE 79

Subdifferential of the nuclear norm

Let X ∈ Rm×n be a rank-r matrix with SVD USV T, where U ∈ Rm×r, V ∈ Rn×r and S ∈ Rr×r A matrix G is a subgradient of the nuclear norm at X if and only if G := UV T + W where W satisfies ||W || ≤ 1 UTW = 0 W V = 0

slide-80
SLIDE 80

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

  • Prow(X)

x

  • 2

2 +

  • Prow(X)⊥

x

  • 2

2 = ||

x||2

2 = 1

slide-81
SLIDE 81

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

  • Prow(X)

x

  • 2

2 +

  • Prow(X)⊥

x

  • 2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

slide-82
SLIDE 82

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

  • Prow(X)

x

  • 2

2 +

  • Prow(X)⊥

x

  • 2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T

x

  • 2

2 + ||W

x||2

2

slide-83
SLIDE 83

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

  • Prow(X)

x

  • 2

2 +

  • Prow(X)⊥

x

  • 2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T

x

  • 2

2 + ||W

x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T Prow(X)

x

  • 2

2 +

  • W Prow(X)⊥

x

  • 2

2

slide-84
SLIDE 84

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

  • Prow(X)

x

  • 2

2 +

  • Prow(X)⊥

x

  • 2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T

x

  • 2

2 + ||W

x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T Prow(X)

x

  • 2

2 +

  • W Prow(X)⊥

x

  • 2

2

  • UV T
  • 2
  • Prow(X)

x

  • 2

2 + ||W ||2

  • Prow(X)⊥

x

  • 2

2

slide-85
SLIDE 85

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

  • Prow(X)

x

  • 2

2 +

  • Prow(X)⊥

x

  • 2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T

x

  • 2

2 + ||W

x||2

2

= max {||

x||2=1 | x∈Rn}

  • UV T Prow(X)

x

  • 2

2 +

  • W Prow(X)⊥

x

  • 2

2

  • UV T
  • 2
  • Prow(X)

x

  • 2

2 + ||W ||2

  • Prow(X)⊥

x

  • 2

2

≤ 1

slide-86
SLIDE 86

Hölder’s inequality for matrices

For any matrix A ∈ Rm×n, ||A||∗ = sup

{||B||≤1 | B∈Rm×n}

A, B .

slide-87
SLIDE 87

Proof

For any matrix Y ∈ Rm×n ||Y ||∗ ≥ G, Y = G, X + G, Y − X =

  • UV T, X
  • + W , X + G, Y − X
slide-88
SLIDE 88

Proof

UTW = 0 implies W , X =

  • W , USV T

=

  • UTW , SV T

= 0

  • UV T, X
slide-89
SLIDE 89

Proof

UTW = 0 implies W , X =

  • W , USV T

=

  • UTW , SV T

= 0

  • UV T, X
  • = tr
  • VUTX
slide-90
SLIDE 90

Proof

UTW = 0 implies W , X =

  • W , USV T

=

  • UTW , SV T

= 0

  • UV T, X
  • = tr
  • VUTX
  • = tr
  • VUTUSV T
slide-91
SLIDE 91

Proof

UTW = 0 implies W , X =

  • W , USV T

=

  • UTW , SV T

= 0

  • UV T, X
  • = tr
  • VUTX
  • = tr
  • VUTUSV T

= tr

  • V TV S
slide-92
SLIDE 92

Proof

UTW = 0 implies W , X =

  • W , USV T

=

  • UTW , SV T

= 0

  • UV T, X
  • = tr
  • VUTX
  • = tr
  • VUTUSV T

= tr

  • V TV S
  • = tr (S)
slide-93
SLIDE 93

Proof

UTW = 0 implies W , X =

  • W , USV T

=

  • UTW , SV T

= 0

  • UV T, X
  • = tr
  • VUTX
  • = tr
  • VUTUSV T

= tr

  • V TV S
  • = tr (S)

= ||X||∗

slide-94
SLIDE 94

Proof

For any matrix Y ∈ Rm×n ||Y ||∗ ≥ G, Y = G, X + G, Y − X =

  • UV T, X
  • + G, Y − X

=

  • UV T, X
  • + W , X + G, Y − X

= ||X||∗ + G, Y − X

slide-95
SLIDE 95

Sparse linear regression with 2 features

  • y := α

x1 + z X :=

  • x1
  • x2
  • ||

x1||2 = 1 || x2||2 = 1

  • x1,

x2 = ρ

slide-96
SLIDE 96

Analysis of lasso estimator

Let α ≥ 0

  • βlasso =

α + x T

1

z − λ

  • as long as
  • x T

2

z − ρ x T

1

z

  • 1 − |ρ|

≤ λ ≤ α + xT

1

z

slide-97
SLIDE 97

Lasso estimator

0.00 0.05 0.10 0.15 0.20 Regularization parameter 0.0 0.2 0.4 0.6 0.8 1.0 Coefficients

slide-98
SLIDE 98

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x) = f ( x) for all y ∈ Rn Under strict convexity the minimum is unique

slide-99
SLIDE 99

Proof

The cost function is strictly convex if n ≥ 2 and ρ = 1 Aim: Show that there is a subgradient equal to 0 at a 1-sparse solution

slide-100
SLIDE 100

Proof

The gradient of the quadratic term q

  • β
  • := 1

2

  • X

β − y

  • 2

2

at βlasso equals ∇q

  • βlasso
  • = X T

X βlasso − y

slide-101
SLIDE 101

Proof

If only the first entry is nonzero and nonnegative

  • gℓ1 :=

1 γ

  • is a subgradient of the ℓ1 norm at

βlasso for any γ ∈ R such that |γ| ≤ 1

slide-102
SLIDE 102

Proof

If only the first entry is nonzero and nonnegative

  • gℓ1 :=

1 γ

  • is a subgradient of the ℓ1 norm at

βlasso for any γ ∈ R such that |γ| ≤ 1 In that case glasso := ∇q

  • βlasso
  • + λ

gℓ1 is a subgradient of the cost function at βlasso

slide-103
SLIDE 103

Proof

If only the first entry is nonzero and nonnegative

  • gℓ1 :=

1 γ

  • is a subgradient of the ℓ1 norm at

βlasso for any γ ∈ R such that |γ| ≤ 1 In that case glasso := ∇q

  • βlasso
  • + λ

gℓ1 is a subgradient of the cost function at βlasso If glasso = 0 then βlasso is the unique solution

slide-104
SLIDE 104

Proof

  • glasso := X T

X βlasso − y

  • + λ

1 γ

slide-105
SLIDE 105

Proof

  • glasso := X T

X βlasso − y

  • + λ

1 γ

  • = X T
  • βlasso[1]

x1 − α x1 − z

  • + λ

1 γ

slide-106
SLIDE 106

Proof

  • glasso := X T

X βlasso − y

  • + λ

1 γ

  • = X T
  • βlasso[1]

x1 − α x1 − z

  • + λ

1 γ

  • =

  x T

1

  • βlasso[1]

x1 − α x1 − z

  • + λ
  • x T

2

  • βlasso[1]

x1 − α x1 − z

  • + λγ

 

slide-107
SLIDE 107

Proof

  • glasso := X T

X βlasso − y

  • + λ

1 γ

  • = X T
  • βlasso[1]

x1 − α x1 − z

  • + λ

1 γ

  • =

  x T

1

  • βlasso[1]

x1 − α x1 − z

  • + λ
  • x T

2

  • βlasso[1]

x1 − α x1 − z

  • + λγ

  =

  • βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

slide-108
SLIDE 108

Proof

  • glasso =
  • βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

  • Equal to

0 if

slide-109
SLIDE 109

Proof

  • glasso =
  • βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

  • Equal to

0 if

  • βlasso[1] = α +

xT

1

z − λ

slide-110
SLIDE 110

Proof

  • glasso =
  • βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

  • Equal to

0 if

  • βlasso[1] = α +

xT

1

z − λ γ = ρα + xT

2

z − ρ βlasso[1] λ

slide-111
SLIDE 111

Proof

  • glasso =
  • βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

  • Equal to

0 if

  • βlasso[1] = α +

xT

1

z − λ γ = ρα + xT

2

z − ρ βlasso[1] λ = x T

2

z − ρ x T

1

z λ + ρ

slide-112
SLIDE 112

Proof

We still need to check that it’s a valid subgradient at βlasso, i.e.

βlasso[1] is nonnegative λ ≤ α + xT

1 ◮ |γ| ≤ 1

|γ| ≤

  • x T

2

z − ρ x T

1

z λ

  • + |ρ| ≤ 1

which holds if λ ≥

  • ρ

x T

1

z + x T

2

z

  • 1 − |ρ|
slide-113
SLIDE 113

Robust PCA

Data: Y ∈ Rn×m Robust PCA estimator of low-rank component: LRPCA := arg min

L ||L||∗ + λ ||Y − L||1

where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: SRPCA := Y − LRPCA ||·||1 is the ℓ1 norm of the vectorized matrix

slide-114
SLIDE 114

Example

Y :=   −2 −1 α 1 2 −2 −1 1 2 −2 −1 1 2  

slide-115
SLIDE 115

Analysis of robust PCA estimator

The robust PCA estimates of both components are exact for any value of α as long as 2 √ 30 < λ <

  • 2

3

slide-116
SLIDE 116

Example

10-1 100

λ

4 2 2 4

Low Rank Sparse

slide-117
SLIDE 117

Optimality + uniqueness condition

Let Y := L∗ + S∗ where L∗, S∗ ∈ Rm×n L∗ = UL∗SL∗V T

L∗ has rank r, UL∗ ∈ Rm×r, VL∗ ∈ Rn×r, SL∗ ∈ Rr×r

Assume there exists G∗ := UL∗V T

L∗ + W where W satisfies

||W || < 1, UTW = 0, W V = 0, and there also exists a matrix Gℓ1 satisfying Gℓ1[i, j] = sign (S∗[i, j]) if S∗[i, j] = 0, (1) |Gℓ1[i, j]| < 1

  • therwise,

(2) such that G∗ + λGℓ1 = 0 Then the solution to the robust PCA problem is unique and equal to L∗

slide-118
SLIDE 118

Optimality + uniqueness condition

G∗ := UL∗V T

L∗ + W is a subgradient of the nuclear norm at L∗

Gℓ1 is a subgradient of ||· − Y ||1 at L∗ G∗ + λGℓ1 is a subgradient of the cost function at L∗ G∗ + λGℓ1 = 0 implies that L∗ is a solution (uniqueness is more difficult to prove)

slide-119
SLIDE 119

Example

Y :=   −2 −1 α 1 2 −2 −1 1 2 −2 −1 1 2   We want to show that the solution is L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2   S∗ :=   α  

slide-120
SLIDE 120

Example

L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2  

slide-121
SLIDE 121

Example

L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2   =   1 √ 3   1 1 1     √ 30 1 √ 10

  • −2

−1 1 2

slide-122
SLIDE 122

Example

L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2   =   1 √ 3   1 1 1     √ 30 1 √ 10

  • −2

−1 1 2

  • UL∗V T

L∗ =

1 √ 30   1 1 1   −2 −1 1 2

slide-123
SLIDE 123

Example

G∗ = UL∗V T

L∗ + W =

1 √ 30   1 1 1   −2 −1 1 2

  • + W
slide-124
SLIDE 124

Example

G∗ = UL∗V T

L∗ + W =

1 √ 30   1 1 1   −2 −1 1 2

  • + W

Gℓ1 =   g1 g2 − sign (α) g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14  

slide-125
SLIDE 125

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ      − sign (α)      +          

slide-126
SLIDE 126

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ      − sign (α)      +      λ sign (α)     

slide-127
SLIDE 127

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30

     +      λ sign (α)     

slide-128
SLIDE 128

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30

     +      λ sign (α)      WV = 0

slide-129
SLIDE 129

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30

     +      λ sign (α)      WV = 0

slide-130
SLIDE 130

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30

     +      λ sign (α)      WV = 0 UTW = 0

slide-131
SLIDE 131

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30

1 λ √ 30

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0

slide-132
SLIDE 132

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0

slide-133
SLIDE 133

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0?

slide-134
SLIDE 134

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0? λ > 2/ √ 30

slide-135
SLIDE 135

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0? λ > 2/ √ 30 ||W || < 1?

slide-136
SLIDE 136

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

1 λ √ 30

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0? λ > 2/ √ 30 ||W || < 1? λ <

  • 2/3
slide-137
SLIDE 137

Applications Subgradients Optimization methods

slide-138
SLIDE 138

Subgradient method

Optimization problem minimize f ( x) where f is convex but nondifferentiable Subgradient-method iteration:

  • x (0) = arbitrary initialization
  • x (k+1) =

x (k) − αk g (k) where g (k) is a subgradient of f at x (k)

slide-139
SLIDE 139

Least-squares regression with ℓ1-norm regularization

minimize 1 2 ||A x − y||2

2 + λ ||

x||1 Subgradient at x (k)

  • g (k)
slide-140
SLIDE 140

Least-squares regression with ℓ1-norm regularization

minimize 1 2 ||A x − y||2

2 + λ ||

x||1 Subgradient at x (k)

  • g (k) = AT

A x (k) − y

  • + λ sign
  • x (k)
slide-141
SLIDE 141

Least-squares regression with ℓ1-norm regularization

minimize 1 2 ||A x − y||2

2 + λ ||

x||1 Subgradient at x (k)

  • g (k) = AT

A x (k) − y

  • + λ sign
  • x (k)

Subgradient-method iteration:

  • x (0) = arbitrary initialization
  • x (k+1) =

x (k) − αk

  • AT

A x (k) − y

  • + λ sign
  • x (k)
slide-142
SLIDE 142

Convergence of subgradient method

It is not a descent method Convergence rate can be shown to be O

  • 1/ǫ2

Diminishing step sizes are necessary for convergence Experiment: minimize 1 2 ||A x − y||2

2 + λ ||

x||1 A ∈ R2000×1000, y = A x ∗ + z where x ∗ is 100-sparse and z is iid Gaussian

slide-143
SLIDE 143

Convergence of subgradient method

20 40 60 80 100

k

10-2 10-1 100 101 102 103 104 f(x(k) )−f(x ∗ ) f(x ∗ )

α0 α0 /

p

k α0 /k

slide-144
SLIDE 144

Convergence of subgradient method

1000 2000 3000 4000 5000

k

10-3 10-2 10-1 100 101 f(x(k) )−f(x ∗ ) f(x ∗ )

α0 α0 /

pn

α0 /n

slide-145
SLIDE 145

Composite functions

Interesting class of functions for data analysis f ( x) + h ( x) f convex and differentiable, h convex but not differentiable Example: 1 2 ||A x − y||2

2 + λ ||

x||1

slide-146
SLIDE 146

Motivation

Aim: Minimize convex differentiable function f Idea: Iteratively minimize first-order approximation, while staying close to current point

  • x (0) = arbitrary initialization
  • x (k+1) = arg min
  • x f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

where αk is a parameter that determines how close we stay

slide-147
SLIDE 147

Motivation

Linear approximation+ ℓ2 term is convex ∇

  • f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

slide-148
SLIDE 148

Motivation

Linear approximation+ ℓ2 term is convex ∇

  • f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

  • = ∇f
  • x (k)

+ x − x (k) αk

slide-149
SLIDE 149

Motivation

Linear approximation+ ℓ2 term is convex ∇

  • f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

  • = ∇f
  • x (k)

+ x − x (k) αk Setting the gradient to zero

  • x (k+1) = arg min
  • x f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

slide-150
SLIDE 150

Motivation

Linear approximation+ ℓ2 term is convex ∇

  • f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

  • = ∇f
  • x (k)

+ x − x (k) αk Setting the gradient to zero

  • x (k+1) = arg min
  • x f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

= x (k) − αk∇f

  • x (k)
slide-151
SLIDE 151

Proximal gradient method

Idea: Minimize local first-order approximation + h

  • x (k+1) = arg min
  • x f
  • x (k)

+ ∇f

  • x (k)T
  • x −

x (k) + 1 2 αk

  • x −

x (k)

  • 2

2

+ h ( x) = arg min

  • x

1 2

  • x −
  • x (k) − αk ∇f
  • x (k)
  • 2

2 + αk h (

x) = proxαk h

  • x (k) − αk ∇f
  • x (k)

Proximal operator: proxh (y) := arg min

  • x h (

x) + 1 2 ||y − x||2

2

slide-152
SLIDE 152

Proximal gradient method

Method to solve the optimization problem minimize f ( x) + h ( x) , where f is differentiable and proxh is tractable Proximal-gradient iteration:

  • x (0) = arbitrary initialization
  • x (k+1) = proxαk h
  • x (k) − αk ∇f
  • x (k)
slide-153
SLIDE 153

Interpretation as a fixed-point method

A vector x ∗ is a solution to minimize f ( x) + h ( x) , if and only if it is a fixed point of the proximal-gradient iteration for any α > 0

  • x ∗ = proxα h (

x ∗ − α ∇f ( x ∗))

slide-154
SLIDE 154

Proof

  • x ∗ is the solution to

min

  • x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that

slide-155
SLIDE 155

Proof

  • x ∗ is the solution to

min

  • x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that α∇f ( x ∗) + α g =

slide-156
SLIDE 156

Proof

  • x ∗ is the solution to

min

  • x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that α∇f ( x ∗) + α g =

  • x ∗ minimizes f + h if and only if there is a subgradient

g of h at x ∗ such that

slide-157
SLIDE 157

Proof

  • x ∗ is the solution to

min

  • x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that α∇f ( x ∗) + α g =

  • x ∗ minimizes f + h if and only if there is a subgradient

g of h at x ∗ such that ∇f ( x ∗) + g =

slide-158
SLIDE 158

Proximal operator of ℓ1 norm

The proximal operator of the ℓ1 norm is the soft-thresholding operator proxα ||·||1 (y) = Sα (y) where α > 0 and Sα (y)i :=

  • yi − sign (yi) α

if |yi| ≥ α

  • therwise
slide-159
SLIDE 159

Proof

α || x||1 + 1 2 || y − x||2

2 = α m

  • i=1

| x[i]| + 1 2 ( y[i] − x[i])2 We can just consider w (x) := α |x| + 1 2 (y − x)2 = y2 + x2 2 + α |x| − yx

slide-160
SLIDE 160

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) =

slide-161
SLIDE 161

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α)

slide-162
SLIDE 162

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at

slide-163
SLIDE 163

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at x := y − α

slide-164
SLIDE 164

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at x := y − α If y < α minimum at

slide-165
SLIDE 165

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at x := y − α If y < α minimum at 0

slide-166
SLIDE 166

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) =

slide-167
SLIDE 167

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α)

slide-168
SLIDE 168

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at

slide-169
SLIDE 169

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at x := y + α

slide-170
SLIDE 170

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at x := y + α If y ≥ −α minimum at

slide-171
SLIDE 171

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at x := y + α If y ≥ −α minimum at 0

slide-172
SLIDE 172

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α)

slide-173
SLIDE 173

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2

slide-174
SLIDE 174

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2 = αy − α2 2

slide-175
SLIDE 175

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2 = αy − α2 2 ≤ y2 2 = w (0) because (y − α)2 ≥ 0

slide-176
SLIDE 176

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2 = αy − α2 2 ≤ y2 2 = w (0) because (y − α)2 ≥ 0 Same argument for y < α

slide-177
SLIDE 177

Iterative Shrinkage-Thresholding Algorithm (ISTA)

The proximal gradient method for the problem minimize 1 2 ||A x − y||2

2 + λ ||

x||1 is called ISTA ISTA iteration:

  • x (0) = arbitrary initialization
  • x (k+1) = Sαk λ
  • x (k) − αk AT

A x (k) − y

slide-178
SLIDE 178

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)

ISTA can be accelerated using Nesterov’s accelerated gradient method FISTA iteration:

  • x (0) = arbitrary initialization
  • z (0) =

x (0)

  • x (k+1) = Sαk λ
  • z (k) − αk AT

A z (k) − y

  • z (k+1) =

x (k+1) + k k + 3

  • x (k+1) −

x (k)

slide-179
SLIDE 179

Convergence of proximal gradient method

Without acceleration:

◮ Descent method ◮ Convergence rate can be shown to be O (1/ǫ) with constant step or

backtracking line search With acceleration:

◮ Not a descent method ◮ Convergence rate can be shown to be O

  • 1

√ǫ

  • with constant step or

backtracking line search Experiment: minimize

1 2 ||A

x − y||2

2 + λ ||

x||1 A ∈ R2000×1000, y = A x0 + z, x0 100-sparse and z iid Gaussian

slide-180
SLIDE 180

Convergence of proximal gradient method

20 40 60 80 100

k

10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 f(x(k) )−f(x ∗ ) f(x ∗ )

  • Subg. method (α0 /

p

k) ISTA FISTA

slide-181
SLIDE 181

Coordinate descent

Idea: Solve the n-dimensional problem minimize c ( x[1], x[2], . . . , x[n]) by solving a sequence of 1D problems Coordinate-descent iteration:

  • x (0) = arbitrary initialization
  • x (k+1)[i] = arg min

α c

  • x (k)[1], . . . , α, . . . ,

x (k)[n]

  • for some 1 ≤ i ≤ n
slide-182
SLIDE 182

Coordinate descent

Convergence is guaranteed for functions of the form f ( x) +

n

  • i=1

hi ( x[i]) where f is convex and differentiable and h1, . . . , hn are convex

slide-183
SLIDE 183

Least-squares regression with ℓ1-norm regularization

h ( x) := 1 2 ||A x − y||2

2 + λ ||

x||1 The solution to the subproblem min

x[i] h (

x[1], . . . , x[i], . . . , x[n]) is

  • x ∗[i] = Sλ (γi)

||Ai||2

2

where Ai is the ith column of A and γi :=

m

  • l=1

Ali   y[l] −

  • j=i

Alj x[j]  