[PPT] - Nondifferentiable Convex Functions DS-GA 1013 / MATH-GA 2824 PowerPoint Presentation

SLIDE 1

Nondifferentiable Convex Functions

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

SLIDE 2

Applications Subgradients Optimization methods

SLIDE 3

Regression

The aim is to learn a function h that relates

◮ a response or dependent variable y ◮ to several observed variables x1, x2, . . . , xp, known as covariates,

features or independent variables The response is assumed to be of the form y = h ( x) + z where x ∈ Rp contains the features and z is noise

SLIDE 4

Linear regression

The regression function h is assumed to be linear y(i) = x (i) T β∗ + z(i), 1 ≤ i ≤ n Our aim is to estimate β∗ ∈ Rp from the data

SLIDE 5

Linear regression

In matrix form     y(1) y(2) · · · y(n)     =     

x (1)

1

x (1)

2

· · ·

x (1)

p

x (2)

1

x (2)

2

· · ·

x (2)

p

· · · · · · · · · · · ·

x (n)

1

x (n)

2

· · ·

x (n)

p

         

β∗

1

β∗

2

· · ·

β∗

p

     +     z(1) z(2) · · · z(n)     Equivalently,

y = X

β∗ + z

SLIDE 6

Sparse linear regression

Only a subset of the features are relevant Model selection problem Two objectives:

◮ Good fit to the data;

X

β − y

2

2 should be as small as possible ◮ Using a small number of features;

β should be as sparse as possible

SLIDE 7

Sparse linear regression

    y(1) y(2) · · · y(n)     =     

x (1)

j

x (1)

l

x (2)

j

x (2)

l

· · · · · ·

x (n)

l

x (n)

l

    

β∗

j

β∗

l

+

    z(1) z(2) · · · z(n)     =     

x (1)

1

· · ·

x (1)

j

· · ·

x (1)

l

· · ·

x (1)

p

x (2)

1

· · ·

x (2)

j

· · ·

x (2)

l

· · ·

x (2)

p

· · ·

x (n)

1

· · ·

x (n)

j

· · ·

x (n)

l

· · ·

x (n)

p

                · · ·

β∗

j

· · ·

β∗

l

· · ·            +     z(1) z(2) · · · z(n)     = X β∗ + z

SLIDE 8

Sparse linear regression with 2 features

y := α

x1 + z X :=

x1
x2
||

x1||2 = 1 || x2||2 = 1

x1,

x2 = ρ

SLIDE 9

Least squares: not sparse

βLS =
X TX

−1 X T y =  1 ρ ρ 1  

−1 

 xT

1

y

xT

2

y   = 1 1 − ρ2   1 −ρ −ρ 1     α + xT

1

z αρ + xT

2

z   =  α   + 1 1 − ρ2   x1 − ρ x2, z

x2 − ρ

x1, z  

SLIDE 10

The lasso

Idea: Use ℓ1-norm regularization to promote sparse coefficients

βlasso := arg min
β

1 2

y − X

β

2

2 +λ

β
1

SLIDE 11

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative

SLIDE 12

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y)

SLIDE 13

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y) =

m

i=1

αi fi (θ x + (1 − θ) y)

SLIDE 14

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y) =

m

i=1

αi fi (θ x + (1 − θ) y) ≤

m

i=1

αi (θfi ( x) + (1 − θ) fi ( y))

SLIDE 15

Nonnegative weighted sums

The weighted sum of m convex functions f1, . . . , fm f :=

m

i=1

αi fi is convex if α1, . . . , α ∈ R are nonnegative Proof: f (θ x + (1 − θ) y) =

m

i=1

αi fi (θ x + (1 − θ) y) ≤

m

i=1

αi (θfi ( x) + (1 − θ) fi ( y)) = θ f ( x) + (1 − θ) f ( y)

SLIDE 16

Regularized least-squares

Regularized least-squares cost functions ||A x − y||2

2 + ||

x|| are convex

SLIDE 17

It works

10-3 10-2 10-1 100 Regularization parameter 0.0 0.2 0.4 0.6 0.8 1.0 Coefficients

SLIDE 18

Ridge regression doesn’t work

10-3 10-2 10-1 100 101 102 103

Regularization parameter

0.0 0.2 0.4 0.6 0.8 1.0

Coefficients

SLIDE 19

Prostate cancer data set

◮ 8 features (age, weight, analysis results) ◮ Response: Prostate-specific antigen (PSA), associated to cancer ◮ Training set: 60 patients ◮ Test set: 37 patients

SLIDE 20

Prostate cancer data set

10-2 10-1 100

λ

0.0 0.5 1.0 1.5

Coefficient Values Training Loss Test Loss

SLIDE 21

Principal component analysis

Given n data vectors x1, x2, . . . , xn ∈ Rd,

1. Center the data,
ci =

xi − av ( x1, x2, . . . , xn) , 1 ≤ i ≤ n

2. Group the centered data as columns of a matrix

C =

c1
c2

· · ·

cn
.
3. Compute the SVD of C

The left singular vectors are the principal directions The principal values are the coefficients of the centered vectors in the basis of principal directions.

SLIDE 22

Example

C :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2  

SLIDE 23

Principal component analysis

SLIDE 24

Example

C :=   −2 −1 5 1 2 −2 −1 1 2 −2 −1 1 2  

SLIDE 25

Principal component analysis

SLIDE 26

Outliers

Problem: Outliers distort principal directions Model: Data equals low-rank component + sparse component + = L S Y Idea: Fit model to data, then apply PCA to L

SLIDE 27

Robust PCA

Data: Y ∈ Rn×m Robust PCA estimator of low-rank component: LRPCA := arg min

L ||L||∗ + λ ||Y − L||1

where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: SRPCA := Y − LRPCA ||·||1 is the ℓ1 norm of the vectorized matrix

SLIDE 28

Example

10-1 100

λ

4 2 2 4

Low Rank Sparse

SLIDE 29

λ =

1 √n

L S

SLIDE 30

Large λ

L S

SLIDE 31

Small λ

L S

SLIDE 32

Background subtraction

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 33

Background subtraction

Matrix with vectorized frames as columns Static image: Y =

x
x

· · ·

x
=

x

1

1 · · · 1

Slowly varying background: Low-rank

Rapidly varying foreground: Sparse

SLIDE 34

Frame 17

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 35

Low-rank component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 36

Sparse component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 37

Frame 42

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 38

Low-rank component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 39

Sparse component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 40

Frame 75

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 41

Low-rank component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 42

Sparse component

20 40 60 80 100 120 140 160 20 40 60 80 100 120

SLIDE 43

Applications Subgradients Optimization methods

SLIDE 44

Gradient

A differentiable function f : Rn → R is convex if and only if for every

x,

y ∈ Rn f ( y) ≥ f ( x) + ∇f ( x)T ( y − x)

SLIDE 45

Gradient

x f ( y)

SLIDE 46

Subgradient

The subgradient of f : Rn → R at x ∈ Rn is a vector g ∈ Rn such that f ( y) ≥ f ( x) + gT ( y − x) , for all y ∈ Rn Geometrically, the hyperplane H

g :=

   y | y[n + 1] = gT    

y[1]

· · ·

y[n]

       is a supporting hyperplane of the epigraph at x The set of all subgradients at x is called the subdifferential

SLIDE 47

Subgradients

SLIDE 48

Subgradient of differentiable function

If a function is differentiable, the only subgradient at each point is the gradient

SLIDE 49

Proof

Assume g is a subgradient at x, for any α ≥ 0 f ( x + α ei) ≥ f ( x) + gTα ei = f ( x) + g[i] α f ( x) ≥ f ( x − α ei) + gTα ei = f ( x − α ei) + g[i] α

SLIDE 50

Proof

Assume g is a subgradient at x, for any α ≥ 0 f ( x + α ei) ≥ f ( x) + gTα ei = f ( x) + g[i] α f ( x) ≥ f ( x − α ei) + gTα ei = f ( x − α ei) + g[i] α Combining both inequalities f ( x) − f ( x − α ei) α ≤ g[i] ≤ f ( x + α ei) − f ( x) α

SLIDE 51

Proof

Assume g is a subgradient at x, for any α ≥ 0 f ( x + α ei) ≥ f ( x) + gTα ei = f ( x) + g[i] α f ( x) ≥ f ( x − α ei) + gTα ei = f ( x − α ei) + g[i] α Combining both inequalities f ( x) − f ( x − α ei) α ≤ g[i] ≤ f ( x + α ei) − f ( x) α Letting α → 0, implies g[i] = ∂f (

x) ∂ x[i]

SLIDE 52

Subgradient

A function f : Rn → R is convex if and only if it has a subgradient at every point It is strictly convex if and only for all x ∈ Rn there exists g ∈ Rn such that f ( y) > f ( x) + g T ( y − x) , for all y = x.

SLIDE 53

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x)

SLIDE 54

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x) = f ( x) for all y ∈ Rn

SLIDE 55

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x) = f ( x) for all y ∈ Rn Under strict convexity the minimum is unique

SLIDE 56

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

g :=

g1+ g2 is a subgradient of f := f1 + f2 at x

SLIDE 57

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

g :=

g1+ g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f ( y) = f1 ( y) + f2 ( y)

SLIDE 58

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

g :=

g1+ g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f ( y) = f1 ( y) + f2 ( y) ≥ f1 ( x) + g T

1 (

y − x) + f2 ( y) + g T

2 (

y − x)

SLIDE 59

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R

g :=

g1+ g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f ( y) = f1 ( y) + f2 ( y) ≥ f1 ( x) + g T

1 (

y − x) + f2 ( y) + g T

2 (

y − x) ≥ f ( x) + g T ( y − x)

SLIDE 60

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x

SLIDE 61

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x Proof: For any y ∈ Rn f2 ( y) = ηf1 ( y)

SLIDE 62

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x Proof: For any y ∈ Rn f2 ( y) = ηf1 ( y) ≥ η

f1 (

x) + g T

1 (

y − x)

SLIDE 63

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any η ≥ 0 g2 := η g1 is a subgradient of f2 := ηf1 at x Proof: For any y ∈ Rn f2 ( y) = ηf1 ( y) ≥ η

f1 (

x) + g T

1 (

y − x)

≥ f2 (

x) + g T

2 (

y − x)

SLIDE 64

Subdifferential of absolute value

f(x) = |x|

SLIDE 65

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0)

SLIDE 66

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0) |y| ≥ gy

SLIDE 67

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0) |y| ≥ gy Holds if and only if |g| ≤ 1

SLIDE 68

Subdifferential of ℓ1 norm

g is a subgradient of the ℓ1 norm at

x ∈ Rn if and only if

g[i] = sign (x[i])

if x[i] = 0 | g[i]| ≤ 1 if x[i] = 0

SLIDE 69

Proof

g is a subgradient of ||·||1 at

x if and only if g[i] is a subgradient

f |·| at

x[i] for all 1 ≤ i ≤ n

SLIDE 70

Proof

If g is a subgradient of ||·||1 at x then for any y ∈ R |y| = | x[i]| + || x + (y − x[i]) ei||1 − || x||1

SLIDE 71

Proof

If g is a subgradient of ||·||1 at x then for any y ∈ R |y| = | x[i]| + || x + (y − x[i]) ei||1 − || x||1 ≥ | x[i]| + || x||1 + g T (y − x[i]) ei − || x||1

SLIDE 72

Proof

If g is a subgradient of ||·||1 at x then for any y ∈ R |y| = | x[i]| + || x + (y − x[i]) ei||1 − || x||1 ≥ | x[i]| + || x||1 + g T (y − x[i]) ei − || x||1 = | x[i]| + g[i] (y − x[i]) so g[i] is a subgradient of |·| at | x[i]| for all 1 ≤ i ≤ n

SLIDE 73

Proof

If g[i] is a subgradient of |·| at | x[i]| for 1 ≤ i ≤ n then for any y ∈ Rn || y||1 =

n

i=1

| y [i]|

SLIDE 74

Proof

If g[i] is a subgradient of |·| at | x[i]| for 1 ≤ i ≤ n then for any y ∈ Rn || y||1 =

n

i=1

| y [i]| ≥

n

i=1

| x[i]| + g[i] ( y [i] − x[i])

SLIDE 75

Proof

If g[i] is a subgradient of |·| at | x[i]| for 1 ≤ i ≤ n then for any y ∈ Rn || y||1 =

n

i=1

| y [i]| ≥

n

i=1

| x[i]| + g[i] ( y [i] − x[i]) = || x||1 + g T ( y − x) so g is a subgradient of ||·||1 at x

SLIDE 76

Subdifferential of ℓ1 norm

SLIDE 77

Subdifferential of ℓ1 norm

SLIDE 78

Subdifferential of ℓ1 norm

SLIDE 79

Subdifferential of the nuclear norm

Let X ∈ Rm×n be a rank-r matrix with SVD USV T, where U ∈ Rm×r, V ∈ Rn×r and S ∈ Rr×r A matrix G is a subgradient of the nuclear norm at X if and only if G := UV T + W where W satisfies ||W || ≤ 1 UTW = 0 W V = 0

SLIDE 80

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

Prow(X)

x

2

2 +

Prow(X)⊥

x

2

2 = ||

x||2

2 = 1

SLIDE 81

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

Prow(X)

x

2

2 +

Prow(X)⊥

x

2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

SLIDE 82

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

Prow(X)

x

2

2 +

Prow(X)⊥

x

2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

UV T

x

2

2 + ||W

x||2

2

SLIDE 83

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

Prow(X)

x

2

2 +

Prow(X)⊥

x

2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

UV T

x

2

2 + ||W

x||2

2

= max {||

x||2=1 | x∈Rn}

UV T Prow(X)

x

2

2 +

W Prow(X)⊥

x

2

2

SLIDE 84

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

Prow(X)

x

2

2 +

Prow(X)⊥

x

2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

UV T

x

2

2 + ||W

x||2

2

= max {||

x||2=1 | x∈Rn}

UV T Prow(X)

x

2

2 +

W Prow(X)⊥

x

2

2

≤

UV T
2
Prow(X)

x

2

2 + ||W ||2

Prow(X)⊥

x

2

2

SLIDE 85

Proof

By Pythagoras’ Theorem, for any x ∈ Rm with unit ℓ2 norm we have

Prow(X)

x

2

2 +

Prow(X)⊥

x

2

2 = ||

x||2

2 = 1

The rows of UV T are in row (X) and the rows of W in row (X)⊥, so ||G||2 := max {||

x||2=1 | x∈Rn}

||G x||2

2

= max {||

x||2=1 | x∈Rn}

UV T

x

2

2 + ||W

x||2

2

= max {||

x||2=1 | x∈Rn}

UV T Prow(X)

x

2

2 +

W Prow(X)⊥

x

2

2

≤

UV T
2
Prow(X)

x

2

2 + ||W ||2

Prow(X)⊥

x

2

2

≤ 1

SLIDE 86

Hölder’s inequality for matrices

For any matrix A ∈ Rm×n, ||A||∗ = sup

{||B||≤1 | B∈Rm×n}

A, B .

SLIDE 87

Proof

For any matrix Y ∈ Rm×n ||Y ||∗ ≥ G, Y = G, X + G, Y − X =

UV T, X
+ W , X + G, Y − X

SLIDE 88

Proof

UTW = 0 implies W , X =

W , USV T

=

UTW , SV T

= 0

UV T, X

SLIDE 89

Proof

UTW = 0 implies W , X =

W , USV T

=

UTW , SV T

= 0

UV T, X
= tr
VUTX

SLIDE 90

Proof

UTW = 0 implies W , X =

W , USV T

=

UTW , SV T

= 0

UV T, X
= tr
VUTX
= tr
VUTUSV T

SLIDE 91

Proof

UTW = 0 implies W , X =

W , USV T

=

UTW , SV T

= 0

UV T, X
= tr
VUTX
= tr
VUTUSV T

= tr

V TV S

SLIDE 92

Proof

UTW = 0 implies W , X =

W , USV T

=

UTW , SV T

= 0

UV T, X
= tr
VUTX
= tr
VUTUSV T

= tr

V TV S
= tr (S)

SLIDE 93

Proof

UTW = 0 implies W , X =

W , USV T

=

UTW , SV T

= 0

UV T, X
= tr
VUTX
= tr
VUTUSV T

= tr

V TV S
= tr (S)

= ||X||∗

SLIDE 94

Proof

For any matrix Y ∈ Rm×n ||Y ||∗ ≥ G, Y = G, X + G, Y − X =

UV T, X
+ G, Y − X

=

UV T, X
+ W , X + G, Y − X

= ||X||∗ + G, Y − X

SLIDE 95

Sparse linear regression with 2 features

y := α

x1 + z X :=

x1
x2
||

x1||2 = 1 || x2||2 = 1

x1,

x2 = ρ

SLIDE 96

Analysis of lasso estimator

Let α ≥ 0

βlasso =

α + x T

1

z − λ

as long as
x T

2

z − ρ x T

1

z

1 − |ρ|

≤ λ ≤ α + xT

1

z

SLIDE 97

Lasso estimator

0.00 0.05 0.10 0.15 0.20 Regularization parameter 0.0 0.2 0.4 0.6 0.8 1.0 Coefficients

SLIDE 98

Optimality condition for nondifferentiable functions

If 0 is a subgradient of f at x, then f ( y) ≥ f ( x) + 0T ( y − x) = f ( x) for all y ∈ Rn Under strict convexity the minimum is unique

SLIDE 99

Proof

The cost function is strictly convex if n ≥ 2 and ρ = 1 Aim: Show that there is a subgradient equal to 0 at a 1-sparse solution

SLIDE 100

Proof

The gradient of the quadratic term q

β
:= 1

2

X

β − y

2

2

at βlasso equals ∇q

βlasso
= X T

X βlasso − y

SLIDE 101

Proof

If only the first entry is nonzero and nonnegative

gℓ1 :=

1 γ

is a subgradient of the ℓ1 norm at

βlasso for any γ ∈ R such that |γ| ≤ 1

SLIDE 102

Proof

If only the first entry is nonzero and nonnegative

gℓ1 :=

1 γ

is a subgradient of the ℓ1 norm at

βlasso for any γ ∈ R such that |γ| ≤ 1 In that case glasso := ∇q

βlasso
+ λ

gℓ1 is a subgradient of the cost function at βlasso

SLIDE 103

Proof

If only the first entry is nonzero and nonnegative

gℓ1 :=

1 γ

is a subgradient of the ℓ1 norm at

βlasso for any γ ∈ R such that |γ| ≤ 1 In that case glasso := ∇q

βlasso
+ λ

gℓ1 is a subgradient of the cost function at βlasso If glasso = 0 then βlasso is the unique solution

SLIDE 104

Proof

glasso := X T

X βlasso − y

+ λ

1 γ

SLIDE 105

Proof

glasso := X T

X βlasso − y

+ λ

1 γ

= X T
βlasso[1]

x1 − α x1 − z

+ λ

1 γ

SLIDE 106

Proof

glasso := X T

X βlasso − y

+ λ

1 γ

= X T
βlasso[1]

x1 − α x1 − z

+ λ

1 γ

=

  x T

1

βlasso[1]

x1 − α x1 − z

+ λ
x T

2

βlasso[1]

x1 − α x1 − z

+ λγ

 

SLIDE 107

Proof

glasso := X T

X βlasso − y

+ λ

1 γ

= X T
βlasso[1]

x1 − α x1 − z

+ λ

1 γ

=

  x T

1

βlasso[1]

x1 − α x1 − z

+ λ
x T

2

βlasso[1]

x1 − α x1 − z

+ λγ

  =

βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

SLIDE 108

Proof

glasso =
βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

Equal to

0 if

SLIDE 109

Proof

glasso =
βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

Equal to

0 if

βlasso[1] = α +

xT

1

z − λ

SLIDE 110

Proof

glasso =
βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

Equal to

0 if

βlasso[1] = α +

xT

1

z − λ γ = ρα + xT

2

z − ρ βlasso[1] λ

SLIDE 111

Proof

glasso =
βlasso[1] − α −

xT

1

z + λ ρ βlasso[1] − ρα − xT

2

z + λγ

Equal to

0 if

βlasso[1] = α +

xT

1

z − λ γ = ρα + xT

2

z − ρ βlasso[1] λ = x T

2

z − ρ x T

1

z λ + ρ

SLIDE 112

Proof

We still need to check that it’s a valid subgradient at βlasso, i.e.

◮

βlasso[1] is nonnegative λ ≤ α + xT

1 ◮ |γ| ≤ 1

|γ| ≤

x T

2

z − ρ x T

1

z λ

+ |ρ| ≤ 1

which holds if λ ≥

ρ

x T

1

z + x T

2

z

1 − |ρ|

SLIDE 113

Robust PCA

Data: Y ∈ Rn×m Robust PCA estimator of low-rank component: LRPCA := arg min

L ||L||∗ + λ ||Y − L||1

where λ > 0 is a regularization parameter Robust PCA estimator of sparse component: SRPCA := Y − LRPCA ||·||1 is the ℓ1 norm of the vectorized matrix

SLIDE 114

Example

Y :=   −2 −1 α 1 2 −2 −1 1 2 −2 −1 1 2  

SLIDE 115

Analysis of robust PCA estimator

The robust PCA estimates of both components are exact for any value of α as long as 2 √ 30 < λ <

2

3

SLIDE 116

Example

10-1 100

λ

4 2 2 4

Low Rank Sparse

SLIDE 117

Optimality + uniqueness condition

Let Y := L∗ + S∗ where L∗, S∗ ∈ Rm×n L∗ = UL∗SL∗V T

L∗ has rank r, UL∗ ∈ Rm×r, VL∗ ∈ Rn×r, SL∗ ∈ Rr×r

Assume there exists G∗ := UL∗V T

L∗ + W where W satisfies

||W || < 1, UTW = 0, W V = 0, and there also exists a matrix Gℓ1 satisfying Gℓ1[i, j] = sign (S∗[i, j]) if S∗[i, j] = 0, (1) |Gℓ1[i, j]| < 1

therwise,

(2) such that G∗ + λGℓ1 = 0 Then the solution to the robust PCA problem is unique and equal to L∗

SLIDE 118

Optimality + uniqueness condition

G∗ := UL∗V T

L∗ + W is a subgradient of the nuclear norm at L∗

Gℓ1 is a subgradient of ||· − Y ||1 at L∗ G∗ + λGℓ1 is a subgradient of the cost function at L∗ G∗ + λGℓ1 = 0 implies that L∗ is a solution (uniqueness is more difficult to prove)

SLIDE 119

Example

Y :=   −2 −1 α 1 2 −2 −1 1 2 −2 −1 1 2   We want to show that the solution is L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2   S∗ :=   α  

SLIDE 120

Example

L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2  

SLIDE 121

Example

L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2   =   1 √ 3   1 1 1     √ 30 1 √ 10

−2

−1 1 2

SLIDE 122

Example

L∗ :=   −2 −1 1 2 −2 −1 1 2 −2 −1 1 2   =   1 √ 3   1 1 1     √ 30 1 √ 10

−2

−1 1 2

UL∗V T

L∗ =

1 √ 30   1 1 1   −2 −1 1 2

SLIDE 123

Example

G∗ = UL∗V T

L∗ + W =

1 √ 30   1 1 1   −2 −1 1 2

+ W

SLIDE 124

Example

G∗ = UL∗V T

L∗ + W =

1 √ 30   1 1 1   −2 −1 1 2

+ W

Gℓ1 =   g1 g2 − sign (α) g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14  

SLIDE 125

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ      − sign (α)      +          

SLIDE 126

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ      − sign (α)      +      λ sign (α)     

SLIDE 127

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α)     

SLIDE 128

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α)      WV = 0

SLIDE 129

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α)      WV = 0

SLIDE 130

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α)      WV = 0 UTW = 0

SLIDE 131

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0

SLIDE 132

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0

SLIDE 133

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0?

SLIDE 134

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0? λ > 2/ √ 30

SLIDE 135

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0? λ > 2/ √ 30 ||W || < 1?

SLIDE 136

Example

G∗ + λGℓ1 = 1 √ 30      −2 −1 1 2 −2 −1 1 2 −2 −1 1 2      + λ     

2 λ √ 30 1 λ √ 30

− sign (α) −

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30 2 λ √ 30 1 λ √ 30 λ sign(α) 2λ

−

1 λ √ 30

−

2 λ √ 30

     +      λ sign (α) − λ sign(α)

2

− λ sign(α)

2

     WV = 0 UTW = 0 |Gℓ1[i, j]| < 1 for S∗[i, j] = 0? λ > 2/ √ 30 ||W || < 1? λ <

2/3

SLIDE 137

Applications Subgradients Optimization methods

SLIDE 138

Subgradient method

Optimization problem minimize f ( x) where f is convex but nondifferentiable Subgradient-method iteration:

x (0) = arbitrary initialization
x (k+1) =

x (k) − αk g (k) where g (k) is a subgradient of f at x (k)

SLIDE 139

Least-squares regression with ℓ1-norm regularization

minimize 1 2 ||A x − y||2

2 + λ ||

x||1 Subgradient at x (k)

g (k)

SLIDE 140

Least-squares regression with ℓ1-norm regularization

minimize 1 2 ||A x − y||2

2 + λ ||

x||1 Subgradient at x (k)

g (k) = AT

A x (k) − y

+ λ sign
x (k)

SLIDE 141

Least-squares regression with ℓ1-norm regularization

minimize 1 2 ||A x − y||2

2 + λ ||

x||1 Subgradient at x (k)

g (k) = AT

A x (k) − y

+ λ sign
x (k)

Subgradient-method iteration:

x (0) = arbitrary initialization
x (k+1) =

x (k) − αk

AT

A x (k) − y

+ λ sign
x (k)

SLIDE 142

Convergence of subgradient method

It is not a descent method Convergence rate can be shown to be O

1/ǫ2

Diminishing step sizes are necessary for convergence Experiment: minimize 1 2 ||A x − y||2

2 + λ ||

x||1 A ∈ R2000×1000, y = A x ∗ + z where x ∗ is 100-sparse and z is iid Gaussian

SLIDE 143

Convergence of subgradient method

20 40 60 80 100

k

10-2 10-1 100 101 102 103 104 f(x(k) )−f(x ∗ ) f(x ∗ )

α0 α0 /

p

k α0 /k

SLIDE 144

Convergence of subgradient method

1000 2000 3000 4000 5000

k

10-3 10-2 10-1 100 101 f(x(k) )−f(x ∗ ) f(x ∗ )

α0 α0 /

pn

α0 /n

SLIDE 145

Composite functions

Interesting class of functions for data analysis f ( x) + h ( x) f convex and differentiable, h convex but not differentiable Example: 1 2 ||A x − y||2

2 + λ ||

x||1

SLIDE 146

Motivation

Aim: Minimize convex differentiable function f Idea: Iteratively minimize first-order approximation, while staying close to current point

x (0) = arbitrary initialization
x (k+1) = arg min
x f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

where αk is a parameter that determines how close we stay

SLIDE 147

Motivation

Linear approximation+ ℓ2 term is convex ∇

f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

SLIDE 148

Motivation

Linear approximation+ ℓ2 term is convex ∇

f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

= ∇f
x (k)

+ x − x (k) αk

SLIDE 149

Motivation

Linear approximation+ ℓ2 term is convex ∇

f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

= ∇f
x (k)

+ x − x (k) αk Setting the gradient to zero

x (k+1) = arg min
x f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

SLIDE 150

Motivation

Linear approximation+ ℓ2 term is convex ∇

f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

= ∇f
x (k)

+ x − x (k) αk Setting the gradient to zero

x (k+1) = arg min
x f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

= x (k) − αk∇f

x (k)

SLIDE 151

Proximal gradient method

Idea: Minimize local first-order approximation + h

x (k+1) = arg min
x f
x (k)

+ ∇f

x (k)T
x −

x (k) + 1 2 αk

x −

x (k)

2

2

+ h ( x) = arg min

x

1 2

x −
x (k) − αk ∇f
x (k)
2

2 + αk h (

x) = proxαk h

x (k) − αk ∇f
x (k)

Proximal operator: proxh (y) := arg min

x h (

x) + 1 2 ||y − x||2

2

SLIDE 152

Proximal gradient method

Method to solve the optimization problem minimize f ( x) + h ( x) , where f is differentiable and proxh is tractable Proximal-gradient iteration:

x (0) = arbitrary initialization
x (k+1) = proxαk h
x (k) − αk ∇f
x (k)

SLIDE 153

Interpretation as a fixed-point method

A vector x ∗ is a solution to minimize f ( x) + h ( x) , if and only if it is a fixed point of the proximal-gradient iteration for any α > 0

x ∗ = proxα h (

x ∗ − α ∇f ( x ∗))

SLIDE 154

Proof

x ∗ is the solution to

min

x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that

SLIDE 155

Proof

x ∗ is the solution to

min

x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that α∇f ( x ∗) + α g =

SLIDE 156

Proof

x ∗ is the solution to

min

x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that α∇f ( x ∗) + α g =

x ∗ minimizes f + h if and only if there is a subgradient

g of h at x ∗ such that

SLIDE 157

Proof

x ∗ is the solution to

min

x

α h ( x) + 1 2 || x ∗ − α ∇f ( x ∗) − x||2

2

(3) if and only if there is a subgradient g of h at x ∗ such that α∇f ( x ∗) + α g =

x ∗ minimizes f + h if and only if there is a subgradient

g of h at x ∗ such that ∇f ( x ∗) + g =

SLIDE 158

Proximal operator of ℓ1 norm

The proximal operator of the ℓ1 norm is the soft-thresholding operator proxα ||·||1 (y) = Sα (y) where α > 0 and Sα (y)i :=

yi − sign (yi) α

if |yi| ≥ α

therwise

SLIDE 159

Proof

α || x||1 + 1 2 || y − x||2

2 = α m

i=1

| x[i]| + 1 2 ( y[i] − x[i])2 We can just consider w (x) := α |x| + 1 2 (y − x)2 = y2 + x2 2 + α |x| − yx

SLIDE 160

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) =

SLIDE 161

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α)

SLIDE 162

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at

SLIDE 163

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at x := y − α

SLIDE 164

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at x := y − α If y < α minimum at

SLIDE 165

Proof

If x ≥ 0 w (x) = y2 + x2 2 − (y − α) x w′ (x) = x − (y − α) If y ≥ α minimum at x := y − α If y < α minimum at 0

SLIDE 166

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) =

SLIDE 167

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α)

SLIDE 168

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at

SLIDE 169

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at x := y + α

SLIDE 170

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at x := y + α If y ≥ −α minimum at

SLIDE 171

Proof

If x < 0 w (x) = y2 + x2 2 − (y + α) x w′ (x) = x − (y + α) If y ≤ −α minimum at x := y + α If y ≥ −α minimum at 0

SLIDE 172

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α)

SLIDE 173

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2

SLIDE 174

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2 = αy − α2 2

SLIDE 175

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2 = αy − α2 2 ≤ y2 2 = w (0) because (y − α)2 ≥ 0

SLIDE 176

Proof

If −α ≤ y ≤ α minimum at x := 0 If y ≥ α minimum at x := y − α or at x := 0, but w (y − α) = α (y − α) + α2 2 = αy − α2 2 ≤ y2 2 = w (0) because (y − α)2 ≥ 0 Same argument for y < α

SLIDE 177

Iterative Shrinkage-Thresholding Algorithm (ISTA)

The proximal gradient method for the problem minimize 1 2 ||A x − y||2

2 + λ ||

x||1 is called ISTA ISTA iteration:

x (0) = arbitrary initialization
x (k+1) = Sαk λ
x (k) − αk AT

A x (k) − y

SLIDE 178

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)

ISTA can be accelerated using Nesterov’s accelerated gradient method FISTA iteration:

x (0) = arbitrary initialization
z (0) =

x (0)

x (k+1) = Sαk λ
z (k) − αk AT

A z (k) − y

z (k+1) =

x (k+1) + k k + 3

x (k+1) −

x (k)

SLIDE 179

Convergence of proximal gradient method

Without acceleration:

◮ Descent method ◮ Convergence rate can be shown to be O (1/ǫ) with constant step or

backtracking line search With acceleration:

◮ Not a descent method ◮ Convergence rate can be shown to be O

1

√ǫ

with constant step or

backtracking line search Experiment: minimize

1 2 ||A

x − y||2

2 + λ ||

x||1 A ∈ R2000×1000, y = A x0 + z, x0 100-sparse and z iid Gaussian

SLIDE 180

Convergence of proximal gradient method

20 40 60 80 100

k

10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 f(x(k) )−f(x ∗ ) f(x ∗ )

Subg. method (α0 /

p

k) ISTA FISTA

SLIDE 181

Coordinate descent

Idea: Solve the n-dimensional problem minimize c ( x[1], x[2], . . . , x[n]) by solving a sequence of 1D problems Coordinate-descent iteration:

x (0) = arbitrary initialization
x (k+1)[i] = arg min

α c

x (k)[1], . . . , α, . . . ,

x (k)[n]

for some 1 ≤ i ≤ n

SLIDE 182

Coordinate descent

Convergence is guaranteed for functions of the form f ( x) +

n

i=1

hi ( x[i]) where f is convex and differentiable and h1, . . . , hn are convex