Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for - - PowerPoint PPT Presentation

sparse regression
SMART_READER_LITE
LIVE PREVIEW

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for - - PowerPoint PPT Presentation

Sparse regression DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Sparse regression Linear regression is challenging when the number of features


slide-1
SLIDE 1

Sparse regression

DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science

https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Sparse regression

Linear regression is challenging when the number of features p is large Solution: Select subset of features I ⊂ {1, . . . , p}, such that y ≈

  • i∈I

β[i]x[i] Equivalently, find sparse coefficient vector β ∈ Rp such that y ≈ x, β Problem: How to promote sparsity?

slide-3
SLIDE 3

Toy problem

Find t such that vt :=   t t − 1 t − 1   is sparse Equivalently, find arg mint ||vt||0

slide-4
SLIDE 4

ℓ0 “norm"

Number of nonzero entries in a vector Not a norm! ||2x||0 = ||x||0 = 2 ||x||0

slide-5
SLIDE 5

Toy problem

−0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 1 1.5 2 2.5 3 t ||vt||0

slide-6
SLIDE 6

Alternative strategy

Minimize another norm f (t) := ||vt||

slide-7
SLIDE 7

Toy problem

−0.4 −0.2 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 t ||vt||0 ||vt||1 ||vt||2 ||vt||∞

slide-8
SLIDE 8

The lasso Convexity Subgradients Analysis of the lasso estimator for a simple example

slide-9
SLIDE 9

Sparse linear regression

Find a small subset of useful features Model selection problem Two objectives: ◮ Good fit to the data;

  • X Tβ − y
  • 2

2 should be as small as possible

◮ Using a small number of features; β should be as sparse as possible

slide-10
SLIDE 10

The lasso

Uses ℓ1-norm regularization to promote sparse coefficients βlasso := arg min

β

1 2

  • y − X Tβ
  • 2

2 +λ ||β||1

slide-11
SLIDE 11

Temperature prediction via linear regression

◮ Dataset of hourly temperatures measured at weather stations all over the US ◮ Goal: Predict temperature in Jamestown (North Dakota) from other temperatures ◮ Response: Temperature in Jamestown ◮ Features: Temperatures in 133 other stations (p = 133) in 2015 ◮ Test set: 103 measurements ◮ Additional test set: All measurements from 2016

slide-12
SLIDE 12

Ridge regression n := 135

10

1

100 101 102 103 104 105 106

Regularization parameter ( /n)

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Coefficients WolfPoint, MT Aberdeen, SD Buffalo, SD

slide-13
SLIDE 13

Lasso n := 135

10

5

10

4

10

3

10

2

10

1

100 101 102

Regularization parameter ( /n)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

Coefficients WolfPoint, MT Aberdeen, SD Buffalo, SD

slide-14
SLIDE 14

Lasso n := 135

10

5

10

4

10

3

10

2

10

1

100 101 102

Regularization parameter ( )

2 4 6 8 10 12

Average error (deg Celsius) Training error Validation error

slide-15
SLIDE 15

Lasso n := 135

102 103 104

Number of training data (n)

10

5

10

4

10

3

10

2

Regularization parameter ( )

slide-16
SLIDE 16

Ridge-regression coefficients

102 103 104

Number of training data

0.6 0.4 0.2 0.0 0.2 0.4 0.6

Coefficients WolfPoint, MT Aberdeen, SD Buffalo, SD

slide-17
SLIDE 17

Lasso coefficients

102 103 104

Number of training data

0.6 0.4 0.2 0.0 0.2 0.4 0.6

Coefficients WolfPoint, MT Aberdeen, SD Buffalo, SD

slide-18
SLIDE 18

Results

102 103 104

Number of training data

1.0 1.5 2.0 2.5 3.0

Average error (deg Celsius) Training error (RR) Test error (RR) Test error 2016 (RR) Training error (lasso) Test error (lasso) Test error 2016 (lasso)

slide-19
SLIDE 19

The lasso Convexity Subgradients Analysis of the lasso estimator for a simple example

slide-20
SLIDE 20

Convex functions

A function f : Rn → R is convex if for any x, y ∈ Rn and any θ ∈ (0, 1) θf (x) + (1 − θ) f (y) ≥ f (θx + (1 − θ) y)

slide-21
SLIDE 21

Convex functions

f (θx + (1 − θ)y) θf (x) + (1 − θ)f (y) f (x) f (y)

slide-22
SLIDE 22

Strictly convex functions

A function f : Rn → R is strictly convex if for any x, y ∈ Rn and any θ ∈ (0, 1) θf (x) + (1 − θ) f (y) >f (θx + (1 − θ) y)

slide-23
SLIDE 23

Linear and quadratic functions

Linear functions are convex f (θx + (1 − θ) y) = θf (x) + (1 − θ) f (y) Positive definite quadratic forms are strictly convex

slide-24
SLIDE 24

Norms are convex

For any x, y ∈ Rn and any θ ∈ (0, 1) ||θx + (1 − θ) y|| ≤ ||θx|| + ||(1 − θ) y|| = θ ||x|| + (1 − θ) ||y||

slide-25
SLIDE 25

ℓ0 “norm" is not convex

Let x := ( 1

0 ) and y := ( 0 1 ), for any θ ∈ (0, 1)

||θx + (1 − θ) y||0 = 2 θ ||x||0 + (1 − θ) ||y||0 = 1

slide-26
SLIDE 26

Is the lasso cost function convex?

f strictly convex, g convex, h := f + λg? h(θx + (1 − θ) y) = f (θx + (1 − θ) y) + λg(θx + (1 − θ) y) < θf (x) + (1 − θ) f (y) + λθg(x) + λ (1 − θ) g(y) = θh(x) + (1 − θ) h(y)

slide-27
SLIDE 27

Lasso cost function is convex

Sum of convex functions is convex If at least one is strictly convex, then sum is strictly convex Scaling by a positive factor preserves convexity Lasso cost function is convex!

slide-28
SLIDE 28

Local minima are global

Any local minimum of a convex function is also a global minimum

slide-29
SLIDE 29

Strictly convex functions

Strictly convex functions have at most one global minimum Proof: Assume two minima exist at x = y with value vmin f (0.5x + 0.5y) < 0.5f (x) + 0.5f (y) = vmin

slide-30
SLIDE 30

The lasso Convexity Subgradients Analysis of the lasso estimator for a simple example

slide-31
SLIDE 31

Epigraph

The epigraph of f : Rn → R is epi (f ) :=   x | f     x[1] · · · x[n]     ≤ x[n + 1]   

slide-32
SLIDE 32

Epigraph

f epi (f )

slide-33
SLIDE 33

Supporting hyperplane

A hyperplane H is a supporting hyperplane of a set S at x if ◮ H and S intersect at x ◮ S is contained in one of the half-spaces bounded by H

slide-34
SLIDE 34

Supporting hyperplane

slide-35
SLIDE 35

Subgradient

A function f : Rn → R is convex if and only if it has a supporting hyperplane at every point It is strictly convex if and only for all x ∈ Rn it only intersects with the supporting hyperplane at one point

slide-36
SLIDE 36

Subgradients

The subgradient of f : Rn → R at x ∈ Rn is a vector g ∈ Rn such that f (y) ≥ f (x) + gT (y − x) , for all y ∈ Rn The hyperplane Hg :=   y | y[n + 1] = gT     y[1] · · · y[n]        is a supporting hyperplane of the epigraph at x The set of all subgradients at x is called the subdifferential

slide-37
SLIDE 37

Subgradients

slide-38
SLIDE 38

Subgradient of differentiable function

If a function is differentiable, the only subgradient at each point is the gradient

slide-39
SLIDE 39

Proof

Assume g is a subgradient at x, for any α ≥ 0 f (x + α ei) ≥ f (x) + gTα ei = f (x) + g[i] α f (x) ≤ f (x − α ei) + gTα ei = f (x − α ei) + g[i] α Combining both inequalities f (x) − f (x − α ei) α ≤ g[i] ≤ f (x + α ei) − f (x) α Letting α → 0, implies g[i] = ∂f (x)

∂x[i]

slide-40
SLIDE 40

Optimality condition for nondifferentiable functions

x is a minimum of f if and only if the zero vector is a subgradient of f at x f (y) ≥ f (x) + 0T (y − x) = f (x) for all y ∈ Rn Under strict convexity the minimum is unique

slide-41
SLIDE 41

Sum of subgradients

Let g1 and g2 be subgradients at x ∈ Rn of f1 : Rn → R and f2 : Rn → R g := g1+g2 is a subgradient of f := f1 + f2 at x Proof: For any y ∈ Rn f (y) = f1 (y) + f2 (y) ≥ f1 (x) + g T

1 (y − x) + f2 (y) + g T 2 (y − x)

≥ f (x) + g T (y − x)

slide-42
SLIDE 42

Subgradient of scaled function

Let g1 be a subgradient at x ∈ Rn of f1 : Rn → R For any α ≥ 0 g2 := αg1 is a subgradient of f2 := αf1 at x Proof: For any y ∈ Rn f2 (y) = αf1 (y) ≥ α

  • f1 (x) + g T

1 (y − x)

  • ≥ f2 (x) + g T

2 (y − x)

slide-43
SLIDE 43

Subdifferential of absolute value

At x = 0, f (x) = |x| is differentiable, so g = sign (x) At x = 0, we need f (0 + y) ≥ f (0) + g (y − 0) |y| ≥ gy Holds if and only if |g| ≤ 1

slide-44
SLIDE 44

Subdifferential of absolute value

f(x) = |x|

slide-45
SLIDE 45

Subdifferential of ℓ1 norm

g is a subgradient of the ℓ1 norm at x ∈ Rn if and only if g[i] = sign (x[i]) if x[i] = 0 |g[i]| ≤ 1 if x[i] = 0

slide-46
SLIDE 46

Proof (one direction)

Assume g[i] is a subgradient of |·| at |x[i]| for 1 ≤ i ≤ n For any y ∈ Rn ||y||1 =

n

  • i=1

|y [i]| ≥

n

  • i=1

|x[i]| + g[i] (y [i] − x[i]) = ||x||1 + g T (y − x)

slide-47
SLIDE 47

Subdifferential of ℓ1 norm

slide-48
SLIDE 48

Subdifferential of ℓ1 norm

slide-49
SLIDE 49

Subdifferential of ℓ1 norm

slide-50
SLIDE 50

The lasso Convexity Subgradients Analysis of the lasso estimator for a simple example

slide-51
SLIDE 51

Additive model

˜ ytrain := X Tβtrue + ˜ ztrain Goal: Gain intuition about why the lasso promotes sparse solutions

slide-52
SLIDE 52

Decomposition of lasso cost function

arg min

β ˜

ytrain − X Tβ2

2 + λ ||β||1

= arg min

β (β − βtrue)TXX T(β − βtrue)+λ ||β||1−2˜

zT

trainX Tβ

slide-53
SLIDE 53

Sparse regression with two features

One true feature ˜ y := xtrue + ˜ z We fit a model using an additional feature X :=

  • xtrue

xother T βtrue := 1

slide-54
SLIDE 54

(β − βtrue)TXX T(β − βtrue)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βtrue

0.01 0.10 0.25 0.50 . 5 1.00 1.00 2.00 2.00 4.00 0.10

slide-55
SLIDE 55

||β||1

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

0.10 0.25 0.50 1.00 1.50 1.50 2.00 2.00

slide-56
SLIDE 56

(β − βtrue)TXX T(β − βtrue) + λ ||β||1

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βtrue βlasso

1.60 1.80 2.00 2.40 3 . 4.00 4 . 5.00 5.00

slide-57
SLIDE 57

(β − βtrue)TXX T(β − βtrue) + λ ||β||1 − 2˜ zT

trainX Tβ

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βOLS βlasso βtrue

1.60 1 . 8 2.00 2.40 3 . 4 . 4.00 5.00 5.00

slide-58
SLIDE 58

(β − βtrue)TXX T(β − βtrue) + λ ||β||1 − 2˜ zT

trainX Tβ

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βOLS βlasso βtrue

1.60 1.80 2.00 2.40 3.00 4.00 4.00 5.00 5 .

slide-59
SLIDE 59

(β − βtrue)TXX T(β − βtrue) + λ ||β||1 − 2˜ zT

trainX Tβ

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βOLS βlasso βtrue

1.80 2 . 2.40 2.40 3 . 4.00 4.00 5 . 5.00

slide-60
SLIDE 60

(β − βtrue)TXX T(β − βtrue) + λ ||β||1 − 2˜ zT

trainX Tβ

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βOLS βlasso βtrue

1.80 2.00 2 . 4 3.00 4.00 5 . 5 .

slide-61
SLIDE 61

λ = 0.02

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βtrue

slide-62
SLIDE 62

λ = 0.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βtrue

slide-63
SLIDE 63

λ = 2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βtrue

slide-64
SLIDE 64

λ = 4

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

β[1]

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

β[2]

βtrue

slide-65
SLIDE 65

Sparse regression with two features

Feature vectors and noise are fixed n-dimensional vectors y := xtrue + z We fit a model using an additional feature X :=

  • xtrue

xother T βtrue := 1

  • ||xtrue||2 = ||xother||2 = 1
slide-66
SLIDE 66

Sparse regression with two features

If λ satisfies

  • x T
  • therz − ρx T

truez

  • 1 − |ρ|

≤ λ ≤ 1 + x T

truez

then the lasso coefficient estimate equals βlasso = 1 + x T

truez − λ

  • where ρ := x T

truexother

slide-67
SLIDE 67

Lasso coefficients

0.00 0.05 0.10 0.15 0.20 Regularization parameter 0.0 0.2 0.4 0.6 0.8 1.0 Coefficients

slide-68
SLIDE 68

Analyzing the lasso

How do we prove this? No closed-form solution! Show there is a horizontal supporting hyperplane at βlasso Equivalently, zero is subgradient of lasso cost function at βlasso

slide-69
SLIDE 69

Subgradients of lasso cost function

Gradient of 1

2

  • X Tβ − y
  • 2

2 at βlasso:

X

  • X Tβlasso − y
  • Subgradient of ℓ1 norm at βlasso if only first entry is nonzero and positive:

gℓ1 := 1 γ

  • |γ| ≤ 1

Subgradient of lasso cost function at βlasso if only first entry is nonzero and positive: glasso := X

  • X Tβlasso − y
  • + λ

1 γ

  • |γ| ≤ 1
slide-70
SLIDE 70

Subgradients of lasso cost function

glasso := X

  • X Tβlasso − y
  • + λ

 1 γ   = X (βlasso[1]xtrue − xtrue − z) + λ  1 γ   =   x T

true ((βlasso[1] − 1)xtrue − z) + λ

x T

  • ther ((βlasso[1] − 1)xtrue − z) + λγ

  =   βlasso[1] − 1 − xT

truez + λ

ρ(βlasso[1] − 1) − xT

  • therz + λγ

 

slide-71
SLIDE 71

Is zero a valid subgradient?

Setting glasso = 0 βlasso[1] = 1 − λ + xT

truez

γ = ρ + xT

  • therz − ρβlasso[1]

λ = x T

  • therz − ρx T

truez

λ + ρ We need βlasso[1] ≥ 0 λ ≤ 1 + x T

truez

We need |γ| ≤ 1

  • x T
  • therz − ρx T

truez

  • 1 − |ρ|

≤ λ