[PPT] - 1 -norm regularization Ji Zhu (Michigan), Saharon Rosset (IBM T. J. PowerPoint Presentation

SLIDE 1

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 1

l 1-norm regularization

Ji Zhu (Michigan), Saharon Rosset (IBM T. J. Watson), Trevor Hastie (Stanford), Rob Tibshirani (Stanford)

SLIDE 2

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 2

Agenda

Regularized optimization problems

l

1-norm penalty Motivations:

– Statistical advantage: sparsity (feature selection) – Computational advantage: piecewise linear solution paths

SLIDE 3

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 3

Prediction problems

Training data (x 1 ; y 1 ); : : : ; (x n ; y n ) Input x i 2 R p Output y i

– Regression:

y i 2 R

– Two class classification:

y i 2 f1; 1g High-dimensional data modeling: p

n

Wish to find a prediction model for future data x 2 R p ! y 2 R or f1; 1g

SLIDE 4

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 4

The regularized optimization problem

^

()

= arg min

L

( y ; X

)

+ J ( )

Where

L(;

): R p

R

p ! R is a convex non-negative loss functional.

J:

R p ! R is a convex non-negative penalty functional. Almost exclusively

use

J ( ) = k k q q ; q

1.
0 is a tuning parameter

– As

!

0, we get non-regularized models.

– As

!

1, we get that ^

()

! 0.

SLIDE 5

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 5

Examples

Traditional methods

– Ridge regression:

L(y ; X

)

= P i (y i

x

i ) 2, J ( ) = k k 2 2

– Penalized logistic regression:

L(y ; X

)

= P i log (1 + e y i

x

i ), J ( ) = k k q q Modern methods

– Support vector machines:

L(y ; X

)

= P i (1

y

i

x

i ) +, J ( ) = k k 2 2

– AdaBoost: approximately

L(y ; X

)

= P i e y i

x

i, J ( ) = k k 1.

See Rosset, Zhu and Hastie 2003.

SLIDE 6

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 6

The

l 1-norm penalty

A canonical example:

Lasso (Tibshirani 1996, Efron, Hastie, Johnstone and Tibshirani 2002) ^

()

= arg min

X

i (y i

x

i ) 2 + k k 1 Two properties:

– Sparse solution (feature selection) – Piecewise linear coefficient paths

SLIDE 7

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 7

+ + + + +

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

+ + + + + + + + + + + + + + +

k ^

k

1 ^

SLIDE 8

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 8

Sparsity

l

1-norm penalty causes some of the coefficients ^

j’s to be exactly zero.
l

1-norm penalty allows continuous feature selection as changes.

0.0 0.4 0.8 1.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6

^

^
k

^

k

1 k ^

k

2 2 1-norm SVM 2-norm SVM

SLIDE 9

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 9

Existence and uniqueness of the sparse solution

^

()

= arg min

L

(y ; X

)

+ k k 1 There exists a solution which has at most n non-zero coefficients Under mild conditions, the sparse solution is unique Rosset, Zhu and Hastie 2003

SLIDE 10

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 10

Bet on sparsity

p
n

Sparse scenario: only a small number of true coefficients

j’s are non-zero.

In the sparse scenario, the l 1-norm penalty works better than the l 2-norm

penalty.

In the non-sparse scenario, neither the l 1-norm penalty nor the l 2-norm

penalty works well.

Friedman, Hastie, Rosset, Tibshirani and Zhu 2003

SLIDE 11

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 11

Bet on sparsity simulation

Regression: n = 50; p = 300 Sparse scenario:

j
N

(0; 1); j = 1; : : : ; 10 or 30, other

j

= Non-sparse scenario:

j
N

(0; 1); j = 1; : : : ; 300

0.0 0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 0.5

Lasso/Gaussian

Noise-to-Signal Ratio Percentage Variance Explained 0.1 0.2 0.3 0.4 0.5

Ridge/Gaussian

Noise-to-Signal Ratio 0.0 0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 0.5

Lasso/Subset 10

Noise-to-Signal Ratio Percentage Variance Explained 0.1 0.2 0.3 0.4 0.5

Ridge/Subset 10

Noise-to-Signal Ratio 0.0 0.2 0.4 0.6 0.8 1.0 0.1 0.2 0.3 0.4 0.5

Lasso/Subset 30

Noise-to-Signal Ratio Percentage Variance Explained 0.1 0.2 0.3 0.4 0.5

Ridge/Subset 30

Noise-to-Signal Ratio

SLIDE 12

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 12

Computational advantage of the

l 1-norm penalty When L is piecewise quadratic as a function of , the solution path ^

() is

piecewise linear as a function of

. Consequence:

– Efficient algorithm to compute the exact whole solution path

f ^

();
1g

– Facilitate the selection of the tuning parameter

+

+ + + +

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

+ + + + + + + + + + + + + + +

k ^

k

1 ^

SLIDE 13

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 13

Examples

L(y ; X

)

= X i l (y i ;

x

i )

Examples:

Regression (residual r = y

x)

– Squared error loss:

l (r ) = r 2

– Huber’s loss with a fixed knot

Æ (more robust): l (r ) = 8 < : r 2

if

jr j

Æ

2Æ jr j

Æ

2

therwise.

– Absolute value loss:

l (r ) = jr j (non-differentiable at r = 0)

SLIDE 14

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 14

Classification (margin r = y

x)

– Squared hinge loss:

l (r ) = (1

r

) 2 +

– Huberized squared hinge loss (more robust):

l (r ) = 8 > > < > > : (1

Æ

) 2 + 2(1

Æ

)(Æ

r

)

if

r

Æ

(1

r

) 2

if

Æ < r

1
therwise.

– Hinge loss:

l (r ) = (1

r

) + (non-differentiable at r = 1)

SLIDE 15

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 15

Illustration: regression

n

= 100, p = 80. All x ij are i.i.d N (0; 1) and the true model is: y i = 10

x

i1 +

i
i

iid

0:9
N

(0; 1) + 0:1

N

(0; 100) Compare Huber’s loss and squared error loss + l 1-norm penalty

SLIDE 16

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 16

20 40 60 80 −5 5 10 50 100 150 200 250 −5 5 10

k ^

()k

1 k ^

()k

1 ^

^
10

20 30 40 10 20 30 40 50 60

Squared Error

LASSO Huberized

k ^

()k

1

SLIDE 17

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 17

Dexter demonstration

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Dexter validation error and number of non−0 coefficients

Val. error

0.05 0.1 0.15 0.2 0.25 0.3 0.35 20 40 60 80 100 120 140 160 180 200

k ^

k

1

SLIDE 18

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 18

Computational cost

Efficient algorithms available to compute the exact whole solution path

f ^

();
1g. See Rosset and Zhu 2003.

Approximate estimate of the computational cost Assume the number of joints is O (n + p) Total cost is O (n 2 p) Linear in p even when p

n

SLIDE 19

NIPS03 Workshop:

l 1-norm regularization

Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani 19

Summary

What are good (loss L, penalty J) pairs? How should we determine the value

f the tuning parameter

? We use statistical motivations of robustness and sparsity to select interesting

(loss

L, penalty J) pairs. The resulting methods are adaptable (because we can choose an optimal

tuning parameter), efficient (because we can generate the whole regularized path efficiently) and robust (because we choose to use robust loss functions).