Low norm and 1 guarantees on Sparsifiability Shai - PowerPoint PPT Presentation

Low norm and ℓ 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008

Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w

Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w

Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w Strict assumptions on data distribution ⇒ w 1 is also sparse But, what if w 1 is not sparse ?

Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w features not correlated Strict assumptions on data distribution ⇒ w 1 is also sparse But, what if w 1 is not sparse ?

Sparsification Predictor w with � w � 1 = B Sparsification procedure w with � ˜ w � 0 = S Predictor ˜

Sparsification Predictor w with � w � 1 = B Sparsification procedure w with � ˜ w � 0 = S Predictor ˜ Constraint: E [ L ( � ˜ w , x � , y )] ≤ E [ L ( � w , x � , y )] + ǫ Goal: Minimal S that satisfies constraint Question: How S depends on B and ǫ ?

Main Result Theorem: For any predictor w , λ -Lipschitz loss function L , distribution D over X × Y , desired accuracy ǫ Exists ˜ w s.t. E [ L ( � ˜ w , x � , y )] ≤ E [ L ( � w , x � , y )] + ǫ and �� λ � w � 1 � 2 � � ˜ w � 0 = O ǫ Tightness: Data distribution, loss function, dense predictor w with loss l , but need Ω (( � w � 2 1 / ǫ ) 2 ) features for loss l + ǫ Sparsifying by taking largest weights or following ℓ 1 regularization path might fail Low ℓ 2 norm predictor �⇒ sparse predictor

Main Result (cont.) Distribution D Loss L

Main Result (cont.) Distribution D Loss L Convex opt. Low ℓ 1 predictor w

Main Result (cont.) Distribution D Loss L Convex opt. Low ℓ 1 predictor w Sparse predictor ˜ w Randomized sparsification

Main Result (cont.) Distribution D Loss L Forward selection procedure Convex opt. Low ℓ 1 predictor w Sparse predictor ˜ w Randomized sparsification

Randomized Sparsification Procedure | w 1 | | w n | Sparsification Procedure Z Z For j = 1 , . . . , S Sample r i from distribution P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 | ˜ w 1 | | ˜ w n | Z ′ Z ′

Randomized Sparsification Procedure Sparsification Procedure For j = 1 , . . . , S Sample r i from distribution P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 Guarantee Assume: X = { x : � x � ∞ ≤ 1 } , Y = arbitrary set, D = arbitrary distribution over X × Y , Loss L : R × Y → R is λ -Lipschitz w.r.t. 1st argument λ 2 � w � 2 � � 1 log(1 / δ ) If: S ≥ Ω ǫ 2 Then, with probability at least 1 − δ , E [ L ( � ˜ w , x � , y )] − E [ L ( � w , x � , y )] ≤ ǫ

Randomized Sparsification Procedure Distribution D Loss L Convex • Requires access to w opt. • Does not require access to D Randomized sparsification Low ℓ 1 predictor w Sparse predictor ˜ w

Tightness Data distribution: spread ‘information’ about label among all features Y P ( Y = ± 1) = 1 2 X 1 X n X i P ( X i = y | y ) = 1 + 1 /B 2

Tightness (cont.) Y P ( Y = ± 1) = 1 2 Dense predictor: w i = B n and thus � w � 1 = B B E [ | � w , x � − y | ] ≤ X 1 X i X n √ n P ( X i = y | y ) = 1 + 1 /B Sparse predictor: 2 Any u with E [ | � u , x � − y | ] ≤ ǫ must satisfy: � B 2 � � u � 0 = Ω ǫ 2 Proof uses a generalization of Khintchine inequality: If x = ( x 1 , . . . , x n ) are independent random variables with P [ x k = 1] ∈ (5% , 95%) and Q is degree d polynomial, then: 1 E [ | Q ( x ) | ] ≥ (0 . 2) d E [ | Q ( x ) | 2 ] 2

Low L2 norm does not guarantee sparsifiability Same data distribution as before with B = ǫ √ n Dense predictor: w i = B n B E [ | � w , x � − y | ] ≤ √ n = ǫ B � w � 2 = √ n = ǫ Sparse predictor: Any u with E [ | � u , x � − y | ] ≤ 2 ǫ must use almost all features: � B 2 � � u � 0 = Ω = Ω ( n ) ǫ 2 ℓ 1 captures sparsity but ℓ 2 doesn’t !

Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn

Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn larger weights

Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn initial weights on regularization larger weights path also fails on this example

Intermediate Summary We answer a fundamental question: How much sparsity does low ℓ 1 norm guarantee ? � � w � 2 � � ˜ w � 0 ≤ O 1 ǫ 2 This is tight Achievable by simple randomized procedure Coming next: Direct approach also works !

Intermediate Summary We answer a fundamental question: How much sparsity does low ℓ 1 norm guarantee ? � � w � 2 � � ˜ w � 0 ≤ O 1 ǫ 2 This is tight Achievable by simple randomized procedure Coming next: Direct approach also works ! Distribution D Forward selection procedure Loss L Convex opt. Randomized Low ℓ 1 predictor w Sparse predictor ˜ sparsification w

Greedy Forward Selection Step 1: Define a slightly modified loss function λ 2 ǫ ( u − v ) 2 + L ( u, y ) ˜ L ( v, y ) = min u Using infimal convolution theory, it can be shown that ˜ L has Lipschitz continuous derivative ∀ v, y | L ( v, y ) − ˜ L ( v, y ) | ≤ ǫ / 4 Step 2: Apply forward greedy selection on ˜ L Initialize w 1 = 0 Choose feature using largest element of gradient Choose step size η t (closed form solution exists) Update w t +1 = (1 − η t ) w t + η t B e j t

Greedy Forward Selection Example – Hinge loss:  0 if v > 1   ˜ 1 if v ∈ [1 − 1 ǫ ( v − 1) 2 L ( v, y ) = max { 0 , 1 − v } ; L ( v, y ) = ǫ , 1]  (1 − ǫ 4 ) − v else 

Guarantees Theorem X = { x : � x � ∞ ≤ 1 } , Y = arbitrary set D = arbitrary distribution over X × Y Loss L : R × Y → R is proper, convex, and λ -Lipschitz w.r.t. 1st argument Forward greedy selection on ˜ L finds ˜ w s.t. λ 2 B 2 � � � ˜ w � 0 = O ǫ 2 For any w with � w � 1 ≤ B we have: E [ L ( � ˜ w , x � , y )] − E [ L ( � w , x � , y )] ≤ ǫ

Related Work ℓ 1 norm and sparsity: Donoho provides su ffi cient conditions for when minimizer of ℓ 1 norm is also sparse. But, what if these conditions are not met? Compressed sensing: ℓ 1 norm recovers sparse predictor, but only under server assumptions on the design matrix (in our case, the training examples) ? Converse question: Small � ˜ w � 0 ⇒ Small � w � 1 ? Servedio: partial answer for the case of linear classification Wainwright: partial answer for the Lasso Sparsification: Randomized sparsification procedure previously proposed by Schapire et al. However, their bound depends on training set size Lee, Bartlett, and Williamson addressed similar question for the special case of squared-error loss Zhang presented forward greedy procedure for twice di ff erentiable losses

Summary Distribution D Loss L Convex opt. Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights

Summary Distribution D Loss L Convex opt. Forward selection regularization path Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights

Summary Distribution D Convex opt. Low ℓ 2 predictor Loss L Convex opt. Forward selection regularization path Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights

Low norm and 1 guarantees on Sparsifiability Shai - PowerPoint PPT Presentation

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008 Motivation Problem 1: w 0 = argmin E [ L ( w , x , y )] s . t .

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

6. Approximation and fitting norm approximation least-norm problems regularized

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

NORM And TENORM: A New Legal Normal? Definitions NORM N aturally O ccurring R adioactive M

BRAZILIAN EXPERIENCE IN REMEDIATION OF NORM SITES REMEDIATION OF NORM SITES Dejanira da Costa

Global Warming: Global Warming: Manmade Mess or Natures Norm? or Nature s Norm? Dr. Roy W.

Applications (I) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Norm

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian

Gowers Norm, Function Limits, and Parameter Estimation Yuichi Yoshida National Institute of

SOUTH AFRICA MATSHEDISO MAINE (SENIOR INSPECTOR:NORM) Outline ORGANIZATION STATUS OF

Foreign Tech Workers in the U.S.: Norm Matloff University of California at Failures and

WG 3 NORM and legacy sites Application of models for assessing radiological impacts arising

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

Workplace Attributes and Womens Labor Supply Decisions Evidence from a Randomized Experiment

Randomized algorithms Inge Li Grtz Thank you to Kevin Wayne for inspiration to slides

Countering Code-Injection Attacks With Instruction-Set Randomization Gaurav S. Kc, Angelos D.

The Powerdomain of Continuous Random Variables Jean Goubault-Larrecq, Daniele Varacca LSV - ENS

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Learning with random features Alessandro Rudi INRIA - Ecole Normale Sup erieure, Paris

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Stat 8931 (Aster Models) Lecture Slides Deck 6 Aster Models with Random Effects Charles J. Geyer

Low norm and 1 guarantees on Sparsifiability Shai - PowerPoint PPT Presentation

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008 Motivation Problem 1: w 0 = argmin E [ L ( w , x , y )] s . t .

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

6. Approximation and fitting norm approximation least-norm problems regularized

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

NORM And TENORM: A New Legal Normal? Definitions NORM N aturally O ccurring R adioactive M

BRAZILIAN EXPERIENCE IN REMEDIATION OF NORM SITES REMEDIATION OF NORM SITES Dejanira da Costa

Global Warming: Global Warming: Manmade Mess or Natures Norm? or Nature s Norm? Dr. Roy W.

Applications (I) Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Norm

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei &amp; Tian

Gowers Norm, Function Limits, and Parameter Estimation Yuichi Yoshida National Institute of

SOUTH AFRICA MATSHEDISO MAINE (SENIOR INSPECTOR:NORM) Outline ORGANIZATION STATUS OF

Foreign Tech Workers in the U.S.: Norm Matloff University of California at Failures and

WG 3 NORM and legacy sites Application of models for assessing radiological impacts arising

Lecture 14: Planted Sparse Vector Lecture Outline Part I: Planted Sparse Vector and 2 to 4

Workplace Attributes and Womens Labor Supply Decisions Evidence from a Randomized Experiment

Randomized algorithms Inge Li Grtz Thank you to Kevin Wayne for inspiration to slides

Countering Code-Injection Attacks With Instruction-Set Randomization Gaurav S. Kc, Angelos D.

The Powerdomain of Continuous Random Variables Jean Goubault-Larrecq, Daniele Varacca LSV - ENS

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

Learning with random features Alessandro Rudi INRIA - Ecole Normale Sup erieure, Paris

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Stat 8931 (Aster Models) Lecture Slides Deck 6 Aster Models with Random Effects Charles J. Geyer

Low Power Microprocessors Low Power Microprocessors Low Power Technology Gao Wei & Tian