low norm and
play

Low norm and 1 guarantees on Sparsifiability Shai - PowerPoint PPT Presentation

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008 Motivation Problem 1: w 0 = argmin E [ L ( w , x , y )] s . t .


  1. Low norm and ℓ 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008

  2. Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w

  3. Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w

  4. Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w Strict assumptions on data distribution ⇒ w 1 is also sparse But, what if w 1 is not sparse ?

  5. Motivation Problem 1: w 0 = argmin E [ L ( � w , x � , y )] s . t . � w � 0 ≤ S w Problem II: w 1 = argmin E [ L ( � w , x � , y )] s . t . � w � 1 ≤ B w features not correlated Strict assumptions on data distribution ⇒ w 1 is also sparse But, what if w 1 is not sparse ?

  6. Sparsification Predictor w with � w � 1 = B Sparsification procedure w with � ˜ w � 0 = S Predictor ˜

  7. Sparsification Predictor w with � w � 1 = B Sparsification procedure w with � ˜ w � 0 = S Predictor ˜ Constraint: E [ L ( � ˜ w , x � , y )] ≤ E [ L ( � w , x � , y )] + ǫ Goal: Minimal S that satisfies constraint Question: How S depends on B and ǫ ?

  8. Main Result Theorem: For any predictor w , λ -Lipschitz loss function L , distribu- tion D over X × Y , desired accuracy ǫ Exists ˜ w s.t. E [ L ( � ˜ w , x � , y )] ≤ E [ L ( � w , x � , y )] + ǫ and �� λ � w � 1 � 2 � � ˜ w � 0 = O ǫ Tightness: Data distribution, loss function, dense predictor w with loss l , but need Ω (( � w � 2 1 / ǫ ) 2 ) features for loss l + ǫ Sparsifying by taking largest weights or following ℓ 1 regu- larization path might fail Low ℓ 2 norm predictor �⇒ sparse predictor

  9. Main Result (cont.) Distribution D Loss L

  10. Main Result (cont.) Distribution D Loss L Convex opt. Low ℓ 1 predictor w

  11. Main Result (cont.) Distribution D Loss L Convex opt. Low ℓ 1 predictor w Sparse predictor ˜ w Randomized sparsification

  12. Main Result (cont.) Distribution D Loss L Forward selection procedure Convex opt. Low ℓ 1 predictor w Sparse predictor ˜ w Randomized sparsification

  13. Randomized Sparsification Procedure | w 1 | | w n | Sparsification Procedure Z Z For j = 1 , . . . , S Sample r i from distribu- tion P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 | ˜ w 1 | | ˜ w n | Z ′ Z ′

  14. Randomized Sparsification Procedure | w 1 | | w n | Sparsification Procedure Z Z For j = 1 , . . . , S Sample r i from distribu- tion P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 | ˜ w 1 | | ˜ w n | Z ′ Z ′

  15. Randomized Sparsification Procedure Sparsification Procedure For j = 1 , . . . , S Sample r i from distribution P i ∝ | w i | Add | ˜ w i | ← | ˜ w i | + 1 Guarantee Assume: X = { x : � x � ∞ ≤ 1 } , Y = arbitrary set, D = arbitrary distribution over X × Y , Loss L : R × Y → R is λ -Lipschitz w.r.t. 1st argument λ 2 � w � 2 � � 1 log(1 / δ ) If: S ≥ Ω ǫ 2 Then, with probability at least 1 − δ , E [ L ( � ˜ w , x � , y )] − E [ L ( � w , x � , y )] ≤ ǫ

  16. Randomized Sparsification Procedure Distribution D Loss L Convex • Requires access to w opt. • Does not require access to D Randomized sparsification Low ℓ 1 predictor w Sparse predictor ˜ w

  17. Tightness Data distribution: spread ‘information’ about label among all features Y P ( Y = ± 1) = 1 2 X 1 X n X i P ( X i = y | y ) = 1 + 1 /B 2

  18. Tightness (cont.) Y P ( Y = ± 1) = 1 2 Dense predictor: w i = B n and thus � w � 1 = B B E [ | � w , x � − y | ] ≤ X 1 X i X n √ n P ( X i = y | y ) = 1 + 1 /B Sparse predictor: 2 Any u with E [ | � u , x � − y | ] ≤ ǫ must satisfy: � B 2 � � u � 0 = Ω ǫ 2

  19. Tightness (cont.) Y P ( Y = ± 1) = 1 2 Dense predictor: w i = B n and thus � w � 1 = B B E [ | � w , x � − y | ] ≤ X 1 X i X n √ n P ( X i = y | y ) = 1 + 1 /B Sparse predictor: 2 Any u with E [ | � u , x � − y | ] ≤ ǫ must satisfy: � B 2 � � u � 0 = Ω ǫ 2 Proof uses a generalization of Khintchine inequality: If x = ( x 1 , . . . , x n ) are independent random variables with P [ x k = 1] ∈ (5% , 95%) and Q is degree d polynomial, then: 1 E [ | Q ( x ) | ] ≥ (0 . 2) d E [ | Q ( x ) | 2 ] 2

  20. Low L2 norm does not guarantee sparsifiability Same data distribution as before with B = ǫ √ n Dense predictor: w i = B n B E [ | � w , x � − y | ] ≤ √ n = ǫ B � w � 2 = √ n = ǫ Sparse predictor: Any u with E [ | � u , x � − y | ] ≤ 2 ǫ must use almost all features: � B 2 � � u � 0 = Ω = Ω ( n ) ǫ 2 ℓ 1 captures sparsity but ℓ 2 doesn’t !

  21. Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn

  22. Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn larger weights

  23. Sparsifying by zeroing small weights fails P ( Y = ± 1) = 1 Y 2 P ( Z 1 = y | y ) = 1 + 2 / 3 P ( Z s = y | y ) = 1 + 1 / 3 2 2 Z 1 Z s P ( X j = z ⌈ j/s ⌉ | z ⌈ j/s ⌉ ) = 7 8 X 1 X n X sn initial weights on regularization larger weights path also fails on this example

  24. Intermediate Summary We answer a fundamental question: How much sparsity does low ℓ 1 norm guarantee ? � � w � 2 � � ˜ w � 0 ≤ O 1 ǫ 2 This is tight Achievable by simple randomized procedure Coming next: Direct approach also works !

  25. Intermediate Summary We answer a fundamental question: How much sparsity does low ℓ 1 norm guarantee ? � � w � 2 � � ˜ w � 0 ≤ O 1 ǫ 2 This is tight Achievable by simple randomized procedure Coming next: Direct approach also works ! Distribution D Forward selection procedure Loss L Convex opt. Randomized Low ℓ 1 predictor w Sparse predictor ˜ sparsification w

  26. Greedy Forward Selection Step 1: Define a slightly modified loss function λ 2 ǫ ( u − v ) 2 + L ( u, y ) ˜ L ( v, y ) = min u Using infimal convolution theory, it can be shown that ˜ L has Lipschitz continuous derivative ∀ v, y | L ( v, y ) − ˜ L ( v, y ) | ≤ ǫ / 4 Step 2: Apply forward greedy selection on ˜ L Initialize w 1 = 0 Choose feature using largest element of gradient Choose step size η t (closed form solution exists) Update w t +1 = (1 − η t ) w t + η t B e j t

  27. Greedy Forward Selection Example – Hinge loss:  0 if v > 1   ˜ 1 if v ∈ [1 − 1 ǫ ( v − 1) 2 L ( v, y ) = max { 0 , 1 − v } ; L ( v, y ) = ǫ , 1]  (1 − ǫ 4 ) − v else 

  28. Greedy Forward Selection Example – Hinge loss:  0 if v > 1   ˜ 1 if v ∈ [1 − 1 ǫ ( v − 1) 2 L ( v, y ) = max { 0 , 1 − v } ; L ( v, y ) = ǫ , 1]  (1 − ǫ 4 ) − v else 

  29. Guarantees Theorem X = { x : � x � ∞ ≤ 1 } , Y = arbitrary set D = arbitrary distribution over X × Y Loss L : R × Y → R is proper, convex, and λ -Lipschitz w.r.t. 1st argument Forward greedy selection on ˜ L finds ˜ w s.t. λ 2 B 2 � � � ˜ w � 0 = O ǫ 2 For any w with � w � 1 ≤ B we have: E [ L ( � ˜ w , x � , y )] − E [ L ( � w , x � , y )] ≤ ǫ

  30. Related Work ℓ 1 norm and sparsity: Donoho provides su ffi cient conditions for when minimizer of ℓ 1 norm is also sparse. But, what if these conditions are not met? Compressed sensing: ℓ 1 norm recovers sparse predictor, but only under server assumptions on the design matrix (in our case, the training examples) ? Converse question: Small � ˜ w � 0 ⇒ Small � w � 1 ? Servedio: partial answer for the case of linear classification Wainwright: partial answer for the Lasso Sparsification: Randomized sparsification procedure previously proposed by Schapire et al. However, their bound depends on training set size Lee, Bartlett, and Williamson addressed similar question for the special case of squared-error loss Zhang presented forward greedy procedure for twice di ff erentiable losses

  31. Summary Distribution D Loss L Convex opt. Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights

  32. Summary Distribution D Loss L Convex opt. Forward selection regularization path Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights

  33. Summary Distribution D Convex opt. Low ℓ 2 predictor Loss L Convex opt. Forward selection regularization path Sparse predictor ˜ w � � w � 2 � Randomized � ˜ w � 0 ≤ O Low ℓ 1 predictor w 1 ǫ 2 This is tight largest weights

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend