regml 2016 class 6 structured sparsity
play

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016 Sparsity Variables divided in


  1. RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016

  2. Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016

  3. Sparsity Variables divided in non-overlapping groups L.Rosasco, RegML 2016

  4. Group sparsity ◮ f ( x ) = � d j =1 w j x j ◮ w = ( w 1 , . . . , . . . , . . . , w d ) � �� � � �� � w (1) w ( G ) ◮ each group G g has size |G g | , so w ( g ) ∈ R |G g | L.Rosasco, RegML 2016

  5. Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 L.Rosasco, RegML 2016

  6. Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 L.Rosasco, RegML 2016

  7. Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 or |G g | G G � � � � w ( g ) � 2 = | ( w ( g )) j | g =1 g =1 j =1 L.Rosasco, RegML 2016

  8. ℓ 1 − ℓ 2 norm We take the ℓ 2 norm of all the groups ( � w (1) � , . . . , � w ( G ) � ) and then the ℓ 1 norm of the above vector � G � w ( g ) � g =1 L.Rosasco, RegML 2016

  9. Groups lasso G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 ◮ reduces to the Lasso if groups have cardinality one L.Rosasco, RegML 2016

  10. Computations G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 � �� � non differentiable Convex, non-smooth, but composite structure � � w t − γ 2 X ⊤ ( ˆ ˆ w t +1 = Prox γλR group Xw t − ˆ y ) n L.Rosasco, RegML 2016

  11. Block thresholding It can be shown that Prox λR group ( w ) = (Prox λ �·� ( w (1)) , . . . , Prox λ �·� ( w ( G )) � w ( g ) j − λ w ( g ) j � w ( g ) � > λ (Prox λ �·� ( w ( g ))) j = � w ( g ) � 0 � w ( g ) � ≤ λ ◮ Entire groups of coefficients set to zero! ◮ Reduces to softhresholding if groups have cardinality one L.Rosasco, RegML 2016

  12. Other norms ℓ 1 − ℓ p norms   1 p |G g | � G � G � ( w ( g )) p   R ( w ) = � w ( g ) � p = j g =1 g =1 j =1 L.Rosasco, RegML 2016

  13. Overlapping groups Variables divided in possibly overlapping groups L.Rosasco, RegML 2016

  14. Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 L.Rosasco, RegML 2016

  15. Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 → The selected variables are union of group complements L.Rosasco, RegML 2016

  16. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ L.Rosasco, RegML 2016

  17. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 L.Rosasco, RegML 2016

  18. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) L.Rosasco, RegML 2016

  19. Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) ◮ Selected variables are groups! L.Rosasco, RegML 2016

  20. An equivalence It holds G � 1 1 y � 2 + λR GLO ( w ) ⇔ min y � 2 + λ n � ˆ n � ˜ min Xw − ˆ X ˜ w − ˆ � w ( g ) � w w ˜ g =1 ◮ ˜ X is the matrix obtained by replicating columns/variables ◮ ˜ w = ( w (1) , . . . , w ( G )) , vector with (nonoverlapping!) groups L.Rosasco, RegML 2016

  21. An equivalence (cont.) Indeed � G 1 y � 2 + λ n � ˆ min Xw − ˆ inf � w ( g ) � = w w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w G � 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w � G � G 1 y � 2 + λ n � ˆ inf X ( w ( g )) − ˆ ¯ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G G � � 1 y � 2 + λ ˆ inf n � X |G g w ( g ) − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G � 1 y � 2 + λ n � ˜ min X ˜ w − ˆ � w ( g ) � w ˜ g =1 L.Rosasco, RegML 2016

  22. Computations ◮ Can use block thresholding with replicated variables = ⇒ potentially wasteful ◮ The proximal operator for R GLO can be computed efficiently but not in closed form L.Rosasco, RegML 2016

  23. More structure Structured overlapping groups ◮ trees ◮ DAG ◮ . . . Structure can be exploited in computations. . . L.Rosasco, RegML 2016

  24. Beyond linear models Consider a dictionary made by union of distinct dictionaries G G � � Φ g ( x ) ⊤ w ( g ) , f ( x ) = f g ( x ) � �� � = g =1 g =1 where each dictionary defines a feature map Φ g ( x ) = ( φ g 1 ( x ) , . . . , φ g p g ( x )) Easy extension with usual change of variable... L.Rosasco, RegML 2016

  25. Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 L.Rosasco, RegML 2016

  26. Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� � i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) L.Rosasco, RegML 2016

  27. Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� � i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) Note that in this case � f g � 2 = � w ( g ) � 2 = c ( g ) ⊤ ˆ X ( g ) ˆ X ( g ) ⊤ c ( g ) � �� � ˆ K ( g ) L.Rosasco, RegML 2016

  28. Coefficients update � � c t − γ ( ˆ c t +1 = Prox γλR group Kc t − ˆ y )) where ˆ K = ( ˆ K (1) , . . . , ˆ K ( G )) , and c t = ( c t (1) , . . . , c t ( G )) Block Thresholding It can be shown that  c ( g ) j − λ c ( g ) j �  � f g � > λ    c ( g ) ⊤ ˆ K ( g ) c ( g ) (Prox λ �·� ( c ( g ))) j = � �� �   � fg �   0 � f g � ≤ λ L.Rosasco, RegML 2016

  29. Non-parametric sparsity G � f ( x ) = f g ( x ) g =1 n n � � x ( g ) ⊤ x ( g ) i ( c ( g )) i �→ f g ( x ) = f g ( x ) = K g ( x, x i )( c ( g )) i i =1 i =1 ( K 1 , . . . , K G ) family of kernels G G � � � w ( g ) � = ⇒ � f g � K g g =1 g =1 L.Rosasco, RegML 2016

  30. ℓ 1 MKL � G 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w ⇓ n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f L.Rosasco, RegML 2016

  31. ℓ 2 MKL G G � � � w ( g ) � 2 = � f g � 2 ⇒ K g g =1 g =1 Corresponds to using the kernel G � K ( x, x ′ ) = K g ( x, x ′ ) g =1 L.Rosasco, RegML 2016

  32. ℓ 1 or ℓ 2 MKL ◮ ℓ 2 *much* faster ◮ ℓ 1 could be useful is only few kernels are relevant L.Rosasco, RegML 2016

  33. Why MKL? ◮ Data fusion– different features ◮ Model selection, e.g. gaussian kernels with different widths ◮ Richer model– many kernels! L.Rosasco, RegML 2016

  34. MKL & kernel learning It can be shown that n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f � n � 1 ( y i − f ( x i )) 2 + λ � f � 2 K ∈K min min K n f ∈H K i =1 where K = { K | K = � α g ≥ 0 } g K g α g , L.Rosasco, RegML 2016

  35. Sparsity beyond vectors Recall multi-variable regression x i ∈ R d , y i ∈ R T ( x i , y i ) i =1 n , f ( x ) = x ⊤ W ���� d × T W � ˆ XW − ˆ Y � 2 F + λ Tr ( WAW ⊤ ) min L.Rosasco, RegML 2016

  36. Sparse regularization ◮ We have seen � d � T Tr ( WW ⊤ ) = ( W t,j ) 2 j =1 t =1 ◮ We could consider now � d � T | W t,j | j =1 t =1 ◮ . . . L.Rosasco, RegML 2016

  37. Spectral Norms/ p -Schatten norms ◮ We have seen min { d,T } � Tr ( WW ⊤ ) = σ 2 i t =1 ◮ We could consider now min { d,T } � R ( W ) = � W � ∗ = σ i , nuclear norm t =1 ◮ or min { d,T } � ( σ i ) p ) 1 /p , R ( W ) = ( p-Schatten norm t =1 L.Rosasco, RegML 2016

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend