RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016

Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016

Sparsity Variables divided in non-overlapping groups L.Rosasco, RegML 2016

Group sparsity ◮ f ( x ) = � d j =1 w j x j ◮ w = ( w 1 , . . . , . . . , . . . , w d ) � �� w (1) w ( G ) ◮ each group G g has size |G g | , so w ( g ) ∈ R |G g | L.Rosasco, RegML 2016

Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 L.Rosasco, RegML 2016

Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 L.Rosasco, RegML 2016

Group sparsity regularization Regularization exploiting structure � � |G g | � G � G � � � ( w ( g )) 2 R group ( w ) = � w ( g ) � = j g =1 g =1 j =1 Compare to |G g | G G � � � � w ( g ) � 2 = ( w ( g )) 2 j g =1 g =1 j =1 or |G g | G G � � � � w ( g ) � 2 = | ( w ( g )) j | g =1 g =1 j =1 L.Rosasco, RegML 2016

ℓ 1 − ℓ 2 norm We take the ℓ 2 norm of all the groups ( � w (1) � , . . . , � w ( G ) � ) and then the ℓ 1 norm of the above vector � G � w ( g ) � g =1 L.Rosasco, RegML 2016

Groups lasso G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 ◮ reduces to the Lasso if groups have cardinality one L.Rosasco, RegML 2016

Computations G � 1 y � 2 + λ n � ˆ min Xw − ˆ � w ( g ) � w g =1 � �� non differentiable Convex, non-smooth, but composite structure � � w t − γ 2 X ⊤ ( ˆ ˆ w t +1 = Prox γλR group Xw t − ˆ y ) n L.Rosasco, RegML 2016

Block thresholding It can be shown that Prox λR group ( w ) = (Prox λ �·� ( w (1)) , . . . , Prox λ �·� ( w ( G )) � w ( g ) j − λ w ( g ) j � w ( g ) � > λ (Prox λ �·� ( w ( g ))) j = � w ( g ) � 0 � w ( g ) � ≤ λ ◮ Entire groups of coefficients set to zero! ◮ Reduces to softhresholding if groups have cardinality one L.Rosasco, RegML 2016

Other norms ℓ 1 − ℓ p norms   1 p |G g | � G � G � ( w ( g )) p   R ( w ) = � w ( g ) � p = j g =1 g =1 j =1 L.Rosasco, RegML 2016

Overlapping groups Variables divided in possibly overlapping groups L.Rosasco, RegML 2016

Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 L.Rosasco, RegML 2016

Regularization with overlapping groups Group Lasso G � R GL ( w ) = � w ( g ) � g =1 → The selected variables are union of group complements L.Rosasco, RegML 2016

Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ L.Rosasco, RegML 2016

Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 L.Rosasco, RegML 2016

Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) L.Rosasco, RegML 2016

Regularization with overlapping groups w ( g ) ∈ R d be equal to w ( g ) on group G g and zero otherwise Let ¯ Group Lasso with overlap � G � G � � R GLO ( w ) = inf � w ( g ) � | w (1) , . . . , w ( g ) s . t . w = w ( g ) ¯ g =1 g =1 ◮ Multiple ways to write w = � G g =1 ¯ w ( g ) ◮ Selected variables are groups! L.Rosasco, RegML 2016

An equivalence It holds G � 1 1 y � 2 + λR GLO ( w ) ⇔ min y � 2 + λ n � ˆ n � ˜ min Xw − ˆ X ˜ w − ˆ � w ( g ) � w w ˜ g =1 ◮ ˜ X is the matrix obtained by replicating columns/variables ◮ ˜ w = ( w (1) , . . . , w ( G )) , vector with (nonoverlapping!) groups L.Rosasco, RegML 2016

An equivalence (cont.) Indeed � G 1 y � 2 + λ n � ˆ min Xw − ˆ inf � w ( g ) � = w w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w G � 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w � G � G 1 y � 2 + λ n � ˆ inf X ( w ( g )) − ˆ ¯ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G G � � 1 y � 2 + λ ˆ inf n � X |G g w ( g ) − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 g =1 G � 1 y � 2 + λ n � ˜ min X ˜ w − ˆ � w ( g ) � w ˜ g =1 L.Rosasco, RegML 2016

Computations ◮ Can use block thresholding with replicated variables = ⇒ potentially wasteful ◮ The proximal operator for R GLO can be computed efficiently but not in closed form L.Rosasco, RegML 2016

More structure Structured overlapping groups ◮ trees ◮ DAG ◮ . . . Structure can be exploited in computations. . . L.Rosasco, RegML 2016

Beyond linear models Consider a dictionary made by union of distinct dictionaries G G � � Φ g ( x ) ⊤ w ( g ) , f ( x ) = f g ( x ) � �� = g =1 g =1 where each dictionary defines a feature map Φ g ( x ) = ( φ g 1 ( x ) , . . . , φ g p g ( x )) Easy extension with usual change of variable... L.Rosasco, RegML 2016

Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 L.Rosasco, RegML 2016

Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) L.Rosasco, RegML 2016

Representer theorems Let G G G � � � x ( g ) ⊤ ¯ f ( x ) = x ⊤ ( w ( g )) = ¯ ¯ w ( g ) = f g ( x ) , g =1 g =1 g =1 Idea Show that n � w ( g ) = ¯ x ( g ) i c ( g ) i , ¯ i =1 i.e. n n � � x ( g ) ⊤ ¯ x ( g ) ⊤ x ( g ) i f g ( x ) = ¯ x ( g ) i c ( g ) i = c ( g ) i � �� i =1 i =1 Φ g ( x ) ⊤ Φ g ( x i )= K g ( x,x i ) Note that in this case � f g � 2 = � w ( g ) � 2 = c ( g ) ⊤ ˆ X ( g ) ˆ X ( g ) ⊤ c ( g ) � �� ˆ K ( g ) L.Rosasco, RegML 2016

Coefficients update � � c t − γ ( ˆ c t +1 = Prox γλR group Kc t − ˆ y )) where ˆ K = ( ˆ K (1) , . . . , ˆ K ( G )) , and c t = ( c t (1) , . . . , c t ( G )) Block Thresholding It can be shown that  c ( g ) j − λ c ( g ) j �  � f g � > λ    c ( g ) ⊤ ˆ K ( g ) c ( g ) (Prox λ �·� ( c ( g ))) j = � ��   � fg �   0 � f g � ≤ λ L.Rosasco, RegML 2016

Non-parametric sparsity G � f ( x ) = f g ( x ) g =1 n n � � x ( g ) ⊤ x ( g ) i ( c ( g )) i �→ f g ( x ) = f g ( x ) = K g ( x, x i )( c ( g )) i i =1 i =1 ( K 1 , . . . , K G ) family of kernels G G � � � w ( g ) � = ⇒ � f g � K g g =1 g =1 L.Rosasco, RegML 2016

ℓ 1 MKL � G 1 y � 2 + λ n � ˆ inf Xw − ˆ � w ( g ) � = w (1) ,...,w ( g ) g =1 s.t. � G g =1 ¯ w ( g )= w ⇓ n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f L.Rosasco, RegML 2016

ℓ 2 MKL G G � � � w ( g ) � 2 = � f g � 2 ⇒ K g g =1 g =1 Corresponds to using the kernel G � K ( x, x ′ ) = K g ( x, x ′ ) g =1 L.Rosasco, RegML 2016

ℓ 1 or ℓ 2 MKL ◮ ℓ 2 *much* faster ◮ ℓ 1 could be useful is only few kernels are relevant L.Rosasco, RegML 2016

Why MKL? ◮ Data fusion– different features ◮ Model selection, e.g. gaussian kernels with different widths ◮ Richer model– many kernels! L.Rosasco, RegML 2016

MKL & kernel learning It can be shown that n G � � 1 ( y i − f ( x i )) 2 + λ min � f g � K g n f 1 , . . . , f g i =1 g =1 s . t . � G g =1 f g = f � n � 1 ( y i − f ( x i )) 2 + λ � f � 2 K ∈K min min K n f ∈H K i =1 where K = { K | K = � α g ≥ 0 } g K g α g , L.Rosasco, RegML 2016

Sparsity beyond vectors Recall multi-variable regression x i ∈ R d , y i ∈ R T ( x i , y i ) i =1 n , f ( x ) = x ⊤ W �� d × T W � ˆ XW − ˆ Y � 2 F + λ Tr ( WAW ⊤ ) min L.Rosasco, RegML 2016

Sparse regularization ◮ We have seen � d � T Tr ( WW ⊤ ) = ( W t,j ) 2 j =1 t =1 ◮ We could consider now � d � T | W t,j | j =1 t =1 ◮ . . . L.Rosasco, RegML 2016

Spectral Norms/ p -Schatten norms ◮ We have seen min { d,T } � Tr ( WW ⊤ ) = σ 2 i t =1 ◮ We could consider now min { d,T } � R ( W ) = � W � ∗ = σ i , nuclear norm t =1 ◮ or min { d,T } � ( σ i ) p ) 1 /p , R ( W ) = ( p-Schatten norm t =1 L.Rosasco, RegML 2016

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016 Sparsity Variables divided in

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML2017@SIMULA Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure,

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning

Local Issue Advocacy Elizabeth Erickson / OFA Training Director We will begin the training at

P.L.A.N. to STEP UP Your Safety Programs! OUR WE WEBINAR AR WI WILL BEGI GIN SHORTLY RTLY

t

2017 Work Plan Accomplishments 1 of 2 Safety Continued to promote the reliability and safety

Angular Distributions of Muons from Decays at CDF ( Polarization) Matthew Jones Purdue

GWDAW - 8 University of Wisconsin - Milwaukee, December 17-20 2003 Spectral filtering for

UNDP

Unitary Representations of Nilpotent Super Lie groups Hadi Salmasian February 6, 2010 Basic