feature selection the shapley folkman theorem
play

Feature Selection & the Shapley-Folkman Theorem. Alexandre - PowerPoint PPT Presentation

Feature Selection & the Shapley-Folkman Theorem. Alexandre dAspremont , CNRS & D.I., Ecole Normale Sup erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex dAspremont CIRM, Luminy,


  1. Feature Selection & the Shapley-Folkman Theorem. Alexandre d’Aspremont , CNRS & D.I., ´ Ecole Normale Sup´ erieure. With Armin Askari, Laurent El Ghaoui (UC Berkeley) and Quentin Rebjock (EPFL) Alex d’Aspremont CIRM, Luminy, March 2020. 1/32

  2. Introduction Feature Selection. � Reduce number of variables while preserving classification performance. � Often improves test performance, especially when samples are scarce. � Helps interpretation. Classical examples: LASSO, ℓ 1 -logistic regression, RFE-SVM, . . . Alex d’Aspremont CIRM, Luminy, March 2020. 2/32

  3. Introduction: feature selection RNA classification. Find genes which best discriminate cell type (lung cancer vs control). 35238 genes, 2695 examples. [Lachmann et al., 2018] ×10 11 3.430 3.435 3.440 Objective 3.445 3.450 3.455 0 5000 10000 15000 20000 25000 30000 35000 Number of features ( k ) Best ten genes: MT-CO3, MT-ND4, MT-CYB, RP11-217O12.1, LYZ, EEF1A1, MT-CO1, HBA2, HBB, HBA1. Alex d’Aspremont CIRM, Luminy, March 2020. 3/32

  4. Introduction: feature selection Applications. Mapping brain activity by fMRI. From PARIETAL team at INRIA. Alex d’Aspremont CIRM, Luminy, March 2020. 4/32

  5. Introduction: feature selection fMRI. Many voxels, very few samples leads to false discoveries. Wired article on Bennett et al. “Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Proper Multiple Comparisons Correction” Journal of Serendipitous and Unexpected Results, 2010. Alex d’Aspremont CIRM, Luminy, March 2020. 5/32

  6. Introduction: linear models Linear models. Select features from large weights w . � LASSO solves min w � Xw − y � 2 2 + λ � w � 1 with linear prediction given by w T x . � i max { 0 , 1 − y i w T x i } + λ � w � 2 � Linear SVM, solves min w 2 with linear classification rule sign( w T x ) . In practice. � Relatively high complexity on very large-scale data sets. � Recovery results require uncorrelated features (incoherence, RIP, etc.). � Cheaper featurewise methods (ANOVA, TF-IDF, etc.) have relatively poor performance. Alex d’Aspremont CIRM, Luminy, March 2020. 6/32

  7. Outline � Sparse Naive Bayes � The Shapley-Folkman theorem � Duality gap bounds � Numerical performance Alex d’Aspremont CIRM, Luminy, March 2020. 7/32

  8. Multinomial Naive Bayse Multinomial Naive Bayse. In the multinomial model � � ( � m j =1 x j )! log Prob ( x | C ± ) = x ⊤ log θ ± + log . � m j =1 x j ! Training by maximum likelihood f + ⊤ log θ + + f −⊤ log θ − ( θ + ∗ , θ − ∗ ) = argmax 1 ⊤ θ + = 1 ⊤ θ − =1 θ + ,θ − ∈ [0 , 1] m Linear classification rule: for a given test point x ∈ R m , set y ( x ) = sign ( v + w ⊤ x ) , ˆ where w � log θ + ∗ − log θ − v � log Prob ( C + ) − log Prob ( C − ) , and ∗ Alex d’Aspremont CIRM, Luminy, March 2020. 8/32

  9. Sparse Naive Bayse Naive Feature Selection. Make w � log θ + ∗ − log θ − ∗ sparse. Solve f + ⊤ log θ + + f −⊤ log θ − ( θ + ∗ , θ − ∗ ) = argmax � θ + − θ − � 0 ≤ k subject to (SMNB) 1 ⊤ θ + = 1 ⊤ θ − = 1 θ + , θ + ≥ 0 where k ≥ 0 is a target number of features. Features for which θ + i = θ − i can be discarded. Nonconvex problem. � Convex relaxation? � Approximation bounds? Alex d’Aspremont CIRM, Luminy, March 2020. 9/32

  10. Sparse Naive Bayse Convex Relaxation. The dual is very simple. Sparse Multinomial Naive Bayes [Askari, A., El Ghaoui, 2019] Let φ ( k ) be the optimal value of (SMNB). Then φ ( k ) ≤ ψ ( k ) , where ψ ( k ) is the optimal value of the following one-dimensional convex optimization problem ψ ( k ) := C + min α ∈ [0 , 1] s k ( h ( α )) , (USMNB) where C is a constant, s k ( · ) is the sum of the top k entries of its vector argument, and for α ∈ (0 , 1) , h ( α ) := f + ◦ log f + + f − ◦ log f − − ( f + + f − ) ◦ log( f + + f − ) − f + log α − f − log(1 − α ) . Solved by bisection, linear complexity O ( n + k log k ) . Approximation bounds? Alex d’Aspremont CIRM, Luminy, March 2020. 10/32

  11. Outline � Sparse Naive Bayes � The Shapley-Folkman theorem � Duality gap bounds � Numerical performance Alex d’Aspremont CIRM, Luminy, March 2020. 11/32

  12. Shapley-Folkman Theorem Minkowski sum. Given sets X, Y ⊂ R d , we have X + Y = { x + y : x ∈ X, y ∈ Y } (CGAL User and Reference Manual) Convex hull. Given subsets V i ⊂ R d , we have �� � � V i = Co ( V i ) Co i i Alex d’Aspremont CIRM, Luminy, March 2020. 12/32

  13. Shapley-Folkman Theorem The ℓ 1 / 2 ball, Minkowsi average of two and ten balls, convex hull. + + + + = Minkowsi sum of five first digits (obtained by sampling). Alex d’Aspremont CIRM, Luminy, March 2020. 13/32

  14. Shapley-Folkman Theorem Shapley-Folkman Theorem [Starr, 1969] Suppose V i ⊂ R d , i = 1 , . . . , n, and � n � n � � x ∈ Co V i = Co ( V i ) i =1 i =1 then � � x ∈ V i + Co ( V i ) S x [1 ,n ] \S x where |S x | ≤ d . Alex d’Aspremont CIRM, Luminy, March 2020. 14/32

  15. Shapley-Folkman Theorem Proof sketch. Write x ∈ � n i =1 Co ( V i ) , or n d +1 � � � � x v ij � � = λ ij , for λ ≥ 0 , e i 1 n i =1 j =1 Conic Carath´ eodory then yields representation with at most n + d nonzero coefficients. Use a pigeonhole argument λ ij } d } n x i ∈ V i x i ∈ Co ( V i ) Number of nonzero λ ij controls gap with convex hull. Alex d’Aspremont CIRM, Luminy, March 2020. 15/32

  16. Shapley-Folkman: geometric consequences Consequences. � If the sets V i ⊂ R d are uniformly bounded with rad( V i ) ≤ R , then � �� n �� n �� i =1 V i i =1 V i min { n, d } d H , Co ≤ R n n n where rad( V ) = inf x ∈ V sup y ∈ V � x − y � . � In particular, when d is fixed and n → ∞ �� n �� n � � i =1 V i i =1 V i → Co n n in the Hausdorff metric with rate O (1 /n ) . � Holds for many other nonconvexity measures [Fradelizi et al., 2017]. Alex d’Aspremont CIRM, Luminy, March 2020. 16/32

  17. Outline � Sparse Naive Bayes � The Shapley-Folkman theorem � Duality gap bounds � Numerical performance Alex d’Aspremont CIRM, Luminy, March 2020. 17/32

  18. Nonconvex Optimization Separable nonconvex problem. Solve � n minimize i =1 f i ( x i ) (P) subject to Ax ≤ b, in the variables x i ∈ R d i with d = � n i =1 d i , where f i are lower semicontinuous and A ∈ R m × d . Take the dual twice to form a convex relaxation , � n i =1 f ∗∗ minimize i ( x i ) (CoP) subject to Ax ≤ b in the variables x i ∈ R d i . Alex d’Aspremont CIRM, Luminy, March 2020. 18/32

  19. Nonconvex Optimization Convex envelope. Biconjugate f ∗∗ satisfies epi ( f ∗∗ ) = Co ( epi ( f )) , which means that f ∗∗ ( x ) and f ( x ) match at extreme points of epi ( f ∗∗ ) . Define lack of convexity as ρ ( f ) � sup x ∈ dom ( f ) { f ( x ) − f ∗∗ ( x ) } . Example. | x | 1 Card ( x ) 0 − 1 1 x The l 1 norm is the convex envelope of Card ( x ) in [ − 1 , 1] . Alex d’Aspremont CIRM, Luminy, March 2020. 19/32

  20. Nonconvex Optimization Writing the epigraph of problem (P) as in [Lemar´ echal and Renaud, 2001], � � n � ( r 0 , r ) ∈ R 1+ m : f i ( x i ) ≤ r 0 , Ax − b ≤ r, x ∈ R d G r � , i =1 we can write the dual function of (P) as � � r 0 + λ ⊤ r : ( r 0 , r ) ∈ G ∗∗ Ψ( λ ) � inf , r in the variable λ ∈ R m , where G ∗∗ = Co ( G ) is the closed convex hull of the epigraph G . Affine constraints means (P) and (CoP) have the same dual [Lemar´ echal and Renaud, 2001, Th. 2.11], given by sup Ψ( λ ) (D) λ ≥ 0 in the variable λ ∈ R m . Roughly, if G ∗∗ = G , no duality gap in (P). Alex d’Aspremont CIRM, Luminy, March 2020. 20/32

  21. Nonconvex Optimization Epigraph & duality gap. Define � i ( x i ) , A i x i ) : x i ∈ R d i � ( f ∗∗ F i = where A i ∈ R m × d i is the i th block of A . � The epigraph G ∗∗ can be written as a Minkowski sum of F i r n � G ∗∗ F i + (0 , − b ) + R m +1 = r + i =1 � Shapley-Folkman at x ∈ G ∗∗ shows f ∗∗ ( x i ) = f ( x i ) for all but at most m + 1 r terms in the objective. � As n → ∞ , with m/n → 0 , G r gets closer to its convex hull G ∗∗ and the r duality gap becomes negligible. Alex d’Aspremont CIRM, Luminy, March 2020. 21/32

  22. Bound on duality gap A priori bound on duality gap of � n minimize i =1 f i ( x i ) subject to Ax ≤ b, where A ∈ R m × d . Proposition [Aubin and Ekeland, 1976, Ekeland and Temam, 1999] A priori bounds on the duality gap Suppose the functions f i in (P) satisfy Assumption (. . . ). There is a point x ⋆ ∈ R d at which the primal optimal value of (CoP) is attained, such that n n n m +1 � � � � f ∗∗ i ( x ⋆ x ⋆ f ∗∗ i ( x ⋆ i ) ≤ f i (ˆ i ) ≤ i ) + ρ ( f [ i ] ) i =1 i =1 i =1 i =1 � �� � � �� � � �� � � �� � gap CoP P CoP x ⋆ is an optimal point of (P) and ρ ( f [1] ) ≥ ρ ( f [2] ) ≥ . . . ≥ ρ ( f [ n ] ) . where ˆ Alex d’Aspremont CIRM, Luminy, March 2020. 22/32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend