network flow algorithms for structured sparsity
play

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 - PowerPoint PPT Presentation

Network Flow Algorithms for Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team Bellevue, ICML Workshop, July 2011 Julien Mairal, UC Berkeley Network Flow


  1. Network Flow Algorithms for Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team Bellevue, ICML Workshop, July 2011 Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 1/50

  2. What this work is about Sparse and structured linear models. Optimization for group Lasso with overlapping groups. Links between sparse regularization and network flow optimization. Related publications: [1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network Flow Algorithms for Structured Sparsity. NIPS, 2010. [2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Hierarchical Sparse Coding. JMLR, to appear. [3] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Sparse Hierarchical Dictionary Learning. ICML, 2010. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 2/50

  3. Part I: Introduction to Structured Sparsity Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 3/50

  4. Sparse Linear Model: Machine Learning Point of View i =1 be a training set, where the vectors x i are in R p and are Let ( y i , x i ) n called features. The scalars y i are in {− 1 , +1 } for binary classification problems. R for regression problems. We assume there is a relation y ≈ w ⊤ x , and solve n 1 � ℓ ( y i , w ⊤ x i ) min + λ Ω( w ) . n w ∈ R p � �� � i =1 regularization � �� � empirical risk Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 4/50

  5. Sparse Linear Models: Machine Learning Point of View A few examples: n 1 � ( y i − w ⊤ x i ) 2 + λ � w � 2 Ridge regression: min 2 . 2 n w ∈ R p i =1 n 1 � max(0 , 1 − y i w ⊤ x i ) + λ � w � 2 Linear SVM: min 2 . n w ∈ R p i =1 n � 1 + e − y i w ⊤ x i � 1 � + λ � w � 2 Logistic regression: min log 2 . n w ∈ R p i =1 The squared ℓ 2 -norm induces “ smoothness ” in w . When one knows in advance that w should be sparse, one should use a sparsity-inducing regularization such as the ℓ 1 -norm. [Chen et al., 1999, Tibshirani, 1996] How can one add a-priori knowledge in the regularization? Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 5/50

  6. Sparse Linear Models: Signal Processing Point of View Let y in R n be a signal. Let X = [ x 1 , . . . , x p ] ∈ R n × p be a set of normalized “basis vectors”. We call it dictionary . X is “adapted” to y if it can represent it with a few basis vectors—that is, there exists a sparse vector w in R p such that x ≈ Xw . We call w the sparse code .   w 1     w 2    x 1 x 2 x p    y ≈ · · · .    .  .   w p � �� � � �� � y ∈ R n X ∈ R n × p � �� � w ∈ R p , sparse Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 6/50

  7. Sparse Linear Models: the Lasso/ Basis Pursuit Signal processing: X is a dictionary in R n × p , 1 2 � y − Xw � 2 min 2 + λ � w � 1 . w ∈ R p Machine Learning: n 1 1 � ( y i − x i ⊤ w ) 2 + λ � w � 1 = min 2 n � y − X ⊤ w � 2 2 + λ � w � 1 , min 2 n w ∈ R p w ∈ R p i =1 with X △ = [ x 1 , . . . , x n ], and y △ = [ y 1 , . . . , y n ] ⊤ . Useful tool in signal processing, machine learning, statistics, neuroscience,. . . as long as one wishes to select features. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 7/50

  8. Group Sparsity-Inducing Norms data fitting term ���� min f ( w ) + λ Ω( w ) w ∈ R p � �� � sparsity-inducing norm The most popular choice for Ω: The ℓ 1 norm, � w � 1 = � p j =1 | w j | . However, the ℓ 1 norm encodes poor information, just cardinality ! Another popular choice for Ω: The ℓ 1 - ℓ q norm [Turlach et al., 2005], with q = 2 or q = ∞ � � w g � q with G a partition of { 1 , . . . , p } . g ∈G The ℓ 1 - ℓ q norm sets to zero groups of non-overlapping variables (as opposed to single variables for the ℓ 1 norm). Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 8/50

  9. Structured Sparsity with Overlapping Groups Warning: Under the name “structured sparsity” appear in fact significantly different formulations! 1 non-convex zero-tree wavelets [Shapiro, 1993] sparsity patterns are in a predefined collection: [Baraniuk et al., 2010] select a union of groups: [Huang et al., 2009] structure via Markov Random Fields: [Cehver et al., 2008] 2 convex tree-structure: [Zhao et al., 2009] non-zero patterns are a union of groups: [Jacob et al., 2009] zero patterns are a union of groups: [Jenatton et al., 2009] other norms: [Micchelli et al., 2010] Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 9/50

  10. Sparsity-Inducing Norms � Ω( w ) = � w g � q g ∈G What happens when the groups overlap? [Jenatton et al., 2009] Inside the groups, the ℓ 2 -norm (or ℓ ∞ ) does not promote sparsity. Variables belonging to the same groups are encouraged to be set to zero together. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 10/50

  11. Examples of set of groups G [Jenatton et al., 2009] Selection of contiguous patterns on a sequence, p = 6. G is the set of blue groups. Any union of blue groups set to zero leads to the selection of a contiguous pattern. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 11/50

  12. Hierarchical Norms [Zhao et al., 2009] A node can be active only if its ancestors are active . The selected patterns are rooted subtrees. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 12/50

  13. Part II: How do we optimize these cost functions? Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 13/50

  14. Different strategies � w ∈ R p f ( w ) + λ min � w g � q g ∈G generic methods: QP, CP, subgradient descent. Augmented Lagrangian, ADMM [Mairal et al., 2011, Qi and Goldfarb, 2011] Nesterov smoothing technique [Chen et al., 2010] hierarchical case: proximal methods [Jenatton et al., 2010a] for q = ∞ : proximal gradient methods with network flow optimization. [Mairal et al., 2010] also proximal gradient methods with inexact proximal operator [Jenatton et al., 2010a, Liu and Ye, 2010] for q =2, reweighted- ℓ 2 [Jenatton et al., 2010b, Micchelli et al., 2010] Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 14/50

  15. First-order/proximal methods w ∈ R p f ( w ) + λ Ω( w ) min f is strictly convex and differentiable with a Lipshitz gradient. Generalizes the idea of gradient descent + L w k +1 ← arg min w ∈ R p f ( w k )+ ∇ f ( w k ) ⊤ ( w − w k ) 2 � w − w k � 2 + λ Ω( w ) 2 � �� � � �� � linear approximation quadratic term 1 2 � w − ( w k − 1 2 + λ L ∇ f ( w k )) � 2 ← arg min L Ω( w ) w ∈ R p When λ = 0, w k +1 ← w k − 1 L ∇ f ( w k ), this is equivalent to a classical gradient descent step. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 15/50

  16. First-order/proximal methods They require solving efficiently the proximal operator 1 2 � u − w � 2 min 2 + λ Ω( w ) w ∈ R p For the ℓ 1 -norm, this amounts to a soft-thresholding: i = sign( u i )( u i − λ ) + . w ⋆ There exists accelerated versions based on Nesterov optimal first-order method (gradient method with “extrapolation”) [Beck and Teboulle, 2009, Nesterov, 2007, 1983] suited for large-scale experiments. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 16/50

  17. Tree-structured groups Proposition [Jenatton, Mairal, Obozinski, and Bach, 2010a] If G is a tree-structured set of groups, i.e., ∀ g , h ∈ G , g ∩ h = ∅ or g ⊂ h or h ⊂ g For q = 2 or q = ∞ , we define Prox g and Prox Ω as 1 Prox g : u → arg min 2 � u − w � + λ � w g � q , w ∈ R p 1 � Prox Ω : u → arg min 2 � u − w � + λ � w g � q , w ∈ R p g ∈G If the groups are sorted from the leaves to the root, then Prox Ω = Prox g m ◦ . . . ◦ Prox g 1 . → Tree-structured regularization : Efficient linear time algorithm. Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 17/50

  18. General Overlapping Groups for q = ∞ Dual formulation [Jenatton, Mairal, Obozinski, and Bach, 2010a] The solutions w ⋆ and ξ ⋆ of the following optimization problems 1 � min 2 � u − w � + λ � w g � ∞ , ( Primal ) w ∈ R p g ∈G 1 � ξ g � 2 2 s.t. ∀ g ∈ G , � ξ g � 1 ≤ λ and ξ g 2 � u − ∈ g , min j = 0 if j / ξ ∈ R p ×|G| g ∈G ( Dual ) satisfy � w ⋆ = u − ξ ⋆ g . ( Primal-dual relation ) g ∈G The dual formulation has more variables, but no overlapping constraints . Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 18/50

  19. General Overlapping Groups for q = ∞ [Mairal, Jenatton, Obozinski, and Bach, 2010] First Step: Flip the signs of u The dual is equivalent to a quadratic min-cost flow problem . 1 � � ξ g j ≤ λ and ξ g ξ g � 2 min 2 � u − 2 s.t. ∀ g ∈ G , j = 0 if j / ∈ g , ξ ∈ R p ×|G| g ∈G j ∈ g + Julien Mairal, UC Berkeley Network Flow Algorithms for Structured Sparsity 19/50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend