trimming the 1 regularizer
play

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and - PowerPoint PPT Presentation

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J.


  1. Trimming the ℓ 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur´ elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J. Watson Research Center arcprime@kaist.ac.kr International Conference on Machine Learning June 12, 2019 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 1 / 40

  2. Table of Contents Introduction and Setup 1 Statistical Analysis 2 Optimization 3 Experiments & Applications to Deep Learning 4 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 2 / 40

  3. Table of Contents Introduction and Setup 1 Statistical Analysis 2 Optimization 3 Experiments & Applications to Deep Learning 4 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 3 / 40

  4. ℓ 1 Regularization is Popular High-dimensional data with ℓ 1 regularization ( n < < p ) Genomic Data, Matrix Completion, Deep Learning, etc. (a) Sparse linear models (b) Sparse graphical models (c) Matrix Completion (d) Sparse neural networks Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 4 / 40

  5. Concrete Example 1 Lasso Example 1: Lasso ∗ (Sparse Linear Regression) 1 � 2 n � y − X θ � 2 θ ∈ argmin 2 + λ n � θ � 1 θ ∈ Ω ∗ R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS, Series B,1996. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 5 / 40

  6. Concrete Example 2 Graphical Lasso Example 2: Graphical Lasso ∗ (Sparse Concentration Matrix) � trace( � Θ ∈ argmin ΣΘ) − log det(Θ) + λ n � Θ � 1 , off Θ ∈S p ++ where � Σ is a sample covariance matrix, S p ++ the symmetric and strictly positive definite matrices, and � Θ � 1 , off the ℓ 1 -norm on the off-diagonal elements of Θ . ∗ P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. EJS, 2011 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 6 / 40

  7. Concrete Example 3 Group ℓ 1 on Network Pruning Task Example 3: Group ℓ 1 ∗ (Structured Sparsity of Weight Parameters) � θ ∈ argmin L ( θ ; D ) + λ n � θ � G θ ∈ Ω where � θ is a collection of weight parameters of neural networks, L the neural network loss (ex. softmax), and � θ � G the group sparsity regularizer. Before Pruning After Pruning Pruning Synapses Pruning Neurons Figure: Encouraging group sparsity . For example, � θ � G = � g ∈G � θ g � 2 with each group g . ∗ W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 7 / 40

  8. Shrinkage Bias of Standard ℓ 1 Penalty As parameter size gets larger, the shrinkage bias effect also tends to be larger. The ℓ 1 penalty is proportional to the size of parameters. Despite the popularity of ℓ 1 penalty (and also strong statistical guarantees), Is it really good enough? Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 8 / 40

  9. Non-convex Regularizers Previous Work For amenable non-convex regularizers (such as SCAD ∗ and MCP ∗∗ ), ⊲ Amenable regularizer: Resembles ℓ 1 at the origin and has vanishing derivatives at the tail. → coordinate-wise decomposable . ⊲ (Loh & Wainwright) ∗∗∗ provide the statistical analysis on amenable regularizers. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 9 / 40

  10. Non-convex Regularizers Previous Work For amenable non-convex regularizers (such as SCAD ∗ and MCP ∗∗ ), ⊲ Amenable regularizer: Resembles ℓ 1 at the origin and has vanishing derivatives at the tail. → coordinate-wise decomposable . ⊲ (Loh & Wainwright) ∗∗∗ provide the statistical analysis on amenable regularizers. What about more structurally complex regularizer? ∗ J. Fan and R. Li. Variable selection via non-concave penalized likelihood and its oracle properties. Jour. Amer. Stat. Ass., 96(456):1348-1360, December 2001. ∗∗ Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894-942, 2010. ∗∗∗ P. Loh and M. J. Wainwright. Regularized M -estimators with non-convexity: statistical and algorithmic theory for local optima and algorithmic. JMLR, 2015. ∗∗∗ P. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 2017. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 9 / 40

  11. 𝑞 − ℎ Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

  12. 𝑞 − ℎ Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Definition: For a parameter vector θ ∈ R p , we only ℓ 1 -penalize each entry except largest h entries (We call h the trimming parameter). Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

  13. Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Definition: For a parameter vector θ ∈ R p , we only ℓ 1 -penalize each entry except largest h entries (We call h the trimming parameter). Parameter (The darker color, the larger value) Penalty-free We only penalize the smallest 𝑞 − ℎ entries. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

  14. Trimmed ℓ 1 Penalty First Formulation Parameter (The darker color, the larger value) Penalty-free We only penalize the smallest 𝑞 − ℎ entries. We can formalize by defining the order statistics of the parameter vector | θ (1) | > | θ (2) | > · · · > | θ ( p ) | , the M -estimation with the Trimmed ℓ 1 penalty is minimize L ( θ ; D ) + λ n R ( θ ; h ) θ ∈ Ω where the regularizer R ( θ ; h ) = � p j = h +1 | θ ( j ) | (sum of smallest p − h entries in absolute values). Importantly, the Trimmed ℓ 1 is not amenable nor coordinate-wise separable. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 11 / 40

  15. M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

  16. M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ . As an ideal case, w j = 0 for largest h entries w j = 1 for smallest p − h entries Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

  17. M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ . As an ideal case, w j = 0 for largest h entries w j = 1 for smallest p − h entries If we set the trimming parameter h = 0 , it is just a standard ℓ 1 . Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

  18. M -estimation with the Trimmed ℓ 1 penalty Second Formulation: Important Properties p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The objective function F is Weighted ℓ 1 -regularized if we fix w . Linear in w with fixing θ . However, F is non-convex in jointly ( θ , w ) because of coupling of θ and w . We use this second formulation for an optimization. Since we don’t need to sort the parameter. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 13 / 40

  19. 𝜄 3 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 2 𝜄 1 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

  20. 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 1 𝜄 1 ℎ = 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 2 𝜄 1 ℎ = 0 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

  21. 𝜄 3 𝜄 2 𝜄 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

  22. Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 2 𝜄 1 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 ℎ = 2 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend