machine learning and convex optimization with submodular
play

Machine learning and convex optimization with submodular functions - PowerPoint PPT Presentation

Machine learning and convex optimization with submodular functions Francis Bach Sierra project-team, INRIA - Ecole Normale Sup erieure Workshop on combinatorial optimization - Cargese, 2013 Submodular functions - References References


  1. Submodular and base polyhedra - Properties • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Many facets (up to 2 p ), many extreme points (up to p ! )

  2. Submodular and base polyhedra - Properties • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Many facets (up to 2 p ), many extreme points (up to p ! ) • Fundamental property (Edmonds, 1970): If F is submodular, maximizing linear functions may be done by a “greedy algorithm” – Let w ∈ R p + such that w j 1 � · · · � w j p – Let s j k = F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } ) for k ∈ { 1 , . . . , p } s ∈ P ( F ) w ⊤ s = max s ∈ B ( F ) w ⊤ s – Then f ( w ) = max – Both problems attained at s defined above • Simple proof by convex duality

  3. Submodular functions Links with convexity • Theorem (Lov´ asz, 1982): If F is submodular, then A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min • Consequence: Submodular function minimization may be done in polynomial time (through ellipsoid algorithm) • Representation of f ( w ) as a support function (Edmonds, 1970): s ∈ B ( F ) s ⊤ w f ( w ) = max – Maximizer s may be found efficiently through the greedy algorithm

  4. Outline 1. Submodular functions – Review and examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular minimization – Non-smooth convex optimization – Parallel algorithm for special case 3. Structured sparsity-inducing norms – Relaxation of the penalization of supports by submodular functions – Extensions (symmetric, ℓ q -relaxation)

  5. Submodular function minimization Dual problem • Let F : 2 V → R be a submodular function (such that F ( ∅ ) = 0 ) • Convex duality (Edmonds, 1970): A ⊂ V F ( A ) min = w ∈ [0 , 1] p f ( w ) min s ∈ B ( F ) w ⊤ s = w ∈ [0 , 1] p max min w ∈ [0 , 1] p w ⊤ s = max = max min s ∈ B ( F ) s − ( V ) s ∈ B ( F )

  6. Exact submodular function minimization Combinatorial algorithms • Algorithms based on min A ⊂ V F ( A ) = max s ∈ B ( F ) s − ( V ) • Output the subset A and a base s ∈ B ( F ) as a certificate of optimality • Best algorithms have polynomial complexity (Schrijver, 2000; Iwata et al., 2001; Orlin, 2009) (typically O ( p 6 ) or more) • Update a sequence of convex combination of vertices of B ( F ) obtained from the greedy algorithm using a specific order: – Based only on function evaluations • Recent algorithms using efficient reformulations in terms of generalized graph cuts (Jegelka et al., 2011)

  7. Approximate submodular function minimization • For most machine learning applications, no need to obtain exact minimum – For convex optimization, see, e.g., Bottou and Bousquet (2008) A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min

  8. Approximate submodular function minimization • For most machine learning applications, no need to obtain exact minimum – For convex optimization, see, e.g., Bottou and Bousquet (2008) A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min • Important properties of f for convex optimization – Polyhedral function – Representation as maximum of linear functions s ∈ B ( F ) w ⊤ s f ( w ) = max • Stability vs. speed vs. generality vs. ease of implementation

  9. Projected subgradient descent (Shor et al., 1985) s ∈ B ( F ) s ⊤ w through the greedy algorithm • Subgradient of f ( w ) = max • Using projected subgradient descent to minimize f on [0 , 1] p w t − 1 − C – Iteration: w t = Π [0 , 1] p � � t s t where s t ∈ ∂f ( w t − 1 ) √ √ p – Convergence rate: f ( w t ) − min w ∈ [0 , 1] p f ( w ) � t with primal/dual √ guarantees (Nesterov, 2003) • Fast iterations but slow convergence – need O ( p/ε 2 ) iterations to reach precision ε – need O ( p 2 /ε 2 ) function evaluations to reach precision ε

  10. Ellipsoid method (Nemirovski and Yudin, 1983) • Build a sequence of minimum volume ellipsoids that enclose the set of solutions E 1 E 0 E 1 E 2 • Cost of a single iteration: p function evaluations and O ( p 3 ) operations log 1 • Number of iterations: 2 p 2 � � max A ⊂ V F ( A ) − min A ⊂ V F ( A ) ε . – O ( p 5 ) operations and O ( p 3 ) function evaluations • Slow in practice (the bound is “tight”)

  11. Analytic center cutting planes (Goffin and Vial, 1993) • Center of gravity method – improves the convergence rate of ellipsoid method – cannot be computed easily • Analytic center of a polytope defined by a ⊤ i w � b i , i ∈ I � log( b i − a ⊤ w ∈ R p − min i w ) i ∈ I • Analytic center cutting planes (ACCPM) – Each iteration has complexity O ( p 2 | I | + | I | 3 ) using Newton’s method – No linear convergence rate – Good performance in practice

  12. Simplex method for submodular minimization • Mentioned by Girlich and Pisaruk (1997); McCormick (2005) • Formulation as linear program : s ∈ B ( F ) ⇔ s = S ⊤ η , S ∈ R d × p p � min { ( S ⊤ η ) i , 0 } s ∈ B ( F ) s − ( V ) = max max η � 0 , η ⊤ 1 d =1 i =1 η � 0 , α � 0 , β � 0 − β ⊤ 1 p such that S ⊤ η − α + β = 0 , η ⊤ 1 d = 1 . = max • Column generation for simplex methods : only access the rows of S by maximizing linear functions – no complexity bound, may get global optimum if enough iterations

  13. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with asz extension of F , and Ψ( w ) = � f Lov´ k ∈ V ψ k ( w k ) • Structured sparsity – Total variation denoising - isotonic regression – Regularized risk minimization penalized by the Lov´ asz extension

  14. Total variation denoising (Chambolle, 2005) � � • F ( A ) = d ( k, j ) ⇒ f ( w ) = d ( k, j )( w k − w j ) + k,j ∈ V k ∈ A,j ∈ V \ A • d symmetric ⇒ f = total variation

  15. Isotonic regression • Given real numbers x i , i = 1 , . . . , p p – Find y ∈ R p that minimizes 1 ( x i − y i ) 2 such that ∀ i, y i � y i +1 � 2 j =1 y x • For a directed chain, f ( y ) = 0 if and only if ∀ i, y i � y i +1 j =1 ( x i − y i ) 2 + λf ( y ) for λ large � p • Minimize 1 2

  16. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with asz extension of F , and Ψ( w ) = � f Lov´ k ∈ V ψ k ( w k ) • Structured sparsity – Total variation denoising - isotonic regression – Regularized risk minimization penalized by the Lov´ asz extension

  17. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with asz extension of F , and Ψ( w ) = � f Lov´ k ∈ V ψ k ( w k ) • Structured sparsity – Total variation denoising - isotonic regression – Regularized risk minimization penalized by the Lov´ asz extension • Proximal methods (see second part) – Minimize Ψ( w ) + f ( w ) for smooth Ψ as soon as the following “proximal” problem may be obtained efficiently p 1 1 2( w k − z k ) 2 + f ( w ) � 2 � w − z � 2 min 2 + f ( w ) = min w ∈ R p w ∈ R p k =1 • Submodular function minimization

  18. Separable optimization on base polyhedron Convex duality • Let ψ k : R → R , k ∈ { 1 , . . . , p } be p functions. Assume – Each ψ k is strictly convex – sup α ∈ R ψ ′ j ( α ) = + ∞ and inf α ∈ R ψ ′ j ( α ) = −∞ – Denote ψ ∗ 1 , . . . , ψ ∗ p their Fenchel-conjugates (then with full domain)

  19. Separable optimization on base polyhedron Convex duality • Let ψ k : R → R , k ∈ { 1 , . . . , p } be p functions. Assume – Each ψ k is strictly convex – sup α ∈ R ψ ′ j ( α ) = + ∞ and inf α ∈ R ψ ′ j ( α ) = −∞ – Denote ψ ∗ 1 , . . . , ψ ∗ p their Fenchel-conjugates (then with full domain) p p � � s ∈ B ( F ) w ⊤ s + w ∈ R p f ( w ) + min ψ i ( w j ) = w ∈ R p max min ψ j ( w j ) j =1 j =1 p � w ∈ R p w ⊤ s + = s ∈ B ( F ) min max ψ j ( w j ) j =1 p � ψ ∗ s ∈ B ( F ) − j ( − s j ) = max j =1

  20. Separable optimization on base polyhedron Equivalence with submodular function minimization • For α ∈ R , let A α ⊂ V be a minimizer of A �→ F ( A ) + � j ∈ A ψ ′ j ( α ) • Let w ∗ be the unique minimizer of w �→ f ( w ) + � p j =1 ψ j ( w j ) • Proposition (Chambolle and Darbon, 2009): – Given A α for all α ∈ R , then ∀ j, w ∗ j = sup( { α ∈ R , j ∈ A α } ) – Given w ∗ , then A �→ F ( A ) + � j ∈ A ψ ′ j ( α ) has minimal minimizer { w ∗ > α } and maximal minimizer { w ∗ � α } • Separable optimization equivalent to a sequence of submodular function minimizations – NB: extension of known results from parametric max-flow

  21. Equivalence with submodular function minimization Proof sketch (Bach, 2011b) p p � � ψ ∗ • Duality gap for min s ∈ B ( F ) − j ( − s j ) w ∈ R p f ( w ) + ψ i ( w j ) = max j =1 j =1 p p � � ψ ∗ ψ i ( w j ) − j ( − s j ) f ( w ) + j =1 j =1 p � � � f ( w ) − w ⊤ s + ψ j ( w j ) + ψ ∗ = j ( − s j ) + w j s j j =1 � + ∞ � � ( F + ψ ′ ( α ))( { w � α } ) − ( s + ψ ′ ( α )) − ( V ) = dα −∞ • Duality gap for convex problems = sums of duality gaps for combinatorial problems

  22. Separable optimization on base polyhedron Quadratic case • Let F be a submodular function and w ∈ R p the unique minimizer of w �→ f ( w ) + 1 2 � w � 2 2 . Then: (a) s = − w is the point in B ( F ) with minimum ℓ 2 -norm (b) For all λ ∈ R , the maximal minimizer of A �→ F ( A ) + λ | A | is { w � − λ } and the minimal minimizer of F is { w > − λ } • Consequences – Threshold at 0 the minimum norm point in B ( F ) to minimize F (Fujishige and Isotani, 2011) – Minimizing submodular functions with cardinality constraints (Nagano et al., 2011)

  23. From convex to combinatorial optimization and vice-versa... � • Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V – Thresholding solutions w at zero if ∀ k ∈ V, ψ ′ k (0) = 0 – For quadratic functions ψ k ( w k ) = 1 2 w 2 k , equivalent to projecting 0 on B ( F ) (Fujishige, 2005)

  24. From convex to combinatorial optimization and vice-versa... � • Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V – Thresholding solutions w at zero if ∀ k ∈ V, ψ ′ k (0) = 0 – For quadratic functions ψ k ( w k ) = 1 2 w 2 k , equivalent to projecting 0 on B ( F ) (Fujishige, 2005) � • Solving min A ⊂ V F ( A ) − t ( A ) to solve min ψ k ( w k ) + f ( w ) w ∈ R p k ∈ V – General decomposition strategy (Groenevelt, 1991) – Efficient only when submodular minimization is efficient

  25. � A ⊂ V F ( A ) − t ( A ) to solve min Solving min ψ k ( w k )+ f ( w ) w ∈ R p k ∈ V • General recursive divide-and-conquer algorithm (Groenevelt, 1991) • NB: Dual version of Fujishige (2005) 1. Compute minimizer t ∈ R p of � j ∈ V ψ ∗ j ( − t j ) s.t. t ( V ) = F ( V ) 2. Compute minimizer A of F ( A ) − t ( A ) 3. If A = V , then t is optimal. Exit. j ∈ A ψ ∗ 4. Compute a minimizer s A of � j ( − s j ) over s ∈ B ( F A ) where F A : 2 A → R is the restriction of F to A , i.e., F A ( B ) = F ( A ) j ∈ V \ A ψ ∗ j ( − s j ) over s ∈ B ( F A ) 5. Compute a minimizer s V \ A of � where F A ( B ) = F ( A ∪ B ) − F ( A ) , for B ⊂ V \ A 6. Concatenate s A and s V \ A . Exit.

  26. � Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V • Dual problem: max s ∈ B ( F ) − � p j =1 ψ ∗ j ( − s j ) • Constrained optimization when linear functions can be maximized – Frank-Wolfe algorithms • Two main types for convex functions

  27. Approximate quadratic optimization on B ( F ) 1 s ∈ B ( F ) − 1 2 � w � 2 2 � s � 2 • Goal : min 2 + f ( w ) = max 2 w ∈ R p • Can only maximize linear functions on B ( F ) • Two types of “Frank-wolfe” algorithms • 1. Active set algorithm ( ⇔ min-norm-point) – Sequence of maximizations of linear functions over B ( F ) + overheads (affine projections) – Finite convergence, but no complexity bounds

  28. Minimum-norm-point algorithm (Wolfe, 1976) 2 2 2 1 1 1 (a) (b) (c) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (d) (e) (f) 0 0 0 3 3 3 5 5 5 4 4 4

  29. Approximate quadratic optimization on B ( F ) 1 s ∈ B ( F ) − 1 2 � w � 2 2 � s � 2 • Goal : min 2 + f ( w ) = max 2 w ∈ R p • Can only maximize linear functions on B ( F ) • Two types of “Frank-wolfe” algorithms • 1. Active set algorithm ( ⇔ min-norm-point) – Sequence of maximizations of linear functions over B ( F ) + overheads (affine projections) – Finite convergence, but no complexity bounds • 2. Conditional gradient – Sequence of maximizations of linear functions over B ( F ) – Approximate optimality bound

  30. Conditional gradient with line search 2 2 2 1 1 1 (a) (b) (c) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (d) (e) (f) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (g) (h) (i) 0 0 0 3 3 3 5 5 5 4 4 4

  31. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t

  32. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t • Improved primal candidate through isotonic regression – f ( w ) is linear on any set of w with fixed ordering – May be optimized using isotonic regression (“pool-adjacent- violator”) in O ( n ) (see, e.g., Best and Chakravarti, 1990) – Given w t = − s t , keep the ordering and reoptimize

  33. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t • Improved primal candidate through isotonic regression – f ( w ) is linear on any set of w with fixed ordering – May be optimized using isotonic regression (“pool-adjacent- violator”) in O ( n ) (see, e.g. Best and Chakravarti, 1990) – Given w t = − s t , keep the ordering and reoptimize • Better bound for submodular function minimization?

  34. From quadratic optimization on B ( F ) to submodular function minimization • Proposition : If w is ε -optimal for min w ∈ R p 1 2 � w � 2 2 + f ( w ) , then at � √ εp � least a levet set A of w is -optimal for submodular function 2 minimization √ εp • If ε = 2 D 2 = Dp 1 / 2 √ t , ⇒ no provable gains , but: 2 2 t – Bound on the iterates A t (with additional assumptions) – Possible thresolding for acceleration

  35. From quadratic optimization on B ( F ) to submodular function minimization • Proposition : If w is ε -optimal for min w ∈ R p 1 2 � w � 2 2 + f ( w ) , then at � √ εp � least a levet set A of w is -optimal for submodular function 2 minimization √ εp • If ε = 2 D 2 = Dp 1 / 2 √ ⇒ no provable gains , but: t , 2 2 t – Bound on the iterates A t (with additional assumptions) – Possible thresolding for acceleration • Lower complexity bound for SFM – Conjecture : no algorithm that is based only on a sequence of greedy algorithms obtained from linear combinations of bases can improve on the subgradient bound (after p/ 2 iterations).

  36. Simulations on standard benchmark “DIMACS Genrmf-wide”, p = 430 • Submodular function minimization – (Left) dual suboptimality – (Right) primal suboptimality MNP 4 4 CG−LS log 10 (min(F)−s − (V)) log 10 (F(A)−min(F)) CG−1/t 3 3 SD−1/t 1/2 SD−Polyak 2 2 Ellipsoid Simplex 1 1 ACCPM ACCPM−simp. 0 0 −1 −1 0 500 1000 1500 0 500 1000 1500 iterations iterations

  37. Simulations on standard benchmark “DIMACS Genrmf-long”, p = 575 • Submodular function minimization – (Left) dual suboptimality – (Right) primal suboptimality MNP 4 4 CG−LS log 10 (min(F)−s − (V)) log 10 (F(A)−min(F)) CG−1/t 3 3 SD−1/t 1/2 SD−Polyak 2 2 Ellipsoid Simplex 1 1 ACCPM ACCPM−simp. 0 0 −1 −1 0 500 1000 0 500 1000 iterations iterations

  38. Simulations on standard benchmark • Separable quadratic optimization – (Left) dual suboptimality – (Right) primal suboptimality (in dashed, before the pool-adjacent-violator correction) 8 8 MNP log 10 ( ||w|| 2 /2+f(w)−OPT) CG−LS log 10 (OPT+ ||s|| 2 /2) CG−1/t 6 6 4 4 2 2 0 0 0 500 1000 1500 0 500 1000 1500 iterations iterations

  39. Outline 1. Submodular functions – Review and examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular minimization – Non-smooth convex optimization – Parallel algorithm for special case 3. Structured sparsity-inducing norms – Relaxation of the penalization of supports by submodular functions – Extensions (symmetric, ℓ q -relaxation)

  40. From submodular minimization to proximal problems • Summary : several optimization problems – Discrete problem: min A ⊂ V F ( A ) = w ∈{ 0 , 1 } p f ( w ) min – Continuous problem: w ∈ [0 , 1] p f ( w ) min 1 2 � w � 2 – Proximal problem (P): min 2 + f ( w ) w ∈ R p • Solving (P) is equivalent to minimizing F ( A ) + λ | A | for all λ A ⊆ V F ( A ) + λ | A | = { k, w k � − λ } – arg min • Much simpler problem but no gains in terms of (provable) complexity – See Bach (2011a)

  41. Decomposable functions • F may often be decomposed as the sum of r “simple” functions: r � F ( A ) = F j ( A ) j =1 – Each F j may be minimized efficiently – Example: 2D grid = vertical chains + horizontal chains • Komodakis et al. (2011); Kolmogorov (2012); Stobbe and Krause (2010); Savchynskyy et al. (2011) – Dual decomposition approach but slow non-smooth problem

  42. Decomposable functions and proximal problems (Jegelka, Bach, and Sra, 2013) • Dual problem w ∈ R p f 1 ( w ) + f 2 ( w ) + 1 2 � w � 2 min 2 2 w + 1 s 1 ∈ B ( F 1 ) s ⊤ s 2 ∈ B ( F 2 ) s ⊤ 2 � w � 2 = min max 1 w + max 2 w ∈ R p s 1 ∈ B ( F 1 ) , s 2 ∈ B ( F 2 ) − 1 2 � s 1 + s 2 � 2 = max • Finding the closest point between two polytopes – Several alternatives: Block coordinate ascent, Douglas Rachford splitting (Bauschke et al., 2004) – (a) no parameters, (b) parallelizable

  43. Experiments • Graph cuts on a 500 × 500 image discrete gaps − smooth problems− 4 discrete gaps − non−smooth problems − 4 5 5 grad−accel dual−sgd−P BCD dual−sgd−F 4 4 DR dual−smooth BCD−para primal−smooth log 10 (duality gap) log 10 (duality gap) primal−sgd DR−para 3 3 2 2 1 1 0 0 −1 −1 200 400 600 800 1000 20 40 60 80 100 iteration iteration • Matlab/C implementation 10 times slower than C-code for graph cut – Easy to code and parallelizable

  44. Parallelization • Multiple cores 40 iterations of DR 6 5 speedup factor 4 3 2 1 0 0 2 4 6 8 # cores

  45. Outline 1. Submodular functions – Review and examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular minimization – Non-smooth convex optimization – Parallel algorithm for special case 3. Structured sparsity-inducing norms – Relaxation of the penalization of supports by submodular functions – Extensions (symmetric, ℓ q -relaxation)

  46. Structured sparsity through submodular functions References and Links • References on submodular functions – Submodular Functions and Optimization (Fujishige, 2005) – Tutorial paper based on convex optimization (Bach, 2011b) www.di.ens.fr/~fbach/submodular_fot.pdf • Structured sparsity through convex optimization – Algorithms (Bach, Jenatton, Mairal, and Obozinski, 2011) www.di.ens.fr/~fbach/bach_jenatton_mairal_obozinski_FOT.pdf – Theory/applications (Bach, Jenatton, Mairal, and Obozinski, 2012) www.di.ens.fr/~fbach/stat_science_structured_sparsity.pdf – Matlab/R/Python codes: http://www.di.ens.fr/willow/SPAMS/ • Slides : www.di.ens.fr/~fbach/fbach_cargese_2013.pdf

  47. Sparsity in supervised machine learning • Observed data ( x i , y i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y, Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – square loss + ℓ 1 -norm ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996) – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n )

  48. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1

  49. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn X = ( x 1 , . . . , x p ) ∈ R n × p such that ∀ j, � x j � 2 � 1 k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � x j � 2 � 1 by Θ( x j ) � 1

  50. Sparsity in signal processing • Multiple responses/signals x = ( x 1 , . . . , x k ) ∈ R n × k k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn D = ( d 1 , . . . , d p ) ∈ R n × p such that ∀ j, � d j � 2 � 1 k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � d j � 2 � 1 by Θ( d j ) � 1

  51. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  52. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  53. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  54. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  55. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  56. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  57. Modelling of text corpora (Jenatton et al., 2010)

  58. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  59. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010) • Stability and identifiability • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Numerical efficiency – Non-linear variable selection with 2 p subsets (Bach, 2008)

  60. Classical approaches to structured sparsity • Many application domains – Computer vision (Cevher et al., 2008; Mairal et al., 2009b) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

  61. Why ℓ 1 -norms lead to sparsity? 1 2 x 2 − xy + λ | x | • Example 1 : quadratic problem in 1D, i.e., min x ∈ R • Piecewise quadratic function with a kink at zero – Derivative at 0+ : g + = λ − y and 0 − : g − = − λ − y – x = 0 is the solution iff g + � 0 and g − � 0 (i.e., | y | � λ ) – x � 0 is the solution iff g + � 0 (i.e., y � λ ) ⇒ x ∗ = y − λ – x � 0 is the solution iff g − � 0 (i.e., y � − λ ) ⇒ x ∗ = y + λ • Solution x ∗ = sign( y )( | y | − λ ) + = soft thresholding

  62. Why ℓ 1 -norms lead to sparsity? 1 2 x 2 − xy + λ | x | • Example 1 : quadratic problem in 1D, i.e., min x ∈ R • Piecewise quadratic function with a kink at zero • Solution x ∗ = sign( y )( | y | − λ ) + = soft thresholding x x*(y) −λ y λ

  63. Why ℓ 1 -norms lead to sparsity? • Example 2 : minimize quadratic function Q ( w ) subject to � w � 1 � T . – coupled soft thresholding • Geometric interpretation – NB : penalizing is “equivalent” to constraining w w 2 2 w w 1 1 • Non-smooth optimization!

  64. Gaussian hare ( ℓ 2 ) vs. Laplacian tortoise ( ℓ 1 ) • Smooth vs. non-smooth optimization • See Bach, Jenatton, Mairal, and Obozinski (2011)

  65. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006)

  66. Unit norm balls Geometric interpretation � w 2 1 + w 2 � w � 2 � w � 1 2 + | w 3 |

  67. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006) • What if the set of groups H is not a partition anymore? • Is there any systematic way?

  68. ℓ 1 -norm = convex envelope of cardinality of support • Let w ∈ R p . Let V = { 1 , . . . , p } and Supp( w ) = { j ∈ V, w j � = 0 } • Cardinality of support : � w � 0 = Card(Supp( w )) • Convex envelope = largest convex lower bound (see, e.g., Boyd and Vandenberghe, 2004) ||w|| 0 ||w|| 1 −1 1 • ℓ 1 -norm = convex envelope of ℓ 0 -quasi-norm on the ℓ ∞ -ball [ − 1 , 1] p

  69. Convex envelopes of general functions of the support (Bach, 2010) • Let F : 2 V → R be a set-function – Assume F is non-decreasing (i.e., A ⊂ B ⇒ F ( A ) � F ( B ) ) – Explicit prior knowledge on supports (Haupt and Nowak, 2006; Baraniuk et al., 2008; Huang et al., 2009) • Define Θ( w ) = F (Supp( w )) : How to get its convex envelope? 1. Possible if F is also submodular 2. Allows unified theory and algorithm 3. Provides new regularizers

  70. Submodular functions and structured sparsity • Let F : 2 V → R be a non-decreasing submodular set-function • Proposition : the convex envelope of Θ : w �→ F (Supp( w )) on the ℓ ∞ -ball is Ω : w �→ f ( | w | ) where f is the Lov´ asz extension of F

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend