learning with submodular functions
play

Learning with Submodular Functions Francis Bach Sierra - PowerPoint PPT Presentation

Learning with Submodular Functions Francis Bach Sierra project-team, INRIA - Ecole Normale Sup erieure Machine Learning Summer School, Kyoto September 2012 Submodular functions- References and Links References based on from combinatorial


  1. Examples of submodular functions Flows • Net-flows from multi-sink multi-source networks (Megiddo, 1974) • See details in www.di.ens.fr/~fbach/submodular_fot.pdf • Efficient formulation for set covers

  2. Examples of submodular functions Matroids • The pair ( V, I ) is a matroid with I its family of independent sets, iff: (a) ∅ ∈ I (b) I 1 ⊂ I 2 ∈ I ⇒ I 1 ∈ I (c) for all I 1 , I 2 ∈ I , | I 1 | < | I 2 | ⇒ ∃ k ∈ I 2 \ I 1 , I 1 ∪ { k } ∈ I • Rank function of the matroid, defined as F ( A ) = max I ⊂ A, A ∈I | I | is submodular ( direct proof ) • Graphic matroid (More later!) – V edge set of a certain graph G = ( U, V ) – I = set of subsets of edges which do not contain any cycle – F ( A ) = | U | minus the number of connected components of the subgraph induced by A

  3. Outline 1. Submodular functions – Definitions – Examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular optimization – Minimization – Links with convex optimization – Maximization 3. Structured sparsity-inducing norms – Norms with overlapping groups – Relaxation of the penalization of supports by submodular functions

  4. Choquet integral - Lov´ asz extension • Subsets may be identified with elements of { 0 , 1 } p • Given any set-function F and w such that w j 1 � · · · � w j p , define: p � f ( w ) = w j k [ F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } )] k =1 p − 1 � ( w j k − w j k +1 ) F ( { j 1 , . . . , j k } ) + w j p F ( { j 1 , . . . , j p } ) = k =1 (1, 0, 1)~{1, 3} (1, 1, 1)~{1, 2, 3} (0, 0, 1)~{3} (0, 1, 1)~{2, 3} (1, 0, 0)~{1} (1, 1, 0)~{1, 2} (0, 0, 0)~{ } (0, 1, 0)~{2}

  5. Choquet integral - Lov´ asz extension Properties p � w j k [ F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } )] f ( w ) = k =1 p − 1 � ( w j k − w j k +1 ) F ( { j 1 , . . . , j k } ) + w j p F ( { j 1 , . . . , j p } ) = k =1 • For any set-function F (even not submodular) – f is piecewise-linear and positively homogeneous – If w = 1 A , f ( w ) = F ( A ) ⇒ extension from { 0 , 1 } p to R p

  6. Choquet integral - Lov´ asz extension Example with p = 2 • If w 1 � w 2 , f ( w ) = F ( { 1 } ) w 1 + [ F ( { 1 , 2 } ) − F ( { 1 } )] w 2 • If w 1 � w 2 , f ( w ) = F ( { 2 } ) w 2 + [ F ( { 1 , 2 } ) − F ( { 2 } )] w 1 w 2 w >w 2 1 (1,1)/F({1,2}) (0,1)/F({2}) w >w 1 2 w f(w)=1 1 0 (1,0)/F({1}) (level set { w ∈ R 2 , f ( w ) = 1 } is displayed in blue) • NB: Compact formulation f ( w ) = − [ F ( { 1 } )+ F ( { 2 } ) − F ( { 1 , 2 } )] min { w 1 , w 2 } + F ( { 1 } ) w 1 + F ( { 2 } ) w 2

  7. Submodular functions Links with convexity • Theorem (Lov´ asz, 1982): F is submodular if and only if f is convex • Proof requires additional notions: – Submodular and base polyhedra

  8. Submodular and base polyhedra - Definitions • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } s 3 s 2 B(F) B(F) P(F) s 2 P(F) s 1 s 1 • Property: P ( F ) has non-empty interior

  9. Submodular and base polyhedra - Properties • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Many facets (up to 2 p ), many extreme points (up to p ! )

  10. Submodular and base polyhedra - Properties • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Many facets (up to 2 p ), many extreme points (up to p ! ) • Fundamental property (Edmonds, 1970): If F is submodular, maximizing linear functions may be done by a “greedy algorithm” – Let w ∈ R p + such that w j 1 � · · · � w j p – Let s j k = F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } ) for k ∈ { 1 , . . . , p } s ∈ P ( F ) w ⊤ s = max s ∈ B ( F ) w ⊤ s – Then f ( w ) = max – Both problems attained at s defined above • Simple proof by convex duality

  11. Greedy algorithms - Proof • Lagrange multiplier λ A ∈ R + for s ⊤ 1 A = s ( A ) � F ( A ) � � � s ∈ P ( F ) w ⊤ s = w ⊤ s − λ A [ s ( A ) − F ( A )] max λ A � 0 ,A ⊂ V max min s ∈ R p A ⊂ V p � � �� � � � = λ A � 0 ,A ⊂ V max min λ A F ( A ) + s k w k − λ A s ∈ R p A ⊂ V k =1 A ∋ k � � λ A F ( A ) such that ∀ k ∈ V, w k = = min λ A λ A � 0 ,A ⊂ V A ⊂ V A ∋ k • Define λ { j 1 ,...,j k } = w j k − w j k − 1 for k ∈ { 1 , . . . , p − 1 } , λ V = w j p , and zero otherwise – λ is dual feasible and primal/dual costs are equal to f ( w )

  12. Proof of greedy algorithm - Showing primal feasibility • Assume (wlog) j k = k , and A = ( u 1 , v 1 ] ∪ · · · ∪ ( u m , v m ] s ( A ) = � m k =1 s (( u k , v k ]) by modularity = � m � � F ((0 , v k ]) − F ((0 , u k ]) by definition of s k =1 � � m � � F (( u 1 , v k ]) − F (( u 1 , u k ]) by submodularity k =1 m � � � F (( u 1 , v k ]) − F (( u 1 , u k ]) = F (( u 1 , v 1 ]) + k =2 � F (( u 1 , v 1 ]) + � m � � F (( u 1 , v 1 ] ∪ ( u 2 , v k ]) − F (( u 1 , v 1 ] ∪ ( u 2 , u k ]) k =2 by submodularity = F (( u 1 , v 1 ] ∪ ( u 2 , v 2 ]) + � m � � F (( u 1 , v 1 ] ∪ ( u 2 , v k ]) − F (( u 1 , v 1 ] ∪ ( u 2 , u k ]) k =3 • By pursuing applying submodularity, we get: s ( A ) � F (( u 1 , v 1 ] ∪ · · · ∪ ( u m , v m ]) = F ( A ) , i.e., s ∈ P ( F )

  13. Greedy algorithm for matroids • The pair ( V, I ) is a matroid with I its family of independent sets, iff: (a) ∅ ∈ I (b) I 1 ⊂ I 2 ∈ I ⇒ I 1 ∈ I (c) for all I 1 , I 2 ∈ I , | I 1 | < | I 2 | ⇒ ∃ k ∈ I 2 \ I 1 , I 1 ∪ { k } ∈ I • Rank function, defined as F ( A ) = max I ⊂ A, A ∈I | I | is submodular • Greedy algorithm : – Since F ( A ∪ { k } ) − F ( A ) ∈ { 0 , 1 } p , s ∈ { 0 , 1 } p ⇒ w ⊤ s = � k, s k =1 w k – Start with A = ∅ , orders weights w k in decreasing order and sequentially add element k to A if set A remains independent • Graphic matroid: Kruskal’s algorithm for max. weight spanning tree!

  14. Submodular functions Links with convexity • Theorem (Lov´ asz, 1982): F is submodular if and only if f is convex • Proof 1. If F is submodular, f is the maximum of linear functions ⇒ f convex 2. If f is convex, let A, B ⊂ V . – 1 A ∪ B +1 A ∩ B = 1 A +1 B has components equal to 0 (on V \ ( A ∪ B ) ), 2 (on A ∩ B ) and 1 (on A ∆ B = ( A \ B ) ∪ ( B \ A ) ) – Thus f (1 A ∪ B + 1 A ∩ B ) = F ( A ∪ B ) + F ( A ∩ B ) . – By homogeneity and convexity, f (1 A + 1 B ) � f (1 A ) + f (1 B ) , which is equal to F ( A ) + F ( B ) , and thus F is submodular.

  15. Submodular functions Links with convexity • Theorem (Lov´ asz, 1982): If F is submodular, then A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min • Proof 1. Since f is an extension of F , min A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) � min w ∈ [0 , 1] p f ( w ) 2. Any w ∈ [0 , 1] p may be decomposed as w = � m i =1 λ i 1 B i where B 1 ⊂ · · · ⊂ B m = V , where λ � 0 and λ ( V ) � 1 : f ( w ) = � m i =1 λ i F ( B i ) � � m – Then i =1 λ i min A ⊂ V F ( A ) � min A ⊂ V F ( A ) (because min A ⊂ V F ( A ) � 0 ). – Thus min w ∈ [0 , 1] p f ( w ) � min A ⊂ V F ( A )

  16. Submodular functions Links with convexity • Theorem (Lov´ asz, 1982): If F is submodular, then A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min • Consequence : Submodular function minimization may be done in polynomial time – Ellipsoid algorithm: polynomial time but slow in practice

  17. Submodular functions - Optimization • Submodular function minimization in O ( p 6 ) – Schrijver (2000); Iwata et al. (2001); Orlin (2009) • Efficient active set algorithm with no complexity bound – Based on the efficient computability of the support function – Fujishige and Isotani (2011); Wolfe (1976) • Special cases with faster algorithms : cuts, flows • Active area of research – Machine learning: Stobbe and Krause (2010), Jegelka, Lin, and Bilmes (2011) – Combinatorial optimization: see Satoru Iwata’s talk – Convex optimization: See next part of tutorial

  18. Submodular functions - Summary • F : 2 V → R is submodular if and only if ∀ A, B ⊂ V, F ( A ) + F ( B ) � F ( A ∩ B ) + F ( A ∪ B ) ⇔ ∀ k ∈ V, A �→ F ( A ∪ { k } ) − F ( A ) is non-increasing

  19. Submodular functions - Summary • F : 2 V → R is submodular if and only if ∀ A, B ⊂ V, F ( A ) + F ( B ) � F ( A ∩ B ) + F ( A ∪ B ) ⇔ ∀ k ∈ V, A �→ F ( A ∪ { k } ) − F ( A ) is non-increasing • Intuition 1 : defined like concave functions (“diminishing returns”) – Example: F : A �→ g (Card( A )) is submodular if g is concave

  20. Submodular functions - Summary • F : 2 V → R is submodular if and only if ∀ A, B ⊂ V, F ( A ) + F ( B ) � F ( A ∩ B ) + F ( A ∪ B ) ⇔ ∀ k ∈ V, A �→ F ( A ∪ { k } ) − F ( A ) is non-increasing • Intuition 1 : defined like concave functions (“diminishing returns”) – Example: F : A �→ g (Card( A )) is submodular if g is concave • Intuition 2 : behave like convex functions – Polynomial-time minimization, conjugacy theory

  21. Submodular functions - Examples • Concave functions of the cardinality: g ( | A | ) • Cuts • Entropies – H (( X k ) k ∈ A ) from p random variables X 1 , . . . , X p – Gaussian variables H (( X k ) k ∈ A ) ∝ log det Σ AA – Functions of eigenvalues of sub-matrices • Network flows – Efficient representation for set covers • Rank functions of matroids

  22. Submodular functions - Lov´ asz extension • Given any set-function F and w such that w j 1 � · · · � w j p , define: p � w j k [ F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } )] f ( w ) = k =1 p − 1 � = ( w j k − w j k +1 ) F ( { j 1 , . . . , j k } ) + w j p F ( { j 1 , . . . , j p } ) k =1 – If w = 1 A , f ( w ) = F ( A ) ⇒ extension from { 0 , 1 } p to R p (subsets may be identified with elements of { 0 , 1 } p ) – f is piecewise affine and positively homogeneous • F is submodular if and only if f is convex – Minimizing f ( w ) on w ∈ [0 , 1] p equivalent to minimizing F on 2 V

  23. Submodular functions - Submodular polyhedra • Submodular polyhedron: P ( F ) = { s ∈ R p , ∀ A ⊂ V, s ( A ) � F ( A ) } • Base polyhedron: B ( F ) = P ( F ) ∩ { s ( V ) = F ( V ) } • Link with Lov´ asz extension (Edmonds, 1970; Lov´ asz, 1982): – if w ∈ R p s ∈ P ( F ) w ⊤ s = f ( w ) + , then max s ∈ B ( F ) w ⊤ s = f ( w ) – if w ∈ R p , then max • Maximizer obtained by greedy algorithm : – Sort the components of w , as w j 1 � · · · � w j p – Set s j k = F ( { j 1 , . . . , j k } ) − F ( { j 1 , . . . , j k − 1 } ) • Other operations on submodular polyhedra (see, e.g., Bach, 2011)

  24. Outline 1. Submodular functions – Definitions – Examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular optimization – Minimization – Links with convex optimization – Maximization 3. Structured sparsity-inducing norms – Norms with overlapping groups – Relaxation of the penalization of supports by submodular functions

  25. Submodular optimization problems Outline • Submodular function minimization – Properties of minimizers – Combinatorial algorithms – Approximate minimization of the Lov´ asz extension • Convex optimization with the Lov´ asz extension – Separable optimization problems – Application to submodular function minimization • Submodular function maximization – Simple algorithms with approximate optimality guarantees

  26. Submodularity (almost) everywhere Clustering • Semi-supervised clustering ⇒ • Submodular function minimization

  27. Submodularity (almost) everywhere Graph cuts • Submodular function minimization

  28. Submodular function minimization Properties • Let F : 2 V → R be a submodular function (such that F ( ∅ ) = 0 ) • Optimality conditions : A ⊂ V is a minimizer of F if and only if A is a minimizer of F over all subsets of A and all supersets of A – Proof : F ( A ) + F ( B ) � F ( A ∪ B ) + F ( A ∩ B ) • Lattice of minimizers : if A and B are minimizers, so are A ∪ B and A ∩ B

  29. Submodular function minimization Dual problem • Let F : 2 V → R be a submodular function (such that F ( ∅ ) = 0 ) • Convex duality : A ⊂ V F ( A ) min = w ∈ [0 , 1] p f ( w ) min s ∈ B ( F ) w ⊤ s = w ∈ [0 , 1] p max min w ∈ [0 , 1] p w ⊤ s = max = max min s ∈ B ( F ) s − ( V ) s ∈ B ( F ) • Optimality conditions : The pair ( A, s ) is optimal if and only if s ∈ B ( F ) and { s < 0 } ⊂ A ⊂ { s � 0 } and s ( A ) = F ( A ) – Proof : F ( A ) � s ( A ) = s ( A ∩ { s < 0 } ) + s ( A ∩ { s > 0 } ) � s ( A ∩ { s < 0 } ) � s − ( V )

  30. Exact submodular function minimization Combinatorial algorithms • Algorithms based on min A ⊂ V F ( A ) = max s ∈ B ( F ) s − ( V ) • Output the subset A and a base s ∈ B ( F ) such that A is tight for s and { s < 0 } ⊂ A ⊂ { s � 0 } , as a certificate of optimality • Best algorithms have polynomial complexity (Schrijver, 2000; Iwata et al., 2001; Orlin, 2009) (typically O ( p 6 ) or more) • Update a sequence of convex combination of vertices of B ( F ) obtained from the greedy algorithm using a specific order: – Based only on function evaluations • Recent algorithms using efficient reformulations in terms of generalized graph cuts (Jegelka et al., 2011)

  31. Exact submodular function minimization Symmetric submodular functions • A submodular function F is said symmetric if for all B ⊂ V , F ( V \ B ) = F ( B ) – Then, by applying submodularity, ∀ A ⊂ V , F ( A ) � 0 • Example: undirected cuts, mutual information • Minimization in O ( p 3 ) over all non-trivial subsets of V (Queyranne, 1998) • NB: extension to minimization of posimodular functions (Nagamochi and Ibaraki, 1998), i.e., of functions that satisfies ∀ A, B ⊂ V, F ( A ) + F ( B ) � F ( A \ B ) + F ( B \ A ) .

  32. Approximate submodular function minimization • For most machine learning applications, no need to obtain exact minimum – For convex optimization, see, e.g., Bottou and Bousquet (2008) A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min

  33. Approximate submodular function minimization • For most machine learning applications, no need to obtain exact minimum – For convex optimization, see, e.g., Bottou and Bousquet (2008) A ⊂ V F ( A ) = min w ∈{ 0 , 1 } p f ( w ) = min w ∈ [0 , 1] p f ( w ) min s ∈ B ( F ) s ⊤ w through the greedy algorithm • Subgradient of f ( w ) = max • Using projected subgradient descent to minimize f on [0 , 1] p w t − 1 − C – Iteration: w t = Π [0 , 1] p � � t s t where s t ∈ ∂f ( w t − 1 ) √ – Convergence rate: f ( w t ) − min w ∈ [0 , 1] p f ( w ) � C t with primal/dual √ guarantees (Nesterov, 2003; Bach, 2011)

  34. Approximate submodular function minimization Projected subgradient descent • Assume (wlog.) that ∀ k ∈ V , F ( { k } ) � 0 and F ( V \{ k } ) � F ( V ) • Denote D 2 = � � � F ( { k } ) + F ( V \{ k } ) − F ( V ) k ∈ V w t − 1 − D w ⊤ • Iteration: w t = Π [0 , 1] p � � √ pts t with s t ∈ argmin t − 1 s s ∈ B ( F ) • Proposition : t iterations of subgradient descent outputs a set A t (and a certificate of optimality s t ) such that B ⊂ V F ( B ) � F ( A t ) − ( s t ) − ( V ) � Dp 1 / 2 F ( A t ) − min √ t

  35. Submodular optimization problems Outline • Submodular function minimization – Properties of minimizers – Combinatorial algorithms – Approximate minimization of the Lov´ asz extension • Convex optimization with the Lov´ asz extension – Separable optimization problems – Application to submodular function minimization • Submodular function maximization – Simple algorithms with approximate optimality guarantees

  36. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with f Lov´ asz extension of F • Structured sparsity – Regularized risk minimization penalized by the Lov´ asz extension – Total variation denoising - isotonic regression

  37. Total variation denoising (Chambolle, 2005) � � • F ( A ) = d ( k, j ) ⇒ f ( w ) = d ( k, j )( w k − w j ) + k,j ∈ V k ∈ A,j ∈ V \ A • d symmetric ⇒ f = total variation

  38. Isotonic regression • Given real numbers x i , i = 1 , . . . , p p – Find y ∈ R p that minimizes 1 ( x i − y i ) 2 such that ∀ i, y i � y i +1 � 2 j =1 y x • For a directed chain, f ( y ) = 0 if and only if ∀ i, y i � y i +1 j =1 ( x i − y i ) 2 + λf ( y ) for λ large � p • Minimize 1 2

  39. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with f Lov´ asz extension of F • Structured sparsity – Regularized risk minimization penalized by the Lov´ asz extension – Total variation denoising - isotonic regression

  40. Separable optimization on base polyhedron • Optimization of convex functions of the form Ψ( w ) + f ( w ) with f Lov´ asz extension of F • Structured sparsity – Regularized risk minimization penalized by the Lov´ asz extension – Total variation denoising - isotonic regression • Proximal methods (see next part of the tutorial) – Minimize Ψ( w ) + f ( w ) for smooth Ψ as soon as the following “proximal” problem may be obtained efficiently p 1 1 2( w k − z k ) 2 + f ( w ) � 2 � w − z � 2 min 2 + f ( w ) = min w ∈ R p w ∈ R p k =1 • Submodular function minimization

  41. Separable optimization on base polyhedron Convex duality • Let ψ k : R → R , k ∈ { 1 , . . . , p } be p functions. Assume – Each ψ k is strictly convex – sup α ∈ R ψ ′ j ( α ) = + ∞ and inf α ∈ R ψ ′ j ( α ) = −∞ – Denote ψ ∗ 1 , . . . , ψ ∗ p their Fenchel-conjugates (then with full domain)

  42. Separable optimization on base polyhedron Convex duality • Let ψ k : R → R , k ∈ { 1 , . . . , p } be p functions. Assume – Each ψ k is strictly convex – sup α ∈ R ψ ′ j ( α ) = + ∞ and inf α ∈ R ψ ′ j ( α ) = −∞ – Denote ψ ∗ 1 , . . . , ψ ∗ p their Fenchel-conjugates (then with full domain) p p � � s ∈ B ( F ) w ⊤ s + w ∈ R p f ( w ) + min ψ i ( w j ) = w ∈ R p max min ψ j ( w j ) j =1 j =1 p � w ∈ R p w ⊤ s + = s ∈ B ( F ) min max ψ j ( w j ) j =1 p � ψ ∗ s ∈ B ( F ) − j ( − s j ) = max j =1

  43. Separable optimization on base polyhedron Equivalence with submodular function minimization • For α ∈ R , let A α ⊂ V be a minimizer of A �→ F ( A ) + � j ∈ A ψ ′ j ( α ) • Let u be the unique minimizer of w �→ f ( w ) + � p j =1 ψ j ( w j ) • Proposition (Chambolle and Darbon, 2009): – Given A α for all α ∈ R , then ∀ j, u j = sup( { α ∈ R , j ∈ A α } ) j ∈ A ψ ′ – Given u , then A �→ F ( A ) + � j ( α ) has minimal minimizer { w ∗ > α } and maximal minimizer { w ∗ � α } • Separable optimization equivalent to a sequence of submodular function minimizations

  44. Equivalence with submodular function minimization Proof sketch (Bach, 2011) p p � � ψ ∗ • Duality gap for min s ∈ B ( F ) − j ( − s j ) w ∈ R p f ( w ) + ψ i ( w j ) = max j =1 j =1 p p � � ψ ∗ ψ i ( w j ) − j ( − s j ) f ( w ) + j =1 j =1 p � � � f ( w ) − w ⊤ s + ψ j ( w j ) + ψ ∗ = j ( − s j ) + w j s j j =1 � + ∞ � � ( F + ψ ′ ( α ))( { w � α } ) − ( s + ψ ′ ( α )) − ( V ) = dα −∞ • Duality gap for convex problems = sums of duality gaps for combinatorial problems

  45. Separable optimization on base polyhedron Quadratic case • Let F be a submodular function and w ∈ R p the unique minimizer of w �→ f ( w ) + 1 2 � w � 2 2 . Then: (a) s = − w is the point in B ( F ) with minimum ℓ 2 -norm (b) For all λ ∈ R , the maximal minimizer of A �→ F ( A ) + λ | A | is { w � − λ } and the minimal minimizer of F is { w > − λ } • Consequences – Threshold at 0 the minimum norm point in B ( F ) to minimize F (Fujishige and Isotani, 2011) – Minimizing submodular functions with cardinality constraints (Nagano et al., 2011)

  46. From convex to combinatorial optimization and vice-versa... � • Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V – Thresholding solutions w at zero if ∀ k ∈ V, ψ ′ k (0) = 0 – For quadratic functions ψ k ( w k ) = 1 2 w 2 k , equivalent to projecting 0 on B ( F ) (Fujishige, 2005) – minimum-norm-point algorithm (Fujishige and Isotani, 2011)

  47. From convex to combinatorial optimization and vice-versa... � • Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V – Thresholding solutions w at zero if ∀ k ∈ V, ψ ′ k (0) = 0 – For quadratic functions ψ k ( w k ) = 1 2 w 2 k , equivalent to projecting 0 on B ( F ) (Fujishige, 2005) – minimum-norm-point algorithm (Fujishige and Isotani, 2011) � • Solving min A ⊂ V F ( A ) − t ( A ) to solve min ψ k ( w k ) + f ( w ) w ∈ R p k ∈ V – General decomposition strategy (Groenevelt, 1991) – Efficient only when submodular minimization is efficient

  48. � A ⊂ V F ( A ) − t ( A ) to solve min Solving min ψ k ( w k )+ f ( w ) w ∈ R p k ∈ V • General recursive divide-and-conquer algorithm (Groenevelt, 1991) • NB: Dual version of Fujishige (2005) 1. Compute minimizer t ∈ R p of � j ∈ V ψ ∗ j ( − t j ) s.t. t ( V ) = F ( V ) 2. Compute minimizer A of F ( A ) − t ( A ) 3. If A = V , then t is optimal. Exit. j ∈ A ψ ∗ 4. Compute a minimizer s A of � j ( − s j ) over s ∈ B ( F A ) where F A : 2 A → R is the restriction of F to A , i.e., F A ( B ) = F ( A ) j ∈ V \ A ψ ∗ j ( − s j ) over s ∈ B ( F A ) 5. Compute a minimizer s V \ A of � where F A ( B ) = F ( A ∪ B ) − F ( A ) , for B ⊂ V \ A 6. Concatenate s A and s V \ A . Exit.

  49. � Solving min ψ k ( w k ) + f ( w ) to solve min A ⊂ V F ( A ) w ∈ R p k ∈ V • Dual problem: max s ∈ B ( F ) − � p j =1 ψ ∗ j ( − s j ) • Constrained optimization when linear function can be maximized – Frank-Wolfe algorithms • Two main types for convex functions

  50. Approximate quadratic optimization on B ( F ) 1 s ∈ B ( F ) − 1 2 � w � 2 2 � s � 2 • Goal : min 2 + f ( w ) = max 2 w ∈ R p • Can only maximize linear functions on B ( F ) • Two types of “Frank-wolfe” algorithms • 1. Active set algorithm ( ⇔ min-norm-point) – Sequence of maximizations of linear functions over B ( F ) + overheads (affine projections) – Finite convergence, but no complexity bounds

  51. Minimum-norm-point algorithms 2 2 2 1 1 1 (a) (b) (c) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (d) (e) (f) 0 0 0 3 3 3 5 5 5 4 4 4

  52. Approximate quadratic optimization on B ( F ) 1 s ∈ B ( F ) − 1 2 � w � 2 2 � s � 2 • Goal : min 2 + f ( w ) = max 2 w ∈ R p • Can only maximize linear functions on B ( F ) • Two types of “Frank-wolfe” algorithms • 1. Active set algorithm ( ⇔ min-norm-point) – Sequence of maximizations of linear functions over B ( F ) + overheads (affine projections) – Finite convergence, but no complexity bounds • 2. Conditional gradient – Sequence of maximizations of linear functions over B ( F ) – Approximate optimality bound

  53. Conditional gradient with line search 2 2 2 1 1 1 (a) (b) (c) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (d) (e) (f) 0 0 0 3 3 3 5 5 5 4 4 4 2 2 2 1 1 1 (g) (h) (i) 0 0 0 3 3 3 5 5 5 4 4 4

  54. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t

  55. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t • Improved primal candidate through isotonic regression – f ( w ) is linear on any set of w with fixed ordering – May be optimized using isotonic regression (“pool-adjacent- violator”) in O ( n ) (see, e.g. Best and Chakravarti, 1990) – Given w t = − s t , keep the ordering and reoptimize

  56. Approximate quadratic optimization on B ( F ) • Proposition : t steps of conditional gradient (with line search) outputs s t ∈ B ( F ) and w t = − s t , such that 2 � 2 D 2 f ( w t ) + 1 2 − OPT � f ( w t ) + 1 2 + 1 2 � w t � 2 2 � w t � 2 2 � s t � 2 t • Improved primal candidate through isotonic regression – f ( w ) is linear on any set of w with fixed ordering – May be optimized using isotonic regression (“pool-adjacent- violator”) in O ( n ) (see, e.g. Best and Chakravarti, 1990) – Given w t = − s t , keep the ordering and reoptimize • Better bound for submodular function minimization?

  57. From quadratic optimization on B ( F ) to submodular function minimization • Proposition : If w is ε -optimal for min w ∈ R p 1 2 � w � 2 2 + f ( w ) , then at � √ εp � least a levet set A of w is -optimal for submodular function 2 minimization √ εp • If ε = 2 D 2 = Dp 1 / 2 √ t , ⇒ no provable gains , but: 2 2 t – Bound on the iterates A t (with additional assumptions) – Possible thresolding for acceleration

  58. From quadratic optimization on B ( F ) to submodular function minimization • Proposition : If w is ε -optimal for min w ∈ R p 1 2 � w � 2 2 + f ( w ) , then at � √ εp � least a levet set A of w is -optimal for submodular function 2 minimization √ εp • If ε = 2 D 2 = Dp 1 / 2 √ ⇒ no provable gains , but: t , 2 2 t – Bound on the iterates A t (with additional assumptions) – Possible thresolding for acceleration • Lower complexity bound for SFM – Proposition : no algorithm that is based only on a sequence of greedy algorithms obtained from linear combinations of bases can improve on the subgradient bound (after p/ 2 iterations).

  59. Simulations on standard benchmark “DIMACS Genrmf-wide”, p = 430 • Submodular function minimization – (Left) optimal value minus dual function values ( s t ) − ( V ) (in dashed, certified duality gap) – (Right) Primal function values F ( A t ) minus optimal value min−norm−point 4 4 log 10 (F(A) − min(F)) log 10 (min(f)−s_(V)) cond−grad cond−grad−w 3 3 cond−grad−1/t subgrad−des 2 2 1 1 0 0 500 1000 1500 2000 500 1000 1500 2000 number of iterations number of iterations

  60. Simulations on standard benchmark “DIMACS Genrmf-long”, p = 575 • Submodular function minimization – (Left) optimal value minus dual function values ( s t ) − ( V ) (in dashed, certified duality gap) – (Right) Primal function values F ( A t ) minus optimal value 4 4 min−norm−point log 10 (F(A) − min(F)) log 10 (min(f)−s_(V)) cond−grad 3 3 cond−grad−w cond−grad−1/t subgrad−des 2 2 1 1 0 0 50 100 150 200 250 300 50 100 150 200 250 300 number of iterations number of iterations

  61. Simulations on standard benchmark • Separable quadratic optimization – (Left) optimal value minus dual function values − 1 2 � s t � 2 2 (in dashed, certified duality gap) – (Right) Primal function values f ( w t )+ 1 2 � w t � 2 2 minus optimal value (in dashed, before the pool-adjacent-violator correction) 10 10 log 10 (f(w)+||w|| 2 /2−OPT) min−norm−point log 10 (OPT + ||s|| 2 /2) 8 8 cond−grad cond−grad−1/t 6 6 4 4 2 2 0 0 −2 −2 200 400 600 800 1000 200 400 600 800 1000 number of iterations number of iterations

  62. Submodularity (almost) everywhere Sensor placement • Each sensor covers a certain area (Krause and Guestrin, 2005) – Goal: maximize coverage • Submodular function maximization • Extension to experimental design (Seeger, 2009)

  63. Submodular function maximization • Occurs in various form in applications but is NP-hard • Unconstrained maximization : Feige et al. (2007) shows that that for non-negative functions, a random subset already achieves at least 1 / 4 of the optimal value, while local search techniques achieve at least 1 / 2 • Maximizing non-decreasing submodular functions with cardinality constraint – Greedy algorithm achieves (1 − 1 /e ) of the optimal value – Proof (Nemhauser et al., 1978)

  64. Maximization with cardinality constraint • Let A ∗ = { b 1 , . . . , b k } be a maximizer of F with k elements, and a j the j -th selected element. Let ρ j = F ( { a 1 , . . . , a j } ) − F ( { a 1 , . . . , a j − 1 } ) � F ( A ∗ ∪ A j − 1 ) because F is non-decreasing, F ( A ∗ ) k � � � F ( A j − 1 ∪ { b 1 , . . . , b i } ) − F ( A j − 1 ∪ { b 1 , . . . , b i − 1 } ) = F ( A j − 1 ) + i =1 k � � � � F ( A j − 1 ) + F ( A j − 1 ∪ { b i } ) − F ( A j − 1 ) by submodularity , i =1 � F ( A j − 1 ) + kρ j by definition of the greedy algorithm , j − 1 � = ρ i + kρ j . i =1 • Minimize � k i =1 ρ i : ρ j = ( k − 1) j − 1 k − j F ( A ∗ )

  65. Submodular optimization problems Summary • Submodular function minimization – Properties of minimizers – Combinatorial algorithms – Approximate minimization of the Lov´ asz extension • Convex optimization with the Lov´ asz extension – Separable optimization problems – Application to submodular function minimization • Submodular function maximization – Simple algorithms with approximate optimality guarantees

  66. Outline 1. Submodular functions – Definitions – Examples of submodular functions – Links with convexity through Lov´ asz extension 2. Submodular optimization – Minimization – Links with convex optimization – Maximization 3. Structured sparsity-inducing norms – Norms with overlapping groups – Relaxation of the penalization of supports by submodular functions

  67. Sparsity in supervised machine learning • Observed data ( x i , y i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y, Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – square loss + ℓ 1 -norm ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996) – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n )

  68. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1

  69. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn X = ( x 1 , . . . , x p ) ∈ R n × p such that ∀ j, � x j � 2 � 1 k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � x j � 2 � 1 by Θ( x j ) � 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend