selective linearization method for statistical learning
play

Selective Linearization Method for Statistical Learning Problems Yu - PowerPoint PPT Presentation

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018 Agenda Introduction to multi-block


  1. Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018

  2. Agenda Introduction to multi-block convex non-smooth optimization 1 Motivating examples Problem formulation Review of related existing methods 2 Proximal point and operator splitting methods Bundle method Alternating linearization method(ALIN) Selective linearization (SLIN) for multi-block convex optimization 3 SLIN method for multi-block convex optimization Global convergence Convergence rate Numerical illustration 4 Three-block fused lasso Overlapping group lasso Regularized support vector machine problem Conclusions and ongoing work 5 Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  3. Motivating examples for multi-block structured regularization N � x ∈ R n F ( x ) = f ( x ) + min h i ( B i x ) i = 1 Figure: [Demiralp et al., 2013] Figure: [Zhou et al., 2015] Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  4. Motivating examples for multi-block structured regularization K 1 � � 2 � � � � S ∈S || P Ω ( S ) − P Ω ( A ) || 2 min � � 2 + � � � b − Aw d j � w T j min F + γ || RS || 1 + τ || S || ∗ � 2 2 K λ w j = 1 Figure: [Zhou et al., 2015] Figure: [Demiralp et al., 2013] Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  5. Problem formulation for multi-block convex optimization Problem formulation x ∈ ❘ n F ( x ) = f 1 ( x ) + f 2 ( x ) + . . . + f N ( x ) min f 1 , f 2 , ..., f N : ❘ n → ❘ are convex functions. We introduced the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. Global convergence is guaranteed; Almost O ( 1 / k ) convergence rate with only 1 out of N functions being strongly convex, where k is the iteration number.(Du, Lin, and Ruszczynski, 2017). Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  6. Review of Related Existing Methods Selective linearization algorithm comes from the idea of proximal point algorithm (Rockafellar, 1976) and operator splitting methods (Douglas and Rachford, 1956), (Lions and Mercier, 1979), and later by (Eckstein and Bertsekas, 1992) and (Bauschke and Combettes, 2011), bundle methods(Kiwiel, 1985) and (Ruszczynski, 2006) and Alternating linearzation method (Kiwiel, Rosa, and Ruszczynski, 1999). Proximal point method x ∈ ❘ n F ( x ) min where F : ❘ n → ❘ is a convex function. For solving the above problem, � � 2 � � � x − x k � F ( x ) + ρ construct a proximal step prox F ( x k ) = argmin x and � � 2 x k + 1 = prox F ( x k ) , k = 1 , 2 , . . . . It is known to converge to the minimum of F ( · ) (Rockafellar, 1976). Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  7. Bundle method The main idea of the bundle method (Kiwiel, 1985) and (Ruszczynski, 2006) is to replace problem min x ∈ ❘ n F ( x ) with a sequence of approximate problems of the form (cutting plane approximation): F k ( x ) + ρ � � x − x k � � 2 x ∈ ❘ n ˜ min � � 2 Here ˜ F k ( · ) is a piecewise linear convex lower approximation of the function F ( · ) . � � ˜ F k ( x ) = max F ( z j ) + � g j , x − z j � j ∈ J k with some earlier generated solutions z j and subgradients g j ∈ ∂ F ( z j ) , j ∈ J k , where J k ⊆ { 1 , . . . , k } . The solution of proximal step is subject to a sufficient improvement test, which decides the proximal center will change to current solution or not. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  8. Bundle method(cont) � � ˜ F k ( x ) = max j ∈ J k F ( z j ) + � g j , x − z j � Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  9. Bundle method(cont) Bundle method with multiple cuts Step 1: Initialization: Set k = 1, J 1 = { 1 } , z 1 = x 1 , and select g 1 ∈ ∂ F ( z 1 ) . Choose parameter β ∈ ( 0 , 1 ) , and a stopping precision ε > 0. � 2 } . � � x − x k � Step 2: z k + 1 ← argmin { ˜ F k ( x ) + ρ � � 2 Step 3: if ( F ( x k ) − ˜ F k ( z k + 1 ) ≤ ε ) stop, otherwise continue � � Step 4: Update Test: if ( F ( z k + 1 ) ≤ F ( x k ) − β F k ( z k + 1 ) F ( x k ) − ˜ ), then set x k + 1 = z k + 1 (descent step); otherwise set x k + 1 = x k (null step). Step 5:Select a set J k + 1 so that � � j ∈ J k : F ( z j ) + � g j , z k + 1 − z j � = ˜ F k ( z k + 1 ) J k ∪ { k + 1 } ⊇ J k + 1 ⊇ { k + 1 } ∪ . Increase k by 1 and go to Step 1. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  10. Bundle method(cont) Bundle method with cut aggregation Step 1: Initialization: Set k = 1, J 1 = { 1 } , z 1 = x 1 , and select g 1 ∈ ∂ F ( z 1 ) . Choose parameter β ∈ ( 0 , 1 ) , and a stopping precision ε > 0. � 2 } . � � x − x k � Step 2: z k + 1 ← argmin { ˜ F k ( x ) + ρ � � 2 � ¯ � ˜ F k ( x ) = max F k ( x ) , F ( z k ) + � g k , x − z k � Step 3: if ( F ( x k ) − ˜ F k ( z k + 1 ) ≤ ε ) stop, otherwise continue � � ), then set x k + 1 = z k + 1 Step 4: if ( F ( z k + 1 ) ≤ F ( x k ) − β F ( x k ) − ˜ F k ( z k + 1 ) ( descent step ); otherwise set x k + 1 = x k ( null step ). � � Step 5: Define ¯ F k + 1 ( x ) = θ k ¯ F k ( x ) + ( 1 − θ k ) F ( z k ) + � g k , x − z k � where θ k ∈ [ 0 , 1 ] is such that the gradient of ¯ F k + 1 ( · ) is equal to the subgradient of F k ( · ) at z k + 1 that satisfies the optimality conditions for the problem. ˜ Increase k by 1 and go to Step 1. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  11. Bundle method(cont) Convergence of the bundle method (in both versions) for convex functions is well-known (Kiwiel, 1985) and (Ruszczynski, 2006). Theorem Suppose Argmin F � ∅ and ε = 0. Then a point x ∗ ∈ Argmin F exists, such that: k →∞ x k = lim k →∞ z k = x ∗ . lim Convergence Rate (Du and Ruszczynski, 2017) proved that the bundle method for nonsmooth ln ( 1 ǫ ) optimization achieves solution accuracy ǫ in at most O ( ) iterations, if ǫ the function is strongly convex. The result is true for the versions of the method with multiple cuts and with cut aggregation. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  12. Operator splitting for two block convex optimization Operator Splitting x ∈ ❘ n f ( x ) + h ( x ) min The solution ˆ x satisfies 0 ∈ ∂ f (ˆ x ) , where we can consider two x ) + ∂ h (ˆ subdifferentials as two maximal monotone operators M 1 and M 2 on the space ❘ n : 0 ∈ ( M 1 + M 2 )(ˆ x ) . Standard ADMM x ∈ ❘ n f ( x ) + h ( y ) min s.t. Mx − y = 0 ADMM for solving the above takes the following form, for some scalar parameter c > 0: x k + 1 ∈ argmin x ∈ ❘ n { f ( x ) + g ( y k ) + � λ k , Mx − y k � + c 2 || Mx − y k || 2 } y k + 1 ∈ argmin y ∈ ❘ m { f ( x k + 1 ) + g ( y ) + � λ k , Mx k + 1 − y � + c 2 || Mx k + 1 − y || 2 } λ k + 1 = λ k + c ( Mx k + 1 − y k + 1 ) . Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  13. Alternating Linearization method for two block convex optimization The Alternating Linearization Method (ALIN) (Kiwiel, Rosa, and Ruszczynski, 1999) adapted ideas of the operator splitting methods and bundle methods and proved the globlal convergence. ALIN (Lin, Pham and Ruszczy´ nski, 2014) is sucessfully applied to solve two block structured regularization statistical learning problems. Algorithm Alternating Linearization (ALIN) x ∈ ❘ n f ( x ) + h ( x ) min 1: repeat x h ← argmin { ˜ f ( x ) + h ( x ) + 1 x || 2 2: ˜ D } . 2 || x − ˆ 3: g h ← − g f − D (˜ x h − ˆ x ) 4: if (Update Test for ˜ x h ) then ˆ x h end if x ← ˜ h ( x ) + 1 x f ← argmin { f ( x ) + ˜ x || 2 5: ˜ D } . 2 || x − ˆ 6: g f ← − g h − D (˜ x f − ˆ x ) 7: if (Update Test for ˜ x f ) then ˆ x f end if x ← ˜ 8: until (Stopping Test) ˜ f ( x ) is the linear approximation of f ( x ) . D is a positive definite diagonal matrix. Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

  14. Selective Linearization (SLIN) algorithm Objective: min F ( x ) = � N i = 1 f i ( x ) f 1 , f 2 , . . . , f N : ❘ n → ❘ are convex functions. They can be nonsmooth. Every iteration We choose an index j according to a selection rule (the largest gap between the function value and its linear approximation) and solve the f j -sub-problem: i ( x ) + 1 � ˜ 2 || x − x k || 2 f k min f j ( x ) + D x i � j Each ˜ f k i ( x ) is a first-order linearization of f i ( x ) . x k is a proximal center. It will be updated over the iterations. After solving f j -sub-problem, f j will be linearized using its subgradient at current solution z k j preparing for the next sub-problem: ˜ f k i ( x ) = f i ( z k i ) + � g k i , x − z k i � . We denote the function approximating: ˜ F k ( x ) = f j ( x ) + � i � j ˜ f k i ( x ) . Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend