inertial block proximal methods for non convex non smooth
play

Inertial Block Proximal Methods for Non-Convex Non-Smooth - PowerPoint PPT Presentation

Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization L. T. K. Hien 1 N. Gillis 1 P. Patrinos 2 1 University of Mons 2 KU Leuven The 37th International Conference on Machine Learning ICML 2020 1 / 44 Overview Problem set up 1


  1. Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization L. T. K. Hien 1 N. Gillis 1 P. Patrinos 2 1 University of Mons 2 KU Leuven The 37th International Conference on Machine Learning ICML 2020 1 / 44

  2. Overview Problem set up 1 Motivation Block Coordinate Descent Methods The proposed methods: IBP and IBPG 2 Extension to Bregman divergence Convergence Analysis 3 Subsequential convergence Global convergence Application to NMF 4 Preliminary numerical results 5 2 / 44

  3. Problem set up We consider the following non-smooth non-convex optimization problem min x ∈ E F ( x ) , where F ( x ) := f ( x ) + g ( x ) , (1) and x is partitioned into s blocks/groups of variables: x = ( x 1 , . . . , x s ) ∈ E = E 1 × . . . × E s with E i , i = 1 , . . . , s , being finite dimensional real linear spaces equipped with the norm �·� ( i ) and the inner product �· , ·� ( i ) , f : E → R is a continuous but possibly non-smooth non-convex function, and g ( x ) = � s i =1 g i ( x i ) with g i : E i → R ∪ { + ∞} for i = 1 , . . . , s are proper and lower semi-continuous functions. 3 / 44

  4. Nonnegative matrix factorization – A motivation NMF Given X ∈ R m × n and the integer r < min( m , n ), solve + 1 2 � X − UV � 2 F such that U ∈ R m × r and V ∈ R r × n min + . + U ≥ 0 , V ≥ 0 NMF is a key problem in data analysis and machine learning with applications in image processing, document classification, hyperspectral unmixing, audio source separation. 4 / 44

  5. Nonnegative matrix factorization – A motivation NMF Given X ∈ R m × n and the integer r < min( m , n ), solve + 1 2 � X − UV � 2 F such that U ∈ R m × r and V ∈ R r × n min + . + U ≥ 0 , V ≥ 0 r � 2 Let f ( U : i , V i : ) = 1 � � Let f ( U , V ) = 1 2 � X − UV � 2 � X − � U : i V i : F , F , 2 i =1 g 1 ( U ) = I R m × r ( U ), and g i ( U : i ) = I R m + ( U : i ), i = 1 , . . . , r , and + g 2 ( V ) = I R r × n + ( V ). g i + r ( V i : ) = I R n + ( V i : ), i = 1 , . . . , r . NMF is rewritten as NMF is rewritten as r 2r min U , V f ( U , V )+ g 1 ( U )+ g 2 ( V ). � � U : i , V i : f ( U : i , V i : ) + min g i ( U : i ) + g i ( V i : ). i =1 i = r +1 5 / 44

  6. Non-negative approximate canonical polyadic decomposition (NCPD) We consider the following NCPD problem: given a non-negative tensor T ∈ R I 1 × I 2 × ... × I N and a specified order r , solve X (1) ,..., X ( N ) f := 1 2 � � T − X (1) ◦ . . . ◦ X ( N ) � min � � 2 � F (2) X ( n ) ∈ R I n × r such that , n = 1 , . . . , N , + where the Frobenius norm of a tensor T ∈ R I 1 × I 2 × ... × I N is defined as �� i 1 i 2 ... i N , and the tensor product X = X (1) ◦ . . . ◦ X ( N ) i 1 ,..., i N T 2 � T � F = j =1 X (1) i 1 j X (2) i 2 j . . . X ( N ) is defined as X i 1 i 2 ... i N = � r i N j , for i n ∈ { 1 , . . . , I n } , n = 1 , . . . , N . Here X ( n ) is the ( i , j )-th element of X ( n ) . Let ij g i ( X ( i ) ) = I R Ii × r + ( X ( i ) ). NCPD is rewritten as N � X (1) ,..., X ( N ) f ( X (1) , . . . , X ( N ) ) + g i ( X ( i ) ) . min i =1 6 / 44

  7. Block Coordinate Descent Methods 1: Initialize : Choosing initial point x (0) and other parameters. 2: for k = 1 , . . . do for i = 1 , . . . , s do 3: Fix the latest values of the blocks j � = i : 4: x ( k ) 1 , . . . , x ( k ) i − 1 , x i , x ( k − 1) , . . . , x ( k − 1) � � s i +1 Update block i to get 5: x ( k ) 1 , . . . , x ( k ) i − 1 , x ( k ) , x ( k − 1) , . . . , x ( k − 1) � � s i i +1 end for 6: 7: end for Algorithm 1: General framework of BCD methods. 7 / 44

  8. Block Coordinate Descent Methods � � Denote f ( k ) x ( k ) 1 , . . . , x ( k ) i − 1 , x i , x ( k − 1) , . . . , x ( k − 1) ( x i ) := f . s i i +1 ( First order ) BCD methods can typically be classified into three categories: 1 Classical BCD methods update each block of variables as follows x ( k ) f ( k ) = argmin ( x i ) + g i ( x i ) . i i x i ∈ E i ⊕ converge to a stationary point under suitable convexity assumptions. ⊖ fails to converge for some non-convex problems. 8 / 44

  9. Block Coordinate Descent Methods 2 Proximal BCD methods update each block of variables as follows 1 2 � � x ( k ) f ( k ) � x i − x ( k − 1) = argmin ( x i ) + g i ( x i ) + . � � i i i 2 β ( k ) � x i ∈ E i i ⊕ The authors in [1] established, for the first time, the convergence of x ( k ) � � to a critical point of F with non-convex setting and s = 2. [1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka - Lojasiewicz inequality. Mathematics of Operations Research, 35(2) : 438–457, 2010. 9 / 44

  10. Block Coordinate Descent Methods 3 Proximal gradient BCD methods update each block of variables as follows x ( k ) � ∇ f ( k ) � x ( k − 1) � , x i − x ( k − 1) � = argmin + g i ( x i ) i i i i x i ∈ E i 1 2 � � � x i − x ( k − 1) + . � � i 2 β ( k ) � i When g i ( x i ) = I X i ( x i ) and �·� is Frobenius norm, we have x ( k ) x ( k − 1) − β ( k ) ∇ f ( k ) ( x ( k − 1) � � = Proj X i ) . i i i i i ⊕ In the general non-convex setting, Bolte et al in [2] proved the � x ( k ) � convergence of to a critical point of F when s = 2. [2] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1) : 459–494, Aug 2014. 10 / 44

  11. Gradient descent method When E = R n , s = 1, g ( x ) = 0 and �·� is Frobenius norm, proximal gradient BCD amounts to gradient descent method for unconstrained optimization problem min x ∈ R n f ( x ): x k +1 = x k − β k ∇ f ( x k ) . Some remarks It is a descent method when β k is appropriately chosen. In the convex setting, the method does not have the optimal convergence rate. 11 / 44

  12. Acceleration by extrapolation Heavy-ball method of Polyak [3]: x k +1 = x k − β k ∇ f ( x k ) + θ k ( x k − x k − 1 ) . Accelerated gradient method of Nesterov [4]: y k = x k + θ k ( x k − x k − 1 ) x k +1 = y k − β k ∇ f ( y k ) = x k − β k ∇ f ( y k ) + θ k ( x k − x k − 1 ) Some remarks: they are not descent methods, in the convex setting, these methods are proved to achieve the optimal convergence rate. [3] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5) : 1–17, 1964. [4] Y. Nesterov. A method of solving a convex programming problem with convergence rate O (1 / k 2 ). Soviet Mathematics Doklady, 27(2), 1983. 12 / 44

  13. Let’s recall 1 Classical BCD x ( k ) f ( k ) = argmin ( x i ) + g i ( x i ) . i i x i ∈ E i 2 Proximal BCD 1 2 � � x ( k ) f ( k ) � x i − x ( k − 1) = argmin ( x i ) + g i ( x i ) + . � � i i i 2 β ( k ) � x i ∈ E i i 3 Proximal gradient BCD 1 2 � � � � � � x ( k ) ∇ f ( k ) x ( k − 1) � x i − x ( k − 1) = argmin , x i + g i ( x i ) + . � � i i i i 2 β ( k ) � x i ∈ E i i 13 / 44

  14. The proposed methods: IBP and IBPG x (0) = ˜ x ( − 1) . Initialize : Choose ˜ for k = 1 , . . . do x (0) = ˜ x ( − 1) . x ( k , 0) = ˜ Initialize : Choose ˜ x ( k − 1) . for k = 1 , . . . do for j = 1 , . . . , T k do x ( k , 0) = ˜ x ( k − 1) . Choose i ∈ { 1 , . . . , s } . Let y i be the value of the i th block before it was updated to x ( k , j − 1) for j = 1 , . . . , T k do . i Choose i ∈ { 1 , . . . , s } . Let y i be the value of the Extrapolate i th block before it was updated to x ( k , j − 1) . i Extrapolate = x ( k , j − 1) + α ( k , j ) x ( k , j − 1) � � x i ˆ − y i , i i i = x ( k , j − 1) + γ ( k , j ) x ( k , j − 1) � � x i ` − y i , x i = x ( k , j − 1) + α ( k , j ) � x ( k , j − 1) � ˆ − y i , (3) i i i i i i (5) and compute and compute x ( k , j ) �∇ f ( k , j ) x i ) , x i − x ( k , j − 1) = argmin (` � 1 i i i x i � 2 . xi x ( k , j ) F ( k , j ) = argmin ( x i ) + � x i − ˆ i i 2 β ( k , j ) (6) xi 1 x i � 2 . i + g i ( x i ) + � x i − ˆ (4) 2 β ( k , j ) Let x ( k , j ) = x ( k , j − 1) for i ′ � = i . i i ′ i ′ end for x ( k ) = x ( k , Tk ) . Let x ( k , j ) = x ( k , j − 1) for i ′ � = i . Update ˜ i ′ i ′ end for end for Algorithm 2: IBP x ( k ) = x ( k , Tk ) . Update ˜ end for Algorithm 3: IBPG 14 / 44

  15. An illustration Assumption 1 For all k , all blocks are updated after the T k iterations performed within the k th outer loop, and there exists a positive constant ¯ T such that s ≤ T k ≤ ¯ T . 15 / 44

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend