 
              Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization L. T. K. Hien 1 N. Gillis 1 P. Patrinos 2 1 University of Mons 2 KU Leuven The 37th International Conference on Machine Learning ICML 2020 1 / 44
Overview Problem set up 1 Motivation Block Coordinate Descent Methods The proposed methods: IBP and IBPG 2 Extension to Bregman divergence Convergence Analysis 3 Subsequential convergence Global convergence Application to NMF 4 Preliminary numerical results 5 2 / 44
Problem set up We consider the following non-smooth non-convex optimization problem min x ∈ E F ( x ) , where F ( x ) := f ( x ) + g ( x ) , (1) and x is partitioned into s blocks/groups of variables: x = ( x 1 , . . . , x s ) ∈ E = E 1 × . . . × E s with E i , i = 1 , . . . , s , being finite dimensional real linear spaces equipped with the norm �·� ( i ) and the inner product �· , ·� ( i ) , f : E → R is a continuous but possibly non-smooth non-convex function, and g ( x ) = � s i =1 g i ( x i ) with g i : E i → R ∪ { + ∞} for i = 1 , . . . , s are proper and lower semi-continuous functions. 3 / 44
Nonnegative matrix factorization – A motivation NMF Given X ∈ R m × n and the integer r < min( m , n ), solve + 1 2 � X − UV � 2 F such that U ∈ R m × r and V ∈ R r × n min + . + U ≥ 0 , V ≥ 0 NMF is a key problem in data analysis and machine learning with applications in image processing, document classification, hyperspectral unmixing, audio source separation. 4 / 44
Nonnegative matrix factorization – A motivation NMF Given X ∈ R m × n and the integer r < min( m , n ), solve + 1 2 � X − UV � 2 F such that U ∈ R m × r and V ∈ R r × n min + . + U ≥ 0 , V ≥ 0 r � 2 Let f ( U : i , V i : ) = 1 � � Let f ( U , V ) = 1 2 � X − UV � 2 � X − � U : i V i : F , F , 2 i =1 g 1 ( U ) = I R m × r ( U ), and g i ( U : i ) = I R m + ( U : i ), i = 1 , . . . , r , and + g 2 ( V ) = I R r × n + ( V ). g i + r ( V i : ) = I R n + ( V i : ), i = 1 , . . . , r . NMF is rewritten as NMF is rewritten as r 2r min U , V f ( U , V )+ g 1 ( U )+ g 2 ( V ). � � U : i , V i : f ( U : i , V i : ) + min g i ( U : i ) + g i ( V i : ). i =1 i = r +1 5 / 44
Non-negative approximate canonical polyadic decomposition (NCPD) We consider the following NCPD problem: given a non-negative tensor T ∈ R I 1 × I 2 × ... × I N and a specified order r , solve X (1) ,..., X ( N ) f := 1 2 � � T − X (1) ◦ . . . ◦ X ( N ) � min � � 2 � F (2) X ( n ) ∈ R I n × r such that , n = 1 , . . . , N , + where the Frobenius norm of a tensor T ∈ R I 1 × I 2 × ... × I N is defined as �� i 1 i 2 ... i N , and the tensor product X = X (1) ◦ . . . ◦ X ( N ) i 1 ,..., i N T 2 � T � F = j =1 X (1) i 1 j X (2) i 2 j . . . X ( N ) is defined as X i 1 i 2 ... i N = � r i N j , for i n ∈ { 1 , . . . , I n } , n = 1 , . . . , N . Here X ( n ) is the ( i , j )-th element of X ( n ) . Let ij g i ( X ( i ) ) = I R Ii × r + ( X ( i ) ). NCPD is rewritten as N � X (1) ,..., X ( N ) f ( X (1) , . . . , X ( N ) ) + g i ( X ( i ) ) . min i =1 6 / 44
Block Coordinate Descent Methods 1: Initialize : Choosing initial point x (0) and other parameters. 2: for k = 1 , . . . do for i = 1 , . . . , s do 3: Fix the latest values of the blocks j � = i : 4: x ( k ) 1 , . . . , x ( k ) i − 1 , x i , x ( k − 1) , . . . , x ( k − 1) � � s i +1 Update block i to get 5: x ( k ) 1 , . . . , x ( k ) i − 1 , x ( k ) , x ( k − 1) , . . . , x ( k − 1) � � s i i +1 end for 6: 7: end for Algorithm 1: General framework of BCD methods. 7 / 44
Block Coordinate Descent Methods � � Denote f ( k ) x ( k ) 1 , . . . , x ( k ) i − 1 , x i , x ( k − 1) , . . . , x ( k − 1) ( x i ) := f . s i i +1 ( First order ) BCD methods can typically be classified into three categories: 1 Classical BCD methods update each block of variables as follows x ( k ) f ( k ) = argmin ( x i ) + g i ( x i ) . i i x i ∈ E i ⊕ converge to a stationary point under suitable convexity assumptions. ⊖ fails to converge for some non-convex problems. 8 / 44
Block Coordinate Descent Methods 2 Proximal BCD methods update each block of variables as follows 1 2 � � x ( k ) f ( k ) � x i − x ( k − 1) = argmin ( x i ) + g i ( x i ) + . � � i i i 2 β ( k ) � x i ∈ E i i ⊕ The authors in [1] established, for the first time, the convergence of x ( k ) � � to a critical point of F with non-convex setting and s = 2. [1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka - Lojasiewicz inequality. Mathematics of Operations Research, 35(2) : 438–457, 2010. 9 / 44
Block Coordinate Descent Methods 3 Proximal gradient BCD methods update each block of variables as follows x ( k ) � ∇ f ( k ) � x ( k − 1) � , x i − x ( k − 1) � = argmin + g i ( x i ) i i i i x i ∈ E i 1 2 � � � x i − x ( k − 1) + . � � i 2 β ( k ) � i When g i ( x i ) = I X i ( x i ) and �·� is Frobenius norm, we have x ( k ) x ( k − 1) − β ( k ) ∇ f ( k ) ( x ( k − 1) � � = Proj X i ) . i i i i i ⊕ In the general non-convex setting, Bolte et al in [2] proved the � x ( k ) � convergence of to a critical point of F when s = 2. [2] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1) : 459–494, Aug 2014. 10 / 44
Gradient descent method When E = R n , s = 1, g ( x ) = 0 and �·� is Frobenius norm, proximal gradient BCD amounts to gradient descent method for unconstrained optimization problem min x ∈ R n f ( x ): x k +1 = x k − β k ∇ f ( x k ) . Some remarks It is a descent method when β k is appropriately chosen. In the convex setting, the method does not have the optimal convergence rate. 11 / 44
Acceleration by extrapolation Heavy-ball method of Polyak [3]: x k +1 = x k − β k ∇ f ( x k ) + θ k ( x k − x k − 1 ) . Accelerated gradient method of Nesterov [4]: y k = x k + θ k ( x k − x k − 1 ) x k +1 = y k − β k ∇ f ( y k ) = x k − β k ∇ f ( y k ) + θ k ( x k − x k − 1 ) Some remarks: they are not descent methods, in the convex setting, these methods are proved to achieve the optimal convergence rate. [3] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5) : 1–17, 1964. [4] Y. Nesterov. A method of solving a convex programming problem with convergence rate O (1 / k 2 ). Soviet Mathematics Doklady, 27(2), 1983. 12 / 44
Let’s recall 1 Classical BCD x ( k ) f ( k ) = argmin ( x i ) + g i ( x i ) . i i x i ∈ E i 2 Proximal BCD 1 2 � � x ( k ) f ( k ) � x i − x ( k − 1) = argmin ( x i ) + g i ( x i ) + . � � i i i 2 β ( k ) � x i ∈ E i i 3 Proximal gradient BCD 1 2 � � � � � � x ( k ) ∇ f ( k ) x ( k − 1) � x i − x ( k − 1) = argmin , x i + g i ( x i ) + . � � i i i i 2 β ( k ) � x i ∈ E i i 13 / 44
The proposed methods: IBP and IBPG x (0) = ˜ x ( − 1) . Initialize : Choose ˜ for k = 1 , . . . do x (0) = ˜ x ( − 1) . x ( k , 0) = ˜ Initialize : Choose ˜ x ( k − 1) . for k = 1 , . . . do for j = 1 , . . . , T k do x ( k , 0) = ˜ x ( k − 1) . Choose i ∈ { 1 , . . . , s } . Let y i be the value of the i th block before it was updated to x ( k , j − 1) for j = 1 , . . . , T k do . i Choose i ∈ { 1 , . . . , s } . Let y i be the value of the Extrapolate i th block before it was updated to x ( k , j − 1) . i Extrapolate = x ( k , j − 1) + α ( k , j ) x ( k , j − 1) � � x i ˆ − y i , i i i = x ( k , j − 1) + γ ( k , j ) x ( k , j − 1) � � x i ` − y i , x i = x ( k , j − 1) + α ( k , j ) � x ( k , j − 1) � ˆ − y i , (3) i i i i i i (5) and compute and compute x ( k , j ) �∇ f ( k , j ) x i ) , x i − x ( k , j − 1) = argmin (` � 1 i i i x i � 2 . xi x ( k , j ) F ( k , j ) = argmin ( x i ) + � x i − ˆ i i 2 β ( k , j ) (6) xi 1 x i � 2 . i + g i ( x i ) + � x i − ˆ (4) 2 β ( k , j ) Let x ( k , j ) = x ( k , j − 1) for i ′ � = i . i i ′ i ′ end for x ( k ) = x ( k , Tk ) . Let x ( k , j ) = x ( k , j − 1) for i ′ � = i . Update ˜ i ′ i ′ end for end for Algorithm 2: IBP x ( k ) = x ( k , Tk ) . Update ˜ end for Algorithm 3: IBPG 14 / 44
An illustration Assumption 1 For all k , all blocks are updated after the T k iterations performed within the k th outer loop, and there exists a positive constant ¯ T such that s ≤ T k ≤ ¯ T . 15 / 44
Recommend
More recommend