 
              Scaling Methods to obtain Doubly stochastic matrices Krishna Acharya, Nacim Oijid December 10, 2019 1 / 44
Overview Definition of the problem 1 Applications 2 A Natural Algorithm - Sinkhorn and Knopp 3 Conditions on A 4 Describe types of convergence analysis 5 Knight 2008 6 Livne and Golub 7 Brief Description of Knight and Ruiz 8 Newton’s method 9 10 Splitting method 11 Some Key results from KR 2013 12 Numerical Comparison 2 / 44
Balancing a matrix Given a nonnegative square matrix A. Find diagonal matrices X and Y , such that XAY is doubly stochastic. 3 / 44
Preconditioning Linear Systems We are interested in z , the solution of Az = b Let A 1 = XAY be doubly stochastic, and we can obtain solution of A 1 z 1 = Xb . Then z = Yz 1 Perhaps A 1 z 1 = Xb is more numerically stable, so it helps to do this scaling Also note that A and A 1 have the same sparsity. 4 / 44
Back to Balancing find positive vectors r and c such that D ( r ) AD ( c ) is doubly stochastic find positive vectors r and c such that D ( r ) AD ( c ) e = e and D ( c ) A T D ( r ) e = e r = D ( Ac ) − 1 e and c = D ( A T r ) − 1 e 5 / 44
Iterative method The fixed point earlier leads to an iterative method, where c k +1 = D ( A T r k ) − 1 e and r k +1 = D ( Ac k +1 ) − 1 e In MATLAB c k +1 = 1 ./ ( A ′ ∗ r k ) , r k +1 = 1 ./ ( A ∗ c k +1 ) r 0 = e corresponds to the Sinkhorn and Knopp algorithm(’52). Lets see a run on   3 1 0 1 2 0   2 0 1 6 / 44
Important Qs Do the c k and r k converge to r and c? Which non negative A ’s is this possible for? Linear convergence, quadratic ...? 7 / 44
Some relevant definitions For A ∈ R n × n , a collection of n elements of A is called a diagonal of A, provided no two of the elements belong to the same row or column of A. A nonzero diagonal of A is a diagonal not containing any 0’s. If G is the bipartite graph whose adjacency matrix is A, then non zero diagonals correspond to the perfect matchings of G. 8 / 44
Matrices for which balancing is possible Matrix has support if it has at least one non zero diagonal a (0-1) matrix A has total support provided each of its 1’s belongs to a non zero diagonal. That is, if Aij � = 0 we can find a permutation, B, of the rows of A which puts Aij on the leading diagonal and for which | b kk | > 0 for all k 9 / 44
SK theorem Theorem If A ∈ R n × n is non negative, then a necessary and sufficient condition that there exists a doubly stochastic matrix P of the form DAE where D and E are diagonal matrices with positive main diagonals, is that A has total support. If P exists, then it is unique. D and E are also unique up to a scalar multiple iff A is fully indecomposable. Theorem A necessary and sufficient condition that the SK algorithm converges is that A has total support 10 / 44
Convergence analysis There are many papers on the convergence of r k and c k , The main idea in these papers is to show that on each iteration a special function(e.g entropy, log barrier, permanent) is decreasing. For a matrix that can be balanced, a lower bound is known → it corresponds to the doubly stochastic case. Convex Optimization - Much faster scaling, Cohen Log barrier methods(Kalantari and Khachiyan 1996) Entropy minimization Connections to permanent of a matrix Topological methods 11 / 44
Linear convergence for symmetric A, Knight 2008 Considers the case where A is symmetric. SK algorithm can be applied, Looks for diagonal matrix D , such that DAD is doubly stochastic. The symmetric analogues of SK algorithm are x = D ( Ax ) − 1 e and the iterative step x k = D ( Ax k − 1 ) − 1 e . 12 / 44
Rate of convergence analysis, Knight 2008 Theorem If A is fully indecomposable then the SK algorithm will converge linearly to vectors r ∗ and c ∗ , such that D ( r ∗ ) AD ( c ∗ ) = P where P is doubly stochastic. Furthermore, there exists K ∈ Z such that for k ≥ K. � r k +1 � � r ∗ � � � r k � � r ∗ � � � � � ≤ σ 2 − − � � � � 2 c k +1 c ∗ c k c ∗ � � � Where σ 2 is the second singular value of P. 13 / 44
What LG 2004 achieve Tackle the problem of Bi-normalization Provide an Iterative algorithm BIN for scaling all rows and columns of a real symmetric matrix to unit-2 norm. 14 / 44
Important notions LG 2004 ˜ A = DAD B ij = A 2 ij j ˜ A 2 j d 2 i A 2 ij d 2 � ij = � j = c ∀ i ∈ [ n ] We are looking for positive solution ˜ x to B ( x ) x = be , where b is a positive real and B ( x ) = D ( Bx ) Equivalently(used in GS method) we are looking for a positive solution to ( I n − 1 n ee T ) B ( x ) x = be − 1 n ee T be 15 / 44
Existence and uniqueness Matrix is defined to be scalable if B ( x ) x = be has a positive solution. What we are trying to solve is an n × n system of quadratic equations in x 1 , . . . x n . (positive x). Solution does not always exist. Theorem If a matrix is scalable, it is either (a) not diluted ,or (b) n = 2 and � 0 � a A = a 0 Diluted, if ∃ i ∈ [ n ] , A ii � = 0 and A kj = 0 ∀ k � = i and j � = i 16 / 44
BIN Algorithm Algorithm 1: BIN Input: A , n × n real symm matrix, d 0 ∈ R n ,TOL-tolerance Output: ˜ A , d ∈ R n n ) 2 ) T ; x = (( d 0 1 ) 2 , . . . , ( d 0 B = A ◦ A ; Do Update rule; while s ( x ) > TOL · ¯ β ( x ) and sweeps < MaxSweeps do for i ← 1 to n do Solve equation i for x i keeping all other x j fixed at current values; end Do update rule ; end Update rule: d i = ¯ β − 1 / 2 x i ∀ i ∈ [ n ] ˜ A = DAD k ( x k β k − ¯ s ( x ) = ( 1 β ) 2 ) 1 / 2 � n 17 / 44
Notations Let D be the operator that transform a vector into a diagonal matrix of his elements :   x 1 0 . . . . . . 0   x 1 . ... ... .   0 x 2 . . .     .  . .  ... ... ...   . . D : x = �→ D ( x ) =   . . .   .    .  .  ... ... ...  .    . 0  x n   0 . . . . . . 0 x n 18 / 44
Notations Let D be the operator that transform a vector into a diagonal matrix of his elements :   x 1 0 . . . . . . 0   x 1 . ... ... .   0 x 2 . . .     .  . .  ... ... ...   . . D : x = �→ D ( x ) =   . . .   .    .  .  ... ... ...  .    . 0  x n   0 . . . . . . 0 x n   1 . . let e represent the vector   .   1 19 / 44
Quick recall on iterativ method We are looking for 2 vectors r and c such that D ( r ) AD ( c ) is doubly stochastic. That means D ( r ) Ac = e and D ( c ) A T r = e 20 / 44
Quick recall on iterativ method We are looking for 2 vectors r and c such that D ( r ) AD ( c ) is doubly stochastic. That means D ( r ) Ac = e and D ( c ) A T r = e we have the following : c = D ( A T r ) − 1 e (1) r = D ( Ac ) − 1 e . (2) 21 / 44
Quick recall on iterativ method We are looking for 2 vectors r and c such that D ( r ) AD ( c ) is doubly stochastic. That means D ( r ) Ac = e and D ( c ) A T r = e we have the following : c = D ( A T r ) − 1 e (1) r = D ( Ac ) − 1 e . (2) iterative method : c k +1 = D ( A T r k ) − 1 e , r k +1 = D ( Ac ) − 1 e So, if A is symmetric, by uniqueness of r and c , we have r = c . So, the unique solution x ∗ is solution of the equation f ( x ∗ ) = D ( x ∗ ) Ax ∗ − e = 0 (3) 22 / 44
Quick recall on iterativ method We are looking for 2 vectors r and c such that D ( r ) AD ( c ) is doubly stochastic. That means D ( r ) Ac = e and D ( c ) A T r = e we have the following : c = D ( A T r ) − 1 e (1) r = D ( Ac ) − 1 e . (2) iterative method : c k +1 = D ( A T r k ) − 1 e , r k +1 = D ( Ac ) − 1 e So, if A is symmetric, by uniqueness of r and c , we have r = c . So, the unique solution x ∗ is solution of the equation f ( x ∗ ) = D ( x ∗ ) Ax ∗ − e = 0 (3) The iterative step therefore is x k +1 = D ( Ax k ) − 1 e . But we don’t do the iterations as that would be SK. We will go back to the equation 3 and find a solution of it with Newton’s method. 23 / 44
rows/columns scaling Algorithm 2: Rows Columns scaling Input: a m × n matrix A A (0) ← A ; D (0) ← I m ; E (0) ← I n ; for k ← 0 to convergence do � || r k R ← D ( i || ); � || c k C ← D ( j || ); A ( k +1) ← R − 1 A ( k ) C − 1 ; D ( k +1) ← D ( k ) R − 1 ; C ( k +1) ← E ( k ) C − 1 ; end 24 / 44
Newton’s method remind : Newton’s method (on the blackboard) 25 / 44
Newton’s method Let J ( X ) be the jacobian of f . We have ∂ J ( x ) = ∂ x ( D ( x ) Ax − e ) = D ( x ) A + D ( Ax ) 26 / 44
Newton’s method Let J ( X ) be the jacobian of f . We have ∂ J ( x ) = ∂ x ( D ( x ) Ax − e ) = D ( x ) A + D ( Ax ) result : x k +1 − x k = − J ( x ) − 1 f ( x ) (4) D ( x ∗ ) Ax ∗ − e � − 1 � � � x k +1 = x k − D ( x ) A + D ( Ax ) (5) 27 / 44
Newton’s method Let J ( X ) be the jacobian of f . We have ∂ J ( x ) = ∂ x ( D ( x ) Ax − e ) = D ( x ) A + D ( Ax ) result : x k +1 − x k = − J ( x ) − 1 f ( x ) (4) D ( x ∗ ) Ax ∗ − e � − 1 � � � x k +1 = x k − D ( x ) A + D ( Ax ) (5) we can arrange this equation : ( A + D ( x k ) − 1 D ( Ax k )) x k +1 = Ax k + D ( x k ) − 1 e (6) 28 / 44
Recommend
More recommend