on the linear algebra employed in the mosek conic
play

On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday - PowerPoint PPT Presentation

On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday Jul 13 Erling D. Andersen www.mosek.com MOSEK summary General LP Convex QP SDP MIP CQP Version 8: Work in progress. The conic optimization problem n n


  1. On the Linear Algebra Employed in the MOSEK Conic Optimizer Monday Jul 13 Erling D. Andersen www.mosek.com

  2. MOSEK summary General LP Convex QP SDP MIP CQP • Version 8: Work in progress.

  3. The conic optimization problem n n ¯ � ¯ � � C j , ¯ � min c j x j + X j j =1 j =1 n n ¯ � ¯ A ij , ¯ � � � subject to a ij x j + = b i , i = 1 , . . . , m , X j j =1 j =1 x ∈ K , ¯ X j � 0 , j = 1 , . . . , ¯ n . Explanation: • x j is a scalar variable. • ¯ X j is a square matrix variable.

  4. • K represents Cartesian product of conic quadratic constraints e.g. x 1 ≥ � x 2: n � . • ¯ X j � 0 represents ¯ X j = ¯ X T and ¯ X j is PSD. j • ¯ C j and ¯ A j are required to be symmetric. • � A , B � := tr( A T B ). • Dimensions are large. • Data matrices are typically sparse. • A has ≤ 10 nonzeros per column on average usually. • ¯ A ij contains few nonzeros and/or is low rank.

  5. The algorithm: Simplified version • Step 1: Setup the homogeneous and self-dual model. • Step 2. Choose a starting point. • Step 3: Compute Nesterov-Todd search direction. • Step 4: Take a step. • Step 5: Stop if the trial solution is good enough. • Step 6: Goto 3.

  6. Search direction computation Requires solution of: − ( WW T ) − 1 A T       0 d x r x − ( ¯ W ¯ ¯  = W T ) − 1 A T 0 d ¯ r ¯ x x      ¯ 0 d y r y A A where • W and ¯ W are nonsingular block diagonal matrices. • WW T is a diagonal matrix + low rank terms.

  7. Reduced Schur system approach We have (( AW )( AW ) T + ( ¯ A ¯ W )( ¯ A ¯ W ) T ) d y = · · · and − ( WW T )( r x − A T d y ) , d x = − ( ¯ W ¯ x − ¯ W T )( r ¯ A T d y ) . = d ¯ x Cons: • Dense columns cause issues. • Numerical stability. Bad condition number. Pros: • A positive definite symmetric system. • Use Cholesky with no pivoting. • Employed in major commercial solvers.

  8. Computing the Schur matrix Assumptions: • Let us focus at: W ) T = ¯ W T ¯ ( ¯ A ¯ W )( ¯ A ¯ A ¯ W ¯ A T . • Only one 1 matrix variable. The general case follows easily. • NT search direction implies W T = R T ⊗ R T W = R ⊗ R and ¯ ¯ where the Kronecker product ⊗ is defined as   R 11 R R 12 R · · · R 21 R R 22 R R ⊗ R =  .   .  . .

  9. Fact: W T ¯ A k ) T vec ( RR T ¯ k ¯ A ¯ W ¯ A T e l = vec ( ¯ e T A l RR T ) .

  10. Exploiting sparsity 1 Compute the lower triangular part of vec ( RR T ¯ T vec ( ¯  A 1 ) T   A 1 RR T )  . . W T ¯ A T = A ¯ ¯ W ¯ . .     . .     vec ( RR T ¯ vec ( ¯ A m ) T A m RR T ) so the l th column is computed as W T ¯ A k ) T vec ( RR T ¯ k ¯ A ¯ W ¯ A T e l = vec ( ¯ e T A l RR T ) , for k ≥ l . Avoid computing W T ¯ k ¯ A ¯ W ¯ e T A T e l A k ) T vec ( RR T ¯ vec ( ¯ A l RR T ) = = 0 if A k = 0 or A l = 0.

  11. Exploiting sparsity 2 Moreover, • R is a dense square matrix. • A i is typically extremely sparse e.g. A i = e k e T k . as observed by J. Sturm for instance. • Wlog assume A i = U i V T + ( U i V T i ) T . i because U i = A i / 2 and V i = I is a valid choice. • In practice U i and V i are sparse and low rank e.g. has few columns. • The new idea!

  12. Recall W T ¯ A k ) T vec ( RR T ¯ k ¯ A ¯ W ¯ A T e l = vec ( ¯ e T A l RR T ) must be computed for all k ≥ l and RR T ¯ RR T ( U l ( V l ) T + ( U l ( V l ) T ) T ) RR T A l RR T = U l ˆ ˆ + ( ˆ U l ˆ V T V T l ) T = l where ˆ RR T U l , := U l ˆ RR T V l . V l :=

  13. • ˆ U l and ˆ V l are dense matrices. • Sparsity in U l and V l are exploited. • Low rank structure is exploited. • Is all of ˆ U l and ˆ V l required?

  14. Observe e T i ( U k V T k + ( U k V T k ) T ) = 0 , ∀ i �∈ I k where I k := { i | U ki : � = 0 ∨ V ki : � = 0 } . Now A k ) T vec ( RR T ¯ vec ( ¯ A l RR T ) k ) T ) vec ( ˆ U l ˆ + ( ˆ U l ˆ vec ( U k V T k + ( U k V T V T V T l ) T ) = l � 2( U k e i ) T ( ˆ U l ˆ + ( ˆ U l ˆ V T V T l ) T )( V k e i ) = l i Therefore, only rows ˆ U l and ˆ V l corresponding to � I k k ≥ l are needed.

  15. Proposed algorithm: • Compute � I k k ≥ l • Compute ˆ U kI K : and ˆ V kI K : . • Compute � 2( U k e i ) T ( ˆ U l ˆ + ( ˆ U l ˆ V T V T l ) T )( V k e i ) l i Possible improvements • Exploit the special case U k : j = α V k : j . • Exploit dense computations e.g. level 3 BLAS when possible and worthwhile.

  16. Summary: • Exploit sparsity as done in SeDuMi by Sturm. • Also able to exploit low rank structure. • Not implemented yet!

  17. Linear algebra summary • Sparse matrix operations e.g. multiplications. • Large sparse matrix factorization e.g. Cholesky. • Including ordering (AMD,GP). • Dense column detection and handling. • Dense sequential level 1,2,3 BLAS operations. • Inside sparse Cholesky for instance. • Sequential INTEL Math Kernel Library is employed extensively. • Eigenvalue computations. • What about the parallelization? • Modern computers have many cores. • Typically from 4 to 12. • Recent customer example had 80.

  18. The parallelization challenge on shared memory • A computer has many cores. • Parallelization using native threads is cumbersome and error prone. • Employ a parallelization framework e.g. Cilk or OpenMP. Other issues; • Exploit caches. • Do not overload the memory bus. • Not fine grained due to threading overhead.

  19. Cilk summary: • Extension to C and C++. • Has a runtime environment that execute tasks in parallel on a number of workers. • Handles the load balancing. • Allows nested/recursive parallelism e.g. • Parallel dense matrix mul. within parallelized sparse Cholesky. • Parallel IPM within B&B. • Is run to run deterministic. • Care must be taken in floating point computatiosn. • Supported by the Intel C compiler, Gcc, Clang.

  20. Example parallelized dense syrk The dense level 3 BLAS syrk operation does C = AA T . Parallelized version using Cilk: If C is small C = AA T else C 21 = A 2: A T cilk spawn gemm 1: C 11 = A 1: A T cilk spawn syrk 1: C 22 = A 2: A T cilk spawn syrk 2: cilk sync Usage of recursion is allowed!

  21. Our experience with cilk • cilk is easy to learn i.e. 3 new keywords. • Nested/recursive parallelism is allowed. • Useful for both sparse and dense matrix computations. • Efficient parallelization is nevertheless hard.

  22. Summary and conclusions • I am behind the schedule with MOSEK version 8. • Proposed a new algorithm for computing the Schur matrix in the semidefinite case. • Discussed the usage of task based parallelization framework exemplified by cilk. • Slides url https://mosek.com/resources/presentations .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend