algorithms techniques for dense linear algebra over small
play

Algorithms & Techniques for Dense Linear Algebra over Small - PowerPoint PPT Presentation

Algorithms & Techniques for Dense Linear Algebra over Small Finite Fields Martin R. Albrecht (martinralbrecht+summerschool@googlemail.com) POLSYS Team, UPMC, Paris, France ECrypt II PhD Summer School Outline F 2 Gray Codes Multiplication


  1. Algorithms & Techniques for Dense Linear Algebra over Small Finite Fields Martin R. Albrecht (martinralbrecht+summerschool@googlemail.com) POLSYS Team, UPMC, Paris, France ECrypt II PhD Summer School

  2. Outline F 2 Gray Codes Multiplication Elimination F p F 2 e Precomputation Tables Karatsuba Multiplication Performance F p [ x ]

  3. Outline F 2 Gray Codes Multiplication Elimination F p F 2 e Precomputation Tables Karatsuba Multiplication Performance F p [ x ]

  4. The M4RI Library ◮ available under the GPL Version 2 or later (GPLv2+) ◮ provides basic arithmetic (addition, equality testing, stacking, augmenting, sub-matrices, randomisation, etc.) ◮ asymptotically fast multiplication ◮ asymptotically fast elimination ◮ some multi-core support ◮ Linux, Mac OS X (x86 and PPC), OpenSolaris (Sun Studio Express) and Windows (Cygwin) http://m4ri.sagemath.org

  5. F 2 ◮ field with two elements. ◮ logical bitwise XOR is ⊕ ⊙ addition. 0 0 0 0 ◮ logical bitwise AND is 0 1 1 0 multiplication. 1 0 1 0 ◮ 64 (128) basic operations in 1 1 0 1 at most one CPU cycle ◮ . . . arithmetic rather cheap Memory access is the expensive operation, not arithmetic.

  6. Outline F 2 Gray Codes Multiplication Elimination F p F 2 e Precomputation Tables Karatsuba Multiplication Performance F p [ x ]

  7. Gray Codes The Gray code [Gra53], named after Frank Gray and also known as reflected binary code, is a numbering system where two consecutive values differ in only one digit.

  8. Gray Code Examples 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 0 0 ⇓ 0 1 1 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 1 0 ⇑ 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 1 1 0 0 0

  9. Applications Gray codes are used in various applications where all vectors over small finite fields need to be enumerated, such as: ◮ matrix multiplication; ◮ fast exhaustive search of Boolean polynomial systems; ◮ cube attacks on Grain-128. Gray codes are a pretty basic part of the cryptographer’s toolkit because they allow to reduce the cost of enumerating all vectors over F 2 of length n from n 2 n − 1 to 2 n − 1.

  10. Outline F 2 Gray Codes Multiplication Elimination F p F 2 e Precomputation Tables Karatsuba Multiplication Performance F p [ x ]

  11. M4RM [ADKF70] I Consider C = A · B ( A is m × ℓ and B is ℓ × n ). A can be divided into ℓ/ k vertical “stripes” A 0 . . . A ( ℓ − 1) / k of k columns each. B can be divided into ℓ/ k horizontal “stripes” B 0 . . . B ( ℓ − 1) / k of k rows each. We have: ( ℓ − 1) / k � C = A · B = A i · B i . 0

  12. M4RM [ADKF70] II       1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0       A =  , B =  , A 0 =       1 1 1 1 0 1 1 0 1 1     0 1 1 1 0 1 0 1 0 1   0 1 � 1 � 0 � � 0 0 0 1 1 1 1 0   A 1 =  , B 0 = , B 1 =   0 1 1 0 0 1 0 1 1 1  1 1     0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0     A 0 · B 0 =  , A 1 · B 1 =     1 1 0 1 0 0 1 1    0 1 1 0 0 0 1 1

  13. � n 3 / log n � M4RM: Algorithm O 1 begin C ← − create an m × n matrix with all entries 0; 2 k ← − ⌊ log n ⌋ ; 3 for 0 ≤ i < ( ℓ/ k ) do 4 // create table of 2 k − 1 linear combinations T ← MakeTable ( B , i × k , 0 , k ); 5 for 0 ≤ j < m do 6 // read index for table T id ← − ReadBits ( A , j , i × k , k ); 7 add row id from T to row j of C ; 8 return C ; 9 Algorithm 1: M4RM

  14. Strassen-Winograd [Str69] Multiplication ◮ fastest known pratical algorithm ◮ complexity: O � n log 2 7 � ◮ linear algebra constant: ω = log 2 7 ◮ M4RM can be used as base case for small dimensions → optimisation of this base case

  15. Cache Friendly M4RM I 1 begin C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ i < ( ℓ/ k ) do 3 // this is cheap in terms of memory access T ← MakeTable ( B , i × k , 0 , k ); 4 for 0 ≤ j < m do 5 // for each load of row j we take care of only k bits id ← − ReadBits ( A , j , i × k , k ); 6 add row id from T to row j of C ; 7 return C ; 8

  16. Cache Friendly M4RM II 1 begin C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ start < m / b s do 3 for 0 ≤ i < ( ℓ/ k ) do 4 // we regenerate T for each block T ← MakeTable ( B , i × k , 0 , k ); 5 for 0 ≤ s < b s do 6 j ← − start × b s + s ; 7 id ← − ReadBits ( A , j , i × k , k ); 8 add row id from T to row j of C ; 9 return C ; 10

  17. t > 1 Gray Code Tables I ◮ actual arithmetic is quite cheap compared to memory reads and writes ◮ the cost of memory accesses greatly depends on where in memory data is located ◮ try to fill all of L1 with Gray code tables. ◮ Example: k = 10 and 1 Gray code table → 10 bits at a time. k = 9 and 2 Gray code tables, still the same memory for the tables but deal with 18 bits at once. ◮ The price is one extra row addition, which is cheap if the operands are all in cache.

  18. t > 1 Gray Code Tables II 1 begin C ← − create an m × n matrix with all entries 0; 2 for 0 ≤ i < ( ℓ/ (2 k )) do 3 T 0 ← MakeTable ( B , i × 2 k , 0 , k ); 4 T 1 ← MakeTable ( B , i × 2 k + k , 0 , k ); 5 for 0 ≤ j < m do 6 id 0 ← − ReadBits ( A , j , i × 2 k , k ); 7 id 1 ← − ReadBits ( A , j , i × 2 k + k , k ); 8 add row id 0 from T 0 and row id 1 from T 1 to row j of C ; 9 return C ; 10

  19. Performance: Multiplication Magma 31s M4RI 25s execution time t 19s 13s 7s 1s 2000 8000 14000 20000 26000 matrix dimension n Figure: 2.66 Ghz Intel i7, 4GB RAM

  20. Outline F 2 Gray Codes Multiplication Elimination F p F 2 e Precomputation Tables Karatsuba Multiplication Performance F p [ x ]

  21. PLE Decomposition I Definition (PLE) Let A be a m × n matrix over a field K . A PLE decomposition of A is a triple of matrices P , L and E E such that P is a m × m permutation matrix, L is a unit L lower triangular matrix, and E is a m × n matrix in row-echelon form, and A = PLE . PLE decomposition can be in-place, that is L and E are stored in A and P is stored as an m -vector.

  22. PLE Decomposition II From the PLE decomposition we can ◮ read the rank r , ◮ read the row rank profile (pivots), ◮ compute the null space, ◮ solve y = Ax for x and ◮ compute the (reduced) row echelon form. C.-P. Jeannerod, C. Pernet, and A. Storjohann. Rank-profile revealing Gaussian elimination and the CUP matrix decomposition. arXiv:1112.5717 , 35 pages, 2012.

  23. Block Recursive PLE Decomposition O ( n ω ) I

  24. Block Recursive PLE Decomposition O ( n ω ) II

  25. Block Recursive PLE Decomposition O ( n ω ) III A NE ← L − 1 NW × A NE

  26. Block Recursive PLE Decomposition O ( n ω ) IV A SE ← A SE + A SW × A NE

  27. Block Recursive PLE Decomposition O ( n ω ) V

  28. Block Recursive PLE Decomposition O ( n ω ) VI

  29. Block Iterative PLE Decomposition I We need an efficient base case for PLE Decomposition ◮ block recursive PLE decomposition gives rise to a block iterative PLE decomposition ◮ choose blocks of size k = log n and use M4RM for the “update” multiplications n 3 / log n ◮ this gives a complexity O � �

  30. Block Iterative PLE Decomposition II

  31. Block Iterative PLE Decomposition III L

  32. Block Iterative PLE Decomposition IV A NE ← L − 1 × A NE L

  33. Block Iterative PLE Decomposition V

  34. Block Iterative PLE Decomposition VI A SE ← A SE + A SW × A NE

  35. Block Iterative PLE Decomposition VII

  36. Block Iterative PLE Decomposition VIII

  37. Block Iterative PLE Decomposition IX A NE = L − 1 × A NE

  38. Block Iterative PLE Decomposition X A SE = A SE + A SW × A NE

  39. Block Iterative PLE Decomposition XI

  40. Performance: Reduced Row Echelon Form 31s Magma 25s execution time t c MAGMA ≈ 6 . 8 · 10 − 12 19s M4RI 13s 7s c M4RI ≈ 4 . 3 · 10 − 12 1s 2000 8000 14000 20000 26000 matrix dimension n Figure: 2.66 Ghz Intel i7, 4GB RAM

  41. Performance: Row Echelon Form Using one core – on sage.math – we can compute the echelon form of a 500 , 000 × 500 , 000 dense random matrix over F 2 in 9711 seconds = 2 . 7 hours ( c ≈ 10 − 12 ) . Using four cores decomposition we can compute the echelon form of a random dense 500 , 000 × 500 , 000 matrix in 3806 seconds = 1 . 05 hours.

  42. Caveat: Sensitivity to Sparsity 6 execution time t 5 4 3 Magma 2 M4RI 1 PLE 2 6 10 14 18 non-zero elements per row Figure: Gaussian elimination of 10 , 000 × 10 , 000 matrices on Intel 2.33GHz Xeon E5345 comparing Magma 2.17-12 and M4RI 20111004.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend