optimizing linear maps modulo 2 i e fast xor sequences
play

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for - PDF document

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for bitsliced software) D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 Example: size-4 poly Karatsuba. Start with size 2: F = F 0 + F 1 x , G = G 0 + G 1 x , H 0 =


  1. Optimizing linear maps modulo 2 (i.e.: fast xor sequences for bitsliced software) D. J. Bernstein University of Illinois at Chicago NSF ITR–0716498

  2. Example: size-4 poly Karatsuba. Start with size 2: F = F 0 + F 1 x , G = G 0 + G 1 x , H 0 = F 0 G 0 , H 2 = F 1 G 1 , H 1 = ( F 0 + F 1 )( G 0 + G 1 ) � H 0 � H 2 , ) F G = H 0 + H 1 x + H 2 x 2 . x = t 2 etc.: Substitute F = f 0 + f 1 t + f 2 t 2 + f 3 t 3 , G = f 0 + f 1 t + f 2 t 2 + f 3 t 3 , H 0 = ( f 0 + f 1 t )( g 0 + g 1 t ), H 2 = ( f 2 + f 3 t )( g 2 + g 3 t ), H 1 = ( f 0 + f 2 + ( f 1 + f 3 ) t ) � g 0 + g 2 + ( g 1 + g 3 ) t ) ( � H 0 � H 2 ) F G = H 0 + H 1 t 2 + H 2 t 4 .

  3. Initial linear computation: f 0 + f 2 ; f 1 + f 3 ; g 0 + g 2 ; g 1 + g 3 ; algebraic complexity 4. Three size-2 mults producing H 0 = p 0 + p 1 t + p 2 t 2 ; H 2 = q 0 + q 1 t + q 2 t 2 ; H 0 + H 1 + H 2 = r 0 + r 1 t + r 2 t 2 . Final linear reconstruction: H 1 = ( r 0 � p 0 � q 0 ) + r 1 � p 1 � q 1 ) t + ( r 2 � p 2 � q 2 ) t 2 , ( algebraic complexity 6; F G = H 0 + H 1 t 2 + H 2 t 4 , algebraic complexity 2.

  4. Let’s look more closely at the reconstruction: h 0 = p 0 ; h 1 = p 1 ; h 2 = p 2 + ( r 0 � p 0 � q 0 ); h 3 = ( r 1 � p 1 � q 1 ); h 4 = ( r 2 � p 2 � q 2 ) + q 0 ; h 5 = q 1 ; h 6 = q 2 .

  5. Let’s look more closely at the reconstruction: h 0 = p 0 ; h 1 = p 1 ; h 2 = p 2 + ( r 0 � p 0 � q 0 ); h 3 = ( r 1 � p 1 � q 1 ); h 4 = ( r 2 � p 2 � q 2 ) + q 0 ; h 5 = q 1 ; h 6 = q 2 . Can observe manually p 2 � q 0 is repeated. that See, e.g., 2000 Bernstein.

  6. Some addition-chain algorithms will automatically find this speedup. Consider, e.g., greedy additive CSE algorithm from 1997 Paar: � find input pair i 0 ; i 1 i 0 � i 1 ; with most popular � compute i 0 � i 1 ; � simplify using i 0 � i 1 ; � repeat. This algorithm would have p 2 � q 0 automatically found inside Karatsuba reconstruction.

  7. Today’s algorithm: “xor largest.” Start with the matrix mod 2 for the desired linear map. h 0 : 100000000 h 1 : 010000000 h 2 : 101100100 h 3 : 010010010 h 4 : 001101001 h 5 : 000010000 h 6 : 000001000 Each row has coefficients of p 0 ; p 1 ; p 2 ; q 0 ; q 1 ; q 2 ; r 0 ; r 1 ; r 2 .

  8. Replace largest row by its xor with second-largest row. 100000000 010000000 001100100 010010010 001101001 000010000 000001000 Recursively compute this, and finish with one xor.

  9. If two largest rows don’t have same first bit, change largest row by clearing first bit. 000000000 010000000 001100100 010010010 001101001 000010000 000001000 Recursively compute this, and finish with one xor (often just a copy).

  10. Continue in the same way: 100000000 010000000 101100100 010010010 001101001 000010000 000001000 (starting matrix again)

  11. Continue in the same way: 100000000 010000000 001100100 010010010 001101001 000010000 000001000 plus 1 xor.

  12. Continue in the same way: 000000000 010000000 001100100 010010010 001101001 000010000 000001000 plus 1 xor, 1 input load.

  13. Continue in the same way: 000000000 010000000 001100100 000010010 001101001 000010000 000001000 plus 2 xors, 1 input load.

  14. Continue in the same way: 000000000 000000000 001100100 000010010 001101001 000010000 000001000 plus 2 xors, 2 input loads.

  15. Continue in the same way: 000000000 000000000 001100100 000010010 000001101 000010000 000001000 plus 3 xors, 2 input loads.

  16. Continue in the same way: 000000000 000000000 000100100 000010010 000001101 000010000 000001000 plus 4 xors, 3 input loads.

  17. Continue in the same way: 000000000 000000000 000000100 000010010 000001101 000010000 000001000 plus 5 xors, 4 input loads.

  18. Continue in the same way: 000000000 000000000 000000100 000000010 000001101 000010000 000001000 plus 6 xors, 4 input loads.

  19. Continue in the same way: 000000000 000000000 000000100 000000010 000001101 000000000 000001000 plus 6 xors, 5 input loads.

  20. Continue in the same way: 000000000 000000000 000000100 000000010 000000101 000000000 000001000 plus 7 xors, 5 input loads.

  21. Continue in the same way: 000000000 000000000 000000100 000000010 000000101 000000000 000000000 plus 7 xors, 6 input loads.

  22. Continue in the same way: 000000000 000000000 000000100 000000010 000000001 000000000 000000000 plus 8 xors, 6 input loads.

  23. Continue in the same way: 000000000 000000000 000000000 000000010 000000001 000000000 000000000 plus 8 xors, 7 input loads.

  24. Continue in the same way: 000000000 000000000 000000000 000000000 000000001 000000000 000000000 plus 8 xors, 8 input loads.

  25. Continue in the same way: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 8 xors, 9 input loads.

  26. Continue in the same way: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 8 xors, 9 input loads. “Is this supposed to be an interesting algorithm?”

  27. Another example: 000100000 000010000 100101100 010010010 001001101 000000010 000000001 Same matrix, but inputs in a different order: r ’s (used once each), first p ’s (used twice each), then q ’s (used twice each). then

  28. Another example: 000100000 000010000 000101100 010010010 001001101 000000010 000000001 plus 1 xor, 1 input load.

  29. Another example: 000100000 000010000 000101100 000010010 001001101 000000010 000000001 plus 2 xors, 2 input loads.

  30. Another example: 000100000 000010000 000101100 000010010 000001101 000000010 000000001 plus 3 xors, 3 input loads.

  31. Another example: 000100000 000010000 000001100 000010010 000001101 000000010 000000001 plus 4 xors, 3 input loads.

  32. Another example: 000000000 000010000 000001100 000010010 000001101 000000010 000000001 plus 4 xors, 4 input loads.

  33. Another example: 000000000 000010000 000001100 000000010 000001101 000000010 000000001 plus 5 xors, 4 input loads.

  34. Another example: 000000000 000000000 000001100 000000010 000001101 000000010 000000001 plus 5 xors, 5 input loads.

  35. Another example: 000000000 000000000 000001100 000000010 000000001 000000010 000000001 plus 6 xors, 5 input loads.

  36. Another example: 000000000 000000000 000000100 000000010 000000001 000000010 000000001 plus 7 xors, 6 input loads.

  37. Another example: 000000000 000000000 000000000 000000010 000000001 000000010 000000001 plus 7 xors, 7 input loads.

  38. Another example: 000000000 000000000 000000000 000000000 000000001 000000010 000000001 plus 7 xors, 7 input loads.

  39. Another example: 000000000 000000000 000000000 000000000 000000001 000000000 000000001 plus 7 xors, 8 input loads.

  40. Another example: 000000000 000000000 000000000 000000000 000000000 000000000 000000001 plus 7 xors, 8 input loads.

  41. Another example: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 7 xors, 9 input loads. Algorithm found the speedup.

  42. Another example: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 7 xors, 9 input loads. Algorithm found the speedup. Also has other useful features.

  43. Memory friendliness: Algorithm writes only to the output registers. No temporary storage. n inputs, n outputs: n registers total 2 with 0 loads, 0 stores. n + 1 registers Or n loads, 0 stores: with each input is read only once. n registers Or n loads, 0 stores, with if platform has load-xor insn.

  44. Two-operand friendliness: a a � b Platform with a b � but without n extra copies. uses only Naive column sweep also uses n + 1 registers, n loads, but usually many more xors. Input partitioning (e.g., 1956 Lupanov) uses somewhat more xors, copies; somewhat more registers. Greedy additive CSE uses somewhat fewer xors but many more copies, registers.

  45. m inputs and n outputs, For n � m matrix: average The xor-largest algorithm uses � mn= lg n two-operand xors; n copies; m loads; n + 1 regs.

  46. m inputs and n outputs, For n � m matrix: average The xor-largest algorithm uses � mn= lg n two-operand xors; n copies; m loads; n + 1 regs. Pippenger’s algorithm uses � mn= lg mn three-operand xors but seems to need many regs. Pippenger proved that his algebraic complexity was near optimal for most matrices (at least without mod 2), but didn’t consider regs, two-operand complexity, etc.

  47. Case study of benefits produced by xor-largest: 131-bit conversion from poly basis to normal basis. � 131 matrix. “Random” 131 � 1 xor per cycle, On Cell ( � � registers) bitsliced 128 � 9600 cycles. code took Output of xor-largest: code with only 3380 xors fitting into 132 registers. Schwabe tuned asm for Cell: � 4000 cycles.

  48. Inspiration: 1989 Bos–Coster. 000100000 = 32 000010000 = 16 100101100 = 300 010010010 = 146 001001101 = 77 000000010 = 2 000000001 = 1 x , 16 x , Goal: Compute 32 300 x , 146 x , 77 x , 2 x , 1 x .

  49. Reduce largest row: 000100000 = 32 000010000 = 16 010011010 = 154 010010010 = 146 001001101 = 77 000000010 = 2 000000001 = 1 Integer subtraction of 146 from 300.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend