Optimizing linear maps modulo 2 (i.e.: fast xor sequences for - PDF document

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for bitsliced software) D. J. Bernstein University of Illinois at Chicago NSF ITR–0716498

Example: size-4 poly Karatsuba. Start with size 2: F = F 0 + F 1 x , G = G 0 + G 1 x , H 0 = F 0 G 0 , H 2 = F 1 G 1 , H 1 = ( F 0 + F 1 )( G 0 + G 1 ) � H 0 � H 2 , ) F G = H 0 + H 1 x + H 2 x 2 . x = t 2 etc.: Substitute F = f 0 + f 1 t + f 2 t 2 + f 3 t 3 , G = f 0 + f 1 t + f 2 t 2 + f 3 t 3 , H 0 = ( f 0 + f 1 t )( g 0 + g 1 t ), H 2 = ( f 2 + f 3 t )( g 2 + g 3 t ), H 1 = ( f 0 + f 2 + ( f 1 + f 3 ) t ) � g 0 + g 2 + ( g 1 + g 3 ) t ) ( � H 0 � H 2 ) F G = H 0 + H 1 t 2 + H 2 t 4 .

Initial linear computation: f 0 + f 2 ; f 1 + f 3 ; g 0 + g 2 ; g 1 + g 3 ; algebraic complexity 4. Three size-2 mults producing H 0 = p 0 + p 1 t + p 2 t 2 ; H 2 = q 0 + q 1 t + q 2 t 2 ; H 0 + H 1 + H 2 = r 0 + r 1 t + r 2 t 2 . Final linear reconstruction: H 1 = ( r 0 � p 0 � q 0 ) + r 1 � p 1 � q 1 ) t + ( r 2 � p 2 � q 2 ) t 2 , ( algebraic complexity 6; F G = H 0 + H 1 t 2 + H 2 t 4 , algebraic complexity 2.

Let’s look more closely at the reconstruction: h 0 = p 0 ; h 1 = p 1 ; h 2 = p 2 + ( r 0 � p 0 � q 0 ); h 3 = ( r 1 � p 1 � q 1 ); h 4 = ( r 2 � p 2 � q 2 ) + q 0 ; h 5 = q 1 ; h 6 = q 2 .

Let’s look more closely at the reconstruction: h 0 = p 0 ; h 1 = p 1 ; h 2 = p 2 + ( r 0 � p 0 � q 0 ); h 3 = ( r 1 � p 1 � q 1 ); h 4 = ( r 2 � p 2 � q 2 ) + q 0 ; h 5 = q 1 ; h 6 = q 2 . Can observe manually p 2 � q 0 is repeated. that See, e.g., 2000 Bernstein.

Some addition-chain algorithms will automatically find this speedup. Consider, e.g., greedy additive CSE algorithm from 1997 Paar: � find input pair i 0 ; i 1 i 0 � i 1 ; with most popular � compute i 0 � i 1 ; � simplify using i 0 � i 1 ; � repeat. This algorithm would have p 2 � q 0 automatically found inside Karatsuba reconstruction.

Today’s algorithm: “xor largest.” Start with the matrix mod 2 for the desired linear map. h 0 : 100000000 h 1 : 010000000 h 2 : 101100100 h 3 : 010010010 h 4 : 001101001 h 5 : 000010000 h 6 : 000001000 Each row has coefficients of p 0 ; p 1 ; p 2 ; q 0 ; q 1 ; q 2 ; r 0 ; r 1 ; r 2 .

Replace largest row by its xor with second-largest row. 100000000 010000000 001100100 010010010 001101001 000010000 000001000 Recursively compute this, and finish with one xor.

If two largest rows don’t have same first bit, change largest row by clearing first bit. 000000000 010000000 001100100 010010010 001101001 000010000 000001000 Recursively compute this, and finish with one xor (often just a copy).

Continue in the same way: 100000000 010000000 101100100 010010010 001101001 000010000 000001000 (starting matrix again)

Continue in the same way: 100000000 010000000 001100100 010010010 001101001 000010000 000001000 plus 1 xor.

Continue in the same way: 000000000 010000000 001100100 010010010 001101001 000010000 000001000 plus 1 xor, 1 input load.

Continue in the same way: 000000000 010000000 001100100 000010010 001101001 000010000 000001000 plus 2 xors, 1 input load.

Continue in the same way: 000000000 000000000 001100100 000010010 001101001 000010000 000001000 plus 2 xors, 2 input loads.

Continue in the same way: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 8 xors, 9 input loads. “Is this supposed to be an interesting algorithm?”

Another example: 000100000 000010000 100101100 010010010 001001101 000000010 000000001 Same matrix, but inputs in a different order: r ’s (used once each), first p ’s (used twice each), then q ’s (used twice each). then

Another example: 000100000 000010000 000101100 010010010 001001101 000000010 000000001 plus 1 xor, 1 input load.

Another example: 000100000 000010000 000101100 000010010 001001101 000000010 000000001 plus 2 xors, 2 input loads.

Another example: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 7 xors, 9 input loads. Algorithm found the speedup.

Another example: 000000000 000000000 000000000 000000000 000000000 000000000 000000000 plus 7 xors, 9 input loads. Algorithm found the speedup. Also has other useful features.

Memory friendliness: Algorithm writes only to the output registers. No temporary storage. n inputs, n outputs: n registers total 2 with 0 loads, 0 stores. n + 1 registers Or n loads, 0 stores: with each input is read only once. n registers Or n loads, 0 stores, with if platform has load-xor insn.

Two-operand friendliness: a a � b Platform with a b � but without n extra copies. uses only Naive column sweep also uses n + 1 registers, n loads, but usually many more xors. Input partitioning (e.g., 1956 Lupanov) uses somewhat more xors, copies; somewhat more registers. Greedy additive CSE uses somewhat fewer xors but many more copies, registers.

m inputs and n outputs, For n � m matrix: average The xor-largest algorithm uses � mn= lg n two-operand xors; n copies; m loads; n + 1 regs.

m inputs and n outputs, For n � m matrix: average The xor-largest algorithm uses � mn= lg n two-operand xors; n copies; m loads; n + 1 regs. Pippenger’s algorithm uses � mn= lg mn three-operand xors but seems to need many regs. Pippenger proved that his algebraic complexity was near optimal for most matrices (at least without mod 2), but didn’t consider regs, two-operand complexity, etc.

Case study of benefits produced by xor-largest: 131-bit conversion from poly basis to normal basis. � 131 matrix. “Random” 131 � 1 xor per cycle, On Cell ( � � registers) bitsliced 128 � 9600 cycles. code took Output of xor-largest: code with only 3380 xors fitting into 132 registers. Schwabe tuned asm for Cell: � 4000 cycles.

Inspiration: 1989 Bos–Coster. 000100000 = 32 000010000 = 16 100101100 = 300 010010010 = 146 001001101 = 77 000000010 = 2 000000001 = 1 x , 16 x , Goal: Compute 32 300 x , 146 x , 77 x , 2 x , 1 x .

Reduce largest row: 000100000 = 32 000010000 = 16 010011010 = 154 010010010 = 146 001001101 = 77 000000010 = 2 000000001 = 1 Integer subtraction of 146 from 300.

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for - PDF document

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for bitsliced software) D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 Example: size-4 poly Karatsuba. Start with size 2: F = F 0 + F 1 x , G = G 0 + G 1 x , H 0 =

Array BP-XOR Codes for Reliable Cloud Storage Systems Yongge Wang UNC Charlotte, USA July

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Week 6 Congruence Relation Modulo n Discrete Math Marie Demlov

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Quantum Algorithms for the k -xor Problem Lorenzo Grassi 1 , Mara Naya-Plasencia 2 , Andr

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

An enumerative relationship between maps and 4-regular maps Michael La Croix April 9, 2008 An

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Zenon Modulo: When Achilles Outruns the Tortoise using Deduction Modulo November 18, 2013 David

Unit 4c Division and Module Idioms 2 Unit Objectives Apply division and modulo operation to

Modulo- Parallel Prefix Addition via Excess-Modulo Encoding of

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Dynamical systems Expanding maps on the circle Jana Rodriguez Hertz ICTP 2018 lifts and degree

APPENDICES appendix 1. Systems maps appendix 1. Systems maps appendix 1. Systems maps appendix

CS371m - Mobile Computing Maps Using Google Maps This lecture focuses on using Google Maps

Neural Networks Module2 : learning with Gradient Descent module 2: numerical optimization

Java AND and OR Java XOR and NOT AND operator (&) OR

XOR with intermediate (hidden) units Delta rule as gradient descent in error (sigmoid units)

My typical workflow Jakub Muszy nski 6th7th May 2014 Computer Science and Communications

Lower Bounds for Number-in-Hand Multiparty Communication Complexity Jeff M. Phillips Elad

SAML 1.1 and its uses in eduGAIN Stefan Winter <stefan.winter@restena.lu> 1 Outline

Neural Networks Sven Koenig, USC Russell and Norvig, 3 rd Edition, Sections 18.7.1-18.7.4 These

Proving termination using dependent types: the case of xor-terms J.-F. Monin J. Courant VERIMAG

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for - PDF document

Optimizing linear maps modulo 2 (i.e.: fast xor sequences for bitsliced software) D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 Example: size-4 poly Karatsuba. Start with size 2: F = F 0 + F 1 x , G = G 0 + G 1 x , H 0 =

Array BP-XOR Codes for Reliable Cloud Storage Systems Yongge Wang UNC Charlotte, USA July

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Week 6 Congruence Relation Modulo n Discrete Math Marie Demlov

Making maps pretty Andrea Aime Jim Groffen Making Maps Pretty Making Maps Pretty 1 1 Making

Quantum Algorithms for the k -xor Problem Lorenzo Grassi 1 , Mara Naya-Plasencia 2 , Andr

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

20-03-06 7. Learning Sequences/Behaviors How to use sequences/behaviors? Sequences and more

An enumerative relationship between maps and 4-regular maps Michael La Croix April 9, 2008 An

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Zenon Modulo: When Achilles Outruns the Tortoise using Deduction Modulo November 18, 2013 David

Unit 4c Division and Module Idioms 2 Unit Objectives Apply division and modulo operation to

Modulo- Parallel Prefix Addition via Excess-Modulo Encoding of

Quadratic Residues Definition : The numbers 0 2 , 1 2 , 2 2 , . . . , ( n 1) 2 mod n , are

Dynamical systems Expanding maps on the circle Jana Rodriguez Hertz ICTP 2018 lifts and degree

APPENDICES appendix 1. Systems maps appendix 1. Systems maps appendix 1. Systems maps appendix

CS371m - Mobile Computing Maps Using Google Maps This lecture focuses on using Google Maps

Neural Networks Module2 : learning with Gradient Descent module 2: numerical optimization

Java AND and OR Java XOR and NOT AND operator (&amp;) OR

XOR with intermediate (hidden) units Delta rule as gradient descent in error (sigmoid units)

My typical workflow Jakub Muszy nski 6th7th May 2014 Computer Science and Communications

Lower Bounds for Number-in-Hand Multiparty Communication Complexity Jeff M. Phillips Elad

SAML 1.1 and its uses in eduGAIN Stefan Winter &lt;stefan.winter@restena.lu&gt; 1 Outline

Neural Networks Sven Koenig, USC Russell and Norvig, 3 rd Edition, Sections 18.7.1-18.7.4 These

Proving termination using dependent types: the case of xor-terms J.-F. Monin J. Courant VERIMAG

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Java AND and OR Java XOR and NOT AND operator (&) OR

SAML 1.1 and its uses in eduGAIN Stefan Winter <stefan.winter@restena.lu> 1 Outline