McBits: Objectives fast constant-time Set new speed records - - PowerPoint PPT Presentation

mcbits objectives fast constant time set new speed
SMART_READER_LITE
LIVE PREVIEW

McBits: Objectives fast constant-time Set new speed records - - PowerPoint PPT Presentation

McBits: Objectives fast constant-time Set new speed records code-based cryptography for public-key cryptography. (to appear at CHES 2013) D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work


slide-1
SLIDE 1

McBits: fast constant-time code-based cryptography (to appear at CHES 2013)

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography.

slide-2
SLIDE 2

McBits: fast constant-time code-based cryptography (to appear at CHES 2013)

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level.

slide-3
SLIDE 3

McBits: fast constant-time code-based cryptography (to appear at CHES 2013)

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers.

slide-4
SLIDE 4

McBits: fast constant-time code-based cryptography (to appear at CHES 2013)

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc.

slide-5
SLIDE 5

McBits: fast constant-time code-based cryptography (to appear at CHES 2013)

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto with a solid track record.

slide-6
SLIDE 6

McBits: fast constant-time code-based cryptography (to appear at CHES 2013)

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Joint work with: Tung Chou Technische Universiteit Eindhoven Peter Schwabe Radboud University Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto with a solid track record. ✿ ✿ ✿ all of the above at once.

slide-7
SLIDE 7

McBits: constant-time de-based cryptography appear at CHES 2013) Bernstein University of Illinois at Chicago & echnische Universiteit Eindhoven

  • rk with:

Chou echnische Universiteit Eindhoven Schwabe

  • ud University Nijmegen

Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto with a solid track record. ✿ ✿ ✿ all of the above at once. Examples Some cycle (Intel Co from bench.cr.yp.to mceliece (2008 Bisw gls254 DH (binary elliptic kumfp127g (hyperelliptic; curve25519 (conservative mceliece ronald1024

slide-8
SLIDE 8

constant-time cryptography CHES 2013) Illinois at Chicago & Universiteit Eindhoven Universiteit Eindhoven University Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto with a solid track record. ✿ ✿ ✿ all of the above at once. Examples of the comp Some cycle counts (Intel Core i5-3210M, from bench.cr.yp.to mceliece encrypt (2008 Biswas–Sendri gls254 DH (binary elliptic curve; kumfp127g DH (hyperelliptic; Euro curve25519 DH (conservative elliptic mceliece decrypt ronald1024 decrypt

slide-9
SLIDE 9

Chicago & Eindhoven Eindhoven Nijmegen Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto with a solid track record. ✿ ✿ ✿ all of the above at once. Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt (2008 Biswas–Sendrier, 280) gls254 DH (binary elliptic curve; CHES kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040

slide-10
SLIDE 10

Objectives Set new speed records for public-key cryptography. ✿ ✿ ✿ at a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, branch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto with a solid track record. ✿ ✿ ✿ all of the above at once. Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040

slide-11
SLIDE 11

Objectives new speed records public-key cryptography. ✿ ✿ ✿ a high security level. ✿ ✿ ✿ including protection against quantum computers. ✿ ✿ ✿ including full protection against cache-timing attacks, ranch-prediction attacks, etc. ✿ ✿ ✿ using code-based crypto solid track record. ✿ ✿ ✿

  • f the above at once.

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New dec (♥❀ t) = (4096❀

slide-12
SLIDE 12

records cryptography. ✿ ✿ ✿ security level. ✿ ✿ ✿ rotection computers. ✿ ✿ ✿ full protection cache-timing attacks, rediction attacks, etc. ✿ ✿ ✿ de-based crypto track record. ✿ ✿ ✿

  • ve at once.

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41);

slide-13
SLIDE 13

cryptography. ✿ ✿ ✿ level. ✿ ✿ ✿ ers. ✿ ✿ ✿ rotection attacks, etc. ✿ ✿ ✿ crypto ✿ ✿ ✿

  • nce.

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 securit

slide-14
SLIDE 14

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security:

slide-15
SLIDE 15

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.)

slide-16
SLIDE 16

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles.

slide-17
SLIDE 17

Examples of the competition Some cycle counts on h9ivy (Intel Core i5-3210M, Ivy Bridge) from bench.cr.yp.to: mceliece encrypt 61440 (2008 Biswas–Sendrier, 280) gls254 DH 77468 (binary elliptic curve; CHES 2013) kumfp127g DH 116944 (hyperelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS.

slide-18
SLIDE 18

Examples of the competition cycle counts on h9ivy Core i5-3210M, Ivy Bridge) bench.cr.yp.to: mceliece encrypt 61440 Biswas–Sendrier, 280) DH 77468 ry elliptic curve; CHES 2013) kumfp127g DH 116944 erelliptic; Eurocrypt 2013) curve25519 DH 182632 (conservative elliptic curve) mceliece decrypt 1219344 ronald1024 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time The extremist’s to eliminate Handle all using only XOR (^),

slide-19
SLIDE 19

competition counts on h9ivy i5-3210M, Ivy Bridge) bench.cr.yp.to: encrypt 61440 as–Sendrier, 280) 77468 curve; CHES 2013) 116944 Eurocrypt 2013) 182632 elliptic curve) decrypt 1219344 decrypt 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist’s app to eliminate timing Handle all secret data using only bit operations— XOR (^), AND (&),

slide-20
SLIDE 20

etition h9ivy Bridge) 61440

80)

77468 CHES 2013) 116944 2013) 182632 curve) 1219344 1340040 New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc.

slide-21
SLIDE 21

New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc.

slide-22
SLIDE 22

New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach.

slide-23
SLIDE 23

New decoding speeds (♥❀ t) = (4096❀ 41); 2128 security: 60493 Ivy Bridge cycles. Talk will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) (♥❀ t) = (2048❀ 32); 280 security: 26544 Ivy Bridge cycles. All load/store addresses and all branch conditions are public. Eliminates cache-timing attacks etc. Similar improvements for CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach. “How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?”

slide-24
SLIDE 24

decoding speeds ♥❀ t (4096❀ 41); 2128 security: Ivy Bridge cycles. will focus on this case. (Decryption is slightly slower: includes hash, cipher, MAC.) ♥❀ t (2048❀ 32); 280 security: Ivy Bridge cycles. load/store addresses all branch conditions

  • blic. Eliminates

cache-timing attacks etc. r improvements for CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach. “How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?” Yes, we a Not as slo On a typical the XOR is actually

  • perating
  • n vecto
slide-25
SLIDE 25

speeds ♥❀ t ❀ 41); 2128 security: Bridge cycles.

  • n this case.

slightly slower: cipher, MAC.) ♥❀ t ❀ 32); 280 security: Bridge cycles. addresses conditions Eliminates attacks etc. rovements for CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach. “How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?” Yes, we are. Not as slow as it sounds! On a typical 32-bit the XOR instruction is actually 32-bit X

  • perating in parallel
  • n vectors of 32 bits
slide-26
SLIDE 26

♥❀ t ❀ security: case. er: C.) ♥❀ t ❀ security: CFS. Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach. “How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?” Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.
slide-27
SLIDE 27

Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach. “How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?” Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.
slide-28
SLIDE 28

Constant-time fanaticism The extremist’s approach to eliminate timing attacks: Handle all secret data using only bit operations— XOR (^), AND (&), etc. We take this approach. “How can this be competitive in speed? Are you really simulating field multiplication with hundreds of bit operations instead of simple log tables?” Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.
slide-29
SLIDE 29

Constant-time fanaticism extremist’s approach eliminate timing attacks: Handle all secret data

  • nly bit operations—

), AND (&), etc. take this approach. can this be etitive in speed?

  • u really simulating

multiplication with hundreds of bit operations

  • f simple log tables?”

Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.

Not imme that this saves time multiplication

slide-30
SLIDE 30

fanaticism approach timing attacks: secret data erations— (&), etc. approach. e speed? simulating multiplication with

  • perations

simple log tables?” Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F

slide-31
SLIDE 31

attacks: tables?” Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212.

slide-32
SLIDE 32

Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212.

slide-33
SLIDE 33

Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212.

slide-34
SLIDE 34

Yes, we are. Not as slow as it sounds! On a typical 32-bit CPU, the XOR instruction is actually 32-bit XOR,

  • perating in parallel
  • n vectors of 32 bits.

Low-end smartphone CPU: 128-bit XOR every cycle. Ivy Bridge: 256-bit XOR every cycle,

  • r three 128-bit XORs.

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing.

slide-35
SLIDE 35

e are. slow as it sounds! ypical 32-bit CPU, OR instruction actually 32-bit XOR, erating in parallel vectors of 32 bits. w-end smartphone CPU: 128-bit XOR every cycle. Bridge: 256-bit XOR every cycle, e 128-bit XORs. Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive Fix ♥ = t Big final is to find

  • f ❢ = ❝

① ✁ ✁ ✁ ❝ ① For each ☛ ✷ compute ❢ ☛ 41 adds,

slide-36
SLIDE 36

it sounds! 32-bit CPU, instruction 32-bit XOR, rallel bits. rtphone CPU: every cycle. every cycle, XORs. Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix ♥ = 4096 = 212 t Big final decoding is to find all roots

  • f ❢ = ❝41①41 + ✁ ✁ ✁

❝ ① For each ☛ ✷ F212, compute ❢(☛) by Ho 41 adds, 41 mults.

slide-37
SLIDE 37

CPU: Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults.

slide-38
SLIDE 38

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults.

slide-39
SLIDE 39

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults.

slide-40
SLIDE 40

Not immediately obvious that this “bitslicing” saves time for, e.g., multiplication in F212. But quite obvious that it saves time for addition in F212. Typical decoding algorithms have add, mult roughly balanced. Coming next: how to save many adds and most mults. Nice synergy with bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults.

slide-41
SLIDE 41

immediately obvious this “bitslicing” time for, e.g., multiplication in F212. quite obvious that it time for addition in F212. ypical decoding algorithms add, mult roughly balanced. Coming next: how to save adds and most mults. synergy with bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asymptotics: normally t ✷ ♥❂ ♥ so Horner’s Θ(♥t) = ♥ ❂ ♥

slide-42
SLIDE 42
  • bvious

“bitslicing” e.g., F212.

  • bvious that it

addition in F212. algorithms roughly balanced. how to save most mults. with bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asymptotics: normally t ✷ Θ(♥❂ ♥ so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥

slide-43
SLIDE 43

F212. rithms balanced. mults. bitslicing. The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥).

slide-44
SLIDE 44

The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥).

slide-45
SLIDE 45

The additive FFT Fix ♥ = 4096 = 212, t = 41. Big final decoding step is to find all roots in F212

  • f ❢ = ❝41①41 + ✁ ✁ ✁ + ❝0①0.

For each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: 41 adds, 41 mults. Or use Chien search: compute ❝✐❣✐, ❝✐❣2✐, ❝✐❣3✐, etc. Cost per point: again 41 adds, 41 mults. Our cost: 6.01 adds, 2.09 mults. Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥). Wait a minute. Didn’t we learn in school that FFT evaluates an ♥-coeff polynomial at ♥ points using ♥1+♦(1) operations? Isn’t this better than ♥2❂ lg ♥?

slide-46
SLIDE 46

additive FFT ♥ = 4096 = 212, t = 41. final decoding step find all roots in F212 ❢ ❝41①41 + ✁ ✁ ✁ + ❝0①0. each ☛ ✷ F212, compute ❢(☛) by Horner’s rule: adds, 41 mults. Chien search: compute ❝✐❣✐ ❝✐❣2✐, ❝✐❣3✐, etc. Cost per again 41 adds, 41 mults. cost: 6.01 adds, 2.09 mults. Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥). Wait a minute. Didn’t we learn in school that FFT evaluates an ♥-coeff polynomial at ♥ points using ♥1+♦(1) operations? Isn’t this better than ♥2❂ lg ♥? Standard Want to ❢ = ❝0 + ❝ ① ✁ ✁ ✁ ❝♥ ①♥ at all the ♥ Write ❢ ❢ ① ①❢ ① Observe ❢(☛) = ❢ ☛ ☛❢ ☛ ❢(☛) = ❢ ☛ ☛❢ ☛ ❢0 has ♥❂ evaluate ♥❂ by same Similarly ❢

slide-47
SLIDE 47

FFT ♥ 212, t = 41. ding step

  • ts in F212

❢ ❝ ① ✁ ✁ ✁ + ❝0①0. ☛ ✷

12,

❢ ☛ y Horner’s rule: mults. search: compute ❝✐❣✐ ❝✐❣ ✐ ❝✐❣ ✐, etc. Cost per adds, 41 mults. adds, 2.09 mults. Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥). Wait a minute. Didn’t we learn in school that FFT evaluates an ♥-coeff polynomial at ♥ points using ♥1+♦(1) operations? Isn’t this better than ♥2❂ lg ♥? Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ ❝♥ ①♥ at all the ♥th roots Write ❢ as ❢0(①2) ①❢ ① Observe big overlap ❢(☛) = ❢0(☛2) + ☛❢ ☛ ❢(☛) = ❢0(☛2) ☛❢ ☛ ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd by same idea recursively Similarly ❢1.

slide-48
SLIDE 48

♥ t 41. ❢ ❝ ① ✁ ✁ ✁ ❝ ① . ☛ ✷ ❢ ☛ rule: compute ❝✐❣✐ ❝✐❣ ✐ ❝✐❣ ✐ Cost per mults. mults. Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥). Wait a minute. Didn’t we learn in school that FFT evaluates an ♥-coeff polynomial at ♥ points using ♥1+♦(1) operations? Isn’t this better than ♥2❂ lg ♥? Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥ at all the ♥th roots of 1. Write ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢(☛) = ❢0(☛2) + ☛❢1(☛2), ❢(☛) = ❢0(☛2) ☛❢1(☛2). ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd roots of by same idea recursively. Similarly ❢1.

slide-49
SLIDE 49

Asymptotics: normally t ✷ Θ(♥❂ lg ♥), so Horner’s rule costs Θ(♥t) = Θ(♥2❂ lg ♥). Wait a minute. Didn’t we learn in school that FFT evaluates an ♥-coeff polynomial at ♥ points using ♥1+♦(1) operations? Isn’t this better than ♥2❂ lg ♥? Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 at all the ♥th roots of 1. Write ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢(☛) = ❢0(☛2) + ☛❢1(☛2), ❢(☛) = ❢0(☛2) ☛❢1(☛2). ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd roots of 1 by same idea recursively. Similarly ❢1.

slide-50
SLIDE 50

Asymptotics: rmally t ✷ Θ(♥❂ lg ♥), rner’s rule costs ♥t = Θ(♥2❂ lg ♥). minute. we learn in school FFT evaluates ♥-coeff polynomial ♥

  • ints

♥1+♦(1) operations? this better than ♥2❂ lg ♥? Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 at all the ♥th roots of 1. Write ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢(☛) = ❢0(☛2) + ☛❢1(☛2), ❢(☛) = ❢0(☛2) ☛❢1(☛2). ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd roots of 1 by same idea recursively. Similarly ❢1. Useless in ☛ ☛ Standard FFT considered 1988 Wa independently “additive Still quite 1996 von some im 2010 Gao–Mateer: much better We use Gao–Mateer, plus some

slide-51
SLIDE 51

t ✷ ♥❂ lg ♥), costs ♥t ♥ ❂ lg ♥). in school evaluates ♥

  • lynomial

♥ ♥

erations? than ♥2❂ lg ♥? Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 at all the ♥th roots of 1. Write ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢(☛) = ❢0(☛2) + ☛❢1(☛2), ❢(☛) = ❢0(☛2) ☛❢1(☛2). ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd roots of 1 by same idea recursively. Similarly ❢1. Useless in char 2: ☛ ☛ Standard workarounds FFT considered imp 1988 Wang–Zhu, independently 1989 “additive FFT” in Still quite expensive. 1996 von zur Gathen–Gerha some improvements. 2010 Gao–Mateer: much better additive We use Gao–Mateer, plus some new imp

slide-52
SLIDE 52

t ✷ ♥❂ ♥ ♥t ♥ ❂ ♥ ♥ ♥ ♥

♥ ❂ lg ♥? Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 at all the ♥th roots of 1. Write ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢(☛) = ❢0(☛2) + ☛❢1(☛2), ❢(☛) = ❢0(☛2) ☛❢1(☛2). ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd roots of 1 by same idea recursively. Similarly ❢1. Useless in char 2: ☛ = ☛. Standard workarounds are painful. FFT considered impractical. 1988 Wang–Zhu, independently 1989 Cantor: “additive FFT” in char 2. Still quite expensive. 1996 von zur Gathen–Gerhard: some improvements. 2010 Gao–Mateer: much better additive FFT. We use Gao–Mateer, plus some new improvements.

slide-53
SLIDE 53

Standard radix-2 FFT: Want to evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 at all the ♥th roots of 1. Write ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢(☛) = ❢0(☛2) + ☛❢1(☛2), ❢(☛) = ❢0(☛2) ☛❢1(☛2). ❢0 has ♥❂2 coeffs; evaluate at (♥❂2)nd roots of 1 by same idea recursively. Similarly ❢1. Useless in char 2: ☛ = ☛. Standard workarounds are painful. FFT considered impractical. 1988 Wang–Zhu, independently 1989 Cantor: “additive FFT” in char 2. Still quite expensive. 1996 von zur Gathen–Gerhard: some improvements. 2010 Gao–Mateer: much better additive FFT. We use Gao–Mateer, plus some new improvements.

slide-54
SLIDE 54

Standard radix-2 FFT: to evaluate ❢ ❝ + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 the ♥th roots of 1. ❢ as ❢0(①2) + ①❢1(①2). Observe big overlap between ❢ ☛ ❢0(☛2) + ☛❢1(☛2), ❢ ☛ = ❢0(☛2) ☛❢1(☛2). ❢ ♥❂2 coeffs; evaluate at (♥❂2)nd roots of 1 same idea recursively. rly ❢1. Useless in char 2: ☛ = ☛. Standard workarounds are painful. FFT considered impractical. 1988 Wang–Zhu, independently 1989 Cantor: “additive FFT” in char 2. Still quite expensive. 1996 von zur Gathen–Gerhard: some improvements. 2010 Gao–Mateer: much better additive FFT. We use Gao–Mateer, plus some new improvements. Gao and ❢ = ❝0 + ❝ ① ✁ ✁ ✁ ❝♥ ①♥

  • n a size-♥

Main idea: ❢ ❢0(①2 + ① ①❢ ① ① Big overlap ❢ ☛ ❢0(☛2 + ☛ ☛❢ ☛ ☛ and ❢(☛ ❢0(☛2 + ☛ ☛ ❢ ☛ ☛ “Twist” ✷ Then ✟ ☛ ☛ ✠ size-(♥❂2) Apply same

slide-55
SLIDE 55

FFT: evaluate ❢ ❝ ❝ ① ✁ ✁ ✁ + ❝♥1①♥1 ♥

  • ts of 1.

❢ ❢ ① ) + ①❢1(①2).

  • verlap between

❢ ☛ ❢ ☛ ☛❢1(☛2), ❢ ☛ ❢ ☛ ) ☛❢1(☛2). ❢ ♥❂ effs; ♥❂2)nd roots of 1 recursively. ❢ Useless in char 2: ☛ = ☛. Standard workarounds are painful. FFT considered impractical. 1988 Wang–Zhu, independently 1989 Cantor: “additive FFT” in char 2. Still quite expensive. 1996 von zur Gathen–Gerhard: some improvements. 2010 Gao–Mateer: much better additive FFT. We use Gao–Mateer, plus some new improvements. Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ ❝♥ ①♥

  • n a size-♥ F2-linea

Main idea: Write ❢ ❢0(①2 + ①) + ①❢1(① ① Big overlap between ❢ ☛ ❢0(☛2 + ☛) + ☛❢1(☛ ☛ and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + ❢ ☛ ☛ “Twist” to ensure ✷ Then ✟ ☛2 + ☛ ✠ is size-(♥❂2) F2-linea Apply same idea recursively

slide-56
SLIDE 56

❢ ❝ ❝ ① ✁ ✁ ✁ ❝♥ ①♥1 ♥ ❢ ❢ ① ①❢ ①2). een ❢ ☛ ❢ ☛ ☛❢ ☛ ), ❢ ☛ ❢ ☛ ☛❢ ☛ ). ❢ ♥❂ ♥❂

  • f 1

❢ Useless in char 2: ☛ = ☛. Standard workarounds are painful. FFT considered impractical. 1988 Wang–Zhu, independently 1989 Cantor: “additive FFT” in char 2. Still quite expensive. 1996 von zur Gathen–Gerhard: some improvements. 2010 Gao–Mateer: much better additive FFT. We use Gao–Mateer, plus some new improvements. Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥

  • n a size-♥ F2-linear space.

Main idea: Write ❢ as ❢0(①2 + ①) + ①❢1(①2 + ①). Big overlap between ❢(☛) = ❢0(☛2 + ☛) + ☛❢1(☛2 + ☛) and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + 1)❢1(☛2 + ☛ “Twist” to ensure 1 ✷ space. Then ✟ ☛2 + ☛ ✠ is a size-(♥❂2) F2-linear space. Apply same idea recursively.

slide-57
SLIDE 57

Useless in char 2: ☛ = ☛. Standard workarounds are painful. FFT considered impractical. 1988 Wang–Zhu, independently 1989 Cantor: “additive FFT” in char 2. Still quite expensive. 1996 von zur Gathen–Gerhard: some improvements. 2010 Gao–Mateer: much better additive FFT. We use Gao–Mateer, plus some new improvements. Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1

  • n a size-♥ F2-linear space.

Main idea: Write ❢ as ❢0(①2 + ①) + ①❢1(①2 + ①). Big overlap between ❢(☛) = ❢0(☛2 + ☛) + ☛❢1(☛2 + ☛) and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + 1)❢1(☛2 + ☛). “Twist” to ensure 1 ✷ space. Then ✟ ☛2 + ☛ ✠ is a size-(♥❂2) F2-linear space. Apply same idea recursively.

slide-58
SLIDE 58

Useless in char 2: ☛ = ☛. Standard workarounds are painful. considered impractical. ang–Zhu, endently 1989 Cantor: “additive FFT” in char 2. quite expensive. von zur Gathen–Gerhard: improvements. Gao–Mateer: better additive FFT. use Gao–Mateer, some new improvements. Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1

  • n a size-♥ F2-linear space.

Main idea: Write ❢ as ❢0(①2 + ①) + ①❢1(①2 + ①). Big overlap between ❢(☛) = ❢0(☛2 + ☛) + ☛❢1(☛2 + ☛) and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + 1)❢1(☛2 + ☛). “Twist” to ensure 1 ✷ space. Then ✟ ☛2 + ☛ ✠ is a size-(♥❂2) F2-linear space. Apply same idea recursively. We generalize ❢ = ❝0 + ❝ ① ✁ ✁ ✁ ❝t①t for any t ❁ ♥ ✮ several not all of by simply For t = 0: ❝ For t ✷ ❢ ❀ ❣ ❢1 is a constant. Instead of this constant ☛ multiply and compute

slide-59
SLIDE 59

2: ☛ = ☛. rounds are painful. impractical. ng–Zhu, 1989 Cantor: in char 2. ensive. Gathen–Gerhard: rovements. Gao–Mateer: additive FFT. Gao–Mateer, improvements. Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1

  • n a size-♥ F2-linear space.

Main idea: Write ❢ as ❢0(①2 + ①) + ①❢1(①2 + ①). Big overlap between ❢(☛) = ❢0(☛2 + ☛) + ☛❢1(☛2 + ☛) and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + 1)❢1(☛2 + ☛). “Twist” to ensure 1 ✷ space. Then ✟ ☛2 + ☛ ✠ is a size-(♥❂2) F2-linear space. Apply same idea recursively. We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are by simply tracking For t = 0: copy ❝0 For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛ multiply only by generato and compute subset

slide-60
SLIDE 60

☛ ☛. painful. ractical. r: Gathen–Gerhard: FFT. rovements. Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1

  • n a size-♥ F2-linear space.

Main idea: Write ❢ as ❢0(①2 + ①) + ①❢1(①2 + ①). Big overlap between ❢(☛) = ❢0(☛2 + ☛) + ☛❢1(☛2 + ☛) and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + 1)❢1(☛2 + ☛). “Twist” to ensure 1 ✷ space. Then ✟ ☛2 + ☛ ✠ is a size-(♥❂2) F2-linear space. Apply same idea recursively. We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy ❝0. For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛, multiply only by generators and compute subset sums.

slide-61
SLIDE 61

Gao and Mateer evaluate ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1

  • n a size-♥ F2-linear space.

Main idea: Write ❢ as ❢0(①2 + ①) + ①❢1(①2 + ①). Big overlap between ❢(☛) = ❢0(☛2 + ☛) + ☛❢1(☛2 + ☛) and ❢(☛ + 1) = ❢0(☛2 + ☛) + (☛ + 1)❢1(☛2 + ☛). “Twist” to ensure 1 ✷ space. Then ✟ ☛2 + ☛ ✠ is a size-(♥❂2) F2-linear space. Apply same idea recursively. We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy ❝0. For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛, multiply only by generators and compute subset sums.

slide-62
SLIDE 62

and Mateer evaluate ❢ ❝ + ❝1① + ✁ ✁ ✁ + ❝♥1①♥1 size-♥ F2-linear space. idea: Write ❢ as ❢ ① + ①) + ①❢1(①2 + ①).

  • verlap between ❢(☛) =

❢ ☛ + ☛) + ☛❢1(☛2 + ☛) ❢(☛ + 1) = ❢ ☛ + ☛) + (☛ + 1)❢1(☛2 + ☛). “Twist” to ensure 1 ✷ space. ✟ ☛2 + ☛ ✠ is a ♥❂2) F2-linear space. same idea recursively. We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy ❝0. For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛, multiply only by generators and compute subset sums. Syndrome Initial deco s0 = r1 + r ✁ ✁ ✁ r♥ s1 = r1☛ r ☛ ✁ ✁ ✁ r♥☛♥ s2 = r1☛ r ☛ ✁ ✁ ✁ r♥☛♥ . . ., st = r1☛t r ☛t ✁ ✁ ✁ r♥☛t

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ scaled by Typically mapping Not as slo still ♥2+♦

slide-63
SLIDE 63

evaluate ❢ ❝ ❝ ① ✁ ✁ ✁ + ❝♥1①♥1 ♥

  • linear space.

rite ❢ as ❢ ① ① ①❢1(①2 + ①). een ❢(☛) = ❢ ☛ ☛ ☛❢1(☛2 + ☛) ❢ ☛ ❢ ☛ ☛ ☛ + 1)❢1(☛2 + ☛). ensure 1 ✷ space. ✟ ☛ ☛ ✠ is a ♥❂

  • linear space.

recursively. We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy ❝0. For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛, multiply only by generators and compute subset sums. Syndrome computation Initial decoding step: s0 = r1 + r2 + ✁ ✁ ✁ r♥ s1 = r1☛1 + r2☛2 ✁ ✁ ✁ r♥☛♥ s2 = r1☛2

1 + r2☛2 2

✁ ✁ ✁ r♥☛♥ . . ., st = r1☛t

1 + r2☛t 2

✁ ✁ ✁ r♥☛t

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are r scaled by Goppa constants. Typically precompute mapping bits to syndrome. Not as slow as Chien still ♥2+♦(1) and huge

slide-64
SLIDE 64

❢ ❝ ❝ ① ✁ ✁ ✁ ❝♥ ①♥1 ♥ space. ❢ ❢ ① ① ①❢ ① ①). ❢ ☛ = ❢ ☛ ☛ ☛❢ ☛ ☛) ❢ ☛ ❢ ☛ ☛ ☛ ❢ ☛2 + ☛). ✷ space. ✟ ☛ ☛ ✠ ♥❂ space. recursively. We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy ❝0. For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛, multiply only by generators and compute subset sums. Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥ s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛♥

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search still ♥2+♦(1) and huge secret

slide-65
SLIDE 65

We generalize to ❢ = ❝0 + ❝1① + ✁ ✁ ✁ + ❝t①t for any t ❁ ♥. ✮ several optimizations, not all of which are automated by simply tracking zeros. For t = 0: copy ❝0. For t ✷ ❢1❀ 2❣: ❢1 is a constant. Instead of multiplying this constant by each ☛, multiply only by generators and compute subset sums. Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key.

slide-66
SLIDE 66

generalize to ❢ ❝ + ❝1① + ✁ ✁ ✁ + ❝t①t any t ❁ ♥. ✮ several optimizations, all of which are automated simply tracking zeros. t 0: copy ❝0. t ✷ ❢1❀ 2❣: ❢ constant. Instead of multiplying constant by each ☛, multiply only by generators compute subset sums. Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key. Compare ❢(☛1) = ❝ ❝ ☛ ✁ ✁ ✁ ❝t☛t ❢(☛2) = ❝ ❝ ☛ ✁ ✁ ✁ ❝t☛t . . ., ❢(☛♥) = ❝ ❝ ☛♥ ✁ ✁ ✁ ❝t☛t

slide-67
SLIDE 67

❢ ❝ ❝ ① ✁ ✁ ✁ + ❝t①t t ❁ ♥ ✮

  • ptimizations,

are automated tracking zeros. t ❝0. t ✷ ❢ ❀ ❣ ❢ constant. ltiplying each ☛, generators subset sums. Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key. Compare to multip ❢(☛1) = ❝0 + ❝1☛1 ✁ ✁ ✁ ❝t☛t ❢(☛2) = ❝0 + ❝1☛2 ✁ ✁ ✁ ❝t☛t . . ., ❢(☛♥) = ❝0 + ❝1☛♥ ✁ ✁ ✁ ❝t☛t

slide-68
SLIDE 68

❢ ❝ ❝ ① ✁ ✁ ✁ ❝t①t t ❁ ♥ ✮ automated t ❝ t ✷ ❢ ❀ ❣ ❢ ☛ rs sums. Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t ❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t . . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

slide-69
SLIDE 69

Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

slide-70
SLIDE 70

Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

Matrix for syndrome computation is transpose of matrix for multipoint evaluation.

slide-71
SLIDE 71

Syndrome computation Initial decoding step: compute s0 = r1 + r2 + ✁ ✁ ✁ + r♥, s1 = r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s2 = r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

. . ., st = r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r1❀ r2❀ ✿ ✿ ✿ ❀ r♥ are received bits scaled by Goppa constants. Typically precompute matrix mapping bits to syndrome. Not as slow as Chien search but still ♥2+♦(1) and huge secret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few

  • ps as multipoint evaluation.

Eliminate precomputed matrix.

slide-72
SLIDE 72

Syndrome computation decoding step: compute s r1 + r2 + ✁ ✁ ✁ + r♥, s r1☛1 + r2☛2 + ✁ ✁ ✁ + r♥☛♥, s r1☛2

1 + r2☛2 2 + ✁ ✁ ✁ + r♥☛2 ♥,

st r1☛t

1 + r2☛t 2 + ✁ ✁ ✁ + r♥☛t ♥.

r ❀ r ❀ ✿ ✿ ✿ ❀ r♥ are received bits by Goppa constants. ypically precompute matrix mapping bits to syndrome. slow as Chien search but ♥2+♦(1) and huge secret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few

  • ps as multipoint evaluation.

Eliminate precomputed matrix. Transposition If a linea computes ▼ then reversing exchanging computes ▼ 1956 Bord independently for Boolean 1973 Fiduccia preserves preserves number

slide-73
SLIDE 73

computation step: compute s r r ✁ ✁ ✁ + r♥, s r ☛ r ☛2 + ✁ ✁ ✁ + r♥☛♥, s r ☛ r ☛2

2 + ✁ ✁ ✁ + r♥☛2 ♥,

st r ☛t r ☛t

2 + ✁ ✁ ✁ + r♥☛t ♥.

r ❀ r ❀ ✿ ✿ ✿ ❀ r♥ re received bits constants. recompute matrix syndrome. Chien search but ♥

huge secret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few

  • ps as multipoint evaluation.

Eliminate precomputed matrix. Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges exchanging inputs/outputs computes the transp ▼ 1956 Bordewijk; independently 1957 for Boolean matric 1973 Fiduccia analysis: preserves number of preserves number of number of nontrivial

slide-74
SLIDE 74

compute s r r ✁ ✁ ✁ r♥ s r ☛ r ☛ ✁ ✁ ✁ r♥☛♥, s r ☛ r ☛ ✁ ✁ ✁ r♥☛2

♥,

st r ☛t r ☛t ✁ ✁ ✁ r♥☛t

♥.

r ❀ r ❀ ✿ ✿ ✿ ❀ r♥ bits constants. matrix syndrome. rch but ♥

cret key. Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few

  • ps as multipoint evaluation.

Eliminate precomputed matrix. Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼ 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs.

slide-75
SLIDE 75

Compare to multipoint evaluation: ❢(☛1) = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢(☛2) = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

. . ., ❢(☛♥) = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

Matrix for syndrome computation is transpose of matrix for multipoint evaluation. Amazing consequence: syndrome computation is as few

  • ps as multipoint evaluation.

Eliminate precomputed matrix. Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs.

slide-76
SLIDE 76

Compare to multipoint evaluation: ❢ ☛ = ❝0 + ❝1☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢ ☛ = ❝0 + ❝1☛2 + ✁ ✁ ✁ + ❝t☛t

2,

❢ ☛♥ = ❝0 + ❝1☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

for syndrome computation transpose of for multipoint evaluation. Amazing consequence: syndrome computation is as few multipoint evaluation. Eliminate precomputed matrix. Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built producing Too many ♠ gcc ran

slide-77
SLIDE 77

multipoint evaluation: ❢ ☛ ❝ ❝ ☛1 + ✁ ✁ ✁ + ❝t☛t

1,

❢ ☛ ❝ ❝ ☛2 + ✁ ✁ ✁ + ❝t☛t

2,

❢ ☛♥ ❝ ❝ ☛♥ + ✁ ✁ ✁ + ❝t☛t

♥.

syndrome computation multipoint evaluation. consequence: computation is as few

  • int evaluation.

recomputed matrix. Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposin producing C code. Too many variables ♠ gcc ran out of memo

slide-78
SLIDE 78

evaluation: ❢ ☛ ❝ ❝ ☛ ✁ ✁ ✁ ❝t☛t

1,

❢ ☛ ❝ ❝ ☛ ✁ ✁ ✁ ❝t☛t

2,

❢ ☛♥ ❝ ❝ ☛♥ ✁ ✁ ✁ + ❝t☛t

♥.

computation evaluation. as few evaluation. matrix. Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = gcc ran out of memory.

slide-79
SLIDE 79

Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory.

slide-80
SLIDE 80

Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly.

slide-81
SLIDE 81

Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size.

slide-82
SLIDE 82

Transposition principle: If a linear algorithm computes a matrix ▼ then reversing edges and exchanging inputs/outputs computes the transpose of ▼. 1956 Bordewijk; independently 1957 Lupanov for Boolean matrices. 1973 Fiduccia analysis: preserves number of mults; preserves number of adds plus number of nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead.

slide-83
SLIDE 83
  • sition principle:

ear algorithm computes a matrix ▼ reversing edges and exchanging inputs/outputs computes the transpose of ▼. Bordewijk; endently 1957 Lupanov

  • lean matrices.

Fiduccia analysis: reserves number of mults; reserves number of adds plus er of nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at wrote do with same Small co Speedups translate to transp Further savings: merged first scaling b

slide-84
SLIDE 84

rinciple: rithm matrix ▼ edges and inputs/outputs transpose of ▼. 1957 Lupanov matrices. analysis: er of mults; er of adds plus nontrivial outputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at additive FFT, wrote down transp with same loops etc. Small code, no overhead. Speedups of additive translate easily to transposed algo Further savings: merged first stage scaling by Goppa constants.

slide-85
SLIDE 85

▼ inputs/outputs

  • f ▼.

Lupanov mults; plus tputs. We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants.

slide-86
SLIDE 86

We built transposing compiler producing C code. Too many variables for ♠ = 13; gcc ran out of memory. Used qhasm register allocator to optimize the variables. Worked, but not very quickly. Wrote faster register allocator. Still excessive code size. Built new interpreter, allowing some code compression. Still big; still some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants.

slide-87
SLIDE 87

ilt transposing compiler ducing C code. many variables for ♠ = 13; ran out of memory. qhasm register allocator

  • ptimize the variables.

ed, but not very quickly. faster register allocator. excessive code size. new interpreter, wing some code compression. big; still some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants. Secret permutation Additive ✮ ❢ field eleme This is not needed in Must apply part of the Same issue Solution: Almost done Beneˇ s net

slide-88
SLIDE 88
  • sing compiler

de. riables for ♠ = 13; memory. register allocator variables. very quickly. register allocator. de size. reter, code compression. some overhead. Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants. Secret permutation Additive FFT ✮ ❢ field elements in a This is not the order needed in code-base Must apply a secret part of the secret k Same issue for syndrome. Solution: Batcher Almost done with Beneˇ s network.

slide-89
SLIDE 89

compiler ♠ = 13; cator quickly. cator. ression.

  • verhead.

Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants. Secret permutation Additive FFT ✮ ❢ values at field elements in a standard This is not the order needed in code-based crypto! Must apply a secret permutation, part of the secret key. Same issue for syndrome. Solution: Batcher sorting. Almost done with faster solution: Beneˇ s network.

slide-90
SLIDE 90

Better solution: stared at additive FFT, wrote down transposition with same loops etc. Small code, no overhead. Speedups of additive FFT translate easily to transposed algorithm. Further savings: merged first stage with scaling by Goppa constants. Secret permutation Additive FFT ✮ ❢ values at field elements in a standard order. This is not the order needed in code-based crypto! Must apply a secret permutation, part of the secret key. Same issue for syndrome. Solution: Batcher sorting. Almost done with faster solution: Beneˇ s network.

slide-91
SLIDE 91

solution: at additive FFT, down transposition same loops etc. code, no overhead. eedups of additive FFT translate easily transposed algorithm. urther savings: merged first stage with by Goppa constants. Secret permutation Additive FFT ✮ ❢ values at field elements in a standard order. This is not the order needed in code-based crypto! Must apply a secret permutation, part of the secret key. Same issue for syndrome. Solution: Batcher sorting. Almost done with faster solution: Beneˇ s network. Results 60493 Ivy 8622 fo 20846 fo 7714 fo 14794 fo 8520 fo Code will We’re still More info cr.yp.to/papers.html#mcbits

slide-92
SLIDE 92

additive FFT, transposition etc.

  • verhead.

additive FFT algorithm. stage with Goppa constants. Secret permutation Additive FFT ✮ ❢ values at field elements in a standard order. This is not the order needed in code-based crypto! Must apply a secret permutation, part of the secret key. Same issue for syndrome. Solution: Batcher sorting. Almost done with faster solution: Beneˇ s network. Results 60493 Ivy Bridge cycles: 8622 for permuta 20846 for syndrome. 7714 for BM. 14794 for roots. 8520 for permuta Code will be public We’re still speeding More information: cr.yp.to/papers.html#mcbits

slide-93
SLIDE 93

constants. Secret permutation Additive FFT ✮ ❢ values at field elements in a standard order. This is not the order needed in code-based crypto! Must apply a secret permutation, part of the secret key. Same issue for syndrome. Solution: Batcher sorting. Almost done with faster solution: Beneˇ s network. Results 60493 Ivy Bridge cycles: 8622 for permutation. 20846 for syndrome. 7714 for BM. 14794 for roots. 8520 for permutation. Code will be public domain. We’re still speeding it up. More information: cr.yp.to/papers.html#mcbits

slide-94
SLIDE 94

Secret permutation Additive FFT ✮ ❢ values at field elements in a standard order. This is not the order needed in code-based crypto! Must apply a secret permutation, part of the secret key. Same issue for syndrome. Solution: Batcher sorting. Almost done with faster solution: Beneˇ s network. Results 60493 Ivy Bridge cycles: 8622 for permutation. 20846 for syndrome. 7714 for BM. 14794 for roots. 8520 for permutation. Code will be public domain. We’re still speeding it up. More information: cr.yp.to/papers.html#mcbits