The tangent FFT D. J. Bernstein University of Illinois at Chicago - - PDF document

the tangent fft d j bernstein university of illinois at
SMART_READER_LITE
LIVE PREVIEW

The tangent FFT D. J. Bernstein University of Illinois at Chicago - - PDF document

The tangent FFT D. J. Bernstein University of Illinois at Chicago Advertisement SPEED: Software Performance Enhancement for Encryption and Decryption A workshop on software speeds for secret-key cryptography and public-key


slide-1
SLIDE 1

The tangent FFT

  • D. J. Bernstein

University of Illinois at Chicago

slide-2
SLIDE 2

Advertisement “SPEED: Software Performance Enhancement for Encryption and Decryption” A workshop on software speeds for secret-key cryptography and public-key cryptography. Amsterdam, June 11–12, 2007 http:// www.hyperelliptic.org/SPEED

slide-3
SLIDE 3

The convolution problem How quickly can we multiply polynomials in the ring R[ x]? Answer depends on degrees, representation of polynomials, number of polynomials, etc. Answer also depends on definition of “quickly.” Many models of computation; many interesting cost measures.

slide-4
SLIDE 4

Assume two inputs

f ; g 2 R[ x],

deg

f < m, deg g
  • n
  • m,

so deg

f g <
  • n. Assume
f ; g ; f g

represented as coeff sequences. How quickly can we compute the

n coeffs of f g, given f’s m coeffs

and

g’s n
  • m + 1 coeffs?

Inputs (

f0 ; f1 ; : : : ; f m1) 2 R m,

(

g0 ; g1 ; : : : ; g nm) 2 R nm+1.

Output (

h0 ; h1 ; : : : ; h n1) 2 R n

with

h0+ h1 x+
  • +
h n1 x n1 =

(

f0 + f1 x +
  • )(
g0 + g1 x +
  • ).
slide-5
SLIDE 5

Assume R-algebraic algorithms (without divs, branches): chains of binary R-adds

u; v 7! u + v,

binary R-subs

u; v 7! u
  • v,

binary R-mults

u; v 7! uv,

starting from inputs and constants. Example (1963 Karatsuba):

f0
  • h0
f1
  • +
  • h1
g0
  • +
  • +
  • g1
  • h2
slide-6
SLIDE 6

Cost measure for this talk: total R-algebraic complexity. Cost 1 for binary R-add; cost 1 for binary R-sub; cost 1 for binary R-mult; cost 0 for constant in R. Many real-world computations use (e.g.) Pentium M’s floating-point operations to approximate operations in R. Properly scheduled operations achieve Pentium M cycles

total R-algebraic complexity.
slide-7
SLIDE 7

Fast Fourier transforms Define

  • n
2 C as exp(2 i=n).

Define

T : C[ x] =( x n 1) ,

։ C

n

as

f 7! f(1) ; f(
  • n)
; : : : ; f(
  • n1
n

). Can very quickly compute

T.

First publication of fast algorithm: 1866 Gauss. Easy to see that Gauss’s FFT uses

O( n lg n) arithmetic operations

if

n 2 f1; 2; 4; 8; : : : g.

Several subsequent reinventions, ending with 1965 Cooley Tukey.

slide-8
SLIDE 8

Inverse map is also very fast. Multiplication in C

n is very fast.

1966 Sande, 1966 Stockham: Can very quickly multiply in C[ x] =(

x n 1) or C[ x] or R[ x]

by mapping C[ x] =(

x n 1) to C n.

Given

f ; g 2 C[ x] =( x n 1):

compute

f g as T 1( T( f) T( g)).

Given

f ; g 2 C[ x] with deg f g < n:

compute

f g from

its image in C[ x] =(

x n 1).

R-algebraic complexity

O( n lg n).
slide-9
SLIDE 9

A closer look at costs More precise analysis of Gauss FFT (and Cooley-Tukey FFT): C[ x] =(

x n 1) ,

։ C

n using

(1=2)

n lg n binary C-adds,

(1=2)

n lg n binary C-subs,

(1=2)

n lg n binary C-mults,

if

n 2 f1; 2; 4; 8; : : : g.

(

a; b) 2 R2 represents a + bi 2 C.

C-add, C-sub, C-mult cost 2

; 2; 6:

(

a; b) + ( ; d) = ( a + ; b + d),

(

a; b) ( ; d) = ( a
  • ;
b
  • d),

(

a; b)( ; d) = ( a
  • bd;
ad + b ).
slide-10
SLIDE 10

Total cost 5n lg

n.

Easily save some time: eliminate mults by 1; absorb mults by

1; i; i

into subsequent operations; simplify mults by

  • p
i

using, e.g., (

a; b)(1 = p

2; 1=

p

2) = ((

a
  • b)
= p

2; (

a + b) = p

2). Cost 5 n lg

n 10n + 16

to map C[ x] =(

x n 1) ,

։ C

n,

if

n 2 f4; 8; 16; 32; : : : g.
slide-11
SLIDE 11

What about cost of convolution? 5n lg

n + O( n) to compute T( f),

5n lg

n + O( n) to compute T( g), O( n) to multiply in C n,

similar 5

n lg n + O( n) for T 1.

Total cost 15n lg

n + O( n)

to compute

f g 2 C[ x] =( x n 1)

given

f ; g 2 C[ x] =( x n 1).

Total cost (15=2)

n lg n + O( n)

to compute

f g 2 R[ x] =( x n 1)

given

f ; g 2 R[ x] =( x n 1): map

R[ x] =(

x n 1) ,

։ R2

C n=21

(Gauss) to save half the time.

slide-12
SLIDE 12

1968 R. Yavne: Can do better! Cost 4 n lg

n 6n + 8

to map C[ x] =(

x n 1) ,

։ C

n,

if

n 2 f2; 4; 8; 16; : : : g.
slide-13
SLIDE 13

1968 R. Yavne: Can do better! Cost 4 n lg

n 6n + 8

to map C[ x] =(

x n 1) ,

։ C

n,

if

n 2 f2; 4; 8; 16; : : : g.

2004 James Van Buskirk: Can do better! Cost (34 =9)

n lg n + O( n).

Expositions of the new algorithm: Frigo, Johnson, in IEEE Trans. Signal Processing; Lundy, Van Buskirk, in Computing; Bernstein, this talk, expanding an old idea of Fiduccia.

slide-14
SLIDE 14

Van Buskirk, comp.arch, January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib?”

slide-15
SLIDE 15

Van Buskirk, comp.arch, January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib?” Bernstein, comp.arch: “What do you mean, ‘better opcounts’? The algebraic complexity

: : : of a size-2 k

complex DFT has stood at (3 k

  • 3)2
k+4 additions and ( k 3)2 k+4

multiplications since 1968.”

slide-16
SLIDE 16

Van Buskirk, comp.arch, January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib?” Bernstein, comp.arch: “What do you mean, ‘better opcounts’? The algebraic complexity

: : : of a size-2 k

complex DFT has stood at (3 k

  • 3)2
k+4 additions and ( k 3)2 k+4

multiplications since 1968.” Van Buskirk, comp.arch: “Oh, you’re so 20th century, Dan. Read the handwriting on the wall.”

slide-17
SLIDE 17

Understanding the FFT The FFT trick: C[ x] =(

x n 1) ,

։ C[ x] =(

x n=2 1) C[ x] =( x n=2 + 1)

by unique C[ x]-algebra morphism. Cost 2 n:

n=2 C-adds, n=2 C-subs.

e.g.

n = 4: C[ x] =( x4 1) ,

։ C[ x] =(

x2 1) C[ x] =( x2 + 1)

by

g0 + g1 x + g2 x2 + g3 x3 7!

(

g0 + g2) + ( g1 + g3) x;

(

g0
  • g2) + (
g1
  • g3)
x.

Representation: (

g0 ; g1 ; g2 ; g3) 7!

((

g0 + g2) ; ( g1 + g3)) ;

((

g0
  • g2)
; ( g1
  • g3)).
slide-18
SLIDE 18

Recurse: C[ x] =(

x n=2 1) ,

։ C[ x] =(

x n=4 1) C[ x] =( x n=4 +1);

similarly C[ x] =(

x n=2 + 1) ,

։ C[ x] =(

x n=4
  • i)
C[ x] =( x n=4 + i);

continue to C[ x] =(

x 1)
  • .

General case: C[ x] =(

x n
  • 2)
,

։ C[ x] =(

x n=2
  • )
C[ x] =( x n=2 + )

by

g0 +
  • +
g n=2 x n=2 +
  • 7!

(

g0 + g n=2) + ( g1 +
  • )
x +
  • ,

(

g0
  • g
n=2) + ( g1
  • )
x +
  • .

Cost 5 n:

n=2 C-mults, n=2 C-adds, n=2 C-subs.

Recurse, eliminate easy mults. Cost 5 n lg

n 10n + 16.
slide-19
SLIDE 19

Alternative: the twisted FFT After previous C[ x] =(

x n 1) ,

։ C[ x] =(

x n=2 1) C[ x] =( x n=2 +1),

apply unique C-algebra morphism C[ x] =(

x n=2+1) ,

։ C[ y] =(

y n=2 1)

that maps

x to
  • n
y. g0 + g1 x +
  • +
g n=2 x n=2 +
  • 7!

(

g0 + g n=2)+( g1 + g n=2+1) x+
  • ,

(

g0
  • g
n=2)+
  • n(
g1
  • g
n=2+1) y+
  • .

Again cost 5

n: n=2 C-mults, n=2 C-adds, n=2 C-subs.

Eliminate easy mults, recurse. Cost 5 n lg

n 10n + 16.
slide-20
SLIDE 20

The split-radix FFT FFT and twisted FFT end up with same number of mults by

  • n,

same number of mults by

  • n=2,

same number of mults by

  • n=4,

etc. Is this necessary? No! Split-radix FFT: more easy mults. “Don’t twist until you see the whites of their

i’s.”

(Can use same idea to speed up Sch¨

  • nhage-Strassen algorithm

for integer multiplication.)

slide-21
SLIDE 21

Cost 2 n: C[ x] =(

x n 1) ,

։ C[ x] =(

x n=2 1) C[ x] =( x n=2 +1).

Cost

n: C[ x] =( x n=2 + 1) ,

։ C[ x] =(

x n=4
  • i)
C[ x] =( x n=4 + i).

Cost 6(n=4): C[ x] =(

x n=4
  • i)
,

։ C[ y] =(

y n=4 1) by x 7!
  • n
y.

Cost 6(n=4): C[ x] =(

x n=4 + i) ,

։ C[ y] =(

y n=4 1) by x 7!
  • 1
n y.

Overall cost 6 n to split into 1=2; 1=4; 1=4, entropy 1

:5 bits.

Eliminate easy mults, recurse. Cost 4 n lg

n 6n + 8,

exactly as in 1968 Yavne.

slide-22
SLIDE 22

The tangent FFT Several ways to achieve cost 6 for mult by

e i.

One approach: Factor

e i

as (1 +

i tan ) cos .

Cost 2 for mult by cos

.

Cost 4 for mult by 1 +

i tan .

For stability and symmetry, use max

fjcos
  • j
; jsin
  • jg

instead of cos

.

Surprise (Van Buskirk): Can merge some cost-2 mults!

slide-23
SLIDE 23

Rethink basis of C[ x] =(

x n 1).

Instead of 1

; x; : : : ; x n1 use

1=s

n;0 ; x=s n;1 ; : : : ; x n1 =s n;n1

where

s n;k =

max

  • cos 2
k n
  • ;
  • sin 2
k n
  • max
  • cos 2
k n=4
  • ;
  • sin 2
k n=4
  • max
  • cos 2
k n=16
  • ;
  • sin 2
k n=16
  • .

Now ( g0

; g1 ; : : : ; g n1) represents g0 =s n;0 +
  • +
g n1 x n1 =s n;n1.

Note that

s n;k = s n;k+n=4.

Note that

  • k
n( s n=4;k =s n;k) is (1 + i tan
  • ) or
(cot
  • +
i).
slide-24
SLIDE 24

Cost 2 n: C[ x] =(

x n 1), basis x k =s n;k, ,

։ C[ x] =(

x n=2 1), basis x k =s n;k,
  • C[ x] =(
x n=2 + 1), basis x k =s n;k.

Cost

n:

C[ x] =(

x n=2 1), basis x k =s n;k, ,

։ C[ x] =(

x n=4 1), basis x k =s n;k,
  • C[ x] =(
x n=4 + 1), basis x k =s n;k.

Cost

n:

C[ x] =(

x n=2 +1), basis x k =s n;k, ,

։ C[ x] =(

x n=4
  • i), basis
x k =s n;k,
  • C[ x] =(
x n=4 + i), basis x k =s n;k.
slide-25
SLIDE 25

Cost

n=2 2:

C[ x] =(

x n=4 1), basis x k =s n;k, ,

։ C[ x] =(

x n=4 1), basis x k =s n=4;k.

Cost

n=2 2:

C[ x] =(

x n=4 +1), basis x k =s n;k, ,

։ C[ x] =(

x n=4 + 1), basis x k =s n=2;k.

Cost

n=2:

C[ x] =(

x n=4 + 1), basis x k =s n=2;k, ,

։ C[ x] =(

x n=8
  • i), basis
x k =s n=2;k,
  • C[ x] =(
x n=8 + i), basis x k =s n=2;k.
slide-26
SLIDE 26

Cost 4(n=4)

6:

C[ x] =(

x n=4
  • i), basis
x k =s n;k, ,

։ C[ x] =(

y n=4 1), basis y k =s n=4;k.

Cost 4(n=4)

6:

C[ x] =(

x n=4 + i), basis x k =s n;k, ,

։ C[ x] =(

y n=4 1), basis y k =s n=4;k.

Cost 4(n=8)

6:

C[ x] =(

x n=8
  • i), basis
x k =s n=2;k, ,

։ C[ x] =(

y n=8 1), basis y k =s n=8;k.

Cost 4(n=8)

6:

C[ x] =(

x n=8 + i), basis x k =s n=2;k, ,

։ C[ x] =(

y n=8 1), basis y k =s n=8;k.
slide-27
SLIDE 27

Overall cost 8 :5n

28 to split into

1=4; 1=4; 1=4; 1=8; 1=8, entropy 9

=4.

Recurse: (34 =9)

n lg n + O( n).

What if input is in C[ x] =(

x n 1)

with usual basis 1

; x; : : : ; x n1?

Could scale immediately, but faster to scale upon twist. Cost (34 =9)

n lg n (124=27) n
  • 2 lg
n (2=9)( 1)lg n lg n +

(16=27)(

1)lg n + 8,

exactly as in 2004 Van Buskirk.

slide-28
SLIDE 28

Easily handle R[ x] =(

x n + 1)

by mapping to C[ x] =(

x n=2
  • i).

Easily handle R[ x] =(

x n 1)

by mapping to R[ x] =(

x n=2 1) R[ x] =( x n=2 +1).

Cost (17 =9)

n lg n + O( n) for

R[ x] =(

x n 1) ,

։ R2

C n=21,

so cost (17

=3) n lg n + O( n)

to compute

f g 2 R[ x] =( x n 1)

given

f ; g 2 R[ x] =( x n 1).

Cost (17 =3)

n lg n + O( n)

for size-

n convolution.

Open: Can 17

=3 be improved?