SLIDE 1 The tangent FFT
University of Illinois at Chicago
SLIDE 2
Advertisement “SPEED: Software Performance Enhancement for Encryption and Decryption” A workshop on software speeds for secret-key cryptography and public-key cryptography. Amsterdam, June 11–12, 2007 http:// www.hyperelliptic.org/SPEED
SLIDE 3
The convolution problem How quickly can we multiply polynomials in the ring R[ x]? Answer depends on degrees, representation of polynomials, number of polynomials, etc. Answer also depends on definition of “quickly.” Many models of computation; many interesting cost measures.
SLIDE 4 Assume two inputs
f ; g 2 R[ x],
deg
f < m, deg g
so deg
f g <
f ; g ; f g
represented as coeff sequences. How quickly can we compute the
n coeffs of f g, given f’s m coeffs
and
g’s n
Inputs (
f0 ; f1 ; : : : ; f m1) 2 R m,
(
g0 ; g1 ; : : : ; g nm) 2 R nm+1.
Output (
h0 ; h1 ; : : : ; h n1) 2 R n
with
h0+ h1 x+
h n1 x n1 =
(
f0 + f1 x +
g0 + g1 x +
SLIDE 5 Assume R-algebraic algorithms (without divs, branches): chains of binary R-adds
u; v 7! u + v,
binary R-subs
u; v 7! u
binary R-mults
u; v 7! uv,
starting from inputs and constants. Example (1963 Karatsuba):
f0
f1
g0
SLIDE 6
Cost measure for this talk: total R-algebraic complexity. Cost 1 for binary R-add; cost 1 for binary R-sub; cost 1 for binary R-mult; cost 0 for constant in R. Many real-world computations use (e.g.) Pentium M’s floating-point operations to approximate operations in R. Properly scheduled operations achieve Pentium M cycles
total R-algebraic complexity.
SLIDE 7 Fast Fourier transforms Define
2 C as exp(2 i=n).
Define
T : C[ x] =( x n 1) ,
։ C
n
as
f 7! f(1) ; f(
; : : : ; f(
n
). Can very quickly compute
T.
First publication of fast algorithm: 1866 Gauss. Easy to see that Gauss’s FFT uses
O( n lg n) arithmetic operations
if
n 2 f1; 2; 4; 8; : : : g.
Several subsequent reinventions, ending with 1965 Cooley Tukey.
SLIDE 8
Inverse map is also very fast. Multiplication in C
n is very fast.
1966 Sande, 1966 Stockham: Can very quickly multiply in C[ x] =(
x n 1) or C[ x] or R[ x]
by mapping C[ x] =(
x n 1) to C n.
Given
f ; g 2 C[ x] =( x n 1):
compute
f g as T 1( T( f) T( g)).
Given
f ; g 2 C[ x] with deg f g < n:
compute
f g from
its image in C[ x] =(
x n 1).
R-algebraic complexity
O( n lg n).
SLIDE 9 A closer look at costs More precise analysis of Gauss FFT (and Cooley-Tukey FFT): C[ x] =(
x n 1) ,
։ C
n using
(1=2)
n lg n binary C-adds,
(1=2)
n lg n binary C-subs,
(1=2)
n lg n binary C-mults,
if
n 2 f1; 2; 4; 8; : : : g.
(
a; b) 2 R2 represents a + bi 2 C.
C-add, C-sub, C-mult cost 2
; 2; 6:
(
a; b) + ( ; d) = ( a + ; b + d),
(
a; b) ( ; d) = ( a
b
(
a; b)( ; d) = ( a
ad + b ).
SLIDE 10 Total cost 5n lg
n.
Easily save some time: eliminate mults by 1; absorb mults by
1; i; i
into subsequent operations; simplify mults by
i
using, e.g., (
a; b)(1 = p
2; 1=
p
2) = ((
a
= p
2; (
a + b) = p
2). Cost 5 n lg
n 10n + 16
to map C[ x] =(
x n 1) ,
։ C
n,
if
n 2 f4; 8; 16; 32; : : : g.
SLIDE 11
What about cost of convolution? 5n lg
n + O( n) to compute T( f),
5n lg
n + O( n) to compute T( g), O( n) to multiply in C n,
similar 5
n lg n + O( n) for T 1.
Total cost 15n lg
n + O( n)
to compute
f g 2 C[ x] =( x n 1)
given
f ; g 2 C[ x] =( x n 1).
Total cost (15=2)
n lg n + O( n)
to compute
f g 2 R[ x] =( x n 1)
given
f ; g 2 R[ x] =( x n 1): map
R[ x] =(
x n 1) ,
։ R2
C n=21
(Gauss) to save half the time.
SLIDE 12
1968 R. Yavne: Can do better! Cost 4 n lg
n 6n + 8
to map C[ x] =(
x n 1) ,
։ C
n,
if
n 2 f2; 4; 8; 16; : : : g.
SLIDE 13
1968 R. Yavne: Can do better! Cost 4 n lg
n 6n + 8
to map C[ x] =(
x n 1) ,
։ C
n,
if
n 2 f2; 4; 8; 16; : : : g.
2004 James Van Buskirk: Can do better! Cost (34 =9)
n lg n + O( n).
Expositions of the new algorithm: Frigo, Johnson, in IEEE Trans. Signal Processing; Lundy, Van Buskirk, in Computing; Bernstein, this talk, expanding an old idea of Fiduccia.
SLIDE 14
Van Buskirk, comp.arch, January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib?”
SLIDE 15 Van Buskirk, comp.arch, January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib?” Bernstein, comp.arch: “What do you mean, ‘better opcounts’? The algebraic complexity
: : : of a size-2 k
complex DFT has stood at (3 k
k+4 additions and ( k 3)2 k+4
multiplications since 1968.”
SLIDE 16 Van Buskirk, comp.arch, January 2005: “Have you ever considered changing djbfft to get better opcounts along the lines of home.comcast.net/~kmbtib?” Bernstein, comp.arch: “What do you mean, ‘better opcounts’? The algebraic complexity
: : : of a size-2 k
complex DFT has stood at (3 k
k+4 additions and ( k 3)2 k+4
multiplications since 1968.” Van Buskirk, comp.arch: “Oh, you’re so 20th century, Dan. Read the handwriting on the wall.”
SLIDE 17 Understanding the FFT The FFT trick: C[ x] =(
x n 1) ,
։ C[ x] =(
x n=2 1) C[ x] =( x n=2 + 1)
by unique C[ x]-algebra morphism. Cost 2 n:
n=2 C-adds, n=2 C-subs.
e.g.
n = 4: C[ x] =( x4 1) ,
։ C[ x] =(
x2 1) C[ x] =( x2 + 1)
by
g0 + g1 x + g2 x2 + g3 x3 7!
(
g0 + g2) + ( g1 + g3) x;
(
g0
g1
x.
Representation: (
g0 ; g1 ; g2 ; g3) 7!
((
g0 + g2) ; ( g1 + g3)) ;
((
g0
; ( g1
SLIDE 18 Recurse: C[ x] =(
x n=2 1) ,
։ C[ x] =(
x n=4 1) C[ x] =( x n=4 +1);
similarly C[ x] =(
x n=2 + 1) ,
։ C[ x] =(
x n=4
C[ x] =( x n=4 + i);
continue to C[ x] =(
x 1)
General case: C[ x] =(
x n
,
։ C[ x] =(
x n=2
C[ x] =( x n=2 + )
by
g0 +
g n=2 x n=2 +
(
g0 + g n=2) + ( g1 +
x +
(
g0
n=2) + ( g1
x +
Cost 5 n:
n=2 C-mults, n=2 C-adds, n=2 C-subs.
Recurse, eliminate easy mults. Cost 5 n lg
n 10n + 16.
SLIDE 19 Alternative: the twisted FFT After previous C[ x] =(
x n 1) ,
։ C[ x] =(
x n=2 1) C[ x] =( x n=2 +1),
apply unique C-algebra morphism C[ x] =(
x n=2+1) ,
։ C[ y] =(
y n=2 1)
that maps
x to
y. g0 + g1 x +
g n=2 x n=2 +
(
g0 + g n=2)+( g1 + g n=2+1) x+
(
g0
n=2)+
g1
n=2+1) y+
Again cost 5
n: n=2 C-mults, n=2 C-adds, n=2 C-subs.
Eliminate easy mults, recurse. Cost 5 n lg
n 10n + 16.
SLIDE 20 The split-radix FFT FFT and twisted FFT end up with same number of mults by
same number of mults by
same number of mults by
etc. Is this necessary? No! Split-radix FFT: more easy mults. “Don’t twist until you see the whites of their
i’s.”
(Can use same idea to speed up Sch¨
for integer multiplication.)
SLIDE 21 Cost 2 n: C[ x] =(
x n 1) ,
։ C[ x] =(
x n=2 1) C[ x] =( x n=2 +1).
Cost
n: C[ x] =( x n=2 + 1) ,
։ C[ x] =(
x n=4
C[ x] =( x n=4 + i).
Cost 6(n=4): C[ x] =(
x n=4
,
։ C[ y] =(
y n=4 1) by x 7!
y.
Cost 6(n=4): C[ x] =(
x n=4 + i) ,
։ C[ y] =(
y n=4 1) by x 7!
n y.
Overall cost 6 n to split into 1=2; 1=4; 1=4, entropy 1
:5 bits.
Eliminate easy mults, recurse. Cost 4 n lg
n 6n + 8,
exactly as in 1968 Yavne.
SLIDE 22 The tangent FFT Several ways to achieve cost 6 for mult by
e i.
One approach: Factor
e i
as (1 +
i tan ) cos .
Cost 2 for mult by cos
.
Cost 4 for mult by 1 +
i tan .
For stability and symmetry, use max
fjcos
; jsin
instead of cos
.
Surprise (Van Buskirk): Can merge some cost-2 mults!
SLIDE 23 Rethink basis of C[ x] =(
x n 1).
Instead of 1
; x; : : : ; x n1 use
1=s
n;0 ; x=s n;1 ; : : : ; x n1 =s n;n1
where
s n;k =
max
k n
k n
k n=4
k n=4
k n=16
k n=16
Now ( g0
; g1 ; : : : ; g n1) represents g0 =s n;0 +
g n1 x n1 =s n;n1.
Note that
s n;k = s n;k+n=4.
Note that
n( s n=4;k =s n;k) is (1 + i tan
(cot
i).
SLIDE 24 Cost 2 n: C[ x] =(
x n 1), basis x k =s n;k, ,
։ C[ x] =(
x n=2 1), basis x k =s n;k,
x n=2 + 1), basis x k =s n;k.
Cost
n:
C[ x] =(
x n=2 1), basis x k =s n;k, ,
։ C[ x] =(
x n=4 1), basis x k =s n;k,
x n=4 + 1), basis x k =s n;k.
Cost
n:
C[ x] =(
x n=2 +1), basis x k =s n;k, ,
։ C[ x] =(
x n=4
x k =s n;k,
x n=4 + i), basis x k =s n;k.
SLIDE 25 Cost
n=2 2:
C[ x] =(
x n=4 1), basis x k =s n;k, ,
։ C[ x] =(
x n=4 1), basis x k =s n=4;k.
Cost
n=2 2:
C[ x] =(
x n=4 +1), basis x k =s n;k, ,
։ C[ x] =(
x n=4 + 1), basis x k =s n=2;k.
Cost
n=2:
C[ x] =(
x n=4 + 1), basis x k =s n=2;k, ,
։ C[ x] =(
x n=8
x k =s n=2;k,
x n=8 + i), basis x k =s n=2;k.
SLIDE 26 Cost 4(n=4)
6:
C[ x] =(
x n=4
x k =s n;k, ,
։ C[ x] =(
y n=4 1), basis y k =s n=4;k.
Cost 4(n=4)
6:
C[ x] =(
x n=4 + i), basis x k =s n;k, ,
։ C[ x] =(
y n=4 1), basis y k =s n=4;k.
Cost 4(n=8)
6:
C[ x] =(
x n=8
x k =s n=2;k, ,
։ C[ x] =(
y n=8 1), basis y k =s n=8;k.
Cost 4(n=8)
6:
C[ x] =(
x n=8 + i), basis x k =s n=2;k, ,
։ C[ x] =(
y n=8 1), basis y k =s n=8;k.
SLIDE 27 Overall cost 8 :5n
28 to split into
1=4; 1=4; 1=4; 1=8; 1=8, entropy 9
=4.
Recurse: (34 =9)
n lg n + O( n).
What if input is in C[ x] =(
x n 1)
with usual basis 1
; x; : : : ; x n1?
Could scale immediately, but faster to scale upon twist. Cost (34 =9)
n lg n (124=27) n
n (2=9)( 1)lg n lg n +
(16=27)(
1)lg n + 8,
exactly as in 2004 Van Buskirk.
SLIDE 28 Easily handle R[ x] =(
x n + 1)
by mapping to C[ x] =(
x n=2
Easily handle R[ x] =(
x n 1)
by mapping to R[ x] =(
x n=2 1) R[ x] =( x n=2 +1).
Cost (17 =9)
n lg n + O( n) for
R[ x] =(
x n 1) ,
։ R2
C n=21,
so cost (17
=3) n lg n + O( n)
to compute
f g 2 R[ x] =( x n 1)
given
f ; g 2 R[ x] =( x n 1).
Cost (17 =3)
n lg n + O( n)
for size-
n convolution.
Open: Can 17
=3 be improved?