Frobenius Additive Fast Fourier Transform Wen-Ding Li Research - - PowerPoint PPT Presentation

frobenius additive fast fourier transform
SMART_READER_LITE
LIVE PREVIEW

Frobenius Additive Fast Fourier Transform Wen-Ding Li Research - - PowerPoint PPT Presentation

Frobenius Additive Fast Fourier Transform Wen-Ding Li Research Center for Information Technology Innovation, Academia Sinica, Taiwan July 19, 2018 ISSAC 2018, New York, USA Joint work with Ming-Shing Chen, Po-Chun Kuo, Chen-Mou Cheng, Bo-Yin


slide-1
SLIDE 1

Frobenius Additive Fast Fourier Transform

Wen-Ding Li

Research Center for Information Technology Innovation, Academia Sinica, Taiwan

July 19, 2018 ISSAC 2018, New York, USA Joint work with Ming-Shing Chen, Po-Chun Kuo, Chen-Mou Cheng, Bo-Yin Yang

slide-2
SLIDE 2

Polynomial Multiplication over F2

  • Schoolbook : O(n2)
  • Karatsuba or Toom-Cook : O(nω) , 1 < ω < 2
  • Fast Fourier Transform (FFT) :

O(n)

1

slide-3
SLIDE 3

Multiplication with FFT

Fourier transform of f ∈ F[x] : Evaluation of f in some zero set Z ⊂ F.

2

slide-4
SLIDE 4

Multiplication with FFT

Fourier transform of f ∈ F[x] : Evaluation of f in some zero set Z ⊂ F. How do we multiply h = f · g in F[x] ?

2

slide-5
SLIDE 5

Multiplication with FFT

Fourier transform of f ∈ F[x] : Evaluation of f in some zero set Z ⊂ F. How do we multiply h = f · g in F[x] ?

  • Evaluate f and g at points of some zero set Z ⊂ F
  • Multiply pointwise to obtain {f(α) · g(α), α ∈ Z}
  • Interpolate: recover h from {f(α) · g(α), α ∈ Z}

2

slide-6
SLIDE 6

Multiplication with FFT

Fourier transform of f ∈ F[x] : Evaluation of f in some zero set Z ⊂ F. How do we multiply h = f · g in F[x] ?

  • Evaluate f and g at points of some zero set Z ⊂ F
  • Multiply pointwise to obtain {f(α) · g(α), α ∈ Z}
  • Interpolate: recover h from {f(α) · g(α), α ∈ Z}

Multiplication in F2[x]

  • Not many evaluation points in F2 ⇒ work in an extension field
  • Naive method: F2[x] F2d[x]

2

slide-7
SLIDE 7

Multiplication with FFT

Fourier transform of f ∈ F[x] : Evaluation of f in some zero set Z ⊂ F. How do we multiply h = f · g in F[x] ?

  • Evaluate f and g at points of some zero set Z ⊂ F
  • Multiply pointwise to obtain {f(α) · g(α), α ∈ Z}
  • Interpolate: recover h from {f(α) · g(α), α ∈ Z}

Multiplication in F2[x]

  • Not many evaluation points in F2 ⇒ work in an extension field
  • Naive method: F2[x] F2d[x] ⇒ incurs d-times penalty.

2

slide-8
SLIDE 8

The Kronecker segmentation

  • Sch¨
  • nhage’s ternary FFT (GF2x: Brent, Gaudry, Thome,

Zimmermann) F2[x] F2[x]<M[y] F2[x]/(x2L + xL + 1)[y], y = xM, L >= M

  • Mixed Radix FFT over F260 (ISSAC 2016: Harvey, van der Hoeven,

Lecerf) F2[x] F2[x]<30[y] F260[y], y = x30

  • Additive FFT over F2256 (Chen, Cheng, Kuo, Li, Yang - 2017)

F2[x] F2[x]<128[y] F2256[y], y = x128 Pack half as many bits in each coefficients as the extension field

6

slide-9
SLIDE 9

The Kronecker segmentation

  • Sch¨
  • nhage’s ternary FFT (GF2x: Brent, Gaudry, Thome,

Zimmermann) F2[x] F2[x]<M[y] F2[x]/(x2L + xL + 1)[y], y = xM, L >= M

  • Mixed Radix FFT over F260 (ISSAC 2016: Harvey, van der Hoeven,

Lecerf) F2[x] F2[x]<30[y] F260[y], y = x30

  • Additive FFT over F2256 (Chen, Cheng, Kuo, Li, Yang - 2017)

F2[x] F2[x]<128[y] F2256[y], y = x128 Pack half as many bits in each coefficients as the extension field Factor-of-two loss!

6

slide-10
SLIDE 10

The Frobenius Fourier transform - ISSAC 2017

Directly compute Fourier transform of a polynomial f in F2[x]<n : {f(1), f(ω), f(ω2), . . . , f(ωn−1)} where ω ∈ F2d primitive root of unity.

7

slide-11
SLIDE 11

The Frobenius Fourier transform - ISSAC 2017

Directly compute Fourier transform of a polynomial f in F2[x]<n : {f(1), f(ω), f(ω2), . . . , f(ωn−1)} where ω ∈ F2d primitive root of unity. Save some computation by using the Frobenius automorphism: f(w2) = f(φ(w)) = φ(f(w)) = (f(w))2 ⇒ For each orbit w, φ(w), φ◦2(w), φ◦3(w), . . ., we only need to compute at one point: f(w) and all other values φ◦2(f(w)), φ◦3(f(w)), . . . are determined.

7

slide-12
SLIDE 12

The Frobenius Fourier transform - ISSAC 2017

Directly compute Fourier transform of a polynomial f in F2[x]<n : {f(1), f(ω), f(ω2), . . . , f(ωn−1)} where ω ∈ F2d primitive root of unity. Save some computation by using the Frobenius automorphism: f(w2) = f(φ(w)) = φ(f(w)) = (f(w))2 ⇒ For each orbit w, φ(w), φ◦2(w), φ◦3(w), . . ., we only need to compute at one point: f(w) and all other values φ◦2(f(w)), φ◦3(f(w)), . . . are determined. Result: d-times faster than naive method.

7

slide-13
SLIDE 13

Cantor’s FFT and its derivatives

  • Cantor give an “analogue of the fast Fourier transform” which

efficiently evaluates a polynomial on some additive subgroup Z of Fppk in O(n(log n)2) time for n = |Z|.

  • Based on a tower Fp, Fpp, Fpp2 , . . . of Artin-Schreier extensions of Fp
  • Gao and Mateer improved it to O(n log n log(log n)) when p = 2

and f ∈ F22k [x]

  • We showed that van der Hoeven and Larrieu’s idea of using

Frobenius automorphism to accelerate polynomial multiplication beautifully generalizes to Cantor-Gao-Mateer-FFT

9

slide-14
SLIDE 14

Additive FFT

Let s(x) = x2 + x, s0(x) = x and si(x) := s(s(· · · s

  • i times

(x) · · · )) = s◦i(x)

  • Let Wi be the zero set of si(x) =

ω∈Wi(x − ω), then

F2 = W1 ⊂ W2 ⊂ · · · ⊂ F2

  • Since si’s are linear, Wi’s are vector spaces over F2
  • Since s2k = x22k

+ x, W2k is a field F22k . e.g. W1 = F2, W2 = F22, W4 = F24, W8 = F28,...

  • Cantor showed that there is a basis (v0, v1, v2, . . . , ) such that

Wi = span{v0, v1, . . . , vi−1} and s(vi) = v2

i + vi = vi−1

  • We’ll denote a0v0 + a1v1 + . . . + ad−1vd as ad−1ad−1 . . . a0.

e.g. 1101 is v3 + v2 + v0.

11

slide-15
SLIDE 15

Additive FFT - Subproduct Tree

sk(x) + α can be written as the product of sk−1(x) + β and sk−1(x) + β + 1, where β2 + β = α. sk(x) + α sk−1(x) + β sk−1(x) + β + 1 right child = left child +1

12

slide-16
SLIDE 16

Additive FFT

s3(x) s2(x) s1(x) x x + 1 s1(x) + 1 x + v1 x + v1 + 1 s2(x) + 1 s1(x) + v1 x + v2 x + v2 + 1 s1(x) + v1 + 1 x + v2 + v1 x + v2 + v1 + 1

13

slide-17
SLIDE 17

Additive FFT

The roots of polynomial in subproduct tree. The “X” means it could take 0 or 1. XXX 0XX 00X 000 001 01X 010 011 1XX 10X 100 101 11X 110 111

14

slide-18
SLIDE 18

Additive FFT

f(x) mod s3(x) f(x) mod s2(x) f(x) mod s1(x) f(0) f(1) f(x) mod s1(x) + 1 f(v1) f(v1 + 1) f(x) mod s2(x) + 1 f(x) mod . . . f(v2) f(v2 + 1) f(x) mod . . . f(v2 + v1)f(v2 + v1 + 1)

15

slide-19
SLIDE 19

Additive FFT

Let

  • f(x) mod sn(x) + α
  • = P(x)sn−1(x) + Q(x) [Gao-Mateer], then

f(x) mod sn(x) + α f(x) mod sn−1(x) + β = Q(x) + βP(x) f(x) mod sn−1(x) + β + 1 = Q(x) + βP(x) + P(x) Let the left child be f0(x) and the right child be f1(x), then f0(x) = Q(x) + βP(x) f1(x) = P(x) + f0(x) By applying this recursively, we get {f(x) mod x + ω|sn(ω) = α} = {f(ω)|ω ∈ Wi + γ} where sn(γ) = α

16

slide-20
SLIDE 20

Frobenius Additive FFT

Question: Given d a power of two, when computing additive FFT of f in F2d[x], can we achieve d-times speedup if f actually admits only coefficients in F2?

17

slide-21
SLIDE 21

Frobenius Additive FFT

Question: Given d a power of two, when computing additive FFT of f in F2d[x], can we achieve d-times speedup if f actually admits only coefficients in F2? Save some computation by using the Frobenius automorphism: f(w2) = (f(w))2

17

slide-22
SLIDE 22

Frobenius Additive FFT

Question: Given d a power of two, when computing additive FFT of f in F2d[x], can we achieve d-times speedup if f actually admits only coefficients in F2? Save some computation by using the Frobenius automorphism: f(w2) = (f(w))2 ⇒ If we have f(w), f(w2) can be obtained efficiently. Only need to evaluate a subset of the original points

17

slide-23
SLIDE 23

Orbits under the action of φ : x → x2

Denote the Orbit of w under the action φ be Orbw = {w, φ(w), φ◦2(w), φ◦3(w), φ◦4(w), . . .} = {w, w2, w4, w8, w16, . . .}

  • For w ∈ Wi+1 \ Wi, |Orbw| = 2⌊lg i⌋+1
  • How the action affect the points:

φ◦2k(x) = s2k(x) + x Change the position whose distance is 2k from most significant bits

18

slide-24
SLIDE 24

Main Result: the Cross section of the orbit

Let Σ0 = {0}, and ∀k > 0, let Σk =

  • vk−1 + j1vk−2 + · · · + jk−1v0 : ji = 0 if i is a power of 2,

ji ∈ {0, 1} otherwise.

  • = 100X0XXX0XXXXXXX0XX . . .

Theorem

Σk is a cross section of Wk \ Wk−1. That is, ∀k > 0, ∀w ∈ Wk \ Wk−1, there exists exactly one σ ∈ Σk such that φ◦j(σ) = w for some j.

19

slide-25
SLIDE 25

Main Result: the Cross section of the orbit

Let Σ0 = {0}, and ∀k > 0, let Σk =

  • vk−1 + j1vk−2 + · · · + jk−1v0 : ji = 0 if i is a power of 2,

ji ∈ {0, 1} otherwise.

  • = 100X0XXX0XXXXXXX0XX . . .

Theorem

Σk is a cross section of Wk \ Wk−1. That is, ∀k > 0, ∀w ∈ Wk \ Wk−1, there exists exactly one σ ∈ Σk such that φ◦j(σ) = w for some j. A cross section of Wm is Σ0 ∪ Σ1 ∪ Σ2 ∪ . . . ∪ Σm . .

19

slide-26
SLIDE 26

Truncated additive FFT

XXXX 0XXX 00XX 000X 0000 0001 001X 0010 0011 01XX 010X 0100 0101 011X 0110 0111 1XXX 10XX 100X 1000 1001 101X 1010 1011 11XX 110X 1100 1101 111X 1110 1111 20

slide-27
SLIDE 27

Truncated additive FFT

XXXX 0XXX 00XX 000X 0000 0001 001X 0010 01XX 010X 0100 1XXX 10XX 100X 1000 1001 21

slide-28
SLIDE 28

Truncated additive FFT

F2[x] F2[x] F2[x] F2[x] F2 F2 F2[x] F22 F2[x] F22[x] F24 F2[x] F22[x] F24[x] F24 F24 22

slide-29
SLIDE 29

New speed records in terms of bit-operation count

  • Consider subfield properties to accelerate the constant

multiplications in FFT

  • Use tower field representations
  • Use common subexpression elimination techniques to further reduce

the number of bit operations

23

slide-30
SLIDE 30

New speed records in terms of bit-operation count

200 400 600 800 1,000 0.5 1 1.5 2 2.5 3 ·105 (709,159267) (414,68484) (231,29124) (1024,158226) (512,68446) (512,98018) (256,29005) Polynomial size Bit-operation count Bernstein 2009 Cenk-Hasan 2017 This work

24

slide-31
SLIDE 31

New speed records on modern CPUs

We need to use the PCLMULQDQ instruction to multiply in F2128[x]: Cross-section of size 2m−7 we use to enable truncated additive FFT: Σ = {vk + j64vk−64 + j65vk−65 + · · · + j2m−7−1vk−63−2m−7 : ji} = {1

64

00 · · · 0

m−7

  • XX · · · X}

Table: Timing of multiplications in F2[x]<n on Intel Skylake Xeon E3-1275 v5 @ 3.60GHz (10−3 sec.)

log2(n/64) 16 17 18 19 20 21 22 23 This work, F2128 9 20 41 88 192 418 889 1865 FDFT c 11 24 56 127 239 574 958 2465 ADFT 16 34 74 175 382 817 1734 3666 FFT over F260b 22 51 116 217 533 885 2286 5301 gf2x a 23 51 111 250 507 1182 2614 6195

a Version 1.2. Available from http://gf2x.gforge.inria.fr/ b SVN r10663. Available from svn://scm.gforge.inria.fr/svn/mmx c SVN r10681. Available from svn://scm.gforge.inria.fr/svn/mmx

25