Faster arithmetic for number-theoretic transforms David Harvey - PowerPoint PPT Presentation

Faster arithmetic for number-theoretic transforms David Harvey University of New South Wales 7th October 2011, Macquarie University David Harvey Faster arithmetic for number-theoretic transforms

Plan for talk 1. Review number-theoretic transform (NTT) 2. Discuss typical butterfly algorithm 3. Improvements to butterfly algorithm 4. Performance data David Harvey Faster arithmetic for number-theoretic transforms

The number-theoretic transform (NTT) NTT = discrete Fourier transform (DFT) over a finite field. We will assume: ◮ transform length is N = 2 n . ◮ base field is F p where p is prime and p = 1 (mod N ), so that F p contains N -th roots of unity. ◮ p fits into a single machine word, i.e. p < β , where β describes the word size, for example β = 2 64 . David Harvey Faster arithmetic for number-theoretic transforms

The number-theoretic transform (NTT) Definition of NTT: Input vector ( a 0 , . . . , a N − 1 ) ∈ F N p . Let ω be an N -th root of unity in F p . Output is the vector ( b 0 , . . . , b N − 1 ) where � ω ij a i . b j = 0 ≤ i < N Computing NTT is equivalent to evaluating the polynomial f ( x ) = a 0 + a 1 x + · · · + a N − 1 x N − 1 simultaneously at the points 1 , ω, ω 2 , . . . , ω N − 1 . David Harvey Faster arithmetic for number-theoretic transforms

The number-theoretic transform (NTT) Naive algorithm for NTT has complexity O ( N 2 ). (Complexity = number of ring operations in F p .) Fast Fourier transform (FFT) has complexity O ( N log N ). Applications: ◮ Fast polynomial multiplication in F p [ X ]. ◮ Fast polynomial multiplication in Z [ X ] (via Chinese remainder theorem), ( Z / n Z )[ X ], F q [ X ], etc. ◮ Other polynomial operations: reciprocal, division, GCD, square root, composition, factoring, etc. Real-life example: Victor Shoup’s NTL library, very popular in computational number theory and cryptography, uses the fast NTT as the building block for all of these operations. David Harvey Faster arithmetic for number-theoretic transforms

The number-theoretic transform (NTT) Algorithm 1: Simple FFT pseudocode Input : N = 2 n , ( a 0 , . . . , a N − 1 ) ∈ F N p 1 for i ← 0 , 1 , . . . , n − 1 do for 0 ≤ j < 2 i do 2 for 0 ≤ k < 2 n − i − 1 do 3 s ← j 2 n − i + k 4 t ← j 2 n − i + k + 2 n − i − 1 5 w ← ( ω 2 i ) k 6 � a s + a t � a s � � ← (butterfly) 7 a t w ( a s − a t ) Output is in-place, in bit-reversed order. David Harvey Faster arithmetic for number-theoretic transforms

Butterflies Consider the butterfly operation � X � � X + Y � �→ . Y W ( X − Y ) Algorithm performs O ( N log N ) of these. O ( N ) of them have W = 1; we will concentrate on W � = 1 case. We will assume indexing and locality is taken care of, and assume W comes from table lookup. Our focus is the following problem: given X , Y and W , how to most efficiently compute X + Y and W ( X − Y )? David Harvey Faster arithmetic for number-theoretic transforms

Butterflies X Y ∼ = X ′ Y ′ X ′ = X + Y Y ′ = W ( X − Y ) David Harvey Faster arithmetic for number-theoretic transforms

Butterflies Primitive operations allowed (modelled on typical modern instruction sets): ◮ addition/subtraction of single words (modulo β ) ◮ comparison of single words ◮ multiplication of single words (modulo β ) ◮ wide multiplication, i.e. given U , V ∈ [0 , β ), compute UV in the form UV = P 1 β + P 0 where P 0 , P 1 ∈ [0 , β ). For expository purposes I’ll also temporarily assume we have: ◮ double-word division, i.e. given U ∈ [0 , β 2 ) and M ∈ [0 , β ), compute Q = ⌊ U / M ⌋ and R = U − QM David Harvey Faster arithmetic for number-theoretic transforms

Butterflies Algorithm 2: Simple butterfly routine Input : X , Y , W ∈ [0 , p ), assume p < β/ 2 Output : X ′ = X + Y mod p Y ′ = W ( X − Y ) mod p 1 X ′ ← X + Y 2 if X ′ ≥ p then X ′ ← X ′ − p (now X ′ = X + Y mod p ) 3 T ← X − Y 4 if T < 0 then T ← T + p (now T = X − Y mod p ) 5 U = U 1 β + U 0 ← TW (wide multiplication) 6 Y ′ ← U mod p (double-word division) This is inefficient because hardware division is slow . David Harvey Faster arithmetic for number-theoretic transforms

Modular multiplication How can we compute TW mod p more efficiently? There are several well-known methods that replace the division by multiplication(s). Basic strategy: ◮ Estimate a ‘quotient’ Q . ◮ Multiply Q by p . ◮ Subtract Qp from TW to obtain candidate remainder R . ◮ Add/subtract small multiple of p to adjust remainder into standard interval [0 , p ). David Harvey Faster arithmetic for number-theoretic transforms

Modular multiplication Algorithm 3: Shoup’s modular multiplication algorithm Input : T , W ∈ [0 , p ), assume p < β/ 2 precomputed W ′ = ⌊ W β/ p ⌋ Output : R = TW mod p 1 Q ← ⌊ W ′ T /β ⌋ (high part of product W ′ T ) 2 R ← ( WT − Qp ) mod β (two low products) 3 if R ≥ p then R ← R − p Note: W is invariant. This is reasonable for the NTT, since each transform uses the same roots of unity. W ′ is a scaled approximation to W / p . Q is an approximation to WT / p . Claim: WT − Qp ∈ [0 , 2 p ). Thus the R computed in line 2 is exactly WT − Qp . The last line adjusts the remainder into [0 , p ). David Harvey Faster arithmetic for number-theoretic transforms

Modular multiplication Proof of claim: we have 0 ≤ W ′ T 0 ≤ W β − W ′ < 1 and − Q < 1 . p β Multiply by Tp /β and p respectively, and add: 0 ≤ WT − Qp < 2 p . In other words, Q is either the correct quotient or too large by 1. David Harvey Faster arithmetic for number-theoretic transforms

Modular multiplication Algorithm 4: Shoup butterfly Input : X , Y , W ∈ [0 , p ), assume p < β/ 2 precomputed W ′ = ⌊ W β/ p ⌋ Output : X ′ = X + Y mod p , Y ′ = W ( X − Y ) mod p 1 X ′ ← X + Y 2 if X ′ ≥ p then X ′ ← X ′ − p 3 T ← X − Y 4 if T < 0 then T ← T + p 5 Q ← ⌊ W ′ T /β ⌋ 6 Y ′ ← ( WT − Qp ) mod β 7 if Y ′ ≥ p then Y ′ ← Y ′ − p This is essentially the algorithm used in NTL. David Harvey Faster arithmetic for number-theoretic transforms

Removing adjustment steps Our goal: to remove as many adjustment steps “ if (some condition) then Z ← Z ± p ” as possible. Each adjustment requires several machine instructions: a conditional move, and several other instructions to set it up. These adjustments can account for a significant proportion of the total execution time. Especially on modern processors with very fast multipliers! David Harvey Faster arithmetic for number-theoretic transforms

Removing adjustment steps One adjustment is easy to remove. In Shoup’s algorithm for computing TW mod p , we assumed that T ∈ [0 , p ). But in fact the algorithm works perfectly well for any T ∈ [0 , β ). So we can simply skip the adjustment for T . (I don’t know where this was first noticed, but certainly Fabrice Bellard knew this in 2009 when he computed π to 2.7 trillion decimal places using a souped-up desktop machine. NTL does not use this trick.) David Harvey Faster arithmetic for number-theoretic transforms

Removing adjustment steps Algorithm 5: Shoup butterfly, one adjustment removed Input : X , Y , W ∈ [0 , p ), assume p < β/ 2 precomputed W ′ = ⌊ W β/ p ⌋ Output : X ′ = X + Y mod p , Y ′ = W ( X − Y ) mod p 1 X ′ ← X + Y 2 if X ′ ≥ p then X ′ ← X ′ − p 3 T ← X − Y + p (now T ≡ X − Y mod p , and T ∈ [0 , 2 p )) 4 Q ← ⌊ W ′ T /β ⌋ 5 Y ′ ← ( WT − Qp ) mod β 6 if Y ′ ≥ p then Y ′ ← Y ′ − p David Harvey Faster arithmetic for number-theoretic transforms

Removing adjustment steps What about the last adjustment for Y ′ ? Apparently the only way to avoid it is to somehow get the quotient ⌊ TW / p ⌋ correct the first time. But I don’t know of any way to get the correct quotient efficiently. (This is part of the reason that hardware division is so slow!) David Harvey Faster arithmetic for number-theoretic transforms

Removing adjustment steps But there is another way... don’t perform the adjustment ! Then the butterfly outputs lie in [0 , 2 p ). Relax the algorithm to allow the inputs to lie [0 , 2 p ). In other words, the entire FFT algorithm operates on ‘non-canonical residues’. Each element of F p has two possible representatives, one in [0 , p ) and one in [ p , 2 p ). If desired, a final pass at the end reduces the output into [0 , p ). We need p < β/ 4 for this scheme to work. David Harvey Faster arithmetic for number-theoretic transforms

Removing adjustment steps Algorithm 6: Shoup butterfly, two adjustments removed Input : X , Y ∈ [0 , 2 p ), W ∈ [0 , p ), assume p < β/ 4 precomputed W ′ = ⌊ W β/ p ⌋ Output : X ′ ≡ X + Y mod p , Y ′ ≡ W ( X − Y ) mod p X ′ , Y ′ ∈ [0 , 2 p ) 1 X ′ ← X + Y 2 if X ′ ≥ 2 p then X ′ ← X ′ − 2 p 3 T ← X − Y + 2 p 4 Q ← ⌊ W ′ T /β ⌋ 5 Y ′ ← ( WT − Qp ) mod β I don’t know how to remove the adjustment in line 2. David Harvey Faster arithmetic for number-theoretic transforms

Faster arithmetic for number-theoretic transforms David Harvey - PowerPoint PPT Presentation

Faster arithmetic for number-theoretic transforms David Harvey University of New South Wales 7th October 2011, Macquarie University David Harvey Faster arithmetic for number-theoretic transforms Plan for talk 1. Review number-theoretic

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Drawing on the Web CSS CSCI-UA 380 Transforms, Transitions, and Animation Drawing on the Web

JUST THE MATHS SLIDES NUMBER 16.8 Z-TRANSFORMS 1 (Definition and rules) by A.J.Hobson

JUST THE MATHS SLIDES NUMBER 16.2 LAPLACE TRANSFORMS 2 (Inverse Laplace Transforms) by

PRODUCT NUMBER: RT01 REAL TENDER Product Number: RT01 FASTER Crank handle for faster

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

Research that Transforms Healthcare and Transforms Lives Dianne Morrison-Beedy, PhD, RN, WHNP-BC,

Week 5 -Wednesday What did we talk about last time? Transforms Translation Rotation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

M- -Channel Filter Banks: Channel Filter Banks: M Block and Lapped Transforms Block and Lapped

Productathon: Final Presentation MENTOR: APARNA MOHAN BY: ANIRUDHA BHARADWAJ Table of Contents

NOTHING ABOUT US WITHOUT US Boston Users Union Mass Ave IS a Consumption Space, but its far

Innovation, Leadership & Learning Nottingham University Business School An Introduction to

Innovation, Leadership & Learning Nottingham University Business School An introduction to

PRESENTATION ON PROVIDER NETWORKS DIANE RHODES SENIOR POLICY MANAGER The Texas Dental

Strengthening Medicaid and CHIP for Children & Families Anne Marie Costello Acting Director,

Helping Populations and Communities Improve Their Health 6/1/2016 Helping Populations and

2015 CHIP Progress 2015 CHIP Overview In May-August 2015, Ottawa County developed its first

Faster arithmetic for number-theoretic transforms David Harvey - PowerPoint PPT Presentation

Faster arithmetic for number-theoretic transforms David Harvey University of New South Wales 7th October 2011, Macquarie University David Harvey Faster arithmetic for number-theoretic transforms Plan for talk 1. Review number-theoretic

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Drawing on the Web CSS CSCI-UA 380 Transforms, Transitions, and Animation Drawing on the Web

JUST THE MATHS SLIDES NUMBER 16.8 Z-TRANSFORMS 1 (Definition and rules) by A.J.Hobson

JUST THE MATHS SLIDES NUMBER 16.2 LAPLACE TRANSFORMS 2 (Inverse Laplace Transforms) by

PRODUCT NUMBER: RT01 REAL TENDER Product Number: RT01 FASTER Crank handle for faster

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

Research that Transforms Healthcare and Transforms Lives Dianne Morrison-Beedy, PhD, RN, WHNP-BC,

Week 5 -Wednesday What did we talk about last time? Transforms Translation Rotation

F F Fast Transforms using the Cell/B.E. Processor Fast Transforms using the Cell/B.E. Processor

M- -Channel Filter Banks: Channel Filter Banks: M Block and Lapped Transforms Block and Lapped

Productathon: Final Presentation MENTOR: APARNA MOHAN BY: ANIRUDHA BHARADWAJ Table of Contents

NOTHING ABOUT US WITHOUT US Boston Users Union Mass Ave IS a Consumption Space, but its far

Innovation, Leadership &amp; Learning Nottingham University Business School An Introduction to

Innovation, Leadership &amp; Learning Nottingham University Business School An introduction to

PRESENTATION ON PROVIDER NETWORKS DIANE RHODES SENIOR POLICY MANAGER The Texas Dental

Strengthening Medicaid and CHIP for Children &amp; Families Anne Marie Costello Acting Director,

Helping Populations and Communities Improve Their Health 6/1/2016 Helping Populations and

2015 CHIP Progress 2015 CHIP Overview In May-August 2015, Ottawa County developed its first

Innovation, Leadership & Learning Nottingham University Business School An Introduction to

Innovation, Leadership & Learning Nottingham University Business School An introduction to

Strengthening Medicaid and CHIP for Children & Families Anne Marie Costello Acting Director,