fft libraries on cray xt
play

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz - PowerPoint PPT Presentation

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline Background Current FFT libraries on XT CRAFFT design Example interfaces Performance Results Future plans Questions? May 05 Cray Inc. Proprietary


  1. FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc.

  2. Outline Background Current FFT libraries on XT CRAFFT design Example interfaces Performance Results Future plans Questions? May 05 Cray Inc. Proprietary Slide 2

  3. Fourier Transform Background Discrete Fourier Transform (DFT) Transforms an array x(0:N-1) into X(0:N-1) Calculation by the definition is a O (N 2 ) algorithm N 1 ijk 2 X x i exp , 1 k k N j 0 Fast Fourier Transform (FFT) Algorithm to calculate the DFT using O (N log N) Algorithm is dependent on N Applications (among many) Signal processing Solving PDE May 05 Cray Inc. Proprietary Slide 3

  4. Current FFT libraries on XT FFTW (MIT, Frigo & Johnson, fftw.org) Serial performance is very competitive SIMD code for x86 Sophisticated run-time tuning mechanisms Extremely flexible interface FFT for almost any data distribution you can imagine Complicated and tedious interface Substantive differences between versions 2 and 3 Interfaces are incompatible Parallel transforms in version 2 only Superior serial performance in version 3 ACML (AMD, amd.com) Performance is not spectacular Especially on non-powers of 2 May 05 Cray Inc. Proprietary Slide 4

  5. FFT libraries common practice Execution of FFT in application code generally has two steps 1. PLANNING stage • Initialize the FFT library based on the FFT size Some libraries pre-compute a table of trigonometric values FFTW is able to try out various FFT of that size and choose the fastest one • Often this can take orders of magnitude longer than the actual execution of the FFT • An FFTW_PATIENT plan for size 512^3 FFT takes 2758 sec to plan and 9.7 sec to execute!!! 2. EXECUTION stage • Execute the FFT using the information from the Planning stage May 05 Cray Inc. Proprietary Slide 5

  6. Major problem with FFT libs Which library to choose? We want the best possible FFT performance To date, we have seen excellent performance from FFTW FFTW also has a rich set of options for different data distributions Do NOT want to change application code frequently How to use the complicated interfaces??? FFTW can be really difficult to use E.g., 2d FFT with LDA > size, 14 arguments!!! call dfftw_plan_many_dft(plan,rank,n,howmany, & input,inembed, & istride,idist, & output,onembed, & ostride,odist, & expon,FFTW_flags) May 05 Cray Inc. Proprietary Slide 6

  7. CRAFFT library solves this problem CRAFFT is designed with simple-to-use interfaces Planning and execution stage can be combined into one subroutine call Underneath the interfaces, CRAFFT calls the appropriate FFT kernel CRAFFT provides both offline and online tuning Offline tuning Which FFT kernel to use Pre-computed PLANs for common-sized FFT Online tuning is performed as necessary at runtime as well At runtime, CRAFFT adaptively selects the best FFT kernel to use based on both offline and online testing (e.g. ACML, FFTW, Custom FFT) May 05 Cray Inc. Proprietary Slide 7

  8. User Interface Choices Cray-style interface (mostly for legacy compatibility) ZZFFT(…) ; 1d complex-to-complex double precision FFT Simple interface CRAFFT_z2z1d(size,array,isign) Just the basics, size and array locations All internals, including possible temporary memory allocation and tuning are taken care of The easiest choice for users Advanced interface CRAFFT_z2z1d(size,array,isign,workspace,PLANNING) In addition to size and array, user also provides workspace and planning parameters In 2D and 3D, the leading dimension type args can be used May 05 Cray Inc. Proprietary Slide 8

  9. Interfaces (cont.) All subroutine names have the form crafft_ α 2 βθ D α, β = S,D,C or Z like netlib, i.e., D = double precision real, C = single precision complex θ = 1, 2 or 3, i.e., the dimension of the transform E.g., crafft_d2z1d is a double real to double complex transform in 1d Interface makes use of F90 modules to overload the names Users must put “use crafft” in their fortran source code 1D complex to complex examples: crafft_z2z1d (size,array,isign) • in-place crafft_z2z1d (size,input,output,isign) • out-of-place May 05 Cray Inc. Proprietary Slide 9

  10. Simple 1d CRAFFT call resolves to… crafft_z2z1d(n,input,isign) z2z1d_simple1_inplace(n,input,isign) z2z1d_simple_internal(n,input,input,isign,1,1) dfftw_plan_dft_1d(plan,n,input,output,isign,FFTW_FLAG) dfftw_execute(plan) May 05 Cray Inc. Proprietary Slide 10

  11. Advanced 2d CRAFFT call resolves to… crafft_z2z2d(n1,n2,input,ld_in,output,ld_out,isign,work) z2z2d_adv1(n1,n2,input,ld_in,output,ld_out,isign,work) z2z2d_adv_internal(n1,n2,input,ld_in,output,ld_out,isign,1,1,work ) dfftw_plan_many_dft(plan,rank,n,howmany,input,inembed,istride,idist,output,onembed,ostride,odist,isign,FFTW_FLAG) dfftw_execute(plan) May 05 Cray Inc. Proprietary Slide 11

  12. CRAFFT user code calling sequence call crafft_init() Initialize the library Setup the offline wisdom call crafft_z2z1d(n,input,-1) Perform online tuning Execute the forward FFT Do work call crafft_z2z1d(n,input,+1) Execute the backward FFT May 05 Cray Inc. Proprietary Slide 12

  13. CRAFFT 1.0alpha (current status) Largely FFTW centric Includes FFTW offline wisdom to minimize expensive online planning Allows simple interface into advanced FFTW functionality Proposed release in summer 2008 PERFORMANCE??? May 05 Cray Inc. Proprietary Slide 13

  14. Walltime vs. size, 1D C2C FFT planner 1 1 4 16 64 256 1024 4096 16384 65536 262144 CRAFFT_PLANNER=0 0.1 FFTW_ESTIMATE 0.01 Time (s) C2C CRAFFT planner+exe C2C FFTW planner 0.001 0.0001 0.00001 Size May 05 Cray Inc. Proprietary Slide 14

  15. Walltime vs. size, 1D C2C FFT execute 1 1 4 16 64 256 1024 4096 16384 65536 262144 0.1 CRAFFT_PLANNER=0 FFTW_ESTIMATE 0.01 0.001 Time (s) C2C CRAFFT exe C2C FFTW exe 0.0001 0.00001 0.000001 0.0000001 Size May 05 Cray Inc. Proprietary Slide 15

  16. Walltime vs. size, 1D C2C FFT planner 100 10 CRAFFT_PLANNER=2 1 FFTW_PATIENT 1 4 16 64 256 1024 4096 16384 65536 262144 0.1 Time (s) C2C CRAFFT plan+exe 0.01 C2C FFTW plan 0.001 0.0001 0.00001 0.000001 Size May 05 Cray Inc. Proprietary Slide 16

  17. Walltime vs. size, 1D C2C FFT execute 1 1 4 16 64 256 1024 4096 16384 65536 262144 CRAFFT_PLANNER=2 0.1 FFTW_PATIENT 0.01 0.001 Time (s) C2C CRAFFT exe C2C FFTW exe 0.0001 0.00001 0.000001 0.0000001 Size May 05 Cray Inc. Proprietary Slide 17

  18. Summary CRAFFT provides a simple interface into FFT for XT Avoid those nasty 14 argument FFTW calls! CRAFFT overhead is very minimal CRAFFT performance is really excellent when using common-sized FFT CRAFFT avoids expensive planning stage May 05 Cray Inc. Proprietary Slide 18

  19. Future Work Additional libraries “under -the- covers” Complete libraries, e.g., SPIRAL (CMU, Franchetti et. al., spiral.net) Targeted tuning of kernels for specific sizes Parallel FFT Again, provide a simple, intuitive interface and handle the details transparently Provide multiple data distributions May 05 Cray Inc. Proprietary Slide 19

  20. QUESTIONS??? Email: jnbntz@cray.com May 05 Cray Inc. Proprietary Slide 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend