GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a - PowerPoint PPT Presentation

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study Mathias Wagner GTC 2015 | Mathias Wagner | Indiana University |

Lattice Quantum ChromoDynamics and Deep Learning … … sorry, not (yet?) here. GTC 2015 | Mathias Wagner | Indiana University |

Lattice QCD: Some Basics Z •QCD partition function DAD ¯ Ψ D Ψ e − S E ( T,µ ) Z QCD ( T, µ ) = •4 dimensional grid (=Lattice) includes integral over space and time •quarks live on lattice sites •gluons live on the links •typical sizes: 24 3 x 6 to 256 4 •parallelization over lattice sites (10 5 to 10 9 ) GTC 2015 | Mathias Wagner | Indiana University |

        Staggered Fermion Matrix (Dslash) •Krylov space inversion of fermion matrix dominates runtime •within inversion application of sparse Matrix (Dslash) dominates (>80%) •Highly Improved Staggered Quarks (HISQ) use next and 3rd neighbor stencil   3 hn o n oi X µ − U † µ − N † w x = D x,x 0 v x 0 = U x,µ v x +ˆ µ,µ v x − ˆ N x,µ v x +3ˆ µ,µ v x − 3ˆ + x − ˆ µ x − 3ˆ µ µ =0 complex 3x3 matrix   complex 3x3 matrix + U(3) symmetry   complex 3-dim vector   complex 3-dim vector   72 byte for fp32 56 byte for fp32 24 byte for fp32 24 byte for fp32 •each site (x) loads 1024 bytes for links and 384 bytes for vectors, stores 24 bytes: total 1432 bytes / site •performs 1146 flop: arithmetic intensity: 0.8 flop/byte sensitive to memory bandwidth GTC 2015 | Mathias Wagner | Indiana University |

Accelerators Sorry, not the ones with liquid helium cooling and TDP > 300W. GTC 2015 | Mathias Wagner | Indiana University |

Intel Xeon Phi and Nvidia Tesla How can we achieve this performance? 5110 7120 K20 K20X K40 Cores / SMX 60 61 13 14 15 Vector instructions 512 bit (16 fp32) CUDA cores / SMX 192 How can we saturate Clock Speed [MHz] 1053 1238 - 1333 705 732 745-875 the available peak fp32 [TFlop/s] 2.02 2.42 3.52 3.91 4.29 bandwidth? peak fp64 [TFlop/s] 1.01 1.21 1.27 1.31 1.43 Memory [GB] 8 8 5 6 12 Memory Bandwidth [GB/s] 320 352 208 250 288 L1 Cache [kB] / (Core/SMX) 32 16-48 + 48 (Texture) How much energy does [kB] L2 Cache [MB] 30 (60 x 0.5) 30.5 (61 x 0.5) 1.5 that require? TDP [W] 225 300 225 235 235 GTC 2015 | Mathias Wagner | Indiana University |

Setting the bar What performance can we expect on the di ff erent accelerators? Is our code optimized? GTC 2015 | Mathias Wagner | Indiana University |

Estimated Dslash Performance Dslash performance ECC •naive model:   bandwidth times arithmetic intensity 300 estimate (peak bw) estimate (triad bw) measured 200 GFlop/s 100 0 5110 7120 K20 K40 GTC 2015 | Mathias Wagner | Indiana University |

Estimated Dslash Performance Dslash performance ECC •naive model:   bandwidth times arithmetic intensity 300 estimate (peak bw) estimate (triad bw) measured •better use STREAM triad bandwidth 200 theoretical triad triad ECC GFlop/s 400 Memory Bandwidth [GB/s] 300 100 200 0 100 5110 7120 K20 K40 5110 7120 K20 K40 GTC 2015 | Mathias Wagner | Indiana University |

Estimated Dslash Performance Dslash performance ECC •naive model:   bandwidth times arithmetic intensity 300 estimate (peak bw) estimate (triad bw) measured •better use STREAM triad bandwidth 200 •faster than estimate from triad bandwidth GFlop/s 100 0 account for existence of cache in estimate of performance 5110 7120 K20 K40 GTC 2015 | Mathias Wagner | Indiana University |

    Caching for vectors Dslash performance ECC •for upper limit: assume cache hits are free   bytes / site: 1024 x (1-hitrate) 384 + 24   240 est. no cache est. perfect cache measured 16 vectors   1 vectors   gauge field 24 byte each output 160 GFlop/s 80 0 5110 7120 K20 K40 GTC 2015 | Mathias Wagner | Indiana University |

    Caching for vectors Dslash performance ECC •for upper limit: assume cache hits are free   bytes / site: 1024 x (1-hitrate) 384 + 24   240 est. no cache est. perfect cache measured 16 vectors   1 vectors   gauge field 24 byte each output 160 GFlop/s •Perfect caching scenario: hit for 15 out of 16 input vectors   → arithmetic intensity 1.07 (w/o cache 0.80) 80 0 5110 7120 K20 K40 GTC 2015 | Mathias Wagner | Indiana University |

    Caching for vectors Dslash performance ECC •for upper limit: assume cache hits are free   bytes / site: 1024 x (1-hitrate) 384 + 24   240 est. no cache est. perfect cache measured 16 vectors   1 vectors   gauge field 24 byte each output 160 GFlop/s •Perfect caching scenario: hit for 15 out of 16 input vectors   → arithmetic intensity 1.07 (w/o cache 0.80) 80 •typical size of a vector: 32 3 x8 → 3MB, 64 3 x16 → 24MB •KNC: ~30 MB L2 (512 kB / core) + 32kB L1 / core [60 cores] 0 5110 7120 K20 K40 •Kepler: 1.5MB L2+ (16-48) kB L1 / SMX [15 SMX] GTC 2015 | Mathias Wagner | Indiana University |

Try to get a better estimate (GPU focussed) • SM SM •Empirical: vectors through L1, links through texture Read L1 Const only • Programmer’s choice •ignore L2: also loads gauge field (128MB - 1024MB) – L1 is the “default” L2 – DRAM GTC 2015 | Mathias Wagner | Indiana University |

Try to get a better estimate (GPU focussed) •Empirical: vectors through L1, links through texture •ignore L2: also loads gauge field (128MB - 1024MB) •48 kB L1 can hold 2048 24-byte vector elements •for 64 3 x16: 1 xy-plane (even-odd precondition)   hit 7 out of 16 (43% hit rate) •for 32 3 x8: xy plane has 512 elements → 4 xy-planes   in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate) GTC 2015 | Mathias Wagner | Indiana University |

Try to get a better estimate (GPU focussed) •Empirical: vectors through L1, links through texture •ignore L2: also loads gauge field (128MB - 1024MB) •48 kB L1 can hold 2048 24-byte vector elements z-direction •for 64 3 x16: 1 xy-plane (even-odd precondition)   L1 hit 7 out of 16 (43% hit rate) •for 32 3 x8: xy plane has 512 elements → 4 xy-planes   in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate) hit rate 0/16 15/16 3/16 5/16 7/16 9/16 arithmetic intensity 0.8 1.07 0.84 0.87 0.91 0.94 GTC 2015 | Mathias Wagner | Indiana University |

Try to get a better estimate (GPU focussed) Dslash performance K40 ECC, 32x8 •Empirical: vectors through L1, links through texture 240 profiler: L1 hit rate 44% (L2 7%) •ignore L2: also loads gauge field (128MB - 1024MB) •48 kB L1 can hold 2048 24-byte vector elements GFlop/s 170 •for 64 3 x16: 1 xy-plane (even-odd precondition)   hit 7 out of 16 (43% hit rate) •for 32 3 x8: xy plane has 512 elements → 4 xy-planes   in z direction we can hit 2 of 4 elements: 9/16 (56% hit rate) 100 6 6 6 6 6 6 d 1 1 1 1 1 1 e / / / / / / r 0 3 5 7 9 5 u hit rate 0/16 15/16 3/16 5/16 7/16 9/16 1 s a e m arithmetic intensity 0.8 1.07 0.84 0.87 0.91 0.94 GTC 2015 | Mathias Wagner | Indiana University |

Increasing the Intensity Focus on the arithmetic intensity now … push ups later. Cache e ff ects for vectors but remember they are only ~25% of the memory tra ffi c. What can we do about the gauge links ? GTC 2015 | Mathias Wagner | Indiana University |

    HISQ Inverter for multiple right hand sides (rhs) •combine multiple inversions with constant gauge field (constant sparse matrix)   ⇣ ⌘ ⇣ ⌘ v (1) x 0 , v (2) w (1) x , w (2) x , . . . , w ( n ) x 0 , . . . , v ( n ) = D x,x 0 x x •reuse links (input for the sparse matrix) in the matrix-vector multiplication (Dslash)   GTC 2015 | Mathias Wagner | Indiana University | � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a - PowerPoint PPT Presentation

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study Mathias Wagner GTC 2015 | Mathias Wagner | Indiana University | Lattice Quantum ChromoDynamics and Deep Learning sorry, not (yet?) here. GTC

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi Erik Saule 1 ,

Offload Mode Case Study James Briggs 1 COSMOS DiRAC April 28, 2015 Case Study: Modal2d

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Towards Direct Visualization on CPU and Xeon Phi Aaron Knoll SCI Institute, University of Utah

Application aware access and distribution of digital objects using Named Data Networking (NDN)

W ITH RECENT widespread deployment of new peer- MSS and the MHs could become a scalability

Cache Slough Landowners Meeting LITTLE EGBERT TRACT PROJECT Flood Hydraulics Don Trieu P.E. May

N 39 47.457 W

Etisalat DNS Internet Core Services By Mohamed Albanna Manager/ Internet Core Services Outline

Next Generation Multipurpose Microprocessor Activity Overview DASIA 2010 June 1 st , 2010

Build Your Own Static WCET Analyzer the Case of f the Automotiv ive Proce cessor AURIX TC275

Design Space Optimization of Embedded Memory Design Space Optimization of Embedded Memory Systems