GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a - PowerPoint PPT Presentation

 Introduction to the Cray compiler  Example  GTC  Overflow  PARQUET Cray Inc. Confidential Slide 2

 Cray has a long tradition of high performance compilers  Vectorization  Parallelization  Code transformation  More…  Began internal investigation leveraging an open source compiler called LLVM  Initial results and progress better than expected  Decided to move forward with Cray X86 compiler  7.0 released in December 2008  7.1 will be released Q2 2009 Cray Inc. Confidential Slide 3

C and C++ Front End Fortran Source C and C++ Source supplied by Edison Design Group, with Cray-developed code for extensions and Fortran Front End C & C++ Front End interface support Interprocedural Analysis Cray Inc. Compiler Technology Compiler Optimization and Parallelization X86 Code Cray X2 Code Generator Generator X86 Code Generation from Open Source LLVM , with additional Cray-developed Object File optimizations and interface support Slide 4 Cray Inc. Proprietary

 Make sure it is available  module avail PrgEnv-cray  To access the Cray compiler  module load PrgEnv-cray  To target the Barcelona chip  module load xtpe-quadcore  Once you have loaded the module “cc” and “ ftn ” are the Cray compilers  Recommend just using default options  Use – rm (fortran) and – hlist=m (C) to find out what happened  man crayftn Cray Inc. Confidential Slide 5

 Excellent Vectorization  Vectorize more loops than other compilers  OpenMP  2.0 standard  Nesting  PGAS: Functional UPC and CAF available today.  Excellent Cache optimizations  Automatic Blocking  Automatic Management of what stays in cache  Prefetching, Interchange, Fusion, and much more… Cray Inc. Confidential Slide 6

 C++ Support  Automatic Parallelization  Modernized version of Cray X1 streaming capability  Interacts with OMP directives  OpenMP 3.0  Optimized PGAS  Will require Gemini network to really go fast  Improved Vectorization  Improve Cache optimizations Cray Inc. Confidential Slide 7

 Plasma Fusion Simulation  3D Particle-in-cell code (PIC) in toroidal geometry  Developed by Prof. Zhihong Lin (now at UC Irvine)  Code has several different characteristics  Stride-1 copies  Strided memory operations  Computationally intensive  Gather/Scatter  Sorting and Packing  Main routine is known as the “pusher” Cray Inc. Confidential Slide 8

 Main Pusher kernel consists of 2 main loop nests  First loop nest contains groups of 4 statements which include significant indirect addressing e1=e1+wp0*wt00*(wz0*gradphi(1,0,ij)+wz1*gradphi(1,1,ij)) e2=e2+wp0*wt00*(wz0*gradphi(2,0,ij)+wz1*gradphi(2,1,ij)) e3=e3+wp0*wt00*(wz0*gradphi(3,0,ij)+wz1*gradphi(3,1,ij)) e4=e4+wp0*wt00*(wz0*phit(0,ij)+wz1*phit(1,ij))  Turn 4 statements into 1 vector shortloop ev(1:4)=ev(1:4)+wp0*wt00*(wz0*tempphi(1:4,0,ij)+wz1*tempphi(1:4,1,ij))  Second loop is large, computationally intensive, but contains strided loads and computed gather  CCE automatically vectorizes loop Cray Inc. Confidential Slide 9

GTC Pusher performance 3200 MPI ranks and 4 OMP threads 40.0 35.0 Billion Particles Pushed/Sec 30.0 25.0 CCE 20.0 Previous Best 15.0 10.0 5.0 - Cray Inc. Confidential Slide 10

GTC performance 3200 MPI ranks and 4 OMP threads 16.0 14.0 Billion Particles Pushed/Sec 12.0 10.0 CCE 8.0 Previous Best 6.0 4.0 2.0 - Cray Inc. Confidential Slide 11

 Overflow is a NASA developed Navier-Stokes flow solver for unstructured grids  Subroutines consist of two or three simply-nested loops  Inner loops tend to be highly vectorized and have 20-50 Fortran statements  MPI is used for parallel processing  Solver automatically splits grid blocks for load balancing  Scaling is limited due to load balancing at > 1024  Code is threaded at a high-level via OpenMP Cray Inc. Confidential Slide 12

Overflow Scaling 4096 2048 Time in Seconds Previous-MPI 1024 CCE-MPI CCE-OMP 2 thr CCE-OMP 4 thr 512 256 256 512 1024 2048 4096 8192 Number of Cores

 Materials Science code  Scales to 1000s of MPI ranks before it runs out of parallelism  Want to use shared memory parallelism across entire node  Main kernel consists of 4 independent zgemms  Want to use multi-level OMP to scale across the node Cray Inc. Confidential Slide 14

!$omp parallel do … do i=1,4 call complex_matmul (…) enddo Subroutine complex_matmul (…) !$omp parallel do private(j,jend,jsize)! num_threads(p2) do j=1,n,nb jend = min(n, j+nb-1) jsize = jend - j + 1 call zgemm( transA,transB, m,jsize,k, & alpha,A,ldA,B(j,1),ldb, beta,C(1,j),ldC) enddo Cray Inc. Confidential Slide 15

ZGEMM 1000x1000 80 70 60 50 GFlops 40 30 20 10 0 Serial ZGEMM High Level OMP Nested OMP Nested OMP Nested OMP Low level OMP ZGEMM 4x1 ZGEMM 3x3 ZGEMM 4x2 ZGEMM 2x4 ZGEMM 1x8 Parallel method and Nthreads at each level Cray Inc. Confidential Slide 16

ZGEMM 100x100 35 30 25 GFlops 20 15 10 5 0 Serial ZGEMM High Level OMP Nested OMP Nested OMP Low Level ZGEMM ZGEMM 4x1 ZGEMM 3x3 ZGEMM 4x2 1x8 Parallel method and Nthreads at each level Cray Inc. Confidential Slide 17

 The Cray Compiling Environment is a new, different, and interesting compiler with several unique capabilities  Several codes are already taking advantage of CCE  Development is ongoing  Consider trying CCE if you think you could take advantage of its capabilities

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a - PowerPoint PPT Presentation

Introduction to the Cray compiler Example GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of high performance compilers Vectorization Parallelization Code transformation More

GTC 2018 GTC 2018 Motivation Discovering Order in Unordered Datasets: GMN Generative Markov

GETTING THE MOST FROM GTC AND THE NVIDIA DEVELOPER PROGRAM Greg Estes, VP Developer Programs,

GTC 2018 Silicon Valley, California Predictive Learning of Factor Based Strategies using Deep

Data Science Applications of GPUs in the R University of California at Language Davis GTC 2016

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

GTC 2017 Interactive Virtual Reality Technology Janet Goldenstein Lead Developer, EchoPixel,

GTC 2017 Silicon Valley, California An Approach to a High Performance Decision Tree Optimization

2012 RESULTS 11 March 2013 Okcie Business Park , Warsaw, Poland Agenda GTC House, Belgrade,

De Deep Le Learnin ing fo for Di Dialogue Sy Systems GTC 2018 P ROF . Y UN -N UNG (V IVIAN )

Porti ting g and O d Opt ptim imiz izin ing g GTC-P P Code de to NVIDIA IA GPU Bei Wang

Presentation of Current Legal Process between GTC & UKPN 25 May 2011 Sue Standring, Land

Emissions Concentrations Impacts Climate (GtC/yr) (ppmv) Sea-level rise ( ) Heat

GTC Data Privacy & Security Training November 3, 2017 Hosted by 1 SPECIAL THANKS TO ....

GTC/Osiris spectra of z~1 superdense E/S0s Jess Martnez, Rafael Guzmn et al. ( UCM/IAC

V-PANE Virtual Perspectives Augmenting Natural Experiences GTC 2017 The views, opinions and/or

Fast Digital Tomosynthesis for LIVE Radiation Therapy GTC 2015 San Jose, CA, USA

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

[S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Reliable communication via semilattice properties of partial knowledge A. Pagourtzis 1 G.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a - PowerPoint PPT Presentation

Introduction to the Cray compiler Example GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of high performance compilers Vectorization Parallelization Code transformation More

GTC 2018 GTC 2018 Motivation Discovering Order in Unordered Datasets: GMN Generative Markov

GETTING THE MOST FROM GTC AND THE NVIDIA DEVELOPER PROGRAM Greg Estes, VP Developer Programs,

GTC 2018 Silicon Valley, California Predictive Learning of Factor Based Strategies using Deep

Data Science Applications of GPUs in the R University of California at Language Davis GTC 2016

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

GTC 2017 Interactive Virtual Reality Technology Janet Goldenstein Lead Developer, EchoPixel,

GTC 2017 Silicon Valley, California An Approach to a High Performance Decision Tree Optimization

2012 RESULTS 11 March 2013 Okcie Business Park , Warsaw, Poland Agenda GTC House, Belgrade,

De Deep Le Learnin ing fo for Di Dialogue Sy Systems GTC 2018 P ROF . Y UN -N UNG (V IVIAN )

Porti ting g and O d Opt ptim imiz izin ing g GTC-P P Code de to NVIDIA IA GPU Bei Wang

Presentation of Current Legal Process between GTC &amp; UKPN 25 May 2011 Sue Standring, Land

Emissions Concentrations Impacts Climate (GtC/yr) (ppmv) Sea-level rise ( ) Heat

GTC Data Privacy &amp; Security Training November 3, 2017 Hosted by 1 SPECIAL THANKS TO ....

GTC/Osiris spectra of z~1 superdense E/S0s Jess Martnez, Rafael Guzmn et al. ( UCM/IAC

V-PANE Virtual Perspectives Augmenting Natural Experiences GTC 2017 The views, opinions and/or

Fast Digital Tomosynthesis for LIVE Radiation Therapy GTC 2015 San Jose, CA, USA

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

Spark &amp; sparklyr part II Spark &amp; sparklyr part II Programming for Statistical

[S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Reliable communication via semilattice properties of partial knowledge A. Pagourtzis 1 G.

Presentation of Current Legal Process between GTC & UKPN 25 May 2011 Sue Standring, Land

GTC Data Privacy & Security Training November 3, 2017 Hosted by 1 SPECIAL THANKS TO ....

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical