Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - PowerPoint PPT Presentation

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Int’l. Conference on Supercomputing June 10, 2009

Motivation and Goals  Regular, memory bandwidth-bound kernel  GPU with high memory bandwidth  Seems simple but…  Tuning  Constraints on parallelism and data movement  Real systems mix CPU-like/GPU-like  Goal of paper: Solve mapping 2

Key ideas and results  Hand-tuned to understand how to map  98% of empirical streaming bandwidth  Model-driven hybrid CPU/GPU implementation  Asynchronous algorithms to avoid global syncs [Chazan & Miranker ’69]  1.2 to 2.5x speedup even while performing up to 1.7x flops! 3

Problem To solve Poisson’s equation in 2-D on a square grid: Centered finite-difference approximation on a (N +2)*(N+2) grid with step size h : 4

Algorithm Memory bandwidth bound Used as subroutine in other kernel complex algorithms. E.g, Multigrid for t=1,2,3 … T, do for all unknowns in the grid, do (embarrassingly parallel) U t+1 i,j = ¼ * (U t i+1,j + U t i-1,j + U t+1 i,j-1 + U t i,j+1 ) end for end for 2 copies of the grid Implicit global sync required 5

CPU and GPU Baselines 6

Tuned CPU Baseline*  SIMD Vectorization  Parallelized using pthreads  Binding on NUMA architectures *See paper for details 7

Tuned GPU Baseline*  Exploit shared memory  Bank conflicts – access pattern  Non-coalesced accesses – Padding  Loop unrolling  Occupancy - Proper block size *See paper for details 8

Experimental set-up 9

Results Device: Nvidia Tesla C870 Grid size: 4096 No. of Iterations: 32 10

Results 11

98% of empirical streaming bandwidth Approx 37 GFLOPS 66% of true peak 12

Half the work of single Approx 17 GFLOPS 13

CPU/GPU and Multi-GPU Methods 14

CPU-GPU implementation 15

CPU-GPU implementation Need to exchange 16

Algorithm 17

CPU-GPU Model graph Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU Optimal fraction 18

CPU-GPU – Slower CPU Baseline CPU-only 100% Fraction of baseline CPU-only time Hybrid Hybrid never beats the pure GPU version Baseline GPU-only GPU part CPU part Exchange time 0% 0% 100% Fraction assigned to CPU 19

~1.1x speedup 20

~ 1.8x speedup 23

Asynchronous algorithms 24

TunedSync - Review Moving data between global memory Shared Global memory Synchronization Compute & Fetch Global Global grid 2 Write grid 1 Compute & Fetch Global Global Write grid 2 grid 1 T iterations Fetch Compute & Global Global Write grid 2 grid 1 25

TunedSync - Review Shared memory Global Global Compute & Fetch grid 2 grid 1 Write 26

Async0 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write • Effective number of iterations = (T’* α ) • Greater than T? • More iterations -> More FLOPS T’ iterations 27

How is Async0 different from TunedSync?  Reduces the number of global memory accesses  Fewer global synchronizations  Expect Teff ≥ T, by a little (but can’t be less!)  Uses 2 shared memory grids 28

Motivation for Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write Do you need so many local syncs? T’ iterations 29

Async1 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write T’ iterations 30

Motivation for Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Repeat α /2 times Compute & (totally α iterations) Write The exchange ghost cells is not assured anyway. Why not get rid of it? T’ iterations 31

Async2 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Write (totally α iterations) T’ iterations 32

Motivation for Async3 Shared memory Compute & Global Global Fetch grid 2 Write grid 1 Compute & Repeat α /2 times Why not have a Write (totally α iterations) single shared memory grid instead of two? T’ iterations 33

Async3 Shared memory Global Global Fetch grid 2 grid 1 Compute & Write Repeat α /2 times (totally α iterations) T’ iterations 34

Conclusion  Need extensive (automatable) tuning to achieve near peak performance, even for a simple kernel  Simple performance models can guide CPU-GPU and Multi-GPU designs  “Fast and loose” asynchronous algorithms yield non-trivial speedups on the GPU 37

Future work  Extension of chaotic relaxation technique to other domains.  Extending Multi-GPU study to GPU clusters.  Systems that will decide “on-the-go” if CPU-GPU or Multi-GPU execution will pay-off based on performance models  Automatic “asynchronous” code generator for “arbitrary” iterative methods. 38

Thank you! 39

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - PowerPoint PPT Presentation

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Intl. Conference on Supercomputing June 10, 2009 Motivation and Goals Regular, memory

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

David W. Blackwell, Ph.D. Provost We Are Whats Wildly Possible . We Are Whats Wildly

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Early y MARVELS Science Science Scott Gaudi The Ohio State University The Ohio State

VO: enabling science Mark Allen - CoSADIE Project Scientist CoSADIE Project Vision Archives

protection interventions on grain prices in poor countries: Evidence from Ethiopia Kalle Hirvonen

!"#$%&'()+',-.+#(%')+(#-+- /'%0-#('%#-1)(2-345-'+/-6745 !"#$%&'()+,-./$(0

Cosmology: Lecture #1 Our Universe at present: main ingredients and the expansion law Dmitry

FORECASTING THE CHEMICAL INFORMATION CONTENT OF STELLAR SPECTRA NATHAN SANDFORD UC

A continuum model for snow and firn on the surface of ice sheets Ian Hewitt (University of

Strongly coupled plasma - hydrodynamics, thermalization and nonequilibrium behavior Romuald A.

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - PowerPoint PPT Presentation

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Intl. Conference on Supercomputing June 10, 2009 Motivation and Goals Regular, memory

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

David W. Blackwell, Ph.D. Provost We Are Whats Wildly Possible . We Are Whats Wildly

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Realizing Extremely LargeScale Stencil Applications on GPU Supercomputers Toshio Endo, Yuki

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Early y MARVELS Science Science Scott Gaudi The Ohio State University The Ohio State

VO: enabling science Mark Allen - CoSADIE Project Scientist CoSADIE Project Vision Archives

protection interventions on grain prices in poor countries: Evidence from Ethiopia Kalle Hirvonen

!&quot;#$%&amp;'()*+',-.*+#(%')+(#-*+- /'%0-#('%#-1)(2-345-'+/-6745 !&quot;#$%&amp;'()*+,-./$(0

Cosmology: Lecture #1 Our Universe at present: main ingredients and the expansion law Dmitry

FORECASTING THE CHEMICAL INFORMATION CONTENT OF STELLAR SPECTRA NATHAN SANDFORD UC

A continuum model for snow and firn on the surface of ice sheets Ian Hewitt (University of

Strongly coupled plasma - hydrodynamics, thermalization and nonequilibrium behavior Romuald A.

!"#$%&'()+',-.+#(%')+(#-+- /'%0-#('%#-1)(2-345-'+/-6745 !"#$%&'()+,-./$(0