A Productive Framework for Generating High Performance, Portable, - PowerPoint PPT Presentation

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul Dakkak CCoE, University of Illinois at Urbana-Champaign

4,224 Kepler GPUs in Blue Waters • NAMD – 100 million atom benchmark with Langevin dynamics and PME once every 4 steps, from launch to finish, all I/O included – 768 nodes, Kepler+Interlagos is 3.9X faster over Interlagos-only – 768 nodes, XK7 is 1.8X XE6 • Chroma – Lattice QCD parameters: grid size of 48 3 x 512 running at the physical values of the quark masses – 768 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only – 768 nodes, XK7 is 2.4X XE6 • QMCPACK – Full run Graphite 4x4x1 (256 electrons), QMC followed by VMC – 700 nodes, Kepler+Interlagos is 4.9X faster over Interlagos-only – 700 nodes, XK7 is 2.7X XE6 IWCSE 2013

Two Current Challenges • At scale use of GPUs Blue Waters K7 Nodes NAMD Strong Scaling – 100M Atoms – Communication costs 3500 dominate beyond 2048 3000 nodes 2500 – E.g., NAMD Limited by 2000 PME 1500 – Insufficient computation work 1000 • Programming Efforts 500 0 – This talk 512 1024 2048 4096 CPU CPU+GPU SC13

Writing efficient parallel code is complicated. Tools can provide focused help or broad help Planning how to execute an algorithm Implementing the plan • Memory allocation GMAC • Data movement • Choose data structures • Pointer operations DL • Index arithmetic Triolet, X10, Chappel, Nesl, DeLite, Par4All • Kernel dimensions • Map work/data into tasks OpenACC/ • Thread ID arithmetic C++AMP/ • Schedule tasks to threads • Synchronization Thrust • Temporary data structures Tangram SC13

Levels of GPU Programming Languages Prototype & in development X10, Chapel, Nesl, Delite, Par4all, Triolet... Implementation manages GPU threading and synchronization invisibly to user Next generation OpenACC, C++AMP, Thrust, Bolt Simplifies data movement, kernel details and kernel launch Same GPU execution model (but less boilerplate) Current generation CUDA, OpenCL, DirectCompute IWCSE 2013

Where should the smarts be for Parallelization and Optimization? • General-purpose language + parallelizing compiler – Requires a very intelligent compiler – Limited success outside of regular, static array algorithms • Domain-specific language + domain-specific compiler – Simplify compiler’s job with language restrictions and extensions – Requires customizing a compiler for each domain • Parallel meta-library + general-purpose compiler – Library embodies parallelization decisions – Uses a general-purpose compiler infrastructure – Extensible — just add library functions – Historically, library is the area with the most success in parallel computing SC13

Triolet – Composable Library-Driven Parallelization • EDSL-style library: build, then interpret program packages • Allows library to collect multiple parallel operations and create an optimized arrangement – Lazy evaluation and aggressive inlining – Loop fusion to reduce communication and memory traffic – Array partitioning to reduce communication overhead – Library source-guided parallelism optimization of sequential, shared-memory, and/or distributed algorithms • Loop-building decisions use information that is often known at compile time – By adding typing to Python SC13

Example: Correlation Code Compute f(x,y) for every x in xs and for every y in ys (Doubly nested loop) def correlation(xs, ys): scores = (f(x,y) for x in xs for y in ys) return histogram(100, par(scores)) Compute it in parallel Put scores into a 100- element histogram SC13

Triolet Compiler Intermediate Representation • List comprehension and par build a package containing 1. Desired parallelism 2. Input data structures 3. Loop body for each loop level • Loop structure and parallelism annotations are statically known correlation xs ys = Outer loop let i = IdxNest HintPar (arraySlice xs) Inner loop ( λx . IdxFlat HintSeq (arraySlice ys) ( λy . f x y ) ) in histogram 100 i Body SC13

Triolet Meta-Library • Compiler inlines histogram • histogram has code paths for handling different loop structures • Loop structure is known, so compiler can remove unused code paths correlation xs ys = case IdxNest HintPar (arraySlice xs) ( λx . IdxFlat HintSeq (arraySlice ys) ( λy . f x y ) ) of IdxNest parhint input body. case parhint of HintSeq. code for sequential nested histogram HintPar. parReduce input ( λchunk . seqHistogram 100 body chunk) IdxFlat parhint input body. code for flat histogram SC13

Example: Correlation Code • Result is an outer loop specialized for this application • Process continues for inner loop Parallel reduction; each task correlation xs ys = processes a chunk of xs parReduce (arraySlice xs) Task computes a sequential histogram (λchunk. seqHistogram 100 Inner loop (λx. IdxFlat HintSeq (arraySlice ys) (λy. f x y ) ) chunk) Body SC13

Cluster-Parallel Performance and Scalability • Triolet delivers large speedup over sequential C • On par with manually parallelized C for computation-bound code (left) • Beats similar high- level interfaces on communication- intensive code (right) Chris Rodriues Rodrigues, et al, PPoPP 2014 SC13

Tangram • A parallel algorithm framework for solving linear recurrence problems – Scan, tridiagonal matrix solvers, bidiagonal matrix solvers, recursive filters, … – Many specialized algorithms in literature • Linear Recurrence - very important for converting sequential algorithms into parallel algorithms

Tangrams Linear Optimizations • Library operations to simplify application tiling and communication – Auto-tuning for each target architecture • Unified Tiling Space – Simple interface for register tiling, scratchpad tiling, and cache tiling – Automatic thread fusion as enabler • Communication Optimization – Choice/hybrid of three major types of algorithms – Computation vs. communication tradeoff

Linear Recurrence Algorithms and Communication Brent-Kung Circuit Kogge-Stone Circuit Group Structured SC13

Tangram Initial Results 14 35 StreamScan-Reported Tuned-for-Kepler, Run-on-Kepler Throughput (billions of samples per second) Throughput (billions of samples per second) Proposed-Tuned Tuned-for-Kepler-no-Shuffle, Run-on-Kepler StreamScan-Tuned StreamScan, Run-on-Kepler 12 SDK-5.0 30 Tuned-for-Fermi, Run-on-Kepler Thrust-1.5 SDK-5.0, Run-on-Kepler 10 25 8 20 6 15 4 10 2 5 0 0 1-32bit 8-32bit 64-32bit 1-64bit 8-64bit 64-64bit 1 8 64 Problem Size (millions of samples, data tpye) Problem Size (millions of samples) Prefix scan on Fermi (C2050) Prefix scan on Kepler(Titan) 25 1800 Throughput (millions of equations per second) Proposed, Tuned-for-Fermi, Run-on-Fermi Ours,Kepler-Kepler Throughput (billions of samples per second) Proposed, Tuned-for-Fermi, Run-on-Kepler Ours,Fermi-Fermi Proposed, Tuned-for-Kepler, Run-on-Kepler 1600 Ours,Fermi-Kepler SC12,Kepler-Kepler SC12,Fermi-Fermi 20 NVIDIA,Kepler-Kepler 1400 NVIDIA,Fermi-Fermi 1200 15 1000 800 10 600 5 400 200 0 0 1 2 4 1 16 Order of IIR Filter Problem Size (millions of equations) IIR Filter on both GPUs Tridiagonal solver on both GPUs

Next Steps • Triolet released as an open source project – Develop additional Triolet library functions and their implementations for important application domains – Develop Triolet library functions for GPU clusters • Publish and release Tangram – Current tridiagonal solver in CUSPARSE is from UIUC based on the Tangram work – Integration with Triolet SC13

THANK YOU! SC13

A Productive Framework for Generating High Performance, Portable, - PowerPoint PPT Presentation

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul Dakkak CCoE, University of Illinois

+ Productive Struggle & Problem Solving Grades 3-5 Heather Siedschlag hsied22@gmail.com +

New Horizons Releasing the Productive Potential Across 40% of SA Release Productive Potential

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Get Started with Qt for MCUs 1.0 Qt All in One - Code Once, Deploy Productive Framework

Productive Uses of Energy Experiences, publications, guidance and tools Monika Rammelt &

Trade Policies for Productive Transformation: Trade Agreements ILO-FES workshop 4-5 March 2013

Predictors of Individual Differences in Productive Vocabulary and Their Ability to Identify Late

AKTINA* a new productive urban furniture a solar & sustainable mobility project by CITY

We help individuals and businesses become more productive 1 CONFIDENTIAL 5 CONFIDENTIAL We

Finding your voice Body in facilitating productive conversations Source: von Frank, V. (2013,

TRADE, SERVICES AND DEVELOPMENT Enhancing productive capacity through services Geneva, 12 May

The Productive Multivocality Project Lund, Ros, Suthers Suthers, D. D., Lund, K., Ros, C. P.,

Dynamic Re-ordering in Mining Top- k Productive Discriminative Patterns Yoshitaka Kameya * and

The mode of production is defined by the "productive forces" and the "relations of

Productive Development with IntelliJ IDEA Roman trobl JetBrains, Inc. www.jetbrains.com

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

Striving for Versatility in Striving for Versatility in Publish/Subscribe Publish/Subscribe

Substance and Style: domain-specific languages for mathematical diagrams Wode Nimo Ni

Challenges of Data Transfer between CAD- and LCA Software Tools Nora Marosky Julia Dose,

DEVELOPMENT OF A NEW POLICY EVALUATION PROCEDURE FOR XACML Jorian van Oostenbrugge

Evaluation of software tools supporting outcomes- based continuous program improvement processes:

Using ONNX for accelerated inferencing on cloud and edge Prasanth Pulavarthi (Microsoft) Kevin

An Lightweight Infrastructure to Support Experimenting with Heterogeneous Transformations An

Rascal: Meta-Programming for Program Analysis Mark Hills, Paul Klint, & Jurgen J. Vinju 9th

A Productive Framework for Generating High Performance, Portable, - PowerPoint PPT Presentation

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul Dakkak CCoE, University of Illinois

+ Productive Struggle &amp; Problem Solving Grades 3-5 Heather Siedschlag hsied22@gmail.com +

New Horizons Releasing the Productive Potential Across 40% of SA Release Productive Potential

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Get Started with Qt for MCUs 1.0 Qt All in One - Code Once, Deploy Productive Framework

Productive Uses of Energy Experiences, publications, guidance and tools Monika Rammelt &amp;

Trade Policies for Productive Transformation: Trade Agreements ILO-FES workshop 4-5 March 2013

Predictors of Individual Differences in Productive Vocabulary and Their Ability to Identify Late

AKTINA* a new productive urban furniture a solar &amp; sustainable mobility project by CITY

We help individuals and businesses become more productive 1 CONFIDENTIAL 5 CONFIDENTIAL We

Finding your voice Body in facilitating productive conversations Source: von Frank, V. (2013,

TRADE, SERVICES AND DEVELOPMENT Enhancing productive capacity through services Geneva, 12 May

The Productive Multivocality Project Lund, Ros, Suthers Suthers, D. D., Lund, K., Ros, C. P.,

Dynamic Re-ordering in Mining Top- k Productive Discriminative Patterns Yoshitaka Kameya * and

The mode of production is defined by the &quot;productive forces&quot; and the &quot;relations of

Productive Development with IntelliJ IDEA Roman trobl JetBrains, Inc. www.jetbrains.com

Play Framework One Web Framework to rule them all Felix Mller Agenda Yet another web

Striving for Versatility in Striving for Versatility in Publish/Subscribe Publish/Subscribe

Substance and Style: domain-specific languages for mathematical diagrams Wode Nimo Ni

Challenges of Data Transfer between CAD- and LCA Software Tools Nora Marosky Julia Dose,

DEVELOPMENT OF A NEW POLICY EVALUATION PROCEDURE FOR XACML Jorian van Oostenbrugge

Evaluation of software tools supporting outcomes- based continuous program improvement processes:

Using ONNX for accelerated inferencing on cloud and edge Prasanth Pulavarthi (Microsoft) Kevin

An Lightweight Infrastructure to Support Experimenting with Heterogeneous Transformations An

Rascal: Meta-Programming for Program Analysis Mark Hills, Paul Klint, &amp; Jurgen J. Vinju 9th

+ Productive Struggle & Problem Solving Grades 3-5 Heather Siedschlag hsied22@gmail.com +

Productive Uses of Energy Experiences, publications, guidance and tools Monika Rammelt &

AKTINA* a new productive urban furniture a solar & sustainable mobility project by CITY

The mode of production is defined by the "productive forces" and the "relations of

Rascal: Meta-Programming for Program Analysis Mark Hills, Paul Klint, & Jurgen J. Vinju 9th