Two Roads to Parallelism: Compilers and Libraries Lawrence - PowerPoint PPT Presentation

Two Roads to Parallelism: Compilers and Libraries Lawrence Rauchwerger

Parallel Computing • It’s back (again) and ubiquitous • We have the hardware (multicore …. petascale) • Parallel software + Productivity: not yet… • And now ML needs it … Our Road towards a productive parallel software development environment 2

For Existing Serial Programs Previous Approaches - Use Instruction Level Parallelism (ILP): HW + SW ¡ compiler (automatic) BUT not scalable - Thread (Loop) Level (Data) Parallelism: HW+SW ¡ compiler (automatic) BUT insufficient coverage ¡ manual annotations more scalable but labor intensive Our Approach - Hybrid Analysis: A seamless bridge of static and dynamic program analysis for loop level parallelization ¡ USR - a powerful IR for irregular application ¡ Speculation as needed for dynamic analysis 3

For New Programs Previous Approaches - Write parallel programs from scratch - Use parallel language, library, annotations - Hard Work ! Our Approach - STAPL: Parallel Programming Environment Library of parallel algorithms, distributed containers, ¡ patterns and run-time system Used in PDT, an important app for DOE & Nuclear ¡ Engineers, influenced Intel’s TBB …and perhaps similar to Tensorflow ¡ 4

Parallelizing Compilers Auto-Parallelization of Sequential Programs - Around for 30+ years: UIUC, Rice, Stanford, KAI, etc. - Requires complex static analysis + other technology - Not widely adopted Our Approach - Initially: speculative parallelization - Better: Hybrid Analysis is best of both: static + dynamic - Aspects of these techniques used in mainstream compilers and STM based systems. - Excellent results – Major Effort – Don’t try at home 6

Static Data Dependence Analysis: An Essential Tool for Parallelization The Question: Are there cross iteration dependences? • Equivalent to determining if system of equations has integer solutions • In general, undecidable – until symbols become numbers (at runtime) 1 ≤ j w ≤ 10 Linear Reference Patterns DO j = 1, 10 1 ≤ j r ≤ 10 a(j) = a(j+40) - Solutions restricted to linear addressing and control j w ≠ j r ENDDO (mostly small kernels) j w = j r + 40 Geometric view: Polytope model ¡ • Some convex body contains no integral points Existential solutions: GCD Test, Banerjee Test, etc ¡ • Potentially overly conservative General solution: Presburger formula decidability ¡ • Omega Test: Precise, potentially slow Nonlinear Reference Patterns • Common cases: indirect access, recurrence without closed form DO j = 1, 10 • Approaches: Linear Approximation, Symbolic IF (x(j)>0) THEN Analysis, Interactive A(f(j)) = … ENDIF 7 ENDDO

Run-time Dependence Analysis: Speculative Parallelization Checkpoint FOR i = … Problem: A[W[i]] = A[R[i]] + C[i] Speculative parallel execution + tracing Main Idea : Analysis • Speculatively execute the loop in parallel and record reference in private shadow data Yes structures Success ? • Afterwards, check shadow data structures No for data dependences Restore • if no dependences loop was parallel • else re-execute safely (loop not parallel) Sequential execution Cost : End • Worst case: proportional to data size 9

Hybrid Analysis Compile-time Analysis Hybrid Analysis Run-time Analysis (compiler) Symbolic analysis STATIC Symbolic analysis Extract conditions DYNAMIC Evaluate conditions Full reference-by- (run-time) reference analysis PROs PROs PROs Always finds answers Always finds answers No run-time overhead Minimizes runtime overhead CONs CONs CONs More Complex static analysis Run-time overhead Conservative when Ignores compile-time Input/computed values analysis Indirection, Control Weak symbolic analysis 10 Complex recurrences Impractical Combinatorial explosion

DO j=1,N Hybrid Analysis Under what conditions can a(j)=a(j+40) the loop be executed in Compile-time Phase parallel? ENDDO x x 1. Collect and classify memory j+40 j=1,N j j=1,N references. READ WRITE 41:40+N 1:N 2. Aggregate them symbolically READ WRITE Empty? ∩ 3. Formulate independence test. 41:40+N 1:N READ WRITE ≤ ≤ 4.a) If we can prove 1 N 40 4.b) If N is unknown, 11 ≤ N 40 Declare loop parallel. Extract run-time test.

Hybrid Analysis DO j=1,N Execute the loop in parallel if a(j)=a(j+40) possible. Run-time Phase ENDDO 4.a) If we can prove 1 N 40 , 4.b) If N is unknown, ≤ ≤ ≤ N 40 Declare loop parallel. Extract run-time test. Compile Time Run Time Run-time Test IF (N 40) THEN ≤ DO PARALLEL j=1,N Parallel DO PARALLEL j=1,N Parallel a(j)=a(j+40) Loop a(j)=a(j+40) Loop ENDDO ENDDO ELSE DO j=1,N Sequential a(j)=a(j+40) Loop ENDDO No run-time tests ENDIF 12 performed if not necessary!

Hybrid Analysis: a slightly deeper dive DO j = 1, n A(j) = A(j+40) WRITE READ IF (x>0) THEN A(j) = A(j) + A(j+20) ENDIF ENDDO ∩ READ WRITE = Empty? Empty? ∩ READ ∪ Program Level WRITE 1:n Representation of References 41:40+n # ( USR) x>0 21:20+n 13

Set expression to DO j = 1, n A(j) = A(j+40) IF (x>0) THEN Logic expression A(j) = A(j) + A(j+20) ENDIF ENDDO Empty? ∧ ∩ Empty? Empty? WRITE ∪ ∩ ∩ 1:n READ 1. Distribute Intersection 41:40+n 1:n 41:40+n 1:n # # 21:20+n x>0 21:20+n x>0 2 ∧ ∧ (n 20 or x 0) ∨ 3 ∨ ≤ ≤ n 40 n 40 4 ≤ ≤ and n 40 ≤ Empty? n 20 x 0 x 0 ≤ ≤ ∩ ≤ 14 Representation is Key ! 21:20+n 1:n

Hybrid Analysis Strategy Independence conditions factored into a series of sufficient conditions tested at runtime in the order of their complexity previous O(1) Scalar example pass Operations fail O(n/k) Execute in Parallel aggregate Comparisons pass (independent) references reference fail based pass LRPD Execute Sequentially fail (dependent) 15

Hybrid Analysis Parallelization Coverage RT: Individual Refs RT: Aggregated Refs RT: Simple Checks Compile-time 100 80 60 40 20 0 adm arc2d bdna dyfesm flo52 mdg ocean spec77 track trfd applu apsi mgrid swim wupwise hydro2d matrix300 mdljdp2 nasa7 ora swm256 tomcatv PERFECT SPEC2000/06 Previous SPEC • Parallelized 380 loops of 2100 analyzed loops: 92% seq. coverage 17

Speedups: Hybrid Analysis vs. Intel ifort • Older Benchmarks with smaller datasets on 4 cores only • Better performance on 14/18 benchmarks on 4 cores • Better performance on 10/11 benchmarks on 8 cores 19

So…. - What did we accomplish? • Full Parallelization of C-tran codes (28 benchmarks at >90% coverage) • A IR representation & a technique - We cannot declare victory because: • Required Heroic Efforts • Commercial compilers adopt slowly • Compilers cannot create parallelism -- only programmers can! 20

How else? First • Think Parallel! Then • Develop parallel algorithms • Raise the level of abstraction • Use algorithm level (not only) abstraction • Expressivity + Productivity • Optimization can be compiler generated 21

STAPL : Standard Template Adaptive Parallel Library A library of parallel components that adopts the generic programming philosophy of the C++ Standard Template Library (STL). • STL • STAPL - - Iterators provide abstract access Views provide abstracted access to to data stored in Containers . distributed data stored in Distributed Containers . - Algorithms are sequences of - instructions that transform the data. Parallel Algorithms specified by Skeletons Run-time representation is Task Graph ¡ Containers Iterators Algorithms Containers Views Algorithms Task Graphs 23

STAPL Components User Application Code High Level of Abstraction ~ similar to C++ STL Task & Data parallelism: Asynchronous Views Algorithms • Parallelism (SPMD) implicit – Serialization Containers explicit • imperative + functional: Data flow+Containers Skeleton Adaptive Framework Framework SPMD Programs defined by Task Graph • Data Dependence Patterns è Skeletons • Composition: parallel, serial, nested, … Run-time System • Tasks: Work function & Data ARMI Communication Scheduler Performance Monitor Library • Fine grain tasks (coarsened) • Data in distributed containers MPI, OpenMP , Pthreads Execution Defined by: Data Flow Graphs (Task Graphs) Execution policies: scheduling, asynchrony.. 24 24 Distributed Memory Model (PGAS)

The STAPL Graph Library (SGL) • Many problems are modeled using graphs: - Web search, data-mining (Google, Youtube) - Social networks (Facebook, Google+, Twitter) - Geospatial graphs (Maps) - Scientific applications • Many important graph algorithms: - Breadth-first search, single-source shortest path, strongly connected components, k-core decomposition, centralities 27

SGL Programming Model Vertex Operator Neighbor Operator User code Graph Runtime Library code KLA Hierarchical Out-of-Core STAPL Runtime System OpenMP MPI C++11 threads 32

Parallel Graph Algorithms May Use • Asynchronous Model • Level-Synchronous Model - BSP-style iterative computation - Asynchronous task execution - Point-to-point synchronizations, possible - Global synchronization after each redundant work level, no redundant work Processors Processors Computation Tasks Local Interleaved Computation Communication Communication Barrier Synchronization 35

Two Roads to Parallelism: Compilers and Libraries Lawrence - PowerPoint PPT Presentation

Two Roads to Parallelism: Compilers and Libraries Lawrence Rauchwerger Parallel Computing Its back (again) and ubiquitous We have the hardware (multicore . petascale) Parallel software + Productivity: not yet And

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Hampton Roads Retail Hampton Roads Retail Hampton Roads Retail Hampton Roads Retail Southside

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

SCENIC ROADS PROGRAM ADOT PARKWAYS, HISTORIC & SCENIC ROADS PROGRAM Parkways Historic

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Effective Parallelism with Reagents KC Sivaramakrishnan University of OCaml Cambridge

Macquarie Atlas Roads Limited Macquarie Atlas Roads International Limited 2016 Annual General

CURRENT PROJECT PRESENTATION Department of Public Works CROSS ROADS PKWY SUBGRADE PREPERATION

Compilers Structure of a Compiler Alex Aiken Intro to Compilers 1. Lexical Analysis 2. Parsing

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

The Holy Spirit can solve all these problems with ONE solution! A new form of white martyrdom

Scalable Array SSA and Array Dataflow Analysis Silvius Rus Guobin He Lawrence Rauchwerger SSA

Lives of the Saints St. Vincent Ferrer Angel of the Apocalypse Sain int t Vince ncent nt

BioRDF: Seeding the Semantic Web Susie Stephens, Oracle BioRDF Charter Build a life

Mobile Network Sharing Between Operators: Between Operators A Demand Trace-Driven Study Di

Parameter handling Parameter handling and the HADES Oracle database and the HADES Oracle

Good and Bad Neighborhood Approximations for Outlier Detection Ensembles Evelyn Kirner, Erich

Defense Strategies Trent Jaeger Systems and Internet Infrastructure Security (SIIS) Lab Computer

Two Roads to Parallelism: Compilers and Libraries Lawrence - PowerPoint PPT Presentation

Two Roads to Parallelism: Compilers and Libraries Lawrence Rauchwerger Parallel Computing Its back (again) and ubiquitous We have the hardware (multicore . petascale) Parallel software + Productivity: not yet And

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Hampton Roads Retail Hampton Roads Retail Hampton Roads Retail Hampton Roads Retail Southside

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

SCENIC ROADS PROGRAM ADOT PARKWAYS, HISTORIC &amp; SCENIC ROADS PROGRAM Parkways Historic

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Libraries Jonathan Platt Head of Libraries and Heritage 22 nd July 2014 Libraries 1.

Libraries In C++ its possible to create static libraries and shared libraries Static

Effective Parallelism with Reagents KC Sivaramakrishnan University of OCaml Cambridge

Macquarie Atlas Roads Limited Macquarie Atlas Roads International Limited 2016 Annual General

CURRENT PROJECT PRESENTATION Department of Public Works CROSS ROADS PKWY SUBGRADE PREPERATION

Compilers Structure of a Compiler Alex Aiken Intro to Compilers 1. Lexical Analysis 2. Parsing

CS 5220: Locality and parallelism in simulations I David Bindel 2017-09-12 1 Parallelism and

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

The Holy Spirit can solve all these problems with ONE solution! A new form of white martyrdom

Scalable Array SSA and Array Dataflow Analysis Silvius Rus Guobin He Lawrence Rauchwerger SSA

Lives of the Saints St. Vincent Ferrer Angel of the Apocalypse Sain int t Vince ncent nt

BioRDF: Seeding the Semantic Web Susie Stephens, Oracle BioRDF Charter Build a life

Mobile Network Sharing Between Operators: Between Operators A Demand Trace-Driven Study Di

Parameter handling Parameter handling and the HADES Oracle database and the HADES Oracle

Good and Bad Neighborhood Approximations for Outlier Detection Ensembles Evelyn Kirner, Erich

Defense Strategies Trent Jaeger Systems and Internet Infrastructure Security (SIIS) Lab Computer

SCENIC ROADS PROGRAM ADOT PARKWAYS, HISTORIC & SCENIC ROADS PROGRAM Parkways Historic