two roads to parallelism compilers and libraries
play

Two Roads to Parallelism: Compilers and Libraries Lawrence - PowerPoint PPT Presentation

Two Roads to Parallelism: Compilers and Libraries Lawrence Rauchwerger Parallel Computing Its back (again) and ubiquitous We have the hardware (multicore . petascale) Parallel software + Productivity: not yet And


  1. Two Roads to Parallelism: Compilers and Libraries Lawrence Rauchwerger

  2. Parallel Computing • It’s back (again) and ubiquitous • We have the hardware (multicore …. petascale) • Parallel software + Productivity: not yet… • And now ML needs it … Our Road towards a productive parallel software development environment 2

  3. For Existing Serial Programs Previous Approaches - Use Instruction Level Parallelism (ILP): HW + SW ¡ compiler (automatic) BUT not scalable - Thread (Loop) Level (Data) Parallelism: HW+SW ¡ compiler (automatic) BUT insufficient coverage ¡ manual annotations more scalable but labor intensive Our Approach - Hybrid Analysis: A seamless bridge of static and dynamic program analysis for loop level parallelization ¡ USR - a powerful IR for irregular application ¡ Speculation as needed for dynamic analysis 3

  4. For New Programs Previous Approaches - Write parallel programs from scratch - Use parallel language, library, annotations - Hard Work ! Our Approach - STAPL: Parallel Programming Environment Library of parallel algorithms, distributed containers, ¡ patterns and run-time system Used in PDT, an important app for DOE & Nuclear ¡ Engineers, influenced Intel’s TBB …and perhaps similar to Tensorflow ¡ 4

  5. Parallelizing Compilers Auto-Parallelization of Sequential Programs - Around for 30+ years: UIUC, Rice, Stanford, KAI, etc. - Requires complex static analysis + other technology - Not widely adopted Our Approach - Initially: speculative parallelization - Better: Hybrid Analysis is best of both: static + dynamic - Aspects of these techniques used in mainstream compilers and STM based systems. - Excellent results – Major Effort – Don’t try at home 6

  6. Static Data Dependence Analysis: An Essential Tool for Parallelization The Question: Are there cross iteration dependences? • Equivalent to determining if system of equations has integer solutions • In general, undecidable – until symbols become numbers (at runtime) 1 ≤ j w ≤ 10 Linear Reference Patterns DO j = 1, 10 1 ≤ j r ≤ 10 a(j) = a(j+40) - Solutions restricted to linear addressing and control j w ≠ j r ENDDO (mostly small kernels) j w = j r + 40 Geometric view: Polytope model ¡ • Some convex body contains no integral points Existential solutions: GCD Test, Banerjee Test, etc ¡ • Potentially overly conservative General solution: Presburger formula decidability ¡ • Omega Test: Precise, potentially slow Nonlinear Reference Patterns • Common cases: indirect access, recurrence without closed form DO j = 1, 10 • Approaches: Linear Approximation, Symbolic IF (x(j)>0) THEN Analysis, Interactive A(f(j)) = … ENDIF 7 ENDDO

  7. Run-time Dependence Analysis: Speculative Parallelization Checkpoint FOR i = … Problem: A[W[i]] = A[R[i]] + C[i] Speculative parallel execution + tracing Main Idea : Analysis • Speculatively execute the loop in parallel and record reference in private shadow data Yes structures Success ? • Afterwards, check shadow data structures No for data dependences Restore • if no dependences loop was parallel • else re-execute safely (loop not parallel) Sequential execution Cost : End • Worst case: proportional to data size 9

  8. Hybrid Analysis Compile-time Analysis Hybrid Analysis Run-time Analysis (compiler) Symbolic analysis STATIC Symbolic analysis Extract conditions DYNAMIC Evaluate conditions Full reference-by- (run-time) reference analysis PROs PROs PROs Always finds answers Always finds answers No run-time overhead Minimizes runtime overhead CONs CONs CONs More Complex static analysis Run-time overhead Conservative when Ignores compile-time Input/computed values analysis Indirection, Control Weak symbolic analysis 10 Complex recurrences Impractical Combinatorial explosion

  9. DO j=1,N Hybrid Analysis Under what conditions can a(j)=a(j+40) the loop be executed in Compile-time Phase parallel? ENDDO x x 1. Collect and classify memory j+40 j=1,N j j=1,N references. READ WRITE 41:40+N 1:N 2. Aggregate them symbolically READ WRITE Empty? ∩ 3. Formulate independence test. 41:40+N 1:N READ WRITE ≤ ≤ 4.a) If we can prove 1 N 40 4.b) If N is unknown, 11 ≤ N 40 Declare loop parallel. Extract run-time test.

  10. Hybrid Analysis DO j=1,N Execute the loop in parallel if a(j)=a(j+40) possible. Run-time Phase ENDDO 4.a) If we can prove 1 N 40 , 4.b) If N is unknown, ≤ ≤ ≤ N 40 Declare loop parallel. Extract run-time test. Compile Time Run Time Run-time Test IF (N 40) THEN ≤ DO PARALLEL j=1,N Parallel DO PARALLEL j=1,N Parallel a(j)=a(j+40) Loop a(j)=a(j+40) Loop ENDDO ENDDO ELSE DO j=1,N Sequential a(j)=a(j+40) Loop ENDDO No run-time tests ENDIF 12 performed if not necessary!

  11. Hybrid Analysis: a slightly deeper dive DO j = 1, n A(j) = A(j+40) WRITE READ IF (x>0) THEN A(j) = A(j) + A(j+20) ENDIF ENDDO ∩ READ WRITE = Empty? Empty? ∩ READ ∪ Program Level WRITE 1:n Representation of References 41:40+n # ( USR) x>0 21:20+n 13

  12. Set expression to DO j = 1, n A(j) = A(j+40) IF (x>0) THEN Logic expression A(j) = A(j) + A(j+20) ENDIF ENDDO Empty? ∧ ∩ Empty? Empty? WRITE ∪ ∩ ∩ 1:n READ 1. Distribute Intersection 41:40+n 1:n 41:40+n 1:n # # 21:20+n x>0 21:20+n x>0 2 ∧ ∧ (n 20 or x 0) ∨ 3 ∨ ≤ ≤ n 40 n 40 4 ≤ ≤ and n 40 ≤ Empty? n 20 x 0 x 0 ≤ ≤ ∩ ≤ 14 Representation is Key ! 21:20+n 1:n

  13. Hybrid Analysis Strategy Independence conditions factored into a series of sufficient conditions tested at runtime in the order of their complexity previous O(1) Scalar example pass Operations fail O(n/k) Execute in Parallel aggregate Comparisons pass (independent) references reference fail based pass LRPD Execute Sequentially fail (dependent) 15

  14. Hybrid Analysis Parallelization Coverage RT: Individual Refs RT: Aggregated Refs RT: Simple Checks Compile-time 100 80 60 40 20 0 adm arc2d bdna dyfesm flo52 mdg ocean spec77 track trfd applu apsi mgrid swim wupwise hydro2d matrix300 mdljdp2 nasa7 ora swm256 tomcatv PERFECT SPEC2000/06 Previous SPEC • Parallelized 380 loops of 2100 analyzed loops: 92% seq. coverage 17

  15. Speedups: Hybrid Analysis vs. Intel ifort • Older Benchmarks with smaller datasets on 4 cores only • Better performance on 14/18 benchmarks on 4 cores • Better performance on 10/11 benchmarks on 8 cores 19

  16. So…. - What did we accomplish? • Full Parallelization of C-tran codes (28 benchmarks at >90% coverage) • A IR representation & a technique - We cannot declare victory because: • Required Heroic Efforts • Commercial compilers adopt slowly • Compilers cannot create parallelism -- only programmers can! 20

  17. How else? First • Think Parallel! Then • Develop parallel algorithms • Raise the level of abstraction • Use algorithm level (not only) abstraction • Expressivity + Productivity • Optimization can be compiler generated 21

  18. STAPL : Standard Template Adaptive Parallel Library A library of parallel components that adopts the generic programming philosophy of the C++ Standard Template Library (STL). • STL • STAPL - - Iterators provide abstract access Views provide abstracted access to to data stored in Containers . distributed data stored in Distributed Containers . - Algorithms are sequences of - instructions that transform the data. Parallel Algorithms specified by Skeletons Run-time representation is Task Graph ¡ Containers Iterators Algorithms Containers Views Algorithms Task Graphs 23

  19. STAPL Components User Application Code High Level of Abstraction ~ similar to C++ STL Task & Data parallelism: Asynchronous Views Algorithms • Parallelism (SPMD) implicit – Serialization Containers explicit • imperative + functional: Data flow+Containers Skeleton Adaptive Framework Framework SPMD Programs defined by Task Graph • Data Dependence Patterns è Skeletons • Composition: parallel, serial, nested, … Run-time System • Tasks: Work function & Data ARMI Communication Scheduler Performance Monitor Library • Fine grain tasks (coarsened) • Data in distributed containers MPI, OpenMP , Pthreads Execution Defined by: Data Flow Graphs (Task Graphs) Execution policies: scheduling, asynchrony.. 24 24 Distributed Memory Model (PGAS)

  20. The STAPL Graph Library (SGL) • Many problems are modeled using graphs: - Web search, data-mining (Google, Youtube) - Social networks (Facebook, Google+, Twitter) - Geospatial graphs (Maps) - Scientific applications • Many important graph algorithms: - Breadth-first search, single-source shortest path, strongly connected components, k-core decomposition, centralities 27

  21. SGL Programming Model Vertex Operator Neighbor Operator User code Graph Runtime Library code KLA Hierarchical Out-of-Core STAPL Runtime System OpenMP MPI C++11 threads 32

  22. Parallel Graph Algorithms May Use • Asynchronous Model • Level-Synchronous Model - BSP-style iterative computation - Asynchronous task execution - Point-to-point synchronizations, possible - Global synchronization after each redundant work level, no redundant work Processors Processors Computation Tasks Local Interleaved Computation Communication Communication Barrier Synchronization 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend