x10 a high productivity approach to x10 a high
play

X10: a High-Productivity Approach to X10: a High-Productivity - PowerPoint PPT Presentation

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance Programming High Performance Programming Rajkishore Barik Rajkishore Barik Christopher Donawa Christopher Donawa Matteo Frigo Matteo Frigo Allan


  1. X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance Programming High Performance Programming Rajkishore Barik Rajkishore Barik Christopher Donawa Christopher Donawa Matteo Frigo Matteo Frigo Allan Kielstra Allan Kielstra Vivek Sarkar Vivek Sarkar HPC Challenge Class 2 Award Submission HPC Challenge Class 2 Award Submission This work has been supported in part by the This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. under contract No. NBCH30390004.

  2. Motivation: Productivity Challenges caused by Future Hardware Trends Clusters � Global Address Space Challenge: Develop SMP Node SMP Node new language, PEs, PEs, PEs, PEs, . . . compiler and tools . . . . . . technologies to Memory Memory support productive portable parallel Interconnect abstractions for future Heterogeneous hardware Homogeneous Accelerators Multi-core SPE SPE SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU PEs, PEs, L1 $ . . . SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU L1 $ LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS . . . SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF 16B/cycle 16B/cycle L2 Cache EIB (up to 96B/cycle) EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle (2x) 16B/cycle (2x) PPE PPE . . . PEs, PEs, PPU PPU PPU MIC MIC BIC BIC L1 $ L1 $ . . . PXU PXU PXU L2 L2 L1 L1 L1 16B/cycle 16B/cycle 16B/cycle 32B/cycle 32B/cycle Dual Dual FlexIO TM FlexIO TM L2 Cache XDR TM XDR TM 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 2 High-Productivity, High-Performance Programming with X10

  3. X10 Programming Model Storage classes: • Activity-local • Place-local • Partitioned global • Immutable • Dynamic parallelism with a Partitioned Global Address Space • Places encapsulate binding of activities and globally addressable data • All concurrency is expressed as asynchronous activities – subsumes threads, structured parallelism, messaging, DMA transfers (beyond SPMD) • Atomic sections enforce mutual exclusion of co-located data • No place-remote accesses permitted in atomic section • Immutable data offers opportunity for single-assignment parallelism Deadlock safety: any X10 program written with async, atomic, finish, foreach, ateach, and clocks can never deadlock 3 High-Productivity, High-Performance Programming with X10

  4. X10 Deployment X10 Data Structures X10 language defines mapping from X10 objects & activities to X10 places X10 Places X10 deployment defines mapping from virtual X10 places to physical Physical PEs processing elements Homogeneous Heterogeneous Clusters Multi-core Accelerators SPE SPE SPE SPE PEs, PEs, PEs, PEs, PEs, SMP Node SMP Node SMP Node SMP Node SMP Node SMP Node SMP Node SMP Node PEs, PEs, PEs, PEs, PEs, L1 $ . . . L1 $ . . . L1 $ . . . SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, . . . . . . . . . . . . LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS . . . . . . . . . SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF . . . . . . . . . . . . . . . . . . . . . . . . 16B/cycle 16B/cycle 16B/cycle 16B/cycle L2 Cache L2 Cache L2 Cache L2 Cache EIB (up to 96B/cycle) EIB (up to 96B/cycle) EIB (up to 96B/cycle) EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle (2x) 16B/cycle (2x) 16B/cycle (2x) 16B/cycle (2x) PPE PPE PPE PPE Memory Memory Memory Memory Memory Memory Memory Memory PEs, PEs, PEs, PEs, PEs, . . . . . . . . . PEs, PEs, PEs, PEs, PEs, L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ PPU PPU PPU PPU PPU PPU MIC MIC MIC MIC BIC BIC BIC BIC . . . . . . . . . . . . PXU PXU PXU PXU PXU PXU L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 32B/cycle 32B/cycle 32B/cycle 32B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle L2 Cache L2 Cache L2 Cache L2 Cache Dual Dual Dual Dual FlexIO TM FlexIO TM FlexIO TM FlexIO TM Interconnect Interconnect Interconnect XDR TM XDR TM XDR TM XDR TM 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 4 High-Productivity, High-Performance Programming with X10

  5. Current Status: Multi-core SMP Implementation for X10 X10 Code DOMO Generation X10 Front Static Templates Grammar Analyzer End Annotated Target AST AST Java X10 X10 Parser Analysis passes Java code emitter Java compiler source Common components w/ SAFARI X10 classfiles Place (Java classfiles with Atomic sections do not have blocking special annotations for Ready Executing semantics Inbound Activities Activities X10 analysis info) activities Outbound activities X10 Ready Executing Ready Executing Activities Activities Activities Activities Activity can only access Runtime its stack, place-local . . . Blocked Completed mutable data, or global Activities Completed Blocked Completed Blocked Activities immutable data Activities Activities Activities Activities Clock Clock Clock . . . . . . Future Future . . . Outbound X10 libraries Place 0 Inbound Place 1 replies Future replies Java Concurrency Utilities (JCU) STM library Java Extern interface High Performance JRE Portable Standard Fortran, (IBM J9 VM Java 5 Runtime Runtime JCU thread pool C/C++ + Testarossa JIT Environment Compiler (Runs on DLL’s modified for X10 multiple on PPC/AIX) Platforms) 5 High-Productivity, High-Performance Programming with X10

  6. System Configuration used for Performance Results • Hardware − STREAM (C/OpenMP & X10), RandomAccess (C/OpenMP & X10), FFT (X10) • 64-core POWER5+, p595+, 2.3 GHz, 512 GB (r28n01.pbm.ihost.com) − FFT (Cilk version) • 16-core POWER5+, p570, 1.9 GHz − All runs performed with page size = 4KB and SMT turned off • Operating System − AIX v5.3 • Compiler − xlc v7.0.0.5 w/ -O3 option (also qsmp=omp for OpenMP compilation) • X10 − Dynamic compilation options: -J-Xjit:count=0,optLevel=veryHot − X10 activities use serial libraries written in C and linked with X10 runtime − Data size limitation: current X10 runtime is limited to a max heap size of 2GB • All results reported are for runs that passed validation − Caveat: these results should not be treated as official benchmark measurements of the above systems 6 High-Productivity, High-Performance Programming with X10

  7. STREAM OpenMP / C version #pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; } Hybrid X10 + Serial C version finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); } 7 High-Productivity, High-Performance Programming with X10

  8. STREAM Traversing array region can be error-prone OpenMP / C version #pragma omp parallel for Implicitly assumes Uniform for (j=0; j<N; j++) { Memory Access model (no distributed arrays) b[j] = scalar*c[j]; } SLOC counts are comparable Multi-place version designed to run unchanged on an SMP or a cluster Hybrid X10 + Serial C version finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); } Restrict operator simplifies computation of local region scale( ) is a sequential C function 8 High-Productivity, High-Performance Programming with X10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend