X10: a High-Productivity Approach to X10: a High-Productivity - PowerPoint PPT Presentation

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance Programming High Performance Programming Rajkishore Barik Rajkishore Barik Christopher Donawa Christopher Donawa Matteo Frigo Matteo Frigo Allan Kielstra Allan Kielstra Vivek Sarkar Vivek Sarkar HPC Challenge Class 2 Award Submission HPC Challenge Class 2 Award Submission This work has been supported in part by the This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. under contract No. NBCH30390004.

Motivation: Productivity Challenges caused by Future Hardware Trends Clusters � Global Address Space Challenge: Develop SMP Node SMP Node new language, PEs, PEs, PEs, PEs, . . . compiler and tools . . . . . . technologies to Memory Memory support productive portable parallel Interconnect abstractions for future Heterogeneous hardware Homogeneous Accelerators Multi-core SPE SPE SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU PEs, PEs, L1 $ . . . SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU L1 $ LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS . . . SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF 16B/cycle 16B/cycle L2 Cache EIB (up to 96B/cycle) EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle (2x) 16B/cycle (2x) PPE PPE . . . PEs, PEs, PPU PPU PPU MIC MIC BIC BIC L1 $ L1 $ . . . PXU PXU PXU L2 L2 L1 L1 L1 16B/cycle 16B/cycle 16B/cycle 32B/cycle 32B/cycle Dual Dual FlexIO TM FlexIO TM L2 Cache XDR TM XDR TM 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 2 High-Productivity, High-Performance Programming with X10

X10 Programming Model Storage classes: • Activity-local • Place-local • Partitioned global • Immutable • Dynamic parallelism with a Partitioned Global Address Space • Places encapsulate binding of activities and globally addressable data • All concurrency is expressed as asynchronous activities – subsumes threads, structured parallelism, messaging, DMA transfers (beyond SPMD) • Atomic sections enforce mutual exclusion of co-located data • No place-remote accesses permitted in atomic section • Immutable data offers opportunity for single-assignment parallelism Deadlock safety: any X10 program written with async, atomic, finish, foreach, ateach, and clocks can never deadlock 3 High-Productivity, High-Performance Programming with X10

X10 Deployment X10 Data Structures X10 language defines mapping from X10 objects & activities to X10 places X10 Places X10 deployment defines mapping from virtual X10 places to physical Physical PEs processing elements Homogeneous Heterogeneous Clusters Multi-core Accelerators SPE SPE SPE SPE PEs, PEs, PEs, PEs, PEs, SMP Node SMP Node SMP Node SMP Node SMP Node SMP Node SMP Node SMP Node PEs, PEs, PEs, PEs, PEs, L1 $ . . . L1 $ . . . L1 $ . . . SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU SXU L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, PEs, . . . . . . . . . . . . LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS . . . . . . . . . SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF SMF . . . . . . . . . . . . . . . . . . . . . . . . 16B/cycle 16B/cycle 16B/cycle 16B/cycle L2 Cache L2 Cache L2 Cache L2 Cache EIB (up to 96B/cycle) EIB (up to 96B/cycle) EIB (up to 96B/cycle) EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle (2x) 16B/cycle (2x) 16B/cycle (2x) 16B/cycle (2x) PPE PPE PPE PPE Memory Memory Memory Memory Memory Memory Memory Memory PEs, PEs, PEs, PEs, PEs, . . . . . . . . . PEs, PEs, PEs, PEs, PEs, L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ PPU PPU PPU PPU PPU PPU MIC MIC MIC MIC BIC BIC BIC BIC . . . . . . . . . . . . PXU PXU PXU PXU PXU PXU L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 32B/cycle 32B/cycle 32B/cycle 32B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle 16B/cycle L2 Cache L2 Cache L2 Cache L2 Cache Dual Dual Dual Dual FlexIO TM FlexIO TM FlexIO TM FlexIO TM Interconnect Interconnect Interconnect XDR TM XDR TM XDR TM XDR TM 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 64-bit Power Architecture with VMX 4 High-Productivity, High-Performance Programming with X10

Current Status: Multi-core SMP Implementation for X10 X10 Code DOMO Generation X10 Front Static Templates Grammar Analyzer End Annotated Target AST AST Java X10 X10 Parser Analysis passes Java code emitter Java compiler source Common components w/ SAFARI X10 classfiles Place (Java classfiles with Atomic sections do not have blocking special annotations for Ready Executing semantics Inbound Activities Activities X10 analysis info) activities Outbound activities X10 Ready Executing Ready Executing Activities Activities Activities Activities Activity can only access Runtime its stack, place-local . . . Blocked Completed mutable data, or global Activities Completed Blocked Completed Blocked Activities immutable data Activities Activities Activities Activities Clock Clock Clock . . . . . . Future Future . . . Outbound X10 libraries Place 0 Inbound Place 1 replies Future replies Java Concurrency Utilities (JCU) STM library Java Extern interface High Performance JRE Portable Standard Fortran, (IBM J9 VM Java 5 Runtime Runtime JCU thread pool C/C++ + Testarossa JIT Environment Compiler (Runs on DLL’s modified for X10 multiple on PPC/AIX) Platforms) 5 High-Productivity, High-Performance Programming with X10

System Configuration used for Performance Results • Hardware − STREAM (C/OpenMP & X10), RandomAccess (C/OpenMP & X10), FFT (X10) • 64-core POWER5+, p595+, 2.3 GHz, 512 GB (r28n01.pbm.ihost.com) − FFT (Cilk version) • 16-core POWER5+, p570, 1.9 GHz − All runs performed with page size = 4KB and SMT turned off • Operating System − AIX v5.3 • Compiler − xlc v7.0.0.5 w/ -O3 option (also qsmp=omp for OpenMP compilation) • X10 − Dynamic compilation options: -J-Xjit:count=0,optLevel=veryHot − X10 activities use serial libraries written in C and linked with X10 runtime − Data size limitation: current X10 runtime is limited to a max heap size of 2GB • All results reported are for runs that passed validation − Caveat: these results should not be treated as official benchmark measurements of the above systems 6 High-Productivity, High-Performance Programming with X10

STREAM OpenMP / C version #pragma omp parallel for for (j=0; j<N; j++) { b[j] = scalar*c[j]; } Hybrid X10 + Serial C version finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); } 7 High-Productivity, High-Performance Programming with X10

STREAM Traversing array region can be error-prone OpenMP / C version #pragma omp parallel for Implicitly assumes Uniform for (j=0; j<N; j++) { Memory Access model (no distributed arrays) b[j] = scalar*c[j]; } SLOC counts are comparable Multi-place version designed to run unchanged on an SMP or a cluster Hybrid X10 + Serial C version finish ateach(point p : dist.factory.unique()) { final region myR = (D | here).region; scale(b,scalar,c,myR.rank(0).low(),myR.rank(0).high()+1); } Restrict operator simplifies computation of local region scale( ) is a sequential C function 8 High-Productivity, High-Performance Programming with X10

X10: a High-Productivity Approach to X10: a High-Productivity - PowerPoint PPT Presentation

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance Programming High Performance Programming Rajkishore Barik Rajkishore Barik Christopher Donawa Christopher Donawa Matteo Frigo Matteo Frigo Allan

X10 X10 Jonathan Lee Jonathan Lee Daniel Lee Daniel Lee What is X10? What is X10?

X10 Cluster SSH access X10 on your PC Eclipse for X10: x10dt From Eclipse to

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

X10: New opportunities for X10: New opportunities for Compiler-Driven Performance

Introduction to X10 Olivier Tardieu IBM Research This material is based upon work supported by

Local Parallel Iteration in X10 Josh Milthorpe IBM Research This material is based upon work

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

EVALUATION Richard Kneller School of Economics, University of Nottingham The productivity of

Decent work as a source of Decent work as a source of productivity in Europe productivity in

Automated Productivity Based Automated Productivity Based Schedule Animation (APBSA) Schedule

Productivity Development in Germany And the Financial Crisis by Georg Erber 22. November 2012

Structural change, labor productivity and globalization productivity and globalization Margaret

Training course for policy makers on productivity and working conditions in SMEs SESSION 4:

OUTLOOK, JULY 2 0 1 7 Peter Harris Productivity Com m ission Productivity Commission 1 2

Testing Kotlin at Scale: Spek Artem Zinnatullin @artem_zin - Productivity - Productivity -

TABLA: A Framework for Accelerating Statistical Machine Learning Presenters: MeiXing Dong,

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Advanced use of OpenSHMEM 2 Outline

How to Evaluate Efficient Deep Neural Network Approaches Vivienne Sze ( @eems_mit)

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

Bayesian Optimization of Composite Functions Ral Astudillo Cornell University Joint work

Primary 3 English Language Content Joy of Learning Unit Coverage Level Focuses

Models using Buses Chapter 10 Introduction Mesh Advantages Constant link length.

CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel