X10: New opportunities for X10: New opportunities for - PowerPoint PPT Presentation

X10: New opportunities for X10: New opportunities for Compiler-Driven Performance Compiler-Driven Performance via a new Programming Model via a new Programming Model Kemal Ebcioglu Kemal Ebcioglu Vijay Saraswat Vijay Saraswat Vivek Sarkar Vivek Sarkar IBM T.J. Watson Research Center IBM T.J. Watson Research Center {kemal,vsaraswat,vsarkar}@us.ibm.com {kemal,vsaraswat,vsarkar}@us.ibm.com Compiler-Driven Performance Workshop --- CASCON Compiler-Driven Performance Workshop --- CASCON 2004 2004 Oct 6, 2004 Oct 6, 2004 This work has been supported in part by the Defense Advanced Research Projects This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004. Agency (DARPA) under contract No. NBCH30390004.

Acknowledgments • Contributors to X10 design & implementation • IBM PERCS Team members ideas: − Research − David Bacon − Systems & Technology Group − Bob Blainey − Software Group − Philippe Charles − PI: Mootaz Elnozahy − Perry Cheng • University partners: − Julian Dolby − Cornell − Kemal Ebcioglu − LANL − − Guang Gao (U Delaware) MIT − − Allan Kielstra Purdue University − RPI − Robert O'Callahan − UC Berkeley − Filip Pizlo (Purdue) − U. Delaware − Christoph von Praun − U. Illinois − V.T.Rajan − U. New Mexico − Lawrence Rauchwerger (Texas A&M) − U. Pittsburgh − Vijay Saraswat (contact for lang. spec.) − UT Austin − Vivek Sarkar − Vanderbilt University − Mandana Vaziri − Jan Vitek (Purdue) V. Sarkar CDP 2004 Workshop 2

Performance and Productivity Challenges facing Future Large-Scale Systems 2) Frequency wall: Multiple layers of 1) Memory wall: Severe non- hierarchical heterogeneous uniformities in bandwidth & parallelism to compensate for latency in memory hierarchy slowdown in frequency scaling Clusters (scale-out) Proc Cluster Proc Cluster PEs, PEs, PEs, PEs, SMP L1 $ L1 $ L1 $ L1 $ . . . . . . . . . Multiple cores on a chip L2 Cache L2 Cache Coprocessors (SPUs) . . . SMTs SIMD ILP L3 Cache . . . . . . 3) Scalability wall: Software will need Memory to deliver ~ 10 5 -way parallelism to utilize large-scale parallel systems V. Sarkar CDP 2004 Workshop 3

IBM PERCS Project (Productive Easy-to-use Reliable Computing Systems) Increase Increase development Increase number of overall applications written productivity productivity PERCS Programming Tools performance-guided parallelization and transformation, static & dynamic checking, separation of concerns --- all integrated Increase into a single development environment (Eclipse) performance of applications MPI PERCS Programming Model OpenMP Static and Dynamic Compilers for base language w/ programming model extensions Mature languages: C/C++, Fortran, Java Increase Experimental languages: X10, UPC, S treamIt, HTA/ Matlab execution productivity Language Runtime + Dynamic Compilation + Continuous Optimization PERCS System Software (K42) PERCS System Hardware V. Sarkar CDP 2004 Workshop 4

Limitations in exploiting Compiler-Driven Performance in Current Parallel Programming Models • MPI: Local memories + message-passing − Parallelism, locality, and “global view” are completely managed by programmer − Communication, synchronization, consistency operations specified at low level of abstraction � Limited opportunities for compiler optimizations • Java threads, OpenMP: shared-memory parallel programming model − Uniform symmetric view of all shared data − Non-transparent performance --- programmer cannot manage data locality and thread affinity at different hierarchy levels (cluster, SMT, …) � Limited effectiveness of compiler optimizations • HPF, UPC: partitioned global address space + SPMD execution model − User specifies data distribution & parallelism, compiler generates communications using owner-computes rule − Large overheads in accessing shared data; compiler optimizations can help applications with simple data access patterns � Limited applicability of compiler optimizations V. Sarkar CDP 2004 Workshop 5

X10 Design Guidelines: Design for Productivity & Compiler/Runtime-driven Performance • • Support common parallel Start with state-of-the-art OO programming idioms language primitives as foundation − Data parallelism − No gratuitous changes − Control parallelism − Build on existing skills − Divide-and-conquer − Producer-consumer / streaming • Raise level of abstraction for − Message-passing constructs that should be amenable to optimized implementation • − Monitors � atomic sections Ensure that every program has a well-defined semantics − Threads � async activities − Independent of implementation − Barriers � clocks − Simple concurrency model & memory model • Introduce new constructs to model hierarchical parallelism and non- • Defer fault tolerance and reliability uniform data access issues to lower levels of system − Places − Assume tightly-coupled system − Distributions with dedicated interconnect V. Sarkar CDP 2004 Workshop 6

Logical View of X10 Programming Model (Work in progress) Place Place Inbound Outbound async async Partitioned Global heap Partitioned Global heap requests requests Place-local heap Place-local heap Granularity of Value place can range Activities & . . . Activities & Class from single h/w Activity-local storage Activity-local storage Instances thread to an entire scale-up system heap heap heap heap stack . . . stack stack . . . stack Inbound Outbound control control control control async async replies replies • Activities can be created by • Place = collection of resident activities and data − async statements (one-way msgs) − Maps to a data-coherent unit in a − future expressions large scale system − foreach & ateach constructs • Four storage classes: • Activities are coordinated by − Partitioned global − Unconditional atomic sections − Place-local − Conditional atomic sections − Activity-local − Clocks (generalization of barriers) − Value class instances − Force (for result of future) • Can be copied/migrated freely V. Sarkar CDP 2004 Workshop 7

Async activities: abstraction of threads • • Async statement Async expression (future) − async(P){S}: run S at place P − F = future(P){E}, or F = future(D){E} : Return − async(D){S}: run S at place the value of expression E, containing datum D evaluated in place P (or the − S may contain local atomic place containing datum D) operations or additional async − force F or !F : suspend until activities for same/different places. value is known • Example: percolate process to data. • Example: percolate data to process . public void put(K key, V value) { public ^V get(K key) { int hash = key.hashCode()% D.size; int hash = key.hashCode()% D.size; async (D[hash]) { return future (D[hash]) { for (_ b = buckets[hash]; b != null; b = b.next) { for (_ b = buckets[hash]; b != null; b = b.next) { if (b.k.equals(key)) { if (b.k.equals(key)) { b.v = value; return b.v; return; } } } } return new V(); buckets[hash] = } new Bucket<K,V>(key, value, buckets[hash]); } }; } Distributed hash-table example V. Sarkar CDP 2004 Workshop 8

RandomAccess (GUPS) example public void run(int a[] blocked, int seed[] cyclic, int value smallTable[]) { ateach (start : seed clocked c) { int ran = start; for (int count : 1.. N_UPDATES/place.MAX_PLACES) { ran = Math.random(ran); int j = F(ran); // function F() can be in C/Fortran int k = smallTable[g(ran)]; async (a[j]) atomic {a[j]^=k;} } // for } // ateach next c; } V. Sarkar CDP 2004 Workshop 9

Regions and Distributions • • Distributions Regions − Map region elements to places − The domain of some array; a collection of array indices • distribution D = cyclic(R); − Domain and range restriction: − region R = [0..99]; • distribution D2 = D | R; − region R2 = [0..99,0..199]; • distribution D3 = D | P; • Region operators • Regions/Distributions can be used − region Intersect = R3 && like type and place parameters R4; − <region R, distribution D> − region Union = R3 || R4; void m(...) − Etc. V. Sarkar CDP 2004 Workshop 10

ArrayCopy example: example of high- level optimizations of async activities Version 1 (orginal): Version 3 (further optimized): <value T, D, E> public static void <value T, D, E> public static void arrayCopy( T[D] a, T[E] b) { arrayCopy( T[D] a, T[E] b) { // Spawn an activity for each index to // Spawn one activity per D-place and one // fetch and copy the value // future per place p to which E maps an ateach (i : D.region) // index in (D | here). a[i] = async b[i]; ateach ( D.places ) { next c; // Advance clock region LocalD = (D | here).region; } ateach ( p : E[LocalD] ) { region RemoteE = (E | p).region; Version 2 (optimized): region Common = <value T, D, E> public static void LocalD && RemoteE; arrayCopy( T[D] a, T[E] b) { a[Common] = async b[Common]; // Spawn one activity per place } ateach ( D.places ) } for ( j : D | here ) next c; // Advance clock a[i] = async b[i]; } next c; // Advance clock } V. Sarkar CDP 2004 Workshop 11

X10: New opportunities for X10: New opportunities for - PowerPoint PPT Presentation

X10: New opportunities for X10: New opportunities for Compiler-Driven Performance Compiler-Driven Performance via a new Programming Model via a new Programming Model Kemal Ebcioglu Kemal Ebcioglu Vijay Saraswat Vijay Saraswat Vivek Sarkar

X10 X10 Jonathan Lee Jonathan Lee Daniel Lee Daniel Lee What is X10? What is X10?

X10 Cluster SSH access X10 on your PC Eclipse for X10: x10dt From Eclipse to

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

Introduction to X10 Olivier Tardieu IBM Research This material is based upon work supported by

Local Parallel Iteration in X10 Josh Milthorpe IBM Research This material is based upon work

A Resilient Framework for Iterative Linear Algebra Applications in X10 Sara S. Hamouda

Cutting Out the Middleman: OS-Level Support for X10 Activities Manuel Mohr, Sebastian Buchwald,

Dynamic X10 Resource-Aware Programming for Higher Efficiency Manuel Mohr, Andreas Zwinkau,

Conflict Driven Learning and Non-chronological Backtracking x1 + x4 x1 + x3 + x8 x1 + x8

8-Puzzle State: x00 x01 x02 x10 x11 x12 x20 x21 x22 xij in {0, , 8}, i, j in {0, 1, 2} xij

Value Added Opportunities with Value Added Opportunities with Value Added Opportunities with

Opportunities Naming Opportunities thirty-seven opportunities, in seven funding levels levels

TITANIUM EYEWEAR DESIGNED IN ICELAND, MADE IN ITALY AGNAR NEW NEW NEW ALBA NEW NEW NEW

Funding Opportunities Program Meg Sparling Funding Opportunities Coordinator Limited Submissions

Crowdsourcing CSCI 470: Web Science Keith Vertanen

Where am I? Operating System And Virtualization Identification Without System Calls Jason L.

Computing Week 9 LBSC 671 Creating Information Infrastructures Muddiest Points BIBFRAME

Molecular Computing Dr. Nickolas Chelyapov Cliff Johnson Leonard Adleman Areio Soltani

Introduction CS 236 Advanced Computer Security Peter Reiher April 1, 2008 Lecture 1 Page 1

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Welcome to the Government Blockchain Association (GBA) Meet Up May 15 th , 2017 Blockchain

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and