Introduction to X10 Olivier Tardieu IBM Research This material is - PowerPoint PPT Presentation

Introduction to X10 Olivier Tardieu IBM Research This material is based upon work supported by the Defense Advanced Research Projects Agency, by the Air Force Office of Scientific Research, and by the Department of Energy.

Take Away § X10 is § a programming language § derived from and interoperable with Java § an open source tool chain § compilers, runtime, IDE § developed at IBM Research since 2004 with support from DARPA, DoE, and AFOSR § >100 contributors (IBM and Academia) § a growing community § >100 papers § workshops, tutorials, courses § X10 tackles the challenge of programming at scale § first HPC, then clusters, now cloud § scale out: run across many distributed nodes § scale up: exploit multi-core and accelerators § elasticity and resilience § double goal: productivity and performance 2

Links § Main X10 website http://x10-lang.org § X10 Language Specification http://x10.sourceforge.net/documentation/languagespec/x10-latest.pdf § A Brief Introduction to X10 (for the HPC Programmer) http://x10.sourceforge.net/documentation/intro/intro-223.pdf § X10 2.5.3 release (command line tools only) http://sourceforge.net/projects/x10/files/x10/2.5.3/ § X10DT 2.5.3 release (Eclipse-based IDE) http://sourceforge.net/projects/x10/files/x10dt/2.5.3/ 3

Current IBM X10 Team 4

Agenda § X10 overview § APGAS programming model § X10 programming language § Tool chain § Implementation § Applications § 2014/2015 Highlights § Grid X10 5

X10 Overview 6

Asynchronous Partitioned Global Address Space (APGAS) Memory abstraction § Message passing § each task lives in its own address space; example: MPI § Shared memory § shared address space for all the tasks; example: OpenMP § PGAS § global address space: single address space across all tasks § partitioned address space: each partition must fit within a shared-memory node § examples: UPC, Co-array Fortran, X10, Chapel Execution model § SPMD § symmetric tasks progressing in lockstep; examples: MPI, OpenMP 3, UPC, CUDA § APGAS § asynchronous tasks; examples: Cilk, X10, OpenMP 4 tasks 7

Places and Tasks Global ¡Reference ¡ Local ¡ ¡ Local ¡ ¡ Heap ¡ Heap ¡ … ¡ … ¡ … … ¡ … … … … Tasks ¡ Tasks ¡ Place ¡0 ¡ Place ¡N ¡ Task parallelism Concurrency control § async § when async S when(c) S § finish § atomic finish S atomic S Place-shifting operations Distributed heap § at § GlobalRef at(p) S GlobalRef[T] § at § PlaceLocalHandle at(p) e PlaceLocalHandle[T] 8

Idioms § Remote procedure call § SPMD finish for(p in Place.places()) { finish v = at at(p) evalThere(arg1, arg2); at at(p) async async runEverywhere(); } § Active message at(p) async at async runThere(arg1, arg2); § Atomic remote update at at(ref) async async atomic atomic ref() += v; § Divide-and-conquer parallelism def fib(n:Long):Long { § Computation/communication overlap if(n < 2) return n; val acc = new Accumulator(); val f1:Long; while(cond) { val f2:Long; finish finish { finish finish { val v = acc.currentValue(); async async f1 = fib(n-1); at at(ref) async async ref() = v; f2 = fib(n-2); acc.updateValue(); } } return f1 + f2; } } 9

BlockDistRail.x10 public class BlockDistRail[T] { protected val sz:Long; // block size protected val raw:PlaceLocalHandle[Rail[T]]; public def this(sz:Long, places:Long){T haszero} { this.sz = sz; raw = PlaceLocalHandle.make[Rail[T]](PlaceGroup.make(places), ()=>new Rail[T](sz)); } public operator this(i:Long) = (v:T) { at(Place(i/sz)) raw()(i%sz) = v; } public operator this(i:Long) = at(Place(i/sz)) raw()(i%sz); public static def main(Rail[String]) { val rail = new BlockDistRail[Long](5, 4); rail(7) = 8; Console.OUT.println(rail(7)); } } 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 Place 0 Place 1 Place 2 Place 3 10

Like Java § Objects § classes and interfaces § single-class inheritance, multiple interfaces § fields, methods, constructors § virtual dispatch, overriding, overloading, static methods § Packages and files § Garbage collector § Variables and values (final variables, but final is the default) § definite assignment (extended to tasks) § Expressions and statements § control statements: if, switch, for, while, do-while, break, continue, return § Exceptions § try-catch-finally, throw § Comprehension loops and iterators 11

Beyond Java § Syntax § types “ x:Long ” rather than “ Long x ” § declarations val , var , def § function literals (a:Long, b:Long) => a < b ? a : b § ranges 0..(size-1) § operators user-defined behavior for standard operators § Types § local type inference val b = false; § function types (Long, Long) => Long § typedefs type BinOp[T] = (T, T) => T; § structs headerless inline objects; extensible primitive types § arrays multi-dimensional, distributed; implemented in X10 § properties and constraints extended static checking gradual typing § reified generics templates; constrained kinds 12

Tool Chain 13

Tool Chain § Eclipse Public License § “Native” X10 implementation § C++ based; CUDA support § distributed multi-process (one place per process + one place per GPU) § C/POSIX network abstraction layer (X10RT) § x86, x86_64, Power; Linux, AIX, OS X, Windows/Cygwin, BG/Q; TCP/IP, PAMI, MPI § “Managed” X10 implementation § Java 6/7 based; no CUDA support § distributed multi-JVM (one place per JVM) § pure Java implementation over TCP/IP or using X10RT via JNI (Linux & OS X) § X10DT (Eclipse-based IDE) available for Windows, Linux, OS X § supports many core development tasks including remote build & execute facilities 14

Compilation and Execution X10 Compiler Front-End AST Optimizations X10 Parsing / X10 AST AST Lowering Source Type Check Managed X10 Native X10 X10 AST Java C++ Back-End Back-End Java Interop Java Code C++ Code Generation Generation Java Source C++ Source Cuda Source XRJ Java Compiler XRX Platform Compilers XRC Existing Native (C/C++/ Java Byteode Native executable Existing Java Application etc) Application JNI Native Environment X10RT Java VMs (CPU, GPU, etc) 15

X10DT Building Source navigation, syntax Browsing highlighting, parsing errors, folding, hyperlinking, outline and quick outline, hover help, content assist, type - Java/C++ support hierarchy, format, search, - Local and remote call graph, quick fixes Launching Editing Debug 16

Implementation 17

Runtime § X10RT (X10 runtime transport) § core API: active messages § extended API: collectives & RDMAs X10 Application § emulation layer § two versions: C (+JNI bindings) or pure Java X10 Core Class Libraries § Native runtime XRX § processes, threads, atomic ops § object model (layout, RTTI, serialization) § two versions: C++ and Java Native Runtime § XRX (X10 runtime in X10) X10RT § async, finish, at, when, atomic § X10 code compiled to C++ or Java PAMI TCP/IP MPI SHM CUDA § Core X10 libraries § x10.array, io, util, util.concurrent 18

APGAS Constructs § One process per place § Local tasks: async & finish § thread pool; cooperative work-stealing scheduler § Remote tasks: at(p) async § source side: synthetize active message § async id + serialized heap + control state (finish, clocks) § compiler identifies captured variables (roots); runtime serializes heap reachable from roots § destination side: decode active message § polling (when idle + on runtime entry) § Distributed finish § complex and potentially costly due to message reordering § pattern-based specialization; program analysis 19

Applications 20

HPC Challenge 2012 – X10 at Petascale – Power 775 G-FFT G-HPL EP Stream (Triad) 26958 500000 30000 1.20 800000 24.00 15 396614 22.4 589231 25000 1.00 Gflops/place Gflops/place 400000 GB/s/place 600000 22.00 0.88 Gflops 20000 0.80 Gflops 10 GB/s 300000 7.23 15000 0.60 400000 20.00 0.82 18.0 200000 10000 0.40 5 7.12 200000 18.00 100000 5000 0.20 0 0.00 0 16.00 0 0 0 16384 32768 0 16384 32768 0 27840 55680 Places Places Places G-RandomAccess UTS 844 900 0.9 700000 14.00 596451 800 0.8 10.93 10.87 600000 12.00 0.82 Million nodes/s/place 0.82 700 0.7 Million nodes/s 500000 10.00 600 0.6 Gups/place 10.71 400000 8.00 Gups 500 0.5 400 0.4 300000 6.00 356344 300 0.3 200000 4.00 200 0.2 100000 2.00 100 0.1 0 0 0 0.00 0 8192 16384 24576 32768 0 13920 27840 41760 55680 Places Places 21

Introduction to X10 Olivier Tardieu IBM Research This material is - PowerPoint PPT Presentation

Introduction to X10 Olivier Tardieu IBM Research This material is based upon work supported by the Defense Advanced Research Projects Agency, by the Air Force Office of Scientific Research, and by the Department of Energy. Take Away X10

X10 X10 Jonathan Lee Jonathan Lee Daniel Lee Daniel Lee What is X10? What is X10?

X10 Cluster SSH access X10 on your PC Eclipse for X10: x10dt From Eclipse to

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

X10: New opportunities for X10: New opportunities for Compiler-Driven Performance

Local Parallel Iteration in X10 Josh Milthorpe IBM Research This material is based upon work

A Resilient Framework for Iterative Linear Algebra Applications in X10 Sara S. Hamouda

Cutting Out the Middleman: OS-Level Support for X10 Activities Manuel Mohr, Sebastian Buchwald,

Dynamic X10 Resource-Aware Programming for Higher Efficiency Manuel Mohr, Andreas Zwinkau,

Conflict Driven Learning and Non-chronological Backtracking x1 + x4 x1 + x3 + x8 x1 + x8

8-Puzzle State: x00 x01 x02 x10 x11 x12 x20 x21 x22 xij in {0, , 8}, i, j in {0, 1, 2} xij

Mac OS 10.12 Sierra Introduction: ! Sierra 10.12 is the latest Macintosh operating system from

FIRST FLOOR PLAN ELEVATION A PORCH 18'-0"x5'-0" FIRST FLOOR PLAN ELEVS. B, D PORCH

How is python used in biomolecular sciences? Antonia Mey antonia.mey@ed.ac.uk @ppxasjsm L.

Operational amplifiers Types of operational amplifiers (bioelectric amplifiers have different

Changelog Changes made in this version not seen in fjrst lecture: sz is one past the end of the

2.2 Transformations Hao Li http://cs420.hao-li.com 1 OpenGL Transformations Matrices

Completeness of Resolution 5ai We will show by construction: AUTOMATED REASONING If clauses S

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Physics Department 8.044 Statistical Physics I Spring Term

ESQPT in systems with long-range interactions Lea F. Santos Yeshiva University, New York, NY, USA

Valuing ISR Resources Tod S. Levi6 * , Kellen G. Leister * ,

Head First into GlobalISel Or: How to delete SelectionDAG in 100* easy commits 1 LLVM Dev Meeting

Pool::count Pool::grow() Pool::alloc() Pool_element_header Pool_element_header::next

Introduction to X10 Olivier Tardieu IBM Research This material is - PowerPoint PPT Presentation

Introduction to X10 Olivier Tardieu IBM Research This material is based upon work supported by the Defense Advanced Research Projects Agency, by the Air Force Office of Scientific Research, and by the Department of Energy. Take Away X10

X10 X10 Jonathan Lee Jonathan Lee Daniel Lee Daniel Lee What is X10? What is X10?

X10 Cluster SSH access X10 on your PC Eclipse for X10: x10dt From Eclipse to

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

APGAS Programming in X10 http://x10-lang.org This tutorial was originally given by Olivier

X10: New opportunities for X10: New opportunities for Compiler-Driven Performance

Local Parallel Iteration in X10 Josh Milthorpe IBM Research This material is based upon work

A Resilient Framework for Iterative Linear Algebra Applications in X10 Sara S. Hamouda

Cutting Out the Middleman: OS-Level Support for X10 Activities Manuel Mohr, Sebastian Buchwald,

Dynamic X10 Resource-Aware Programming for Higher Efficiency Manuel Mohr, Andreas Zwinkau,

Conflict Driven Learning and Non-chronological Backtracking x1 + x4 x1 + x3 + x8 x1 + x8

8-Puzzle State: x00 x01 x02 x10 x11 x12 x20 x21 x22 xij in {0, , 8}, i, j in {0, 1, 2} xij

Mac OS 10.12 Sierra Introduction: ! Sierra 10.12 is the latest Macintosh operating system from

FIRST FLOOR PLAN ELEVATION A PORCH 18'-0&quot;x5'-0&quot; FIRST FLOOR PLAN ELEVS. B, D PORCH

How is python used in biomolecular sciences? Antonia Mey antonia.mey@ed.ac.uk @ppxasjsm L.

Operational amplifiers Types of operational amplifiers (bioelectric amplifiers have different

Changelog Changes made in this version not seen in fjrst lecture: sz is one past the end of the

2.2 Transformations Hao Li http://cs420.hao-li.com 1 OpenGL Transformations Matrices

Completeness of Resolution 5ai We will show by construction: AUTOMATED REASONING If clauses S

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Physics Department 8.044 Statistical Physics I Spring Term

ESQPT in systems with long-range interactions Lea F. Santos Yeshiva University, New York, NY, USA

Valuing ISR Resources Tod S. Levi6 * , Kellen G. Leister * ,

Head First into GlobalISel Or: How to delete SelectionDAG in 100* easy commits 1 LLVM Dev Meeting

Pool::count Pool::grow() Pool::alloc() Pool_element_header Pool_element_header::next

FIRST FLOOR PLAN ELEVATION A PORCH 18'-0"x5'-0" FIRST FLOOR PLAN ELEVS. B, D PORCH