parakeet A Just-in-Time Parallel Accelerator for Numerical Python - PowerPoint PPT Presentation

parakeet A Just-in-Time Parallel Accelerator for Numerical Python Alex Rubinsteyn Eric Hielscher Nathaniel Weinman Dennis Shasha New York University Friday, June 8, 12

Naive Python Code (is slow) Count the number of times a value occurs within an array: def count(big_array, target): c = 0 for x in big_array: if x == target: c += 1 return c Takes ~10 minutes on a billion integers Friday, June 8, 12

NumPy exists for a reason def count(big_array, target): return np.sum(big_array == target) Runs in 6.62 seconds, an 88X improvement! However : ➡ Creates large temporary array ➡ Only uses single core Can we do better without leaving Python? Friday, June 8, 12

Parakeet to the Rescue (Sequential version) from parakeet import PAR @PAR def count(big_array, target): c = 0 for x in big_array: if x == target: c += 1 return c • @PAR decorator marks boundary between Parakeet and Python • Dynamically compiled to (sequential) LLVM Runs in 1.4 seconds! Friday, June 8, 12

Let’s Get Parallel @PAR def count(big_array, t): return parakeet.sum(big_array == t) Runs in 0.2 seconds across 8 cores! ~3000X faster than naive Python ~33X faster than NumPy ...but where did the parallelism come from? Friday, June 8, 12

meet the adverbs Adverbs are higher order array operators • map : transform each element or subarray • reduce : sum , min , etc... • scan : reduction which keeps intermediate values (e.g. prefix sum) • allpairs : transform all pairs of elements or subarrays (e.g. matrix multiply) Adverbs abstract enough for many implementations : sequential, multicore, GPU kernel, loop within kernel Friday, June 8, 12

Adverbs in disguise No parallelism without adverbs ... but don’t always have to be explicit parakeet.sum(big_array == t) Library function, defined in Python as: def sum(x): return reduce(add, x) Array broadcasting will get rewritten as: map(eq, big_array, t) Friday, June 8, 12

Python Subset Most Python won’t run in Parakeet: • Need source (nothing pre-compiled) • No non-uniform data structures: lists, sets, dictionaries, etc... • No support for user-defined objects, exceptions, generators, etc... • Restrictions recursively apply to every called function Friday, June 8, 12

Is anything left? scalars + control flow + arrays + adverbs • numbers, booleans, tuples, None • math & logic operators, NumPy ufuncs • loops, if statements • array literals & functions like arange • array attributes (e.g. shape , T ) • Parakeet’s adverbs (e.g. map, reduce, ...) If it’s not supported, leave it in Python Friday, June 8, 12

How does it work? 1. Decorator parses function source, @PAR def f(x): translates to untyped wrap return x + 1 intermediate language 2. f(673.6) f(x : int) { return x + float 1.0 } f(np.arange(5)) specialize f(x : array1<int>) { return map(+ int , x, 1) } Decide where should each adverb run, 3. synthesize native code schedule & compile 4. add tasks to work queue (multi-core), transfer data & launch kernel (GPU) execute Friday, June 8, 12

Details: Typed IL ScalarType = i8 | ... | i64 | f32 | f64 Type = s calar | tuple | array {ScalarType, rank} • Every value annotated with type • Rewrite polymorphism into coercions ( e.g. addition becomes + int32 , + float64 , ...) • Array broadcasting & indexing ⇒ maps • Optimized aggressively (adverb fusion) Friday, June 8, 12

Parallelizing Adverbs is (conceptually) easy map( f , concat( x , y )) = concat(map( f , x ), map( f , y )) reduce( f , concat( x , y )) = f (reduce( f , x ), reduce( f , y )) In practice, the split/recombine logic is more complicated and the implementations are messy. Friday, June 8, 12

Adverb Parallelization CPU GPU • Threaded work queue • Kernel templates • Adverbs implemented for each adverb (splice in user- as loops (same as defined function) single-core) • Adverb-specific • Adverb-specific logic launching logic for combining output of each worker Friday, June 8, 12

Scheduling Different locations where an adverb can run: Multicore backend: interpreter, multicore, sequential GPU backend: interpreter, kernel, thread Choose locations which minimize (very naive) cost: • Scalar operations all have same constant cost • Loops will execute only once • Sequential adverbs: cost(nested fn) * number of elements • Parallel adverbs: divide by number of processors Special considerations for GPU: • memory transfer cost • tree-structured scans and reductions Friday, June 8, 12

Runtime Odds & Ends Lots of plumbing! • Shape inference • Keep track of multiple function specializations • Code caches for CPU & GPU implementations of adverb instances • What data is already on the GPU? • What data is no longer used? Friday, June 8, 12

It’s Not Magic Matrix multiplication, Parakeet style: parakeet.allpairs(parakeet.dot, X, Y.T) With 1000x1000 inputs: • Parakeet: 310 ms (8 CPU cores) • NumPy: 90 ms (single core BLAS) We’re ignoring data layout and cache locality Friday, June 8, 12

What’s Next? • Dynamically choose better data layout, transposed copy to local buffer (huge performance gains on both CPU and GPU) • Fix our busted GPU backend (moving to LLVM for saner PTX generation) • Heterogeneity! (if we have multiple backends, why can’t they split the work?) • A less naive cost model (need to know how much work to give each backend) Friday, June 8, 12

Summary • Restricting the programmer liberates the compiler • Higher order array operators (“adverbs”) admit diverse (parallel) implementations • Many adverbs hiding in array-oriented code • Python can be as “fast as C”, for a sufficiently small definition of Python Friday, June 8, 12

Thanks For Listening! Friday, June 8, 12

parakeet A Just-in-Time Parallel Accelerator for Numerical Python - PowerPoint PPT Presentation

parakeet A Just-in-Time Parallel Accelerator for Numerical Python Alex Rubinsteyn Eric Hielscher Nathaniel Weinman Dennis Shasha New York University Friday, June 8, 12 Naive Python Code (is slow) Count the number of times a value occurs

Parakeet: How the Parakeet: how the products and system products and system work work

Carolina Parakeet Panthera leo barbaricus Barbary Lion extinct around 1935? (1918?) Extinct

Tracking lexical garden-path resolution with MEG: Phonological commitment and sensitivity to

2016-19 Phase Two Consultation: Proposals For Change Consultation Workshops Agenda Why

Slide # 1. During my years of overseeing Federal energy programs, I received great support from

Polynomial Chi-binding functions and forbidden induced subgraphs a survey Ingo Schiermeyer TU

Semantics and Pragmatics of NLP The Semantics of Discourse: Overview Alex Lascarides School of

A FASTER PYTHON? YOU HAVE THESE CHOICES Paul Ross AHL MAN AHL

Optimizing zlib for A deflated story Adenilson Cavalcanti BS. MSc. Staff Engineer - Arm San Jose

FAIR Basics FAIR is Where to find them and examples of use The FAIR Data Principles

Chapter 5: Competitive exclusion Theoretical Gause: Paramecium (1934) Biology 2015 Competitive

Collaborative Technologies and Enterprise Middleware: A View of the Next Few Years A Day in the

7 th Grade PSI Structure and Function & Information Processing 2015-11-07 www.njctl.org

Using Parrot in Scientific Workflows Tim Shaffer University of Notre Dame tshaffe1@nd.edu

Jonathan

Parrot Allison Randal The Perl Foundation & O'Reilly Media, Inc. There's an odd

Using Parrot to access CVMFS repositories Ben Tovar University of Notre Dame btovar@nd.edu

Requirements & Promises of Functions Advertised Requirements (Pre-Condition)

MongoDB and Java 8 Agenda Java8 Main Features MongoDB + Java8 Few Examples RX Driver 3 Java

GETTING SOFTWARE RIGHT WITH PROPERTIES, GETTING SOFTWARE RIGHT WITH PROPERTIES, GENERATED TESTS,

CS 240: Operating System Foundations of Computer Instruction Set Architecture Systems

Emerging Languages Ola Bini computational metalinguist ola.bini@gmail.com

Debian Derivatives P r e s e n t a t i o n / B o F D e b C o n f 1 6 , C

A Predictive Differentially-Private Mechanism for Mobility Traces Marco Stronati

parakeet A Just-in-Time Parallel Accelerator for Numerical Python - PowerPoint PPT Presentation

parakeet A Just-in-Time Parallel Accelerator for Numerical Python Alex Rubinsteyn Eric Hielscher Nathaniel Weinman Dennis Shasha New York University Friday, June 8, 12 Naive Python Code (is slow) Count the number of times a value occurs

Parakeet: How the Parakeet: how the products and system products and system work work

Carolina Parakeet Panthera leo barbaricus Barbary Lion extinct around 1935? (1918?) Extinct

Tracking lexical garden-path resolution with MEG: Phonological commitment and sensitivity to

2016-19 Phase Two Consultation: Proposals For Change Consultation Workshops Agenda Why

Slide # 1. During my years of overseeing Federal energy programs, I received great support from

Polynomial Chi-binding functions and forbidden induced subgraphs a survey Ingo Schiermeyer TU

Semantics and Pragmatics of NLP The Semantics of Discourse: Overview Alex Lascarides School of

A FASTER PYTHON? YOU HAVE THESE CHOICES Paul Ross AHL MAN AHL

Optimizing zlib for A deflated story Adenilson Cavalcanti BS. MSc. Staff Engineer - Arm San Jose

FAIR Basics FAIR is Where to find them and examples of use The FAIR Data Principles

Chapter 5: Competitive exclusion Theoretical Gause: Paramecium (1934) Biology 2015 Competitive

Collaborative Technologies and Enterprise Middleware: A View of the Next Few Years A Day in the

7 th Grade PSI Structure and Function &amp; Information Processing 2015-11-07 www.njctl.org

Using Parrot in Scientific Workflows Tim Shaffer University of Notre Dame tshaffe1@nd.edu

Jonathan

Parrot Allison Randal The Perl Foundation &amp; O'Reilly Media, Inc. There's an odd

Using Parrot to access CVMFS repositories Ben Tovar University of Notre Dame btovar@nd.edu

Requirements &amp; Promises of Functions Advertised Requirements (Pre-Condition)

MongoDB and Java 8 Agenda Java8 Main Features MongoDB + Java8 Few Examples RX Driver 3 Java

GETTING SOFTWARE RIGHT WITH PROPERTIES, GETTING SOFTWARE RIGHT WITH PROPERTIES, GENERATED TESTS,

CS 240: Operating System Foundations of Computer Instruction Set Architecture Systems

Emerging Languages Ola Bini computational metalinguist ola.bini@gmail.com

Debian Derivatives P r e s e n t a t i o n / B o F D e b C o n f 1 6 , C

A Predictive Differentially-Private Mechanism for Mobility Traces Marco Stronati

7 th Grade PSI Structure and Function & Information Processing 2015-11-07 www.njctl.org

Parrot Allison Randal The Perl Foundation & O'Reilly Media, Inc. There's an odd

Requirements & Promises of Functions Advertised Requirements (Pre-Condition)