Purely Functional GPU Programming with Futhark Troels Henriksen - PowerPoint PPT Presentation

Purely Functional GPU Programming with Futhark Troels Henriksen (athas@sigkill.dk) Computer Science University of Copenhagen February 4th 2017

Agenda The Problem Modern hardware can handle (and requires) tens to hundreds of thousands of parallel threads. The human mind cannot handle this.

Agenda The Problem Modern hardware can handle (and requires) tens to hundreds of thousands of parallel threads. The human mind cannot handle this. The Solution Functional array programming is a restricted programming paradigm that performs well in practice and is easy to reason about (for some problems).

Agenda The Problem Modern hardware can handle (and requires) tens to hundreds of thousands of parallel threads. The human mind cannot handle this. The Solution Functional array programming is a restricted programming paradigm that performs well in practice and is easy to reason about (for some problems). Agenda: 1. Array programming with Futhark 2. Inter-operability (with Python) 3. GPU performance compared to hand-written code

Two Kinds of Parallelism Task parallelism is the simultaneous execution of different functions across the same or different datasets: spawn thread ( f , x ) spawn thread ( g , y ) Data parallelism is the simultaneous execution of the same function across the elements of a dataset: map f [ v 0 , v 1 , . . . , v n − 1 ] = [ f v 0 , f v 1 , . . . , f v n − 1 ] Array programming is in the latter category.

❴ ✌ ❴ ★ ★ Array Programming Programs are expressed as bulk operations on arrays. In Python with Numpy: >>> import numpy as np >>> a = np.arange(10) >>> b = a * 2 >>> sum(a*b) 570 Popular because it resembles mathematics.

Array Programming Programs are expressed as bulk operations on arrays. In Python with Numpy: >>> import numpy as np >>> a = np.arange(10) >>> b = a * 2 >>> sum(a*b) 570 Popular because it resembles mathematics. Old —first seen in APL from 1964: a ❴ ✌ 10 b ❴ a ★ 2 +/a ★ b Less popular.

Futhark at a Glance Small eagerly evaluated pure functional language with data-parallel looping constructs. Syntax is a combination of C, SML, and Haskell. Data-parallel loops fun add two ( a : [ n ] i32 ) : [ n ] i32 = map ( + 2 ) a sum ( a : [ n ] i32 ) : i32 = reduce ( + ) 0 a fun fun sumrows ( as : [ n ] [m] i32 ) : [ n ] i32 = map sum as Sequential loops fun main ( n : i32 ) : i32 = loop ( x = 1) = for i < n do x ∗ ( i + 1) in x Array Construction i ota 5 = [ 0 ,1 ,2 ,3 ,4 ] r e p l i ca t e 3 1337 = [1337 , 1337 , 1337]

Computing the Mandelbrot Set The root of those pretty visuals is calling this function (here in Python) with a bunch of complex numbers: fun divergence ( c , d ) = i = 0 z = c i < d and dot ( z ) < 4 .0 : while z = c + z ∗ z i = i + 1 i return

Mandelbrot in Numpy 1 def mandelbrot numpy ( c , d ) : output = np . zeros ( c . shape ) z = np . zeros ( c . shape , np . complex64 ) i t range ( d ) : for in notdone = np . l e s s ( z . r e a l ∗ z . r e a l + z . imag ∗ z . imag , 4 .0 ) output [ notdone ] = i t z [ notdone ] = z [ notdone ] ∗∗ 2 + c [ notdone ] output return Problems Control flow obscured. Always runs for maxiter iterations. Lots of memory traffic - three arrays written for every iteration of loop. 1 https://www.ibm.com/developerworks/community/blogs/ jfp/entry/How_To_Compute_Mandelbrodt_Set_Quickly

Mandelbrot in Futhark divergence ( c : complex ) ( d : i32 ) : i32 = fun ( ( z , i ) = ( c , 0 ) ) = while i < d && loop dot ( z ) < 4.0 do ( addComplex ( c , multComplex ( z , z ) ) , i + 1) in i fun mandelbrot ( css : [ n ] [m] complex ) ( d : i32 ) : [ n ] [m] i32 = map ( \ cs − > map ( \ c − > divergence c d ) cs ) css Only one array written, at the end. while loop terminates when the element diverges.

Mandelbrot speedup on GPU compared to sequential implementation in C 12 350 300 10 250 8 Speedup Speedup 200 6 150 4 100 2 50 0 0 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 Width and height Width and height Numpy-style Futhark-style Moral: The vectorised style can sacrifice a lot of potential performance.

Running a Futhark Program Define contrived entry point fun main ( n : i32 ) (m: i32 ) ( d : i32 ) : i32 = css = make complex numbers n m l e t escapes = mandelbrot css d l e t ( + ) 0 ( reshape ( n ∗ m) escapes ) in reduce Creates some arbitrary complex numbers, computes their divergence, and sums the results. Futhark is a pure language and cannot read input or write results itself. When launching a Futhark program, we must indicate an entry point and input data.

Compile to sequential code $ futhark − c mandelbrot . f u t − o mandelbrot − c $ echo 10000 10000 100 | \ . / mandelbrot − c − t / dev / stdout 611240 999901 i32

Compile to sequential code $ futhark − c mandelbrot . f u t − o mandelbrot − c $ echo 10000 10000 100 | \ . / mandelbrot − c − t / dev / stdout 611240 999901 i32 Compile to parallel (GPU) code $ futhark − opencl mandelbrot . f ut − o mandelbrot − opencl $ echo 10000 10000 100 | \ . / mandelbrot − opencl − t / dev / stdout 7550 999901 i32 Advantage 80 × speedup of parallel over sequential execution.

How OpenCL Works The CPU uploads code and data to the GPU, queues Sequential CPU Parallel GPU program program execution, and copies back results. Observation: the CPU code is all management and bookkeeping and does not need to be particularly fast.

How OpenCL Works The CPU uploads code and data to the GPU, queues Sequential CPU Parallel GPU program program execution, and copies back results. Observation: the CPU code is all management and bookkeeping and does not need to be particularly fast. How Futhark Becomes Useful We can generate the CPU code in whichever language the rest of the user’s application is written in. This presents a convenient and conventional API, hiding the fact that GPU calls are happening underneath.

Compiling Futhark to Python+PyOpenCL $ futhark-pyopencl --library mandelbrot.fut This creates a Python module mandelbrot.py which we can use as follows: $ python > import mandelbrot > > > m = mandelbrot . mandelbrot ( ) > > > m. main (100 , 100 , 255) > > 25246 > m. main (1000 , 1000 , 300) > > 299701 Good for all your mandelbrot summing needs.

Compiling Futhark to Python+PyOpenCL $ futhark-pyopencl --library mandelbrot.fut This creates a Python module mandelbrot.py which we can use as follows: $ python > import mandelbrot > > > m = mandelbrot . mandelbrot ( ) > > > m. main (100 , 100 , 255) > > 25246 > m. main (1000 , 1000 , 300) > > 299701 Good for all your mandelbrot summing needs. Or , we could have our Futhark program return an array containing pixel colour values, and use Pygame to blit it to the screen...

Performance This is where you should stop trusting me!

Performance This is where you should stop trusting me! No good objective criterion for whether a language is “fast”. Best practice is to take benchmark programs written in other languages, port or re-implement them, and see how they behave.

Speedup Over Hand-Written Rodinia OpenCL Code GTX 780 W8100 6 4 1 4 3 . . 4 4 Speedup 4 6 7 1 . 2 1 . 2 6 2 4 5 0 0 8 8 8 . 1 . . . 0 0 0 0 Backprop HotSpot CFD K-means 1 GTX 780 W8100 9 . 7 1 5 1 6 . 5 2 1 . 4 6 2 Speedup 4 . 2 3 6 8 5 . 2 2 1 . . 2 2 5 2 3 0 . 1 8 . 0 0 LavaMD NN SRAD Myocyte Pathfinder

Summary Futhark is a small high-level functional data-parallel language with a GPU-targeting optimising compiler. Can be integrated with existing languages and applications. Performance is okay. Questions? Website https://futhark-lang.org Code https://github.com/HIPERFIT/futhark Benchmarks https: //github.com/HIPERFIT/futhark-benchmarks

Purely Functional GPU Programming with Futhark Troels Henriksen - PowerPoint PPT Presentation

Purely Functional GPU Programming with Futhark Troels Henriksen (athas@sigkill.dk) Computer Science University of Copenhagen February 4th 2017 Agenda The Problem Modern hardware can handle (and requires) tens to hundreds of thousands of

Futhark A data-parallel pure functional programming language compiling to GPU Troels Henriksen

Functional Data Structures [C. Okasaki, Simple and efficient purely functional queues and deques ,

Purely Functional Data Structures and Monoids Donnacha Ois n Kidney May 9, 2020 1 Purely

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

Purely Functional Data Structures Kristjan Vedel November 18, 2012 Abstract This paper gives an

FUNC Lecture 7 Purely Functional Queues (lightly adapted for TFPIE17) Colin Runciman Purely

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

High-performance defunctionalization in Futhark Anders Kiel Hovgaard Troels Henriksen Martin

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Data Parallel Programming in Futhark Troels Henriksen (athas@sigkill.dk) DIKU University of

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Introduction to Functional Programming Introduction to Functional Programming Practice Strategy

FUNCTIONAL SAFETY AND THE GPU Richard Bramley, 5/11/2017 How good is good enough What is

Safe Limits on Voltage Reduction Efficiency in GPUs: a Direct Measurement Approach Jingwen Leng,

Automatic Virtualization of Accelerators Hangchen Yu, Arthur Michener Peters , Amogh Akshintala,

Polyhedral Transformations of Explicitly Parallel Programs Prasanth Chatarasi, Jun Shirako, Vivek

Arrhythmia Mechanisms Disclosures Honoria Abbott Biotronik William G. Stevenson,

Towards a Predictable Execution Model for Heterogeneous Systems on a Chip ANDREA

Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters

able to: Understand what the components of a muscle fibres Describe the sliding

Raluca Mateescu, University of Florida 6/2/17 What do consumers want?