An embedded language for An embedded language for data-parallel - PowerPoint PPT Presentation

An embedded language for An embedded language for data-parallel programming data-parallel programming Master of Science Thesis in Computer Science Master of Science Thesis in Computer Science By Joel Svensson By Joel Svensson Department of Computer Science and Engineering Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY CHALMERS UNIVERSITY OF TECHNOLOGY GÖTEBORGS UNIVERSITY GÖTEBORGS UNIVERSITY Göteborg, Sweden Göteborg, Sweden

Obsidian: an embedded language for Obsidian: an embedded language for data-parallel programming data-parallel programming Data-parallel programming Data-parallel programming General-Purpose computations on the GPU General-Purpose computations on the GPU (GPGPU) (GPGPU) Lava Lava NVIDIA 8800 GPU

Project Outline Project Outline An embedded language for data-parallel An embedded language for data-parallel programming programming Lava programming style using combinators Lava programming style using combinators Generate C code for NVIDIA GPU Generate C code for NVIDIA GPU

Data-parallel programming Data-parallel programming Single sequential program Single sequential program Executed by a number of processing Executed by a number of processing elements elements Operating on different data Operating on different data for j := 1 to log(n) do for all k in parallel do if ((k+1) mod 2^j) = 0 then x[k] := x[k-2^(j-1)] + x[k] fi od od

GPGPU GPGPU GPUs are relatively cheap GPUs are relatively cheap  High performance (Hundreds of GFLOPS) High performance (Hundreds of GFLOPS) Applications: Applications: Physics simulation Physics simulation Bioinformatics Bioinformatics Sorting Sorting www.gpgpu.org

GPU vs CPU GFLOPS Chart Chart GPU vs CPU GFLOPS

NVIDIA 8800 GPUs NVIDIA 8800 GPUs A set of SIMD multiprocessors A set of SIMD multiprocessors 8 SIMD processing elements per 8 SIMD processing elements per Multiprocessor Multiprocessor Up to 16 multiprocessors in one GPU Up to 16 multiprocessors in one GPU Giving 128 processing elements total Giving 128 processing elements total www.nvidia.com

NVIDIA 8800 GPUs NVIDIA 8800 GPUs

NVDIA Compute Unified Device NVDIA Compute Unified Device Architecture Architecture C compiler and libraries for the GPU C compiler and libraries for the GPU GPU as a highly parallel co-processor GPU as a highly parallel co-processor for use with NVIDIA's 8800 series GPUs for use with NVIDIA's 8800 series GPUs www.nvidia.com/cuda

CUDA Programming model CUDA Programming model High number of threads High number of threads  Divided into Blocks Divided into Blocks Thread block Thread block  512 Threads 512 Threads  Divided into Warps Divided into Warps  Executed on one multiprocessor Executed on one multiprocessor

CUDA Synchronisation CUDA Synchronisation CUDA supplies a synchronisation primitive, CUDA supplies a synchronisation primitive, __syncthreads() __syncthreads()  Barrier synchronisation Barrier synchronisation  Across all the threads of a block Across all the threads of a block Coordinate communication Coordinate communication

Obsidian Obsidian Embedded in Haskell Embedded in Haskell Presents a high level Presents a high level programmers interface programmers interface Parallel computations Parallel computations described using described using combinators combinators CUDA C code is CUDA C code is generated generated

Obsidian Obsidian Describes computations on arrays: Describes computations on arrays:  Length homogeneous Length homogeneous Sorting algorithms Sorting algorithms  Integer values Integer values Limitations: Limitations:  Currently limited to iterative sorting algorithms Currently limited to iterative sorting algorithms

Obsidian Programming Obsidian Programming Basics Basics  Sequential composition of programs: ->- Sequential composition of programs: ->-  Parallel composition of programs: parl Parallel composition of programs: parl  Index operations: Index operations: rev rev riffle riffle unriffle unriffle  Array operations: Array operations: halve halve conc conc  Apply or Map: fun Apply or Map: fun

Obsidian Programming Obsidian Programming Array Operations Array Operations  halve halve  conc conc  oeSplit oeSplit  shuffle shuffle

Obsidian Programming Obsidian Programming Index Operations Index Operations  rev riffle = halve ->- rev shuffle  riffle riffle  unriffle unriffle

unriffle unriffle unriffle = oeSplit ->- conc

Obsidian Programming Obsidian Programming fun Apply or Map: fun Apply or Map: ->- Sequential composition of programs: ->- Sequential composition of programs: parl Parallel composition of programs: parl Parallel composition of programs:

Obsidian Programming: an Obsidian Programming: an example example rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int)) rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int)) rev_incr = rev ->- fun (+1) ->- sync rev_incr = rev ->- fun (+1) ->- sync *Obsidian> execute rev_incr [1,2,3] *Obsidian> execute rev_incr [1,2,3] [4,3,2] [4,3,2]

Obsidian Synchronisation Obsidian Synchronisation sync Synchronisation primitive: sync Synchronisation primitive: sync  All array elements are updated after a sync All array elements are updated after a  Only applicable at top-level Only applicable at top-level __syncthreads() Inherits behavior from CUDA's __syncthreads() Inherits behavior from CUDA's

Generating C Code Generating C Code Generate CUDA C Code for NVIDIA GPU Generate CUDA C Code for NVIDIA GPU  Executed as one block of threads Executed as one block of threads Pros Pros  Communication and synchronisation possible Communication and synchronisation possible Cons Cons Upper limit of 512 threads per block  Upper limit of 512 threads per block  Does not use entire GPU Does not use entire GPU

Generating C Code Generating C Code Each thread is in charge of calculating one Each thread is in charge of calculating one array element array element  Limits array size to 512 elements Limits array size to 512 elements  Leads to some redundancy Leads to some redundancy Swap operation performed by two threads in Swap operation performed by two threads in cooperation cooperation

Generating C Code Generating C Code reverse = rev ->- sync __global__ static void reverse(int *values, int n) { extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = shared[((n - 1) - tid)]; __syncthreads(); shared[tid] = tmp; __syncthreads(); values[tid] = shared[tid]; }

Generating C Code Generating C Code __global__ static void example( int *values, int n int *values, int n ) { extern __shared__ int shared[]; extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads(); values[tid] = shared[tid]; values[tid] = shared[tid]; }

Generating C Code Generating C Code __global__ static void example(int *values, int n) { extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; 1 __syncthreads(); tmp = f(shared[i1],...,shared[in]); tmp = f(shared[i1],...,shared[in]); __syncthreads(); 2 shared[tid] = tmp; __syncthreads(); 3 values[tid] = shared[tid]; }

Implementing a sorter Implementing a sorter A two-sorter sorts a pair of values: A two-sorter sorts a pair of values: cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a) cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a) Sort each pair of elements in an array: Sort each pair of elements in an array: sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync) sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync) *Obsidian> execute sort2 [2,3,5,1,6,7] *Obsidian> execute sort2 [2,3,5,1,6,7] [2,3,1,5,6,7] [2,3,1,5,6,7] *Obsidian> execute sort2 [2,1,2,1,2,1] *Obsidian> execute sort2 [2,1,2,1,2,1] [1,2,1,2,1,2] [1,2,1,2,1,2]

Implementing a sorter Implementing a sorter A more efficient pairwise sort: A more efficient pairwise sort: sortEvens = evens (cmpSwap (<*)) ->- sync sortEvens = evens (cmpSwap (<*)) ->- sync *Obsidian> execute sortEvens [2,3,5,1,6,7] *Obsidian> execute sortEvens [2,3,5,1,6,7] [2,3,1,5,6,7] [2,3,1,5,6,7] *Obsidian> execute sortEvens [2,1,2,1,2,1] *Obsidian> execute sortEvens [2,1,2,1,2,1] [1,2,1,2,1,2] [1,2,1,2,1,2]

Implementing a sorter Implementing a sorter evens

Implementing a sorter Implementing a sorter evens is odds : A close relative of evens is odds : A close relative of sortOdds = odds (cmpSwap (<*)) ->- sync sortOdds = odds (cmpSwap (<*)) ->- sync *Obsidian> execute sortOdds [5,3,2,1,4,6] *Obsidian> execute sortOdds [5,3,2,1,4,6] [5,2,3,1,4,6] [5,2,3,1,4,6] *Obsidian> execute sortOdds [1,2,1,2,1,2] *Obsidian> execute sortOdds [1,2,1,2,1,2] [1,1,2,1,2,2] [1,1,2,1,2,2]

Implementing a sorter Implementing a sorter odds

An embedded language for An embedded language for data-parallel - PowerPoint PPT Presentation

An embedded language for An embedded language for data-parallel programming data-parallel programming Master of Science Thesis in Computer Science Master of Science Thesis in Computer Science By Joel Svensson By Joel Svensson Department of

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Embedded implicatures Bart Geurts Embedded implicatures?!? (with Nausicaa Pouscoulous) In:

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

EMBEDDED RUST ON THE BEAGLEBOARD X15 MEETING EMBEDDED Jonathan Pallant 14 November 2018

Embedded systems and the role of programmable logic devices in embedded systems Embedded system :

Embedded C for Zynq C r i s t i a n S i s t e r n a U n i v e r s i d a d N a c i o n a l

C++ for Embedded development C++ for Embedded development Thiago Macieira Thiago Macieira

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Cisco IOS Embedded Packet Capture (EPC) Cisco IOS Embedded Packet Capture (EPC) The Cisco IOS

Embedded-Linux Wie komme ich zu einem Embedded-Linux-System? Andreas Klinger ak@it-klinger.de

Orchestrated Android-Style System Upgrades for Embedded Linux Diego Rondini Embedded Linux

Divorce and The Tax Return Helping Clients Wade Through The Jungle Presented by Sharon Numerow,

Water Conservation Tim Motz Ty Fuller, P.G. - Director of Communications -Director of Strategic

Motivating Learning and Catering for Learner Diversity in LAC (A practical approach) By Mr. YIP

Learning Continuity and Attendance Plan Input Sessions Yolanda Valdez, Superintendent August

LES VOI.E.S.X THTRALES Espace-laboratoire pour auteur.trice.s dramatiques (projet pilot)

Original A/B Medicare and Medicare Advantage Part C or Medicare Advantage Whats The

Pourquoi et comment tenir compte de la situation des pays pauvre les plus vulnerables, en

Nuclear moment studies with spin polarized radioactive beams from fragmentation reaction Daisuke