1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM - PDF document

Outline Lecture 1: Multicore Architecture Concepts Lecture 2: Parallel programming with threads and tasks Design and Analysis Lecture 3: Shared memory architecture concepts of Parallel Programs Lecture 4: Design and analysis of parallel algorithms  Parallel cost models  Parallel cost models TDDD56 Lecture 4  Work, time, cost, speedup  Amdahl’s Law Christoph Kessler  Work-time scheduling and Brent’s Theorem PELAB / IDA Lecture 5: Parallel Sorting Algorithms Linköping university Sweden … 2012 2 Parallel Computation Model Parallel Computation Models = Programming Model + Cost Model Shared-Memory Models  PRAM (Parallel Random Access Machine) [Fortune, Wyllie ’78] including variants such as Asynchronous PRAM, QRQW PRAM  Data-parallel computing  Task Graphs (Circuit model; Delay model)  Functional parallel programming  Functional parallel programming  … Message-Passing Models  BSP (Bulk-Synchronous Parallel) Computing [Valiant’90] including variants such as Multi-BSP [Valiant’08]  LogP  Synchronous reactive (event-based) programming e.g. Erlang  … 3 4 Flashback to DALG, Lecture 1: Cost Model The RAM (von Neumann) model for sequential computing Basic operations (instructions): - Arithmetic (add, mul, …) on registers - Load - Load - Store op - Branch op1 Simplifying assumptions for time analysis: op2 - All of these take 1 time unit - Serial composition adds time costs T(op1;op2) = T(op1)+T(op2) 5 6 1

Analysis of sequential algorithms: The PRAM Model – a Parallel RAM RAM model (Random Access Machine) 7 8 Divide&Conquer Parallel Sum Algorithm PRAM Variants in the PRAM / Circuit (DAG) cost model 9 10 Recursive formulation of DC parallel Recursive formulation of DC parallel sum algorithm in EREW-PRAM model sum algorithm in EREW-PRAM model SPMD (single-program-multiple-data) execution style: Fork-Join execution style: single thread starts, code executed by all threads (PRAM procs) in parallel, threads spawn child threads for independent threads distinguished by thread ID $ subtasks, and synchronize with them Implementation in Cilk: cilk int parsum ( int *d, int from, int to ) { { int mid, sumleft, sumright; if (from == to) return d[from]; // base case else { mid = (from + to) / 2; sumleft = spawn parsum ( d, from, mid ); sumright = parsum( d, mid+1, to ); sync ; return sumleft + sumright; } } 11 12 2

Iterative formulation of DC parallel sum Circuit / DAG model in EREW-PRAM model  Independent of how the parallel computation is expressed, the resulting (unfolded) task graph looks the same.  Task graph is a directed acyclic graph (DAG) G=(V,E)  Set V of vertices: elementary tasks (taking time 1 resp. O(1))  Set E of directed edges: dependences (partial order on tasks) (v1,v2) in E  v1 must be finished before v2 can start  Critical path = longest path from an entry to an exit node  Length of critical path is a lower bound for parallel time complexity  Parallel time can be longer if number of processors is limited  schedule tasks to processors such that dependences are preserved (by programmer (SPMD execution) or run-time system (fork-join exec.)) 13 14 Parallel Time, Work, Cost Parallel work, time, cost 15 16 Work-optimal and cost-optimal Some simple task scheduling techniques Greedy scheduling (also known as ASAP, as soon as possible) Dispatch each task as soon as - it is data-ready (its predecessors have finished) - and a free processor is available Critical-path scheduling Critical-path scheduling Schedule tasks on critical path first, then insert remaining tasks where dependences allow, inserting new time steps if no appropriate free slot available Layer-wise scheduling Decompose the task graph into layers of independent tasks Schedule all tasks in a layer before proceeding to the next 17 18 3

Work-Time (Re)scheduling Brent’s Theorem [Brent 1974] Layer-wise scheduling 8 processors 4 processors 19 20 Speedup Speedup 21 22 Amdahl’s Law: Upper bound on Speedup Amdahl’s Law 23 24 4

Proof of Amdahl’s Law Remarks on Amdahl’s Law 25 26 Search Anomaly Example: Speedup Anomalies Simple string search Given: Large unknown string of length n, pattern of constant length m << n Search for any occurrence of the pattern in the string. Simple sequential algorithm: Linear search t 0 n -1 Pattern found at first occurrence at position t in the string after t time steps or not found after n steps 27 28 Parallel Simple string search Parallel Simple string search Given: Large unknown shared string of length n, Given: Large unknown shared string of length n, pattern of constant length m << n pattern of constant length m << n Search for any occurrence of the pattern in the string. Search for any occurrence of the pattern in the string. Simple parallel algorithm: Contiguous partitions, linear search Simple parallel algorithm: Contiguous partitions, linear search 0 n/p -1 2n/p - 3n/p - (p-1)n/p -1 n -1 0 n/p -1 2n/p - 3n/p - (p-1)n/p -1 n -1 1 1 1 1 Case 1: Pattern not found in the string Case 2: Pattern found in the first position scanned by the last processor  measured parallel time n/p steps  measured parallel time 1 step, sequential time n-n/p steps  speedup = n / ( n/p ) = p   observed speedup n-n/p , ”superlinear” speedup?!? But, … … this is not the worst case (but the best case) for the parallel algorithm; … and we could have achieved the same effect in the sequential algorithm, too, by altering the string traversal order 29 30 5

Data-Parallel Algorithms Further fundamental parallel algorithms Parallel prefix sums Read the article by Hillis and Steele (see Further Reading) Parallel list ranking 32 The Prefix-Sums Problem Sequential prefix sums algorithm 33 34 Parallel Prefix Sums Algorithm 2: Parallel prefix sums algorithm 1 Upper-Lower Parallel Prefix A first attempt… 35 36 6

Parallel Prefix Sums Algorithm 3: Parallel Prefix Sums Algorithms Recursive Doubling (for EREW PRAM) Concluding Remarks 37 38 Parallel List Ranking (1) Parallel List Ranking (2) 39 40 Parallel List Ranking (3) Questions? 41 7

Further Reading On PRAM model and Design and Analysis of Parallel Algorithms  J. Keller, C. Kessler, J. Träff: Practical PRAM Programming. Wiley Interscience, New York, 2001.  J. JaJa: An introduction to parallel algorithms. Addison- Wesley, 1992.  D. Cormen, C. Leiserson, R. Rivest: Introduction to  D. Cormen, C. Leiserson, R. Rivest: Introduction to Algorithms, Chapter 30. MIT press, 1989.  H. Jordan, G. Alaghband: Fundamentals of Parallel Processing. Prentice Hall, 2003.  W. Hillis, G. Steele: Data parallel algorithms. Comm. ACM 29 (12), Dec. 1986. Link on course homepage.  Fork compiler with PRAM simulator and system tools http://www.ida.liu.se/chrke/fork (for Solaris and Linux) 43 8

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM - PDF document

Outline Lecture 1: Multicore Architecture Concepts Lecture 2: Parallel programming with threads and tasks Design and Analysis Lecture 3: Shared memory architecture concepts of Parallel Programs Lecture 4: Design and analysis of parallel

Scan Mark Greenstreet CpSc 418 Jan. 20, 2016 Mark Greenstreet Scan CS 418 Jan. 20,

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix

Lecture 13: Block Diagrams and the Inverse Z Transform Mark Hasegawa-Johnson ECE 401: Signal and

Re-indexing the DFT (n and k) We can investigate the various implementations of the DFT by

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Overview on Parallel Programming Paradigms Ivan Giro3o

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel- 0 : A fully parallel algorithm for combinatorial compressed sensing Jared Tanner

P A R A L L E L A L G O R I T H M S F O R M I N I N G L A R G E - S C A L E T I M E - V A R Y

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

Principle of Parallel Algorithm Design Alexandre David B2-206 Today Preliminaries (3.1).

Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages Marek Ko sta

Titanium: A High-Performance Java Dialect Jason Ryder Matt Beaumont-Gay Aravind Bappanadu

Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp,

In Intro roduc ductio ion t n to P Paralle arallel P l Pro rogram gramming ing for

High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale

Parallel Recursive Programs Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur

Computer Graphics (CS 543) Lecture 5 (part 2): Projection (Part 2): Derivation Prof Emmanuel Agu

CS488 Implementation of projections Luc R ENAMBOT 1 3D Graphics Convert a set of polygons in

Viewing in 3D (Chapt. 6 in FVD, Chapt. 12 in Hearn & Baker) Specifying the Viewing

Lecture outline COMP30019 Graphics and Interaction Introduction to perspective geometry

Administrivia Homework 1 posted on Moodle Due Feb 4 9 , 11:30am (before the class)

3d viewing as a Kodak moment Projection 3D viewing is much like photography The role of

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM - PDF document

Outline Lecture 1: Multicore Architecture Concepts Lecture 2: Parallel programming with threads and tasks Design and Analysis Lecture 3: Shared memory architecture concepts of Parallel Programs Lecture 4: Design and analysis of parallel

Scan Mark Greenstreet CpSc 418 Jan. 20, 2016 Mark Greenstreet Scan CS 418 Jan. 20,

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix

Lecture 13: Block Diagrams and the Inverse Z Transform Mark Hasegawa-Johnson ECE 401: Signal and

Re-indexing the DFT (n and k) We can investigate the various implementations of the DFT by

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Overview on Parallel Programming Paradigms Ivan Giro3o

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel- 0 : A fully parallel algorithm for combinatorial compressed sensing Jared Tanner

P A R A L L E L A L G O R I T H M S F O R M I N I N G L A R G E - S C A L E T I M E - V A R Y

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

Principle of Parallel Algorithm Design Alexandre David B2-206 Today Preliminaries (3.1).

Presburger Arithmetic in Memory Access Optimization for Data-Parallel Languages Marek Ko sta

Titanium: A High-Performance Java Dialect Jason Ryder Matt Beaumont-Gay Aravind Bappanadu

Raising the Arithmetic Intensity of Krylov solvers 1 Applied Mathematics, University of Antwerp,

In Intro roduc ductio ion t n to P Paralle arallel P l Pro rogram gramming ing for

High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale

Parallel Recursive Programs Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur

Computer Graphics (CS 543) Lecture 5 (part 2): Projection (Part 2): Derivation Prof Emmanuel Agu

CS488 Implementation of projections Luc R ENAMBOT 1 3D Graphics Convert a set of polygons in

Viewing in 3D (Chapt. 6 in FVD, Chapt. 12 in Hearn &amp; Baker) Specifying the Viewing

Lecture outline COMP30019 Graphics and Interaction Introduction to perspective geometry

Administrivia Homework 1 posted on Moodle Due Feb 4 9 , 11:30am (before the class)

3d viewing as a Kodak moment Projection 3D viewing is much like photography The role of

Viewing in 3D (Chapt. 6 in FVD, Chapt. 12 in Hearn & Baker) Specifying the Viewing