CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some - PowerPoint PPT Presentation

CS 240A:   Parallel Prefix Algorithms   or   Tricks with Trees � Some slides from Jim Demmel,   Kathy Yelick, Alan Edelman,   and a cast of thousands … � � �

PRAM model of parallel computation . . . P2 P1 Pn Parallel Random Access Memory Machine • Very simple theoretical model, used in 1970s and 1980s for lots of “paper designs” of parallel algorithms. • Processors have unit-time access to any location in shared memory. • Number of processors is allowed to grow with problem size. • Goal is (usually) an algorithm with span O(log n) or O(log 2 n). • Eg: Can you sort n numbers with T 1 = O(n log n) and T n = O(log n)? • Was a big open question until Cole solved it in 1988. • Very unrealistic model but sometimes useful for thinking about a problem.

Parallel Vector Operations • Vector add: z = x + y • Embarrassingly parallel if vectors are aligned; span = 1 • DAXPY: v = α *v + β *w (vectors v, w; scalar α , β ) • Broadcast α & β , then pointwise vector +; span = log n • DDOT : α = v T *w (vectors v, w; scalar α ) • Pointwise vector *, then sum reduction; span = log n

Broadcast and reduction • Broadcast of 1 value to p processors with log p span α Broadcast � • Reduction of p values to 1 with log p span • Uses associativity of +, *, min, max, etc. 1 3 1 0 4 -6 3 2 � Add-reduction � 8

Parallel Prefix Algorithms • A theoretical secret for turning serial into parallel • Surprising parallel algorithms: If “ there is no way to parallelize this algorithm! ” … • … it ’ s probably a variation on parallel prefix!

Example of a prefix (also called a scan ) Sum Prefix Input x = (x 1 , x 2 , . . ., x n ) Output y = (y 1 , y 2 , . . ., y n ) y i = Σ j=1:i x j Example x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Prefix functions-- outputs depend upon an initial string

What do you think? • Can we really parallelize this? • It looks like this kind of code: y(0) = 0; for i = 1:n y(i) = y(i-1) + x(i); • The ith iteration of the loop depends completely on the (i-1)st iteration. • Work = n, span = n, parallelism = 1. • Impossible to parallelize, right?

A clue? x = ( 1, 2, 3, 4, 5, 6, 7, 8 ) y = ( 1, 3, 6, 10, 15, 21, 28, 36) Is there any value in adding, say, 4+5+6+7? If we separately have 1+2+3, what can we do? Suppose we added 1+2, 3+4, etc. pairwise -- what could we do?

Prefix sum in parallel Algorithm: 1. Pairwise sum 2. Recursive prefix 3. Pairwise sum 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3 7 11 15 19 23 27 31 (Recursively compute prefix sums) 3 10 21 36 55 78 105 136 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 9 �

Parallel prefix cost: Work and Span • What ’ s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “ odds ” 1 3 6 10 15 21 28 36 • T 1 (n) = n/2 + n/2 + T 1 (n/2) = n + T 1 (n/2) = 2n – 1 at the cost of more work! 10 �

Parallel prefix cost: Work and Span • What ’ s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “ odds ” 1 3 6 10 15 21 28 36 • T 1 (n) = n/2 + n/2 + T 1 (n/2) = n + T 1 (n/2) = 2n – 1 Parallelism at the cost of more work! 11 �

Parallel prefix cost: Work and Span • What ’ s the total work? 1 2 3 4 5 6 7 8 Pairwise sums 3 7 11 15 Recursive prefix 3 10 21 36 Update “ odds ” 1 3 6 10 15 21 28 36 • T 1 (n) = n/2 + n/2 + T 1 (n/2) = n + T 1 (n/2) = 2n – 1 • T ∞ (n) = 2 log n Parallelism at the cost of twice the work! 12 �

Non-recursive view of parallel prefix scan • Tree summation: two phases • up sweep • get values L and R from left and right child • save L in local variable Mine • compute Tmp = L + R and pass to parent • down sweep • get value Tmp from parent • send Tmp to left child • send Tmp+Mine to right child Up sweep: Down sweep: mine = left tmp = parent (root is 0) 0 6 6 tmp = left + right right = tmp + mine 4 5 6 9 0 6 4 6 11 4 5 3 2 4 1 4 5 4 0 3 4 6 6 10 11 12 3 2 4 1 +X = 3 1 2 0 4 1 1 3 3 4 6 6 10 11 12 15 3 1 2 0 4 1 1 3 13 �

Any associative operation works Associative: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) Sum (+) All (and) Product (*) Any ( or) MatMul Max Input: Matrices Min Input: Bits (not commutative!) (Booleans) Input: Reals

Scan (Parallel Prefix) Operations • Definition: the parallel prefix operation takes a binary associative operator ⊕ , and an array of n elements [a 0 , a 1 , a 2 , … a n-1 ] and produces the array [a 0 , (a 0 ⊕ a 1 ), … (a 0 ⊕ a 1 ⊕ ... ⊕ a n-1 )] • Example: add scan of [1, 2, 0, 4, 2, 1, 1, 3] is [1, 3, 3, 7, 9, 10, 11, 14] 15 �

Applications of scans • Many applications, some more obvious than others • lexically compare strings of characters • add multi-precision numbers • add binary numbers fast in hardware • graph algorithms • evaluate polynomials • implement bucket sort, radix sort, and even quicksort • solve tridiagonal linear systems • solve recurrence relations • dynamically allocate processors • search for regular expression (grep) • image processing primitives 16 �

Using Scans for Array Compression • Given an array of n elements [a 0 , a 1 , a 2 , … a n-1 ] and an array of flags [1,0,1,1,0,0,1,…] compress the flagged elements into [a 0 , a 2 , a 3 , a 6 , …] • Compute an add scan of [0, flags] : [0,1,1,2,3,3,4,…] • Gives the index of the i th element in the compressed array • If the flag for this element is 1, write it into the result array at the given position 17 �

Array compression: Keep only positives Matlab code % Start with a vector of n random #s % normally distributed around 0. A = randn(1,n); flag = (A > 0); addscan = cumsum(flag); parfor i = 1:n if flag(i) B(addscan(i)) = A(i); end; end; 18 �

Fibonacci via Matrix Multiply Prefix F n+1 = F n + F n-1 F F 1 1 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ n 1 n + ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ F 1 0 F ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ n n - 1 Can compute all F n by matmul_prefix on [ , , , , , , , , ] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ then select the upper left entry 19 �

Carry-Look Ahead Addition (Babbage 1800 ’ s) Example 1 0 1 1 1 Carry 1 0 1 1 1 First Int 1 0 1 0 1 Second Int 1 0 1 1 0 0 Sum Goal: Add Two n-bit Integers

Carry-Look Ahead Addition (Babbage 1800 ’ s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c 2 c 1 c 0 1 0 1 1 1 First Int a 3 a 2 a 1 a 0 1 0 1 0 1 Second Int b 3 b 2 b 1 b 0 1 0 1 1 0 0 Sum s 3 s 2 s 1 s 0

Carry-Look Ahead Addition (Babbage 1800 ’ s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c 2 c 1 c 0 1 0 1 1 1 First Int a 3 a 2 a 1 a 0 1 0 1 0 1 Second Int b 3 b 2 b 1 b 0 1 0 1 1 0 0 Sum s 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 (addition mod 2) s i = a i + b i + c i-1 c i = a i b i + c i-1 (a i + b i ) end s n = c n-1

Carry-Look Ahead Addition (Babbage 18) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c 2 c 1 c 0 1 0 1 1 1 First Int a 3 a 2 a 1 a 0 1 0 1 0 1 Second Int b 3 b 2 b 1 b 0 1 0 1 1 0 0 Sum s 3 s 2 s 1 s 0 c -1 = 0 for i = 0 : n-1 c i a i + b i a i b i c i-1 s i = a i + b i + c i-1 = 1 0 1 1 c i = a i b i + c i-1 (a i + b i ) end (addition mod 2) s n = c n-1

Carry-Look Ahead Addition (Babbage 1s) Goal: Add Two n-bit Integers Example Notation 1 0 1 1 1 Carry c 2 c 1 c 0 1 0 1 1 1 First Int a 3 a 2 a 1 a 0 1 0 1 0 1 Second Int b 3 b 2 b 1 b 0 1 0 1 1 0 0 Sum s 3 s 2 s 1 s 0 c -1 = 0 c i a i + b i a i b i c i-1 for i = 0 : n-1 = 1 0 1 1 s i = a i + b i + c i-1 1. compute c i by binary matmul prefix c i = a i b i + c i-1 (a i + b i ) 2. compute s i = a i + b i +c i-1 in parallel end s n = c n-1

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some - PowerPoint PPT Presentation

CS 240A: Parallel Prefix Algorithms or Tricks with Trees Some slides from Jim Demmel, Kathy Yelick, Alan Edelman, and a cast of thousands PRAM model of parallel computation . . . P2 P1 Pn Parallel

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Lecture 20- ECE 240a Distributed Feedback Lasers 1 ECE 240a Lasers - Fall 2019 Lecture 20

This week, we are going to look at another prefix. What is a prefix? Choose the right answer. A

This week, we are going to look again at another prefix. What is a prefix? Click on the right

Parallel prefix adders Kostas Vitoroulis, 2006. Presented to Dr. A. J. Al-Khalili. Concordia

Lecture 14- ECE 240a Transient Response Ver Chap. 9.3 Linearized Solution Sinusoidal

Lecture 12- ECE 240a Threshold Mirror Loss Ver Chap. 8-9 Threshold Conditions Homogeneous

Parallel Computation Patterns Scan (Prefix Sum) Objective To master parallel scan (prefix

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Recap: Prefix Sums Given A : set of n integers Find B : prefix sums A: 3 1 1 7 2 5

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Parallel Algorithms Parallel Prefix Sums Algorithm Theory WS 2012/13 Fabian Kuhn PRAM Parallel

IP Prefix Advertisement in EVPN draft-rabadan-l2vpn-evpn-prefix-advertisement-01 Jorge Rabadan

Lecture 11- ECE 240a neous Emission from a (See Notes on Spontaneous Emission) Dipole

Lecture 7- ECE 240a Beam Optics Ver Chap. 1-3 Helmholtz Equation Paraxial Solutions

Non-Associative Flux Algebra in String and M-theory from Octonions DIETER LST (LMU, MPI)

CS453 LR(1), LALR, AMBIGUITY CS453 Shift-Reduce Cont' 1 LR(1), LALR, Ambiguity The plan:

Tutorial 9 : cache memory Why use a cache ? Main memory (VRAM/DRAM) is slow ! To deal with

Programming Abstractions Week 4-1: Combinators and combinatory logic Stephen Checkoway An early

Maximal Subalgebras of Finite-Dimensional Algebras Alex Sistko Joint work with Miodrag Iovanov

Relational join operator 1 Preliminaries 1.a Keys and partitioning Recall from our last reading

Nominalization and Predication in Ut-Main REBECCA PATERSON DISSERTATION DEFENSE PRESENTATION

Associative containers The art of inserting gracefully Jean Guegant Conditional insertion: if