Multithreaded Algorithms Architecture Evolution Weve come a long - - PDF document

multithreaded algorithms
SMART_READER_LITE
LIVE PREVIEW

Multithreaded Algorithms Architecture Evolution Weve come a long - - PDF document

Multithreaded Algorithms Architecture Evolution Weve come a long way since we blamed Von Neumann for putting that bottleneck in our computers Memory contains data and programs Computer fetches the instructions sequentially from memory and


slide-1
SLIDE 1

Multithreaded Algorithms

Multicore Challenges -

Architecture Evolution

We’ve come a long way since we blamed Von Neumann for putting that bottleneck in our computers Memory contains data and programs Computer fetches the instructions sequentially from memory and executes it

2

slide-2
SLIDE 2

Multicore Challenges -

Architecture Evolution

On one hand processor powers have improved On the other hand, how it interacts with memory has changed

3 Multicore Challenges -

Data & Instruction Streamed

4

Single Instruction Single Data SISD Multiple Instruction Single Data MISD Single Instruction Multiple Data SIMD Multiple Instruction Multiple Data MIMD

slide-3
SLIDE 3

Multicore Challenges -

Processor Evolution

Quite a ride of improved processor speeds, speed doubling every 18 months or so Memory speed has been doubling every six years However that has reached limits due to various engineering constraints Mostly hard to contain heat due to packing density Right about 2003 when we started seeing multiple cores This is the direction we’re headed now

5 Multicore Challenges -

Moore’ s Law

Predicted that the capacity of chips would double every 2 years The trend of making a single processor faster has flattened Now we’re on a trend to add more cores This tread will accelerate in the next several years

6

slide-4
SLIDE 4

Multicore Challenges -

Not Your Father’ s Environment

The multicore processors pose some real challenges for programmers On a single processor multithreading is really multitasking On a multicore processor multithreads are on steroids Each core pipelines instructions through multiple threads They get to memory more rapidly Increased possibility of contention

7 Multicore Challenges -

You’ve Been Drawn In

You’ve been drawn in to this war You may argue your application does not need threading But, you have no choice when you are on multicore Remember how a sequential program can break when run with multiple threads Programs will likely break when run on multiple cores Memory is much slower than CPU, cores tend to rely more on cache to improve performance

8

slide-5
SLIDE 5

Multicore Challenges -

Cache Reliance

Cache makes sense generally due to data locality of programs However, on multicore you have multiple layers of cache Not all cache visible to all cores! However, when multiple threads access the data and these data may be in different caches, what’ s going to happen to data correctness?

9 Multicore Challenges -

A Change in Paradigm

As memory to CPU gap widens this problem gets acute Compiler will take care of quite a few things The JVM will take care of quite a few things But, you have to pull a bigger load of your share for correctness

10

slide-6
SLIDE 6

Multicore Challenges -

Programming Got Complex

You have to be more vigilant You need to synchronize more often You have to know about what’ s visible and what’ s not If you don’ t, you’ll see odd unpredictable results Not much fun when that happens

11 Multicore Challenges -

Rethink your Programming

We are used to mutable shared state Mutable is not too bad Sharing data (reads) is not bad also But if you try to share mutable data, we have real trouble waiting ahead

12

slide-7
SLIDE 7

Multicore Challenges -

Challenges

Breaking a problem into threads is hard What is an optimal partitioning? How do you schedule these threads? How do you communicate between these threads?

13 Multicore Challenges -

Dynamic Multithreaded Prog.

You rely upon a platform that takes care of the details such as load-balancing, scheduling, etc. You expect two features to be available: Nested Parallelism You can spawn subroutines The caller and the spawned subroutines can proceed independently Parallel loop Iterations of the loop can execute concurrently

14

slide-8
SLIDE 8

Multicore Challenges -

Benefits

Simply extension to serial model with parallel, spawn, & sync Easy to convert parallel to sequential algorithm by removing these keywords Easy to quantify parallelism based on work and span Divide and conquer lends itself well to this model

15 Multicore Challenges -

Dynamic MT Basics

Serialization of a MT algorithm is achieved by deleting keywords spawn, sync, & parallel Spawn indicates only that a scheduler may (not must) schedule for the subroutine to run concurrently A sync indicates that the algorithm can’ t proceed until the spawned subroutine has completed and its result has been received Every procedure executes an implicit sync before it returns Ensures all spawned subroutines terminate before the procedure does

16

slide-9
SLIDE 9

Multicore Challenges -

Example Fibonacci Number

FIB(n) if n <= 1 return n else x = spawn FIB(n-1) y = FIB(n-2) sync return x + y

17

FIB(n) if n <= 1 return n else x = spawn FIB(n-1) y = FIB(n-2) sync return x + y

Serialize

Multicore Challenges -

Example

Count the number of primes

18

slide-10
SLIDE 10

Multicore Challenges -

A Model for MT Execution

Strand: A chain of instructions with no parallel controls MT computations can be represented as a computation DAG Vertices represent instructions or strands Edges represent dependencies between instructions If the DAG has a directed path from one strand to another, the two are logically in series, otherwise they are locally in parallel

19 Multicore Challenges -

Performance Measures

Tp is runtime of an algorithm on p processors Work and span are useful to calculate theoretical efficiencies Work the total time to execute the entire computation

  • n one processor

Sum of time taken by each strand Span: longest time to execute the strands along any path in the DAG # of processors comes into factor as well

20

slide-11
SLIDE 11

Multicore Challenges -

Performance Measures

Work and span provide the lower bounds on the runtime In one step an ideal parallel computer with p processors can do at most p units of work In Tp time, it can perform pTp units of work Total work to do is T1 (that is work done on one processor) So, we have pTp >= T1 Work law is Tp >= T1/p

21 Multicore Challenges -

Performance Measures

A machine with unlimited number of processors can emulate a p processor machine by using its p processors Span law Tp >= T! Adding more processors than needed does not help

22

slide-12
SLIDE 12

Multicore Challenges -

Speedup

T1 / Tp From work law we have Tp >= T1/p So, T1/Tp <= p The speed up on p processors can be at most p When T1/Tp = Θ(p) you have linear speedup When T1/Tp = p, you have perfect linear speedup

23 Multicore Challenges -

Limits on Speedup

You simply can’ t throw processors at a problem and expect speedup Amdahl’ s law: Speedup that can be realized is limited by the sequential fractions of the computation The overall speedup will be P is the part of computation that can enjoy S speed up

24

1 (1-P) + P S

slide-13
SLIDE 13

Multicore Challenges -

Parallelism

T1 / T! Represents average amount of work that can be done in parallel for each step along the critical path (span) Represents the upper bound, max possible speedup that can be achieved Limit on possibility of attaining linear speedup You can’ t throw more processors at the problem to improve speedup

25 Multicore Challenges -

Analyzing MT Algorithms

Compute T1(n) To compute T!(n), analyze the span If two subcomputations are done in sequence, their spans add to form the span of their composition If they are joined in parallel, take the maximum of the two spans

26

slide-14
SLIDE 14

Multicore Challenges -

Parallel Loops

Algorithms may benefit from executing loop iterations in parallel Parallel keyword before the for loop conveys this

27 Multicore Challenges -

Matrix Vector Multiplication

28

slide-15
SLIDE 15

Multicore Challenges -

Matrix Vector Mult: Div/Conq

29 Multicore Challenges -

Computational Efficiency

T1(n) = Θ(n^2) from serialization of MAT-VEC T!(n) = Θ(lg n) + max 1<=i<=n iter(i) Overall domination of span is Θ(n) Parallelism is Θ(n^2)/Θ(n) = Θ(n)

30

slide-16
SLIDE 16

Multicore Challenges -

Race Conditions

Deterministic behavior is critical Non-deterministic means unpredictable and unreliable Shared state is OK Mutable state is OK Shared-mutable state is not OK

31 Multicore Challenges -

What leads to Race Condition?

Partly it is the timing It is also about visibility

32

slide-17
SLIDE 17

Multicore Challenges -

Avoid Race Conditions

Ensure your algorithm does not have race conditions It is better to avoid it at the root by using immutability

33