Outline 0024 Spring 2010 24 :: 2 Parallel application - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline 0024 Spring 2010 24 :: 2 Parallel application - - PowerPoint PPT Presentation

Outline 0024 Spring 2010 24 :: 2 Parallel application development 0024 Spring 2010 24 :: 3 0024 Spring 2010 24 :: 4 Lock data, not code 0024 Spring 2010 24 :: 5 Do you really need


slide-1
SLIDE 1
slide-2
SLIDE 2

– 24 :: 2 0024 Spring 2010

Outline

slide-3
SLIDE 3

– 24 :: 3 0024 Spring 2010

Parallel application development

slide-4
SLIDE 4

– 24 :: 4 0024 Spring 2010

slide-5
SLIDE 5

– 24 :: 5 0024 Spring 2010

Lock data, not code

slide-6
SLIDE 6

– 24 :: 6 0024 Spring 2010

Do you really need locks?

No shared data => no need for locks

Recall that CSP gives you a model to avoid locks

No free lunch

Lock-free data structures

Mutex-free by design Growing number of class/data structures

slide-7
SLIDE 7

– 24 :: 7 0024 Spring 2010

Detour: no shared data

What if we could write programs so that there are no side-effects?

Think about the simple finite impulse response filter for N

inputs

Think of computing an expensive function for N numbers Think of searching for a string in N documents

slide-8
SLIDE 8

– 24 :: 8 0024 Spring 2010

MapReduce

Basic idea: Parallel computing framework for restricted parallel programming model Useful to distribute work to a farm (cluster) of compute nodes User specifies what needs to be done for each data item (“map”) and how results are to be combined (“reduce”) Libraries take care of everything else

Parallelization Fault Tolerance Data Distribution Load Balancing

slide-9
SLIDE 9

– 24 :: 9 0024 Spring 2010

MapReduce

Map()

Process a key/value pair to generate intermediate key/value pairs

Reduce()

Merge all intermediate values associated with the same key

Names originated in the functional programming world … but slightly different semantics

slide-10
SLIDE 10

– 24 :: 10 0024 Spring 2010

Example: Counting Words

Map()

Input <filename, file text> Parses file and emits <word, count> pairs

E.g. <”hello”, 1>

Reduce()

Sums all values for the same key Emits <word, TotalCount>

E.g. <”hello”, 5 > <”hello”, 1>

<”hello”, 2 > <”hello”, 7 > => <”hello”, 15>

slide-11
SLIDE 11

– 24 :: 11 0024 Spring 2010

Example Use of MapReduce

Counting words in a large set of documents

map(string key, string value) //key: document name //value: document contents for each word w in value

EmitIntermediate(w, “1”);

reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values

result += ParseInt(v);

Emit(AsString(result));

slide-12
SLIDE 12

– 24 :: 12 0024 Spring 2010

Data Distribution

Input files are split into pieces

distributed file system

Intermediate files created from map tasks are written to local disk Output files are written to distributed file system

slide-13
SLIDE 13

– 24 :: 13 0024 Spring 2010

Assigning Tasks

Many copies of user program are started Tries to utilize data localization by running map tasks on machines with data One instance becomes the master Master finds idle machines and assigns tasks

slide-14
SLIDE 14

– 24 :: 14 0024 Spring 2010

slide-15
SLIDE 15

– 24 :: 15 0024 Spring 2010

MapReduce

slide-16
SLIDE 16

– 24 :: 16 0024 Spring 2010

Do you really need locks?

No shared data => no need for locks

Recall that CSP gives you a model to avoid locks

No free lunch

Lock-free data structures

Mutex-free by design Growing number of class/data structures

slide-17
SLIDE 17

– 24 :: 17 0024 Spring 2010

Why Locking Doesnt Scale

Not Robust Relies on conventions Hard to Use

Conservative Deadlocks Lost wake-ups

Not Composable

slide-18
SLIDE 18

– 24 :: 18 0024 Spring 2010

Locks are not Robust

If a thread holding a lock is delayed … No one else can make progress

slide-19
SLIDE 19

– 24 :: 19 0024 Spring 2010

Why Locking Doesnt Scale

Not Robust Relies on conventions Hard to Use

Conservative Deadlocks Lost wake-ups

Not Composable

slide-20
SLIDE 20

– 24 :: 20 0024 Spring 2010

Locking Relies on Conventions

Relation between

Lock bit and object bits Exists only in programmers mind

/* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */

Actual comment from Linux Kernel

(hat tip: Bradley Kuszmaul)

slide-21
SLIDE 21

– 24 :: 21 0024 Spring 2010

Why Locking Doesnt Scale

Not Robust Relies on conventions Hard to Use

Conservative Deadlocks Lost wake-ups

Not Composable

slide-22
SLIDE 22

– 24 :: 22 0024 Spring 2010

Sadistic Homework

enq( x) enq( y) Fifo queue No interference if ends “far enough” apart

slide-23
SLIDE 23

– 24 :: 23 0024 Spring 2010

Sadistic Homework

deq( ) deq( ) Double-ended queue Interference OK if ends “close enough” together

slide-24
SLIDE 24

– 24 :: 24 0024 Spring 2010

Sadistic Homework

deq( ) deq( ) Double-ended queue Make sure suspended dequeuers awake as needed

slide-25
SLIDE 25

– 24 :: 25 0024 Spring 2010

In Search of the Lost Wake-Up

Waiting thread doesnt realize when to wake up Its a real problem in big systems

“Calling pthread_cond_signal() or pthread_cond_broadcast() when

the thread does not hold the mutex lock associated with the condition can lead to lost wake-up bugs.” from Google™ search for “lost wake-up”

slide-26
SLIDE 26

– 24 :: 26 0024 Spring 2010

You Try It …

One lock?

Too Conservative

Locks at each end?

Deadlock, too complicated, etc

Waking blocked dequeuers?

Harder than it looks

slide-27
SLIDE 27

– 24 :: 27 0024 Spring 2010

Actual Solution

Clean solution would be a publishable result [Michael & Scott, PODC 96] High performance fine-grained lock-based solutions are good for libraries… not general consumption by programmers

slide-28
SLIDE 28

– 24 :: 28 0024 Spring 2010

Why Locking Doesnt Scale

Not Robust Relies on conventions Hard to Use

Conservative Deadlocks Lost wake-ups

Not Composable

slide-29
SLIDE 29

– 24 :: 29 0024 Spring 2010

Locks do not compose

add(T1, item) delete(T1, item) add(T2, item) item item

Move from T1 to T2

Must lock T2 before deleting from T1 lock T2 lock T2 lock T1 lock T1 lock T1 lock T1 item

Exposing lock internals breaks abstraction

Hash Table

Must lock T1 before adding item

slide-30
SLIDE 30

– 24 :: 30 0024 Spring 2010

Monitor Wait and Signal

zzz If buffer is empty, wait for item to show up

Empty buffer

Yes!

slide-31
SLIDE 31

– 24 :: 31 0024 Spring 2010

Wait and Signal do not Compose

empty empty zzz…

Wait for either?

slide-32
SLIDE 32

– 24 :: 32 0024 Spring 2010

The Transactional Manifesto

What we do now is inadequate to meet the multi-core challenge Research Agenda

Replace locking with a transactional API Design languages to support this model Implement the run-time to be fast enough

slide-33
SLIDE 33

– 24 :: 33 0024 Spring 2010

Transactions

Atomic

Commit: takes effect Abort: effects rolled back

Usually retried

Linearizable

Appear to happen in one-at-a-time order

slide-34
SLIDE 34

– 24 :: 34 0024 Spring 2010

at om i c {

  • x. r em
  • ve( 3) ;
  • y. add( 3) ;

} at om i c { y = nul l ; }

Atomic Blocks

slide-35
SLIDE 35

– 24 :: 35 0024 Spring 2010

at om i c {

  • x. r em
  • ve( 3) ;
  • y. add( 3) ;

} at om i c { y = nul l ; }

Atomic Blocks

No data race

slide-36
SLIDE 36

– 24 :: 36 0024 Spring 2010

Publ i c voi d Lef t Enq( i t em x) { Q node q = new Q node( x) ;

  • q. l ef t = t hi s. l ef t ;

t hi s. l ef t . r i ght = q; t hi s. l ef t = q; }

Sadistic Homework Revisited

(1)

Write sequential code

slide-37
SLIDE 37

– 24 :: 37 0024 Spring 2010

Publ i c voi d Lef t Enq( i t em x) { at om i c { Q node q = new Q node( x) ;

  • q. l ef t = t hi s. l ef t ;

t hi s. l ef t . r i ght = q; t hi s. l ef t = q; } }

Sadistic Homework Revisited

(1)

slide-38
SLIDE 38

– 24 :: 38 0024 Spring 2010

Publ i c voi d Lef t Enq( i t em x) { at om i c { Q node q = new Q node( x) ;

  • q. l ef t = t hi s. l ef t ;

t hi s. l ef t . r i ght = q; t hi s. l ef t = q; } }

Sadistic Homework Revisited

(1)

Enclose in atomic block

slide-39
SLIDE 39

– 24 :: 39 0024 Spring 2010

Warning

Not always this simple

Conditional waits Enhanced concurrency Overlapping locks

But often it is

Works for sadistic homework

slide-40
SLIDE 40

– 24 :: 40 0024 Spring 2010

Publ i c voi d Tr ansf er ( Q ueue q1, q2) { at om i c { O bj ect x = q1. deq( ) ;

  • q2. enq( x) ;

} }

Composition

Trivial or what?

slide-41
SLIDE 41

– 24 :: 41 0024 Spring 2010

Publ i c O bj ect Lef t Deq( ) { at om i c { i f ( t hi s. l ef t == nul l ) r et r y; … } }

Wake-ups: lost and found

Roll back transaction and restart when something changes

slide-42
SLIDE 42

– 24 :: 42 0024 Spring 2010

OrElse Composition

at om i c { x = q1. deq( ) ; } or El se { x = q2. deq( ) ; }

Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries

slide-43
SLIDE 43

– 24 :: 43 0024 Spring 2010

Transactional Memory

Software transactional memory (STM) Hardware transactional memory (HTM) Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful)

slide-44
SLIDE 44

– 24 :: 44 0024 Spring 2010

Design Issues

Implementation choices Language design issues Semantic issues

slide-45
SLIDE 45

– 24 :: 45 0024 Spring 2010

Granularity

Object

managed languages, Java, C#, … Easy to control interactions between transactional & non-

trans threads

Word

C, C++, … Hard to control interactions between transactional & non-

trans threads

slide-46
SLIDE 46

– 24 :: 46 0024 Spring 2010

Direct/Deferred Update

Deferred

modify private copies & install on commit Commit requires work Consistency easier

Direct

Modify in place, roll back on abort Makes commit efficient Consistency harder

slide-47
SLIDE 47

– 24 :: 47 0024 Spring 2010

Conflict Detection

Eager

Detect before conflict arises “Contention manager” module resolves

Lazy

Detect on commit/abort

Mixed

Eager write/write, lazy read/write …

slide-48
SLIDE 48

– 24 :: 48 0024 Spring 2010

Conflict Detection

Eager detection may abort transaction that could have committed. Lazy detection discards more computation.

slide-49
SLIDE 49

– 24 :: 49 0024 Spring 2010

Contention Managers

Oracle that resolves conflicts

For eager conflict detection

CM decides

Whether to abort other transaction Or give it a chance to finish …

slide-50
SLIDE 50

– 24 :: 50 0024 Spring 2010

Contention Manager Strategies

Exponential backoff Oldest gets priority Most work done gets priority Non-waiting has priority over waiting Lots of alternatives

None seems to dominate But choice can have big impact

slide-51
SLIDE 51

– 24 :: 51 0024 Spring 2010

I/O & System Calls?

Some I/O revocable

Provide transaction-safe libraries Undoable file system/DB calls

Some not

Opening cash drawer Firing missile

slide-52
SLIDE 52

– 24 :: 52 0024 Spring 2010

I/O & System Calls

One solution: make transaction irrevocable

If transaction tries I/O, switch to irrevocable mode.

There can be only one …

Requires serial execution

No explicit aborts

In irrevocable transactions

slide-53
SLIDE 53

– 24 :: 53 0024 Spring 2010

More problems ahead

Maybe we can revisit those issues in a future class on Advanced Parallel Programming

  • after you took classes on System Programming,

Computer Architecture, Compilers, and Performance Evaluation.

slide-54
SLIDE 54

– 24 :: 54 0024 Spring 2010

Parallel programming is difficult

Stand on the shoulder of giants

Dont roll your own {lock, event queue, …} if there exists a

solution provided by others

But know how to build your own if necessary Know if you can trust the solution

E.g., Java 6 (and Java 7) provide many thread-safe utility

classes

Program at the highest level possible

OpenMP may not be the most elegant language but if your

problem fits the model, its a big win.

Let the system handle the details See above

slide-55
SLIDE 55

– 24 :: 55 0024 Spring 2010

… but if you must do it, do it right

Document your design & implementation Prove essential properties of the solution

slide-56
SLIDE 56

– 24 :: 56 0024 Spring 2010

What we did not teach

We (almost) avoided the issue of “performance” Why?

Enough material Course objective is to advance your understanding of

(imperative) programming

… and to prepare for several subsequent classes

Must understand more aspects of computer architecture

Caches Cache management Instruction sets of modern processors

Implementation of JCSP not “performance-sensitive” Implementation of JOMP not “performance-sensitive”

But good implementations for C/C++ and Fortran exist

slide-57
SLIDE 57

– 24 :: 57 0024 Spring 2010

Future classes

System programming and computer architecture Operating systems and networks ? Practice of Parallel Programming ? Advanced Parallel Programming Concurrent object-oriented programming I Concurrent object-oriented programming II Advanced Parallel Computing for Scientific Applications

slide-58
SLIDE 58

– 24 :: 58 0024 Spring 2010

Until then … have a nice summer