[PDF] - A Minicourse on Multithreaded Programming Charles E Leiserson PDF Document

SLIDE 1 A Minicourse

n

Multithreaded Programming Charles E Leiserson Harald Prok

p

MIT Lab

ratory

for Computer Science

T

ec hnology Square Cam bridge Massac h usetts

fcelprokopgl

cs m it e du July

Abstract

These notes con tain t w

lectures

that teac h m ultithreaded algorithms using a Cilk lik e

mo

del These lectures w ere designed for the latter part

f

the MIT undergraduate class

Intr
duction

to A lgorithms The st yle

f

the lecture notes follo ws that

f

the textb

k

b y Cormen Leiserson and Riv est

but

the pseudo co de from that textb

k

has b een Cilkied to allo w it to describ e m ultithreaded algo rithms The rst lecture teac hes the basics b ehind m ultithreading including dening the measures

f

w

rk

and criticalpath length It culminates in the greedy sc heduling theorem due to Graham and Bren t

The

second lecture sho ws ho w parallel applications including matrix m ultiplication and sorting can b e analyzed using divide andconquer recurrences

Multithreaded

programming As m ultipro cessor systems ha v e b ecome increasingly a v ailable in terest has gro wn in parallel programming Multithr e ade d pr

gr

amming is a programming paradigm in whic h a single program is brok en in to m ultiple thr e ads

f

con trol whic h in teract to solv e a single problem These notes pro vide an in tro duction to the analysis

f

m ultithreaded algorithms This researc h w as supp

rted

in part b y the Defense Adv anced Researc h Pro jects Agency D ARP A under Gran t F

SLIDE 2

Mo

del Our mo del

f

m ultithreaded computation is based

n

the pro cedure abstraction found in vir tually an y programming language As an example the pro cedure Fib giv es a m ultithreaded algorithm for computing the Fib

nacci

n um b ers

Fibn
if

n

then

return n

else

x

spa

wn Fibn

y
spa

wn Fibn

sync
return

x

y
A

spa wn is the parallel analog

f

an

rdinary

subroutine call The k eyw

rd

spa wn b efore the subroutine call in line

indicates

that the subpro cedure Fibn

can

execute in parallel with the pro cedure Fibn itself Unlik e an

rdinary

function call ho w ev er where the paren t is not resumed un til after its c hild returns in the case

f

a spa wn the paren t can con tin ue to execute in parallel with the c hild In this case the paren t go es

n

to spa wn Fibn

In

general the paren t can con tin ue to spa wn

c

hildren pro ducing a high degree

f

parallelism A pro cedure cannot safely use the return v alues

f

the c hildren it has spa wned un til it executes a sync statemen t If an y

f

its c hildren ha v e not completed when it executes a sync the pro cedure susp ends and do es not resume un til all

f

its c hildren ha v e completed When all

f

its c hildren return execution

f

the pro cedure resumes at the p

in

t immediately follo wing the sync statemen t In the Fib

nacci

example the sync statemen t in line

is

required b efore the return statemen t in line

to

a v

id

the anomaly that w

uld
ccur

if x and y w ere summed b efore eac h had b een computed The spa wn and sync k eyw

rds

sp ecify lo gic al parallelism not actual parallelism That is these k eyw

rds

indicate whic h co de ma y p

ssibly

execute in parallel but what ac tually runs in parallel is determined b y a sche duler

whic

h maps the dynamically unfolding computation

n

to the a v ailable pro cessors W e can view a m ultithreaded computation in graphtheoretic terms as a dynamically unfolding dag G

V
E
as

is sho wn in Figure

for

Fib W e dene a thr e ad to b e a maximal sequence

f

instructions not con taining the parallel con trol statemen ts spa wn sync and return Threads mak e up the set V

f

v ertices

f

the m ultithreaded computation dag G Eac h pro cedure execution is a linear c hain

f

threads eac h

f

whic h is connected to its successor in the c hain b y a c

ntinuation

edge When a thread u spa wns a thread v

the

dag con tains a sp awn edge u v

E
as

w ell as a con tin uation edge from u to us successor in the pro cedure When a thread u returns the dag con tains an edge u v

where

v is the thread that immediately follo ws the next sync in the paren t pro cedure Ev ery computation starts with a single initial thr e ad and assuming that the computation terminates ends

This

algorithm is a terrible w a y to compute Fib

nacci

n um b ers since it runs in exp

nen

tial time when logarithmic metho ds are kno wn

page
but

it serv es as a go

d

didactic example

SLIDE 3

fib(3)

fib(2) fib(1) fib(1) fib(2) fib(1) fib(0) fib(0) fib(4)

Figure

A

dag represen ting the m ultithreaded computation

f

Fib Threads are sho wn as circles and eac h group

f

threads b elonging to the same pro cedure are surrounded b y a rounded rectangle Do wn w ard edges are spa wns dep endencies horizon tal edges represen t con tin uation de p endencies within a pro cedure and up w ard edges are return dep endencies with a single nal thr e ad

Since

the pro cedures are

rganized

in a tree hierarc h y

w

e can view the computation as a dag

f

threads em b edded in the tree

f

pro cedures

P

erformance Measures Tw

p

erformance measures suce to gauge the theoretical eciency

f

m ultithreaded algo rithms W e dene the work

f

a m ultithreaded computation to b e the total time to execute all the

p

erations in the computation

n
ne

pro cessor W e dene the critic alp ath length

f

a computation to b e the longest time to execute the threads along an y path

f

dep enden cies in the dag Consider for example the computation in Figure

Supp
se

that ev ery thread can b e executed in unit time Then the w

rk
f

the computation is

and

the criticalpath length is

When

a m ultithreaded computation is executed

n

a giv en n um b er P

f

pro cessors its running time dep ends

n

ho w ecien tly the underlying sc heduler can execute it Denote b y T P the running time

f

a giv en computation

n

P pro cessors Then the w

rk
f

the computation can b e view ed as T

and

the criticalpath length can b e view ed as T

The

w

rk

and criticalpath length can b e used to pro vide lo w er b

unds
n

the running time

n

P pro cessors W e ha v e T P

T
P
since

in

ne

step a P pro cessor computer can do at most P w

rk

W e also ha v e T P

T
since

a P pro cessor computer can do no more w

rk

in

ne

step than an innitepro cessor computer

SLIDE 4 The sp e e dup

f

a computation

n

P pro cessors is the ratio T

T

P

whic

h indicates ho w man y times faster the P pro cessor execution is than a

nepro

cessor execution If T

T

P

P
then

w e sa y that the P pro cessor execution exhibits line ar sp e e dup

The

maxim um p

ssible

sp eedup is T

T
whic

h is also called the p ar al lelism

f

the computation b ecause it represen ts the a v erage amoun t

f

w

rk

that can b e done in parallel for eac h step along the critical path W e denote the parallelism

f

a computation b y P

Greedy

Sc heduling The programmer

f

a m ultithreaded application has the abilit y to con trol the w

rk

and criticalpath length

f

his application but he has no direct con trol

v

er the sc heduling

f

his application

n

a giv en n um b er

f

pro cessors It is up to the run time sc heduler to map the dynamically unfolding computation

n

to the a v ailable pro cessors so that the computation executes ecien tly

Go
d
nline

sc hedulers are kno wn

but

their analysis is compli cated F

r

simplicit y

w

ell illustrate the principles b ehind these sc hedulers using an

line

greedy sc heduler A gr e e dy sche duler sc hedules as m uc h as it can at ev ery time step On a P pro cessor computer time steps can b e classied in to t w

t

yp es If there are P

r

more threads ready to execute the step is a c

mplete

step and the sc heduler executes an y P threads

f

those ready to execute If there are few er than P threads ready to execute the step is an inc

mplete

step and the sc heduler executes all

f

them This greedy strategy is pro v ably go

d

Theorem

Graham
Bren

t

A

gr e e dy sche duler exe cutes any multithr e ade d c

m

putation G with work T

and

critic alp ath length T

in

time T P

T
P
T
n

a c

mputer

with P pr

c

essors Pr

f

F

r

eac h complete step P w

rk

is done b y the P pro cessors Th us the n um b er

f

complete steps is at most T

P
b

ecause after T

P

suc h steps all the w

rk

in the computation has b een p erformed No w consider an incomplete step and consider the sub dag G

f

G that remains to b e executed Without loss

f

generalit y

w

e can view eac h

f

the threads executing in unit time since w e can replace a longer thread with a c hain

f

unittime threads Ev ery thread with indegree

is

ready to b e executed since all

f

its predecessors ha v e already executed By the greedy sc heduling p

licy
all

suc h threads are executed since there are strictly few er than P suc h threads Th us the criticalpath length

f

G

is

reduced b y

Since

the criticalpath length

f

the sub dag remaining to b e executed decreases b y

eac

h for eac h incomplete step the n um b er

f

incomplete steps is at most T

Eac

h step is either complete

r

incomplete and hence Inequalit y

follo

ws Corollary

A

gr e e dy sche duler achieves line ar sp e e dup when P

O

P

Pr
f

Since P

T
T
w

e ha v e P

O

T

T
r

equiv alen tly

that

T

O

T

P
Th

us w e ha v e T P

T
P
T
O

T

P

SLIDE 5

Cilk

and

So

crates Cilk

is

a parallel m ultithreaded language based

n

the serial programming lan guage C Instrumen tation in the Cilk sc heduler pro vides an accurate measure

f

w

rk

and critical path Cilks randomized sc heduler pro v ably executes a m ultithreaded computation

n

a P pro cessor computer in T P

T
P
O

T

exp

ected time Empirically

the

sc heduler ac hiev es T P

T
P
T
time

yielding nearp erfect linear sp eedup if P

P
Y
u

can read more ab

ut

Cilk

n

the W eb at httptheorylcsmitedu

cilk

Among the applications that ha v e b een programmed in Cilk are the

So

crates and Cilk c hess c hesspla ying programs These programs ha v e w

n

n umerous prizes in in terna tional comp etition and are considered to b e among the strongest in the w

rld

An in teresting anomaly

ccurred

during the dev elopmen t

f
So

crates whic h w as resolv ed b y understanding the measures

f

w

rk

and criticalpath length The

So

crates program w as initially dev elop ed

n

a pro cessor computer at MIT but it w as in tended to run

n

a pro cessor computer at the National Cen ter for Sup ercomputing Applications NCSA at the Univ ersit y

f

Illinois A clev er

ptimization

w as prop

sed

whic h during testing at MIT caused the program to run m uc h faster than the

riginal

program Nev ertheless the

ptimization

w as abandoned b ecause an analysis

f

w

rk

and criticalpath length indicated that the program w

uld

actually b e slo w er

n

the NCSA mac hine Let us examine this anomaly in more detail F

r

simplicit y

the

actual timing n um b ers ha v e b een simplied The

riginal

program ran in T

seconds

at MIT

n
pro

cessors The

ptimized

program ran in T

seconds

also

n
pro

cessors The

riginal

program had w

rk

T

seconds

and criticalpath length T

second

Using the form ula T P

T
P
T
as

a go

d

appro ximation

f

run time w e disco v er that indeed T

The
ptimized

program had w

rk

T

seconds

and critical path length T

seconds

yielding T

But

no w let us determine the run times

n
pro

cessors W e ha v e T

and

T

whic

h is t wice as slo w Th us b y using w

rk

and criticalpath length w e can predict the p erformance

f

a m ultithreaded computation Exercise

Sk

etc h the m ultithreaded computation that results from executing Fib Assume that all threads in the computation execute in unit time What is the w

rk
f

the computation What is the criticalpath length Sho w ho w to sc hedule the dag

n
pro

cessors in a greedy fashion b y lab eling eac h thread with the time step

n

whic h it executes Exercise

W

rite a m ultithreaded pro cedure SumA where A

n

is an arra y

whic

h uses divideandconquer to sum the elemen ts

f

the arra y A in parallel Exercise

Pro

v e that a greedy sc heduler ac hiev es the stronger b

und

T P

T
T
P
T
Exercise
Pro

v e that the time for a greedy sc heduler to execute an y m ultithreaded computation is within a factor

f
f

the time required b y an

ptimal

sc heduler

SLIDE 6 Exercise

F
r

what n um b er P

f

pro cessors do the t w

c

hess programs describ ed in this section run equally fast Exercise

Professor

Tw eed tak es some measuremen ts

f

his deterministic m ulti threaded program whic h is sc heduled using a greedy sc heduler and nds that T

seconds

and T

seconds

What is the fastest that the professors computation could p

ssibly

run

n
pro

cessors Use Inequalit y

and

the t w

lo

w er b

unds

from Inequalities

and
to

deriv e y

ur

answ er

Analysis
f

m ultithreaded algorithms W e no w turn to the design and analysis

f

m ultithreaded algorithms Because

f

the divide andconquer nature

f

the m ultithreaded mo del recurrences are a natural w a y to express the w

rk

and criticalpath length

f

a m ultithreaded algorithm W e shall in v estigate algorithms for matrix m ultiplication and sorting and analyze their p erformance

P

arallel Matrix Multiplication T

m

ultiply t w

n
n

matrices A and B in parallel to pro duce a matrix C

w

e can recursiv ely form ulate the problem as follo ws

C
C
C
C
A
A
A
A
B
B
B
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
Th

us eac h n

n

matrix m ultiplication can b e expressed as

m

ultiplications and

additions
f

n

n

submatrices The m ultithreaded pro cedure Mul t m ultiplies t w

n
n

matrices where n is a p

w

er

f
using

an auxiliary pro cedure Add to add n

n

matrices This algorithm is not inplace AddC

T
n
if

n

then

C

C
T
else

partition C and T in to n

n

submatrices

spa

wn AddC

T
n
spa

wn AddC

T
n
spa

wn AddC

T
n
spa

wn AddC

T
n
sync

SLIDE 7 Mul tC

A

B

n
if

n

then

C

A
B
else

allo cate a temp

rary

matrix T

n
n
partition

A B

C
and

T in to n

n

submatrices

spa

wn Mul tC

A
B
n
spa

wn Mul tC

A
B
n
spa

wn Mul tC

A
B
n
spa

wn Mul tC

A
B
n
spa

wn Mul tT

A
B
n
spa

wn Mul tT

A
B
n
spa

wn Mul tT

A
B
n
spa

wn Mul tT

A
B
n
sync
spa

wn AddC

T
n
sync

The matrix partitionings in line

f

Mul t and line

f

add tak e O

time

since

nly

a constan t n um b er

f

indexing

p

erations are required T

analyze

this algorithm let A P n b e the P pro cessor running time

f

Add

n

n

n

matrices and let M P n b e the P pro cessor running time

f

Mul t

n

n

n

matrices The w

rk

running time

n
ne

pro cessor for Add can b e expressed b y the recurrence A

n
A
n
n
whic

h is the same as for the

rdinary

doublenestedlo

p

serial algorithm Since the spa wned pro cedures can b e executed in parallel the criticalpath length for Add is A

n
A
n
lg

n

The

w

rk

for Mul t can b e expressed b y the recurrence M

n
M
n
A
n
M
n
n
n
whic

h is the same as for the

rdinary

triplenestedlo

p

serial algorithm The criticalpath length for Mul t is M

n
M
n
lg

n

lg
n

SLIDE 8 Th us the parallelism for Mul t is M

n

M

n
n
lg
n

whic h is quite high T

m

ultiply

matrices

for example the parallelism is ignoring constan ts ab

ut
Most

parallel computers ha v e far few er pro cessors T

ac

hiev e high p erformance it is

ften

adv an tageous for an algorithm to use less space b ecause more space usually means more time F

r

the matrixm ultiplication problem w e can eliminate the temp

rary

matrix T in exc hange for reducing the parallelism Our new algorithm Mul tAdd p erforms C

C
A
B

using a similar divideandconquer strategy to Mul t Mul tAddC

A

B

n
if

n

then

C

C
A
B
else

partition A B

and

C in to n

n

submatrices

spa

wn Mul tAddC

A
B
n
spa

wn Mul tAddC

A
B
n
spa

wn Mul tAddC

A
B
n
spa

wn Mul tAddC

A
B
n
sync
spa

wn Mul tAddC

A
B
n
spa

wn Mul tAddC

A
B
n
spa

wn Mul tAddC

A
B
n
spa

wn Mul tAddC

A
B
n
sync

Let MA P n b e the P pro cessor running time

f

Mul tAdd

n

n

n

matrices The w

rk

for Mul tAdd is MA

n
n
follo

wing the same analysis as for Mul t but the criticalpath length is no w MA

n
MA
n
n
since
nly
recursiv

e calls can b e executed in parallel Th us the parallelism is MA

nMA
n
n
On
matrices

for exam ple the parallelism is ignoring constan ts still quite high ab

ut
In

practice this algorithm

ften

runs somewhat faster than the rst since sa ving space

ften

sa v es time due to hierarc hical memory

SLIDE 9 A B

Al
Al
l

m l

j
j
Al
Al
Figure
Illustration
f

PMer ge The median

f

arra y A is used to partition arra y B

and

then the lo w er p

rtions
f

the t w

arra

ys are recursiv ely merged as in parallel are the upp er p

rtions
P

arallel Merge Sort This section sho ws ho w to parallelize merge sort W e shall see the parallelism

f

the algorithm dep ends

n

ho w w ell the merge subroutine can b e parallelized The most straigh tforw ard w a y to parallelize merge sort is to run the recursion in parallel as is done in the follo wing pseudo co de Mer geSor t A p r

if

p

r
then

q

bp
r

c

spa

wn Mer geSor t A p q

spa

wn Mer geSor t A q

r
sync
Mer

ge A p q

r
The

w

rk
f

Mer geSor t

n

an arra y

f

n elemen ts is T

n
T
n
n
n

lg n

since

the running time

f

Mer ge is n Since the t w

recursiv

e spa wns

p

erate in parallel the criticalpath length

f

Mer geSor t is T

n
T
n
n
n
Consequen

tly

the

parallelism

f

the algorithm is T

nT
n
lg

n whic h is pun y

The
b

vious b

ttlenec

k is Mer ge The follo wing pseudo co de whic h is illustrated in Figure

p

erforms the merge in parallel

SLIDE 10 PMer ge A

l
B
m

C

n
if

m

l
without

loss

f

generalit y

larger

arra y should b e rst

then

spa wn PMer ge B

m

A

l
C
n
elseif

n

then

C

A
elseif

l

and

m

then

if A

B
then

C

A

C

B
else

C

B
C
A
else

nd j suc h that B j

Al
B

j

using

binary searc h

spa

wn PMer ge A

l
B
j
C
l
j
spa

wn PMer ge Al

l
B

j

m

C l

j
n
sync

This merging algorithm nds the median

f

the larger arra y and uses it to partition the smaller arra y

Then

the lo w er p

rtions
f

the t w

arra

ys are recursiv ely merged and in parallel so are the upp er p

rtions

T

analyze

PMer ge let PM P n b e the P pro cessor time to merge t w

arra

ys A and B ha ving n

m
l

elemen ts in total Without loss

f

generalit y

let

A b e the larger

f

the t w

arra

ys that is assume l

m

W ell analyze the criticalpath length rst The binary searc h

f

B tak es lg m time whic h in the w

rst

case is lg n Since the t w

recursiv

e spa wns in lines

and
p

erate in parallel the w

rstcase

criticalpath length is lg n plus the w

rstcase

critical path length

f

the spa wn

p

erating

n

the larger subarra ys In the w

rst

case w e m ust merge half

f

A with all

f

B

in

whic h case the recursiv e spa wn

p

erates

n

at most n elemen ts Th us w e ha v e PM

n
PM
n
lg

n

lg
n
T
analyze

the w

rk
f

Mer ge

bserv

e that although the t w

recursiv

e spa wns ma y

p

erate

n

dieren t n um b ers

f

elemen ts they alw a ys

p

erate

n

n elemen ts b et w een them Let

n

b e the n um b er

f

elemen ts

p

erated

n

b y the rst spa wn where

is

a constan t in the range

Th

us the second spa wn

p

erates

n
n

elemen ts and the w

rstcase

w

rk

satises the recurrence PM

n
PM
n
PM
n
lg

n

W

e shall sho w that PM

n
n

using the substitution metho d Actually

the

Akra Bazzi metho d

if

y

u

kno w it is simpler W e assume inductiv ely that PM

n
an
b

lg n for some constan ts a b

W

e ha v e PM

n
a

n

b

lg n

a
n
b

lg

n
lg

n

an
blg
n
lg
n
lg

n

SLIDE 11

an
blg
lg

n

lg
lg

n

lg

n

an
b

lg n

blg

n

lg
lg

n

an
b

lg n

since

w e can c ho

se

b large enough so that blg n

lg
dominates

lg n Moreo v er w e can pic k a large enough to satisfy the base conditions Th us PM

n
n

whic h is the same w

rk

asymptotically as the

rdinary
serial

merging algorithm W e can no w reanalyze the Mer geSor t using the PMer ge subroutine The w

rk

T

n

remains the same but the w

rstcase

criticalpath length no w satises T

n
T
n
lg
n
lg
n
The

parallelism is no w n lg nlg

n
n

lg

n

Exercise

Giv

e an ecien t and highly parallel m ultithreaded algorithm for m ultiply ing an n

n

matrix A b y a lengthn v ector x that ac hiev es w

rk

n

and

critical path lg n Analyze the w

rk

and criticalpath length

f

y

ur

implemen tation and giv e the parallelism Exercise

Describ

e a m ultithreaded algorithm for matrix m ultiplication that ac hiev es w

rk

n

and

critical path lg n Commen t informally

n

the lo calit y displa y ed b y y

ur

algorithm in the ideal cac he mo del as compared with the t w

algorithms

from this section Exercise

W

rite a Cilk program to m ultiply an n

n
matrix

b y an n

n
matrix

in parallel Analyze the w

rk

criticalpath length and parallelism

f

y

ur

implemen tation Y

ur

algorithm should b e ecien t ev en if an y

f

n

n
and

n

are
Exercise
W

rite a Cilk program to implemen t Strassens matrix m ultiplication al gorithm in parallel as ecien tly as y

u

can Analyze the w

rk

criticalpath length and parallelism

f

y

ur

implemen tation Exercise

W

rite a Cilk program to in v ert a symmetric and p

sitiv

edenite matrix in parallel Hint Use a divideandconquer approac h based

n

the ideas

f

Theorem

from
Exercise
Akl

and San toro

ha

v e prop

sed

a merging algorithm in whic h the rst step is to nd the median

f

all the elemen ts in the t w

sorted

input arra ys as

pp
sed

to the median

f

the elemen ts in the larger subarra y

as

is done in PMer ge Sho w that if the total n um b er

f

elemen ts in the t w

arra

ys is n this median can b e found using lg n time

n
ne

pro cessor in the w

rst

case Describ e a linearw

rk

m ultithreaded merging algorithm based

n

this subroutine that has a parallelism

f

n lg

n

Giv e and solv e the recurrences for w

rk

and criticalpath length and determine the parallelism Implemen t y

ur

algorithm as a Cilk program

SLIDE 12 Exercise

Generalize

the algorithm from Exercise Exercise

to

nd arbitrary

rder

statistics Describ e a mergesorting algorithm with n lg n w

rk

that ac hiev es a parallelism

f

n lg n Hint Merge man y subarra ys in parallel Exercise

The

length

f

a longestcommon subsequence

f

t w

lengthn

sequences x and y can b e computed in parallel using a divideandconquer m ultithreaded algorithm Denote b y ci j

the

length

f

a longest common subsequence

f

x

i

and y

j
First

the m ultithreaded algorithm recursiv ely computes ci j

for

all i in the range

i
n

and all j in the range

j
n

Then it recursiv ely computes ci j

for
i
n

and n

j
n

while in parallel recursiv ely computing ci j

for

n

i
n

and

j
n

Finally

it

recursiv ely computes ci j

for

n

i
n

and n

j
n

F

r

the base case the algorithm computes ci j

in

terms

f

ci

j
ci
j
and

ci j

in

the

rdinary

w a y

since

the logic

f

the algorithm guaran tees that these three v alues ha v e already b een computed That is if the dynamic programming tableau is brok en in to four pieces

I

I I I I I I V

then

the recursiv e m ultithreaded co de w

uld

lo

k

something lik e this spa wn I sync spa wn I I spa wn I I I sync spa wn I V sync Analyze the w

rk

criticalpath length and parallelism

f

this algorithm Describ e and analyze an algorithm that is asymptotically as ecien t same w

rk

but more parallel Mak e whatev er in teresting

bserv

ations y

u

can W rite an ecien t Cilk program for the problem References

Selim

G Akl and Nicola San toro Optimal parallel merging and sorting without memory con icts IEEE T r ansactions

n

Computers C No v em b er

M

Akra and L Bazzi On the solution

f

linear recurrence equations Computational Optimization and Applic ation !

Rob

ert D Blumofe Exe cuting Multithr e ade d Pr

gr

ams Eciently PhD thesis De partmen t

f

Electrical Engineering and Computer Science Massac h usetts Institute

f

T ec hnology

Septem

b er

SLIDE 13

Rob

ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E Leiserson Keith H Randall and Y uli Zhou Cilk An ecien t m ultithreaded run time system In Pr

c

e e dings

f

the Fifth A CM SIGPLAN Symp

sium
n

Principles and Pr actic e

f

Par al lel Pr

gr

amming PPoPP pages ! San ta Barbara California July

Rob

ert D Blumofe and Charles E Leiserson Sc heduling m ultithreaded computations b y w

rk

stealing In Pr

c

e e dings

f

the th A nnual Symp

sium
n

F

undations
f

Computer Scienc e F OCS pages ! San ta F e New Mexico No v em b er

Ric

hard P

Bren

t The parallel ev aluation

f

general arithmetic expressions Journal

f

the A CM ! April

Cilk

Beta

Reference

Man ual Av ailable

n

the In ternet from httptheorylcsmitedu

cilk
Thomas

H Cormen Charles E Leiserson and Ronald L Riv est Intr

duction

to A lgo rithms MIT Press and McGra w Hill

Matteo

F rigo Charles E Leiserson and Keith H Randall The implemen tation

f

the Cilk m ultithreaded language In A CM SIGPLAN

Confer

enc e

n

Pr

gr

amming L anguage Design and Implementation PLDI pages ! Mon treal Canada June

R

L Graham Bounds

n

m ultipro cessing timing anomalies SIAM Journal

n

Applie d Mathematics ! Marc h

Keith

H Randall Cilk Ecient Multithr e ade d Computing PhD thesis Departmen t

f

Electrical Engineering and Computer Science Massac h usetts Institute

f

T ec hnology

Ma

y