A Minicourse on Multithreaded Programming Charles E Leiserson - - PDF document

a minicourse on multithreaded programming charles e
SMART_READER_LITE
LIVE PREVIEW

A Minicourse on Multithreaded Programming Charles E Leiserson - - PDF document

A Minicourse on Multithreaded Programming Charles E Leiserson Harald Prok op MIT Lab oratory for Computer Science T ec hnology Square Cam bridge Massac h usetts f celprokop g l cs


slide-1
SLIDE 1 A Minicourse
  • n
Multithreaded Programming Charles E Leiserson Harald Prok
  • p
MIT Lab
  • ratory
for Computer Science
  • T
ec hnology Square Cam bridge Massac h usetts
  • fcelprokopgl
cs m it e du July
  • Abstract
These notes con tain t w
  • lectures
that teac h m ultithreaded algorithms using a Cilk lik e
  • mo
del These lectures w ere designed for the latter part
  • f
the MIT undergraduate class
  • Intr
  • duction
to A lgorithms The st yle
  • f
the lecture notes follo ws that
  • f
the textb
  • k
b y Cormen Leiserson and Riv est
  • but
the pseudo co de from that textb
  • k
has b een Cilkied to allo w it to describ e m ultithreaded algo rithms The rst lecture teac hes the basics b ehind m ultithreading including dening the measures
  • f
w
  • rk
and criticalpath length It culminates in the greedy sc heduling theorem due to Graham and Bren t
  • The
second lecture sho ws ho w parallel applications including matrix m ultiplication and sorting can b e analyzed using divide andconquer recurrences
  • Multithreaded
programming As m ultipro cessor systems ha v e b ecome increasingly a v ailable in terest has gro wn in parallel programming Multithr e ade d pr
  • gr
amming is a programming paradigm in whic h a single program is brok en in to m ultiple thr e ads
  • f
con trol whic h in teract to solv e a single problem These notes pro vide an in tro duction to the analysis
  • f
m ultithreaded algorithms This researc h w as supp
  • rted
in part b y the Defense Adv anced Researc h Pro jects Agency D ARP A under Gran t F
slide-2
SLIDE 2
  • Mo
del Our mo del
  • f
m ultithreaded computation is based
  • n
the pro cedure abstraction found in vir tually an y programming language As an example the pro cedure Fib giv es a m ultithreaded algorithm for computing the Fib
  • nacci
n um b ers
  • Fibn
  • if
n
  • then
return n
  • else
x
  • spa
wn Fibn
  • y
  • spa
wn Fibn
  • sync
  • return
x
  • y
  • A
spa wn is the parallel analog
  • f
an
  • rdinary
subroutine call The k eyw
  • rd
spa wn b efore the subroutine call in line
  • indicates
that the subpro cedure Fibn
  • can
execute in parallel with the pro cedure Fibn itself Unlik e an
  • rdinary
function call ho w ev er where the paren t is not resumed un til after its c hild returns in the case
  • f
a spa wn the paren t can con tin ue to execute in parallel with the c hild In this case the paren t go es
  • n
to spa wn Fibn
  • In
general the paren t can con tin ue to spa wn
  • c
hildren pro ducing a high degree
  • f
parallelism A pro cedure cannot safely use the return v alues
  • f
the c hildren it has spa wned un til it executes a sync statemen t If an y
  • f
its c hildren ha v e not completed when it executes a sync the pro cedure susp ends and do es not resume un til all
  • f
its c hildren ha v e completed When all
  • f
its c hildren return execution
  • f
the pro cedure resumes at the p
  • in
t immediately follo wing the sync statemen t In the Fib
  • nacci
example the sync statemen t in line
  • is
required b efore the return statemen t in line
  • to
a v
  • id
the anomaly that w
  • uld
  • ccur
if x and y w ere summed b efore eac h had b een computed The spa wn and sync k eyw
  • rds
sp ecify lo gic al parallelism not actual parallelism That is these k eyw
  • rds
indicate whic h co de ma y p
  • ssibly
execute in parallel but what ac tually runs in parallel is determined b y a sche duler
  • whic
h maps the dynamically unfolding computation
  • n
to the a v ailable pro cessors W e can view a m ultithreaded computation in graphtheoretic terms as a dynamically unfolding dag G
  • V
  • E
  • as
is sho wn in Figure
  • for
Fib W e dene a thr e ad to b e a maximal sequence
  • f
instructions not con taining the parallel con trol statemen ts spa wn sync and return Threads mak e up the set V
  • f
v ertices
  • f
the m ultithreaded computation dag G Eac h pro cedure execution is a linear c hain
  • f
threads eac h
  • f
whic h is connected to its successor in the c hain b y a c
  • ntinuation
edge When a thread u spa wns a thread v
  • the
dag con tains a sp awn edge u v
  • E
  • as
w ell as a con tin uation edge from u to us successor in the pro cedure When a thread u returns the dag con tains an edge u v
  • where
v is the thread that immediately follo ws the next sync in the paren t pro cedure Ev ery computation starts with a single initial thr e ad and assuming that the computation terminates ends
  • This
algorithm is a terrible w a y to compute Fib
  • nacci
n um b ers since it runs in exp
  • nen
tial time when logarithmic metho ds are kno wn
  • page
  • but
it serv es as a go
  • d
didactic example
slide-3
SLIDE 3
  • fib(3)

fib(2) fib(1) fib(1) fib(2) fib(1) fib(0) fib(0) fib(4)

Figure
  • A
dag represen ting the m ultithreaded computation
  • f
Fib Threads are sho wn as circles and eac h group
  • f
threads b elonging to the same pro cedure are surrounded b y a rounded rectangle Do wn w ard edges are spa wns dep endencies horizon tal edges represen t con tin uation de p endencies within a pro cedure and up w ard edges are return dep endencies with a single nal thr e ad
  • Since
the pro cedures are
  • rganized
in a tree hierarc h y
  • w
e can view the computation as a dag
  • f
threads em b edded in the tree
  • f
pro cedures
  • P
erformance Measures Tw
  • p
erformance measures suce to gauge the theoretical eciency
  • f
m ultithreaded algo rithms W e dene the work
  • f
a m ultithreaded computation to b e the total time to execute all the
  • p
erations in the computation
  • n
  • ne
pro cessor W e dene the critic alp ath length
  • f
a computation to b e the longest time to execute the threads along an y path
  • f
dep enden cies in the dag Consider for example the computation in Figure
  • Supp
  • se
that ev ery thread can b e executed in unit time Then the w
  • rk
  • f
the computation is
  • and
the criticalpath length is
  • When
a m ultithreaded computation is executed
  • n
a giv en n um b er P
  • f
pro cessors its running time dep ends
  • n
ho w ecien tly the underlying sc heduler can execute it Denote b y T P the running time
  • f
a giv en computation
  • n
P pro cessors Then the w
  • rk
  • f
the computation can b e view ed as T
  • and
the criticalpath length can b e view ed as T
  • The
w
  • rk
and criticalpath length can b e used to pro vide lo w er b
  • unds
  • n
the running time
  • n
P pro cessors W e ha v e T P
  • T
  • P
  • since
in
  • ne
step a P pro cessor computer can do at most P w
  • rk
W e also ha v e T P
  • T
  • since
a P pro cessor computer can do no more w
  • rk
in
  • ne
step than an innitepro cessor computer
slide-4
SLIDE 4 The sp e e dup
  • f
a computation
  • n
P pro cessors is the ratio T
  • T
P
  • whic
h indicates ho w man y times faster the P pro cessor execution is than a
  • nepro
cessor execution If T
  • T
P
  • P
  • then
w e sa y that the P pro cessor execution exhibits line ar sp e e dup
  • The
maxim um p
  • ssible
sp eedup is T
  • T
  • whic
h is also called the p ar al lelism
  • f
the computation b ecause it represen ts the a v erage amoun t
  • f
w
  • rk
that can b e done in parallel for eac h step along the critical path W e denote the parallelism
  • f
a computation b y P
  • Greedy
Sc heduling The programmer
  • f
a m ultithreaded application has the abilit y to con trol the w
  • rk
and criticalpath length
  • f
his application but he has no direct con trol
  • v
er the sc heduling
  • f
his application
  • n
a giv en n um b er
  • f
pro cessors It is up to the run time sc heduler to map the dynamically unfolding computation
  • n
to the a v ailable pro cessors so that the computation executes ecien tly
  • Go
  • d
  • nline
sc hedulers are kno wn
  • but
their analysis is compli cated F
  • r
simplicit y
  • w
ell illustrate the principles b ehind these sc hedulers using an
  • line
greedy sc heduler A gr e e dy sche duler sc hedules as m uc h as it can at ev ery time step On a P pro cessor computer time steps can b e classied in to t w
  • t
yp es If there are P
  • r
more threads ready to execute the step is a c
  • mplete
step and the sc heduler executes an y P threads
  • f
those ready to execute If there are few er than P threads ready to execute the step is an inc
  • mplete
step and the sc heduler executes all
  • f
them This greedy strategy is pro v ably go
  • d
Theorem
  • Graham
  • Bren
t
  • A
gr e e dy sche duler exe cutes any multithr e ade d c
  • m
putation G with work T
  • and
critic alp ath length T
  • in
time T P
  • T
  • P
  • T
  • n
a c
  • mputer
with P pr
  • c
essors Pr
  • f
F
  • r
eac h complete step P w
  • rk
is done b y the P pro cessors Th us the n um b er
  • f
complete steps is at most T
  • P
  • b
ecause after T
  • P
suc h steps all the w
  • rk
in the computation has b een p erformed No w consider an incomplete step and consider the sub dag G
  • f
G that remains to b e executed Without loss
  • f
generalit y
  • w
e can view eac h
  • f
the threads executing in unit time since w e can replace a longer thread with a c hain
  • f
unittime threads Ev ery thread with indegree
  • is
ready to b e executed since all
  • f
its predecessors ha v e already executed By the greedy sc heduling p
  • licy
  • all
suc h threads are executed since there are strictly few er than P suc h threads Th us the criticalpath length
  • f
G
  • is
reduced b y
  • Since
the criticalpath length
  • f
the sub dag remaining to b e executed decreases b y
  • eac
h for eac h incomplete step the n um b er
  • f
incomplete steps is at most T
  • Eac
h step is either complete
  • r
incomplete and hence Inequalit y
  • follo
ws Corollary
  • A
gr e e dy sche duler achieves line ar sp e e dup when P
  • O
P
  • Pr
  • f
Since P
  • T
  • T
  • w
e ha v e P
  • O
T
  • T
  • r
equiv alen tly
  • that
T
  • O
T
  • P
  • Th
us w e ha v e T P
  • T
  • P
  • T
  • O
T
  • P
slide-5
SLIDE 5
  • Cilk
and
  • So
crates Cilk
  • is
a parallel m ultithreaded language based
  • n
the serial programming lan guage C Instrumen tation in the Cilk sc heduler pro vides an accurate measure
  • f
w
  • rk
and critical path Cilks randomized sc heduler pro v ably executes a m ultithreaded computation
  • n
a P pro cessor computer in T P
  • T
  • P
  • O
T
  • exp
ected time Empirically
  • the
sc heduler ac hiev es T P
  • T
  • P
  • T
  • time
yielding nearp erfect linear sp eedup if P
  • P
  • Y
  • u
can read more ab
  • ut
Cilk
  • n
the W eb at httptheorylcsmitedu
  • cilk
Among the applications that ha v e b een programmed in Cilk are the
  • So
crates and Cilk c hess c hesspla ying programs These programs ha v e w
  • n
n umerous prizes in in terna tional comp etition and are considered to b e among the strongest in the w
  • rld
An in teresting anomaly
  • ccurred
during the dev elopmen t
  • f
  • So
crates whic h w as resolv ed b y understanding the measures
  • f
w
  • rk
and criticalpath length The
  • So
crates program w as initially dev elop ed
  • n
a pro cessor computer at MIT but it w as in tended to run
  • n
a pro cessor computer at the National Cen ter for Sup ercomputing Applications NCSA at the Univ ersit y
  • f
Illinois A clev er
  • ptimization
w as prop
  • sed
whic h during testing at MIT caused the program to run m uc h faster than the
  • riginal
program Nev ertheless the
  • ptimization
w as abandoned b ecause an analysis
  • f
w
  • rk
and criticalpath length indicated that the program w
  • uld
actually b e slo w er
  • n
the NCSA mac hine Let us examine this anomaly in more detail F
  • r
simplicit y
  • the
actual timing n um b ers ha v e b een simplied The
  • riginal
program ran in T
  • seconds
at MIT
  • n
  • pro
cessors The
  • ptimized
program ran in T
  • seconds
also
  • n
  • pro
cessors The
  • riginal
program had w
  • rk
T
  • seconds
and criticalpath length T
  • second
Using the form ula T P
  • T
  • P
  • T
  • as
a go
  • d
appro ximation
  • f
run time w e disco v er that indeed T
  • The
  • ptimized
program had w
  • rk
T
  • seconds
and critical path length T
  • seconds
yielding T
  • But
no w let us determine the run times
  • n
  • pro
cessors W e ha v e T
  • and
T
  • whic
h is t wice as slo w Th us b y using w
  • rk
and criticalpath length w e can predict the p erformance
  • f
a m ultithreaded computation Exercise
  • Sk
etc h the m ultithreaded computation that results from executing Fib Assume that all threads in the computation execute in unit time What is the w
  • rk
  • f
the computation What is the criticalpath length Sho w ho w to sc hedule the dag
  • n
  • pro
cessors in a greedy fashion b y lab eling eac h thread with the time step
  • n
whic h it executes Exercise
  • W
rite a m ultithreaded pro cedure SumA where A
  • n
is an arra y
  • whic
h uses divideandconquer to sum the elemen ts
  • f
the arra y A in parallel Exercise
  • Pro
v e that a greedy sc heduler ac hiev es the stronger b
  • und
T P
  • T
  • T
  • P
  • T
  • Exercise
  • Pro
v e that the time for a greedy sc heduler to execute an y m ultithreaded computation is within a factor
  • f
  • f
the time required b y an
  • ptimal
sc heduler
slide-6
SLIDE 6 Exercise
  • F
  • r
what n um b er P
  • f
pro cessors do the t w
  • c
hess programs describ ed in this section run equally fast Exercise
  • Professor
Tw eed tak es some measuremen ts
  • f
his deterministic m ulti threaded program whic h is sc heduled using a greedy sc heduler and nds that T
  • seconds
and T
  • seconds
What is the fastest that the professors computation could p
  • ssibly
run
  • n
  • pro
cessors Use Inequalit y
  • and
the t w
  • lo
w er b
  • unds
from Inequalities
  • and
  • to
deriv e y
  • ur
answ er
  • Analysis
  • f
m ultithreaded algorithms W e no w turn to the design and analysis
  • f
m ultithreaded algorithms Because
  • f
the divide andconquer nature
  • f
the m ultithreaded mo del recurrences are a natural w a y to express the w
  • rk
and criticalpath length
  • f
a m ultithreaded algorithm W e shall in v estigate algorithms for matrix m ultiplication and sorting and analyze their p erformance
  • P
arallel Matrix Multiplication T
  • m
ultiply t w
  • n
  • n
matrices A and B in parallel to pro duce a matrix C
  • w
e can recursiv ely form ulate the problem as follo ws
  • C
  • C
  • C
  • C
  • A
  • A
  • A
  • A
  • B
  • B
  • B
  • B
  • A
  • B
  • A
  • B
  • A
  • B
  • A
  • B
  • A
  • B
  • A
  • B
  • A
  • B
  • A
  • B
  • Th
us eac h n
  • n
matrix m ultiplication can b e expressed as
  • m
ultiplications and
  • additions
  • f
n
  • n
submatrices The m ultithreaded pro cedure Mul t m ultiplies t w
  • n
  • n
matrices where n is a p
  • w
er
  • f
  • using
an auxiliary pro cedure Add to add n
  • n
matrices This algorithm is not inplace AddC
  • T
  • n
  • if
n
  • then
C
  • C
  • T
  • else
partition C and T in to n
  • n
submatrices
  • spa
wn AddC
  • T
  • n
  • spa
wn AddC
  • T
  • n
  • spa
wn AddC
  • T
  • n
  • spa
wn AddC
  • T
  • n
  • sync
slide-7
SLIDE 7 Mul tC
  • A
B
  • n
  • if
n
  • then
C
  • A
  • B
  • else
allo cate a temp
  • rary
matrix T
  • n
  • n
  • partition
A B
  • C
  • and
T in to n
  • n
submatrices
  • spa
wn Mul tC
  • A
  • B
  • n
  • spa
wn Mul tC
  • A
  • B
  • n
  • spa
wn Mul tC
  • A
  • B
  • n
  • spa
wn Mul tC
  • A
  • B
  • n
  • spa
wn Mul tT
  • A
  • B
  • n
  • spa
wn Mul tT
  • A
  • B
  • n
  • spa
wn Mul tT
  • A
  • B
  • n
  • spa
wn Mul tT
  • A
  • B
  • n
  • sync
  • spa
wn AddC
  • T
  • n
  • sync
The matrix partitionings in line
  • f
Mul t and line
  • f
add tak e O
  • time
since
  • nly
a constan t n um b er
  • f
indexing
  • p
erations are required T
  • analyze
this algorithm let A P n b e the P pro cessor running time
  • f
Add
  • n
n
  • n
matrices and let M P n b e the P pro cessor running time
  • f
Mul t
  • n
n
  • n
matrices The w
  • rk
running time
  • n
  • ne
pro cessor for Add can b e expressed b y the recurrence A
  • n
  • A
  • n
  • n
  • whic
h is the same as for the
  • rdinary
doublenestedlo
  • p
serial algorithm Since the spa wned pro cedures can b e executed in parallel the criticalpath length for Add is A
  • n
  • A
  • n
  • lg
n
  • The
w
  • rk
for Mul t can b e expressed b y the recurrence M
  • n
  • M
  • n
  • A
  • n
  • M
  • n
  • n
  • n
  • whic
h is the same as for the
  • rdinary
triplenestedlo
  • p
serial algorithm The criticalpath length for Mul t is M
  • n
  • M
  • n
  • lg
n
  • lg
  • n
slide-8
SLIDE 8 Th us the parallelism for Mul t is M
  • n
M
  • n
  • n
  • lg
  • n
whic h is quite high T
  • m
ultiply
  • matrices
for example the parallelism is ignoring constan ts ab
  • ut
  • Most
parallel computers ha v e far few er pro cessors T
  • ac
hiev e high p erformance it is
  • ften
adv an tageous for an algorithm to use less space b ecause more space usually means more time F
  • r
the matrixm ultiplication problem w e can eliminate the temp
  • rary
matrix T in exc hange for reducing the parallelism Our new algorithm Mul tAdd p erforms C
  • C
  • A
  • B
using a similar divideandconquer strategy to Mul t Mul tAddC
  • A
B
  • n
  • if
n
  • then
C
  • C
  • A
  • B
  • else
partition A B
  • and
C in to n
  • n
submatrices
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • sync
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • spa
wn Mul tAddC
  • A
  • B
  • n
  • sync
Let MA P n b e the P pro cessor running time
  • f
Mul tAdd
  • n
n
  • n
matrices The w
  • rk
for Mul tAdd is MA
  • n
  • n
  • follo
wing the same analysis as for Mul t but the criticalpath length is no w MA
  • n
  • MA
  • n
  • n
  • since
  • nly
  • recursiv
e calls can b e executed in parallel Th us the parallelism is MA
  • nMA
  • n
  • n
  • On
  • matrices
for exam ple the parallelism is ignoring constan ts still quite high ab
  • ut
  • In
practice this algorithm
  • ften
runs somewhat faster than the rst since sa ving space
  • ften
sa v es time due to hierarc hical memory
slide-9
SLIDE 9 A B
  • Al
  • Al
  • l
m l
  • j
  • j
  • Al
  • Al
  • Figure
  • Illustration
  • f
PMer ge The median
  • f
arra y A is used to partition arra y B
  • and
then the lo w er p
  • rtions
  • f
the t w
  • arra
ys are recursiv ely merged as in parallel are the upp er p
  • rtions
  • P
arallel Merge Sort This section sho ws ho w to parallelize merge sort W e shall see the parallelism
  • f
the algorithm dep ends
  • n
ho w w ell the merge subroutine can b e parallelized The most straigh tforw ard w a y to parallelize merge sort is to run the recursion in parallel as is done in the follo wing pseudo co de Mer geSor t A p r
  • if
p
  • r
  • then
q
  • bp
  • r
c
  • spa
wn Mer geSor t A p q
  • spa
wn Mer geSor t A q
  • r
  • sync
  • Mer
ge A p q
  • r
  • The
w
  • rk
  • f
Mer geSor t
  • n
an arra y
  • f
n elemen ts is T
  • n
  • T
  • n
  • n
  • n
lg n
  • since
the running time
  • f
Mer ge is n Since the t w
  • recursiv
e spa wns
  • p
erate in parallel the criticalpath length
  • f
Mer geSor t is T
  • n
  • T
  • n
  • n
  • n
  • Consequen
tly
  • the
parallelism
  • f
the algorithm is T
  • nT
  • n
  • lg
n whic h is pun y
  • The
  • b
vious b
  • ttlenec
k is Mer ge The follo wing pseudo co de whic h is illustrated in Figure
  • p
erforms the merge in parallel
slide-10
SLIDE 10 PMer ge A
  • l
  • B
  • m
C
  • n
  • if
m
  • l
  • without
loss
  • f
generalit y
  • larger
arra y should b e rst
  • then
spa wn PMer ge B
  • m
A
  • l
  • C
  • n
  • elseif
n
  • then
C
  • A
  • elseif
l
  • and
m
  • then
if A
  • B
  • then
C
  • A
C
  • B
  • else
C
  • B
  • C
  • A
  • else
nd j suc h that B j
  • Al
  • B
j
  • using
binary searc h
  • spa
wn PMer ge A
  • l
  • B
  • j
  • C
  • l
  • j
  • spa
wn PMer ge Al
  • l
  • B
j
  • m
C l
  • j
  • n
  • sync
This merging algorithm nds the median
  • f
the larger arra y and uses it to partition the smaller arra y
  • Then
the lo w er p
  • rtions
  • f
the t w
  • arra
ys are recursiv ely merged and in parallel so are the upp er p
  • rtions
T
  • analyze
PMer ge let PM P n b e the P pro cessor time to merge t w
  • arra
ys A and B ha ving n
  • m
  • l
elemen ts in total Without loss
  • f
generalit y
  • let
A b e the larger
  • f
the t w
  • arra
ys that is assume l
  • m
W ell analyze the criticalpath length rst The binary searc h
  • f
B tak es lg m time whic h in the w
  • rst
case is lg n Since the t w
  • recursiv
e spa wns in lines
  • and
  • p
erate in parallel the w
  • rstcase
criticalpath length is lg n plus the w
  • rstcase
critical path length
  • f
the spa wn
  • p
erating
  • n
the larger subarra ys In the w
  • rst
case w e m ust merge half
  • f
A with all
  • f
B
  • in
whic h case the recursiv e spa wn
  • p
erates
  • n
at most n elemen ts Th us w e ha v e PM
  • n
  • PM
  • n
  • lg
n
  • lg
  • n
  • T
  • analyze
the w
  • rk
  • f
Mer ge
  • bserv
e that although the t w
  • recursiv
e spa wns ma y
  • p
erate
  • n
dieren t n um b ers
  • f
elemen ts they alw a ys
  • p
erate
  • n
n elemen ts b et w een them Let
  • n
b e the n um b er
  • f
elemen ts
  • p
erated
  • n
b y the rst spa wn where
  • is
a constan t in the range
  • Th
us the second spa wn
  • p
erates
  • n
  • n
elemen ts and the w
  • rstcase
w
  • rk
satises the recurrence PM
  • n
  • PM
  • n
  • PM
  • n
  • lg
n
  • W
e shall sho w that PM
  • n
  • n
using the substitution metho d Actually
  • the
Akra Bazzi metho d
  • if
y
  • u
kno w it is simpler W e assume inductiv ely that PM
  • n
  • an
  • b
lg n for some constan ts a b
  • W
e ha v e PM
  • n
  • a
n
  • b
lg n
  • a
  • n
  • b
lg
  • n
  • lg
n
  • an
  • blg
  • n
  • lg
  • n
  • lg
n
slide-11
SLIDE 11
  • an
  • blg
  • lg
n
  • lg
  • lg
n
  • lg
n
  • an
  • b
lg n
  • blg
n
  • lg
  • lg
n
  • an
  • b
lg n
  • since
w e can c ho
  • se
b large enough so that blg n
  • lg
  • dominates
lg n Moreo v er w e can pic k a large enough to satisfy the base conditions Th us PM
  • n
  • n
whic h is the same w
  • rk
asymptotically as the
  • rdinary
  • serial
merging algorithm W e can no w reanalyze the Mer geSor t using the PMer ge subroutine The w
  • rk
T
  • n
remains the same but the w
  • rstcase
criticalpath length no w satises T
  • n
  • T
  • n
  • lg
  • n
  • lg
  • n
  • The
parallelism is no w n lg nlg
  • n
  • n
lg
  • n
Exercise
  • Giv
e an ecien t and highly parallel m ultithreaded algorithm for m ultiply ing an n
  • n
matrix A b y a lengthn v ector x that ac hiev es w
  • rk
n
  • and
critical path lg n Analyze the w
  • rk
and criticalpath length
  • f
y
  • ur
implemen tation and giv e the parallelism Exercise
  • Describ
e a m ultithreaded algorithm for matrix m ultiplication that ac hiev es w
  • rk
n
  • and
critical path lg n Commen t informally
  • n
the lo calit y displa y ed b y y
  • ur
algorithm in the ideal cac he mo del as compared with the t w
  • algorithms
from this section Exercise
  • W
rite a Cilk program to m ultiply an n
  • n
  • matrix
b y an n
  • n
  • matrix
in parallel Analyze the w
  • rk
criticalpath length and parallelism
  • f
y
  • ur
implemen tation Y
  • ur
algorithm should b e ecien t ev en if an y
  • f
n
  • n
  • and
n
  • are
  • Exercise
  • W
rite a Cilk program to implemen t Strassens matrix m ultiplication al gorithm in parallel as ecien tly as y
  • u
can Analyze the w
  • rk
criticalpath length and parallelism
  • f
y
  • ur
implemen tation Exercise
  • W
rite a Cilk program to in v ert a symmetric and p
  • sitiv
edenite matrix in parallel Hint Use a divideandconquer approac h based
  • n
the ideas
  • f
Theorem
  • from
  • Exercise
  • Akl
and San toro
  • ha
v e prop
  • sed
a merging algorithm in whic h the rst step is to nd the median
  • f
all the elemen ts in the t w
  • sorted
input arra ys as
  • pp
  • sed
to the median
  • f
the elemen ts in the larger subarra y
  • as
is done in PMer ge Sho w that if the total n um b er
  • f
elemen ts in the t w
  • arra
ys is n this median can b e found using lg n time
  • n
  • ne
pro cessor in the w
  • rst
case Describ e a linearw
  • rk
m ultithreaded merging algorithm based
  • n
this subroutine that has a parallelism
  • f
n lg
  • n
Giv e and solv e the recurrences for w
  • rk
and criticalpath length and determine the parallelism Implemen t y
  • ur
algorithm as a Cilk program
slide-12
SLIDE 12 Exercise
  • Generalize
the algorithm from Exercise Exercise
  • to
nd arbitrary
  • rder
statistics Describ e a mergesorting algorithm with n lg n w
  • rk
that ac hiev es a parallelism
  • f
n lg n Hint Merge man y subarra ys in parallel Exercise
  • The
length
  • f
a longestcommon subsequence
  • f
t w
  • lengthn
sequences x and y can b e computed in parallel using a divideandconquer m ultithreaded algorithm Denote b y ci j
  • the
length
  • f
a longest common subsequence
  • f
x
  • i
and y
  • j
  • First
the m ultithreaded algorithm recursiv ely computes ci j
  • for
all i in the range
  • i
  • n
and all j in the range
  • j
  • n
Then it recursiv ely computes ci j
  • for
  • i
  • n
and n
  • j
  • n
while in parallel recursiv ely computing ci j
  • for
n
  • i
  • n
and
  • j
  • n
Finally
  • it
recursiv ely computes ci j
  • for
n
  • i
  • n
and n
  • j
  • n
F
  • r
the base case the algorithm computes ci j
  • in
terms
  • f
ci
  • j
  • ci
  • j
  • and
ci j
  • in
the
  • rdinary
w a y
  • since
the logic
  • f
the algorithm guaran tees that these three v alues ha v e already b een computed That is if the dynamic programming tableau is brok en in to four pieces
  • I
I I I I I I V
  • then
the recursiv e m ultithreaded co de w
  • uld
lo
  • k
something lik e this spa wn I sync spa wn I I spa wn I I I sync spa wn I V sync Analyze the w
  • rk
criticalpath length and parallelism
  • f
this algorithm Describ e and analyze an algorithm that is asymptotically as ecien t same w
  • rk
but more parallel Mak e whatev er in teresting
  • bserv
ations y
  • u
can W rite an ecien t Cilk program for the problem References
  • Selim
G Akl and Nicola San toro Optimal parallel merging and sorting without memory con icts IEEE T r ansactions
  • n
Computers C No v em b er
  • M
Akra and L Bazzi On the solution
  • f
linear recurrence equations Computational Optimization and Applic ation !
  • Rob
ert D Blumofe Exe cuting Multithr e ade d Pr
  • gr
ams Eciently PhD thesis De partmen t
  • f
Electrical Engineering and Computer Science Massac h usetts Institute
  • f
T ec hnology
  • Septem
b er
slide-13
SLIDE 13
  • Rob
ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E Leiserson Keith H Randall and Y uli Zhou Cilk An ecien t m ultithreaded run time system In Pr
  • c
e e dings
  • f
the Fifth A CM SIGPLAN Symp
  • sium
  • n
Principles and Pr actic e
  • f
Par al lel Pr
  • gr
amming PPoPP pages ! San ta Barbara California July
  • Rob
ert D Blumofe and Charles E Leiserson Sc heduling m ultithreaded computations b y w
  • rk
stealing In Pr
  • c
e e dings
  • f
the th A nnual Symp
  • sium
  • n
F
  • undations
  • f
Computer Scienc e F OCS pages ! San ta F e New Mexico No v em b er
  • Ric
hard P
  • Bren
t The parallel ev aluation
  • f
general arithmetic expressions Journal
  • f
the A CM ! April
  • Cilk
Beta
  • Reference
Man ual Av ailable
  • n
the In ternet from httptheorylcsmitedu
  • cilk
  • Thomas
H Cormen Charles E Leiserson and Ronald L Riv est Intr
  • duction
to A lgo rithms MIT Press and McGra w Hill
  • Matteo
F rigo Charles E Leiserson and Keith H Randall The implemen tation
  • f
the Cilk m ultithreaded language In A CM SIGPLAN
  • Confer
enc e
  • n
Pr
  • gr
amming L anguage Design and Implementation PLDI pages ! Mon treal Canada June
  • R
L Graham Bounds
  • n
m ultipro cessing timing anomalies SIAM Journal
  • n
Applie d Mathematics ! Marc h
  • Keith
H Randall Cilk Ecient Multithr e ade d Computing PhD thesis Departmen t
  • f
Electrical Engineering and Computer Science Massac h usetts Institute
  • f
T ec hnology
  • Ma
y