CS 744: SPLIT ANNOTATIONS
Shivaram Venkataraman Fall 2020
welcome
!
CS 744: SPLIT ANNOTATIONS Shivaram Venkataraman Fall 2020 - - PowerPoint PPT Presentation
! welcome CS 744: SPLIT ANNOTATIONS Shivaram Venkataraman Fall 2020 ADMINISTRIVIA Course Project Checkins due tomorrow! Hot CRP In-class project presentations Dec 8 th and Dec 10 th presentation 4 slot min Sign up sheet on
Shivaram Venkataraman Fall 2020
welcome
!
Course Project Checkins – due tomorrow! In-class project presentations Dec 8th and Dec 10th Sign up sheet on Piazza
→
Hot CRP
→
5
min
slot
a
4
min
presentation
+
@ LAslides
upload
a
min
↳
semesters
computing
cloud
computing
compose
and maintain
efficiency
Multi-core machines Multiple functions and libraries
// inputs are double arrays with `len` elems vdLog1p(len, d1, d1);// d1 = log(d1) vdAdd(len, d1, tmp, d1);// d1 = d1 + tmp // d1 = d1 / vol_sqrt vdDiv(len, d1, vol_sqrt, d1);
✓ Intel
)
MKL
pricing
workload
↳ optimizes
a)
Data
movement
is across all
expensive
even
within
a
d1
machine
→
TVM
④
Arrays I data
is
larger
de
.↳ layers of
them
cache
⇒
streaming
PNN
reads
& writes
to
DRAM
data fits
in
memory
Replace every library call to emit intermediate representation (IR) Compile all the IR together Lots of code change required!
→ -
we
want
→
be here!
Kvm ) Existing
rich
loop fusion libraries
Nunley ,
pipelining
←
Pandas
Provide data movement optimizations across libraries Require minimal or no changes to existing libraries Leverage existing hand-tuned code for speedups
→
not
be
very intrusive
I
matrix
FFT
multiply
d1 = price * strike d1 = np.log2(d1) + strike
split
(1)
Build
execution
. nuearhex
graph
14
pass
cache
sized
splits
every
function
@splittable( size: SizeSplit(size), a: ArraySplit(size), mut out: ArraySplit(size)) void vdLog1p(long size, double*a, double*out)
Split types: N⟨V0...Vn⟩ e.g,: ArraySplit⟨10, 2⟩ for 10 element array, 2 pieces Split annotation: Name and split type to each argument and return value
easier to provide
Given
a
library
than
changing
code
.↳ fewer
data
types
a
ex
#
=
you
can
pipeline
these
a :[
size "]
Vd Scale (long
size ,
int scalar, double * a)
functions
'
is
split
in the
same
fashion
as
expert
@splittable(m:MatrixSplit(m, axis), axis:_)
vector sumReduceToVector(matrix m, int axis);
Arraysflitter>
. Ifdata
shares
same →
split type
⇒
you
can
safely
pipeline
split ( double
* a
,start
intend
Parameters)⇒
. Ifyou
cannot
pipeline
return
at
merge prior results
call
next
function
⑦
→ log , multiply > Eg
eye
implemented
imide
ReducesHit
→
dog , multiply,#D!m
class to
combine
partial
this execution
I II
→ lazily
evaluate
this
graph
,maximum
to pipeline
Writing Annotations: Function decorators
@sa((DataFrameSplit(), DataFrameSplit()), {}, DataFrameSplit()) def divide(series, value):
Capturing the graph Wraps original Python function and registers in graph Returns a Future object Evaluation Points Lazily evaluate by overriding __getattribute__
p Already
exists
Pandas
library
1
If
somebody
calls
can
be
→ ( Ray , Pywren)
intercepted
by decorator
is
constructed
internally
Future [Dataframe]
: print ( Io)
→
internally
do
the
and
call
.the
result
Take dataflow graph à execution plan Series of stages each stage split, pipeline and merge Choosing a batch size Set number of elements per batch using L2 cache size
e.are
.
.. .
. ..
'm
split - pipeline
compute
number of
elements
that will fit
in
L2
cache
.Applications compose data processing libraries Data movement is bottleneck on multi-core machines Key idea: Split and pipeline data across functions Split Annotations to reduce programmer effort Mozart: Client library and runtime for lazy evaluation
Iterative
workload
↳ will
add
stages
to
graph
↳ pipeline
across
iterations ?
https://forms.gle/F2LJ21qFkBGWyypB7
How does the dataflow graph that is executed by Mozart compare to dataflow graphs we have seen in other systems like Spark/PyT
Similarities
Differences
→hazy
execution
→ Faulttolerance is
not
the
→ narrow
dependencies
= pipelinedby
Mozart
.→ No
checkpoint'ng
→
Functions
are
blackbones
→
merging
us
. shuffling↳
can't pick
'
,
→
increase men bandwidth
" two?e'
"
comp
. expensiveMhienednhfthreads
, mid 7
speedier
exp
→
n Ix
I
more
threads
can
compute intensive
leed
E
mem
functions
⇒ not how
bottleneck
much
speed up
Next class: TPU Project check-ins on HotCRP!