CS 744: SPLIT ANNOTATIONS Shivaram Venkataraman Fall 2020 - - PowerPoint PPT Presentation

cs 744 split annotations
SMART_READER_LITE
LIVE PREVIEW

CS 744: SPLIT ANNOTATIONS Shivaram Venkataraman Fall 2020 - - PowerPoint PPT Presentation

! welcome CS 744: SPLIT ANNOTATIONS Shivaram Venkataraman Fall 2020 ADMINISTRIVIA Course Project Checkins due tomorrow! Hot CRP In-class project presentations Dec 8 th and Dec 10 th presentation 4 slot min Sign up sheet on


slide-1
SLIDE 1

CS 744: SPLIT ANNOTATIONS

Shivaram Venkataraman Fall 2020

welcome

!

slide-2
SLIDE 2

ADMINISTRIVIA

Course Project Checkins – due tomorrow! In-class project presentations Dec 8th and Dec 10th Sign up sheet on Piazza

Hot CRP

5

min

slot

a

4

min

presentation

+

@ LA

slides

upload

a

min

slide-3
SLIDE 3

NEW HARDWARE and data MODELS

semesters

computing

cloud

computing

compose

and maintain

j

efficiency

slide-4
SLIDE 4

SETTING

Multi-core machines Multiple functions and libraries

// inputs are double arrays with `len` elems vdLog1p(len, d1, d1);// d1 = log(d1) vdAdd(len, d1, tmp, d1);// d1 = d1 + tmp // d1 = d1 / vol_sqrt vdDiv(len, d1, vol_sqrt, d1);

✓ Intel

)

  • ptions

MKL

pricing

workload

  • .
  • scope

↳ optimizes

a)

Data

movement

is across all

expensive

even

within

a

  • perators

d1

machine

  • Cpu

TVM

Arrays I data

is

larger

de

.

↳ layers of

them

cache

streaming

PNN

reads

& writes

to

DRAM

  • spark
  • ↳ cake if

data fits

in

memory

slide-5
SLIDE 5

COMPILER-BASED APPROACHES

Replace every library call to emit intermediate representation (IR) Compile all the IR together Lots of code change required!

→ -

we

want

  • to

be here!

Kvm ) Existing

rich

loop fusion libraries

Nunley ,

pipelining

Pandas

  • g-
slide-6
SLIDE 6

GOALS

Provide data movement optimizations across libraries Require minimal or no changes to existing libraries Leverage existing hand-tuned code for speedups

not

be

very intrusive

  • I

I

matrix

FFT

multiply

slide-7
SLIDE 7

APPROACH

d1 = price * strike d1 = np.log2(d1) + strike

split

(1)

Build

execution

. nu

earhex

graph

ti

.

14

pass

cache

sized

splits

  • to

every

function

slide-8
SLIDE 8

SPLIT ANNOTATIONS

@splittable( size: SizeSplit(size), a: ArraySplit(size), mut out: ArraySplit(size)) void vdLog1p(long size, double*a, double*out)

Split types: N⟨V0...Vn⟩ e.g,: ArraySplit⟨10, 2⟩ for 10 element array, 2 pieces Split annotation: Name and split type to each argument and return value

easier to provide

Given

a

library

than

changing

code

.

↳ fewer

data

types

  • than

a

ex

  • IT

#

  • perators

=

'T

you

can

pipeline

these

a :[

size "]

Vd Scale (long

size ,

int scalar, double * a)

functions

  • .
. .
  • y

::::÷i÷

:

'

  • utput

is

split

in the

same

fashion

as

expert

  • ut")
slide-9
SLIDE 9

IMPLEMENTING SPLIT API

@splittable(m:MatrixSplit(m, axis), axis:_)

  • > ReduceSplit(axis)

vector sumReduceToVector(matrix m, int axis);

Arraysflitter>

. If

data

shares

same →

split type

you

can

safely

pipeline

split ( double

* a

,

start

intend

Parameters)⇒

. If

you

cannot

pipeline

return

at

merge prior results

call

next

function

→ log , multiply > Eg

,[

eye

  • peration

implemented

imide

ReducesHit

dog , multiply,#D!m

class to

combine

partial

  • utputs
slide-10
SLIDE 10

MOZART DESIGN

→ Capture

this execution

I II

  • graph

→ lazily

evaluate

this

graph

,

maximum

  • pportunity

IT

to pipeline

slide-11
SLIDE 11

PYTHON CLIENT LIBRARY

Writing Annotations: Function decorators

@sa((DataFrameSplit(), DataFrameSplit()), {}, DataFrameSplit()) def divide(series, value):

Capturing the graph Wraps original Python function and registers in graph Returns a Future object Evaluation Points Lazily evaluate by overriding __getattribute__

p Already

exists

  • ]

Pandas

library

1

If

somebody

calls

  • divide
" ,

can

be

→ ( Ray , Pywren)

intercepted

by decorator

  • Graph

is

constructed

internally

Future [Dataframe]

: print ( Io)

internally

do

the

  • ral

and

call

.

print

  • n

the

result

slide-12
SLIDE 12

MOZART RUNTIME

Take dataflow graph à execution plan Series of stages each stage split, pipeline and merge Choosing a batch size Set number of elements per batch using L2 cache size

e.are

.

.. .

. ..

It :S?

'm:3:D

'm

w

split - pipeline

  • merge

compute

number of

elements

that will fit

in

L2

cache

.
slide-13
SLIDE 13

SUMMARY

Applications compose data processing libraries Data movement is bottleneck on multi-core machines Key idea: Split and pipeline data across functions Split Annotations to reduce programmer effort Mozart: Client library and runtime for lazy evaluation

Iterative

workload

↳ will

add

stages

to

graph

↳ pipeline

across

iterations ?

slide-14
SLIDE 14

DISCUSSION

https://forms.gle/F2LJ21qFkBGWyypB7

slide-15
SLIDE 15

How does the dataflow graph that is executed by Mozart compare to dataflow graphs we have seen in other systems like Spark/PyT

  • rch etc.

Similarities

Differences

hazy

execution

→ Fault

tolerance is

not

the

  • bjective

→ narrow

dependencies

= pipelined

by

Mozart

.

→ No

checkpoint'ng

Functions

are

blackbones

merging

us

. shuffling

can't pick

  • ptimal join

'

3.7¥

,

  • perator

slide-16
SLIDE 16

increase men bandwidth

" two?e'

"

comp

. expensive

Mhienednhfthreads

  • for add

, mid 7

  • e

speedier

exp

  • '

n Ix

I

  • having

more

threads

can

compute intensive

leed

E

mem

functions

⇒ not how

bottleneck

much

speed up

slide-17
SLIDE 17

NEXT STEPS

Next class: TPU Project check-ins on HotCRP!