Weld: A Common Runtime for Data Analytics Shoumik Palkar, James - - PowerPoint PPT Presentation

weld a common runtime for data analytics
SMART_READER_LITE
LIVE PREVIEW

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James - - PowerPoint PPT Presentation

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL Motivation Modern data apps combine many


slide-1
SLIDE 1

Weld: A Common Runtime for Data Analytics

Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia

Stanford InfoLab, *MIT CSAIL

slide-2
SLIDE 2

Motivation

Modern data apps combine many disjoint processing libraries & functions

» Relational, statistics, machine learning, … » E.g. PyData stack

+ Great results leveraging work of 1000s of authors – No optimization across these functions

slide-3
SLIDE 3

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

parse_csv dropna mean

5-30x slowdowns in NumPy, Pandas, TensorFlow, etc

slide-4
SLIDE 4

How We Solve This

machine learning SQL graph algorithms

CPU GPU

Common Runtime

slide-5
SLIDE 5

How We Solve This

machine learning SQL graph algorithms

CPU GPU

… …

Weld IR Backends Runtime API Optimizer Weld runtime

slide-6
SLIDE 6

Runtime API

Uses lazy evaluation to collect work across libraries

data = lib1.f1() lib2.map(data, item => lib3.f2(item) )

User Application Weld Runtime

Combined IR program Optimized machine code

1101110 0111010 1101111

IR fragments for each function Runtime API

f1 map f2

Data in application

slide-7
SLIDE 7

Weld IR

Designed to meet three goals:

  • 1. Library composition: support complete

workloads such as nested parallel calls

  • 2. Ability to express optimizations: e.g. loop

fusion, vectorization, loop tiling

  • 3. Explicit parallelism
slide-8
SLIDE 8

Weld IR

Small, powerful design inspired by “monad comprehensions” Parallel loops: iterate over a dataset Builders: declarative objects for producing results

» E.g. append items to a list, compute a sum » Can be implemented differently on different hardware

Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof

slide-9
SLIDE 9

Examples

Implement functional operators using builders

def map(data, f): builder = new vecbuilder[int] for x in data: merge(builder, f(x)) result(builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge(builder, x) result(builder)

slide-10
SLIDE 10

Example Optimization: Fusion

squares = map(data, x => x * x) sum = reduce(data, 0, +) bld1 = new vecbuilder[int] bld2 = new merger[0, +] for x in data: merge(bld1, x * x) merge(bld2, x)

Loops can be merged into one pass over data

slide-11
SLIDE 11

Implementation

Prototype with APIs in Scala and Python

» LLVM and Voodoo for code gen

Integrations: TensorFlow, NumPy, Pandas, Spark

slide-12
SLIDE 12

Results: Individual Workloads

SQL (TPC-H) PageRank

2 4 6 8 10 12 1 2 4 8 12 Runtime [secs] Number of threads

GraphMat Hand-opt Weld

Word2Vec

Q1 Q3 Q6 Q12

0.2 0.4 0.6 0.8 1 1.2 1 4 12

Runtime [secs] Number of threads HyPer H.o. Weld

0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 4 12

Runtime [secs] Number of threads HyPer H.o. Weld

0.05 0.1 0.15 0.2 0.25 0.3 1 4 12

Runtime [secs] Number of threads HyPer H.o. Weld

0.1 0.2 0.3 0.4 0.5 0.6 1 4 12

Runtime [secs] Number of threads HyPer H.o. Weld 5 10 15 20 25 Runtime [secs]

TF TF-Op Weld

TF-Op = C++ operator

slide-13
SLIDE 13

TPC-H Logistic Regression Vector Sum

Results: Existing Frameworks

5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6

Runtime [secs] Workload

SparkSQL Weld

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Runtime [secs] NP NExpr Weld

Integration effort: 500 lines glue, 30 lines/operator

0.1 1 10 100 1000 LR (1T) LR (12T)

Runtime [secs; log10] Workload

TF Hand-opt Weld

1 Core 12 Cores

slide-14
SLIDE 14

Results: Cross-Library Optimization

0.01 0.1 1 10 100 Runtime (sec, log10)

Current Weld, no CLO Weld, CLO Weld, 12 core

Pandas + NumPy

290x 31x

0.0 0.5 1.0 1.5 2.0 Runtime (sec)

Scala UDF Weld

Spark SQL UDF

14x

slide-15
SLIDE 15

Conclusion

The way we compose software will have to change to efficiently use modern hardware Weld is our first attempt at such a design – lots of

  • pen questions!

» Optimization, specialized hardware, domain info, …

Open source: this spring We’re hiring! (postdocs)