Weld: Accelerating Data Science by 100x Shoumik Palkar , James - - PowerPoint PPT Presentation

weld accelerating data science by 100x
SMART_READER_LITE
LIVE PREVIEW

Weld: Accelerating Data Science by 100x Shoumik Palkar , James - - PowerPoint PPT Presentation

Weld: Accelerating Data Science by 100x Shoumik Palkar , James Thomas, Deepak Narayanan , Pratiksha Thaker, Parimajan Negi, Rahul Palamuttam, Anil Shanbhag*, Holger Pirk**, Malte Schwarzkopf*, Saman Amarasinghe*, Sam Madden*, Matei Zaharia


slide-1
SLIDE 1

Weld: Accelerating Data Science by 100x

Shoumik Palkar, James Thomas, Deepak Narayanan, Pratiksha Thaker, Parimajan Negi, Rahul Palamuttam, Anil Shanbhag*, Holger Pirk**, Malte Schwarzkopf*, Saman Amarasinghe*, Sam Madden*, Matei Zaharia

Stanford DAWN, *MIT CSAIL, **Imperial College London

www.weld.rs

slide-2
SLIDE 2

Motivation

Modern data applications combine many disjoint processing libraries & functions + Great results leveraging work of 1000s of authors

slide-3
SLIDE 3

Motivation

Modern data applications combine many disjoint processing libraries & functions + Great results leveraging work of 1000s of authors – No optimization across functions

slide-4
SLIDE 4

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

slide-5
SLIDE 5

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

slide-6
SLIDE 6

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

parse_csv

slide-7
SLIDE 7

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

parse_csv dropna

slide-8
SLIDE 8

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

parse_csv dropna mean

slide-9
SLIDE 9

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)

parse_csv dropna mean

Up to 30x slowdowns in NumPy, Pandas, TensorFlow, etc. compared to an optimized C implementation

slide-10
SLIDE 10

Data Science Today

Data scientists “pip install” libraries needed for prototype/get the job done

slide-11
SLIDE 11

Data Science Today

Data scientists “pip install” libraries needed for prototype/get the job done Observe performance issues in pipelines composed of fast data science tools

slide-12
SLIDE 12

Data Science Today

Data scientists “pip install” libraries needed for prototype/get the job done Observe performance issues in pipelines composed of fast data science tools Hire engineers to optimize your pipeline, leverage new hardware, etc.

slide-13
SLIDE 13

Data Science Today

Weld’s vision: bare metal performance for data science out of the box!

Data scientists “pip install” libraries needed for prototype/get the job done Observe performance issues in pipelines composed of fast data science tools Hire engineers to optimize your pipeline, leverage new hardware, etc.

slide-14
SLIDE 14

Weld: An Optimizing Runtime

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C)

slide-15
SLIDE 15

Weld: An Optimizing Runtime

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) Native NumPy and Pandas

slide-16
SLIDE 16

Weld: An Optimizing Runtime

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~3x Speedup from code generation (SIMD instructions + other standard compiler optimizations)

slide-17
SLIDE 17

Weld: An Optimizing Runtime

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~8x Speedup from fusion within each library (eliminates within-library memory movement)

slide-18
SLIDE 18

Weld: An Optimizing Runtime

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~29x Speedup from fusion across libraries library (eliminates cross-library memory movement, co-optimizes library calls)

slide-19
SLIDE 19

Weld: An Optimizing Runtime

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~180x Speedup with automatic parallelization (eliminates cross-library memory movement, co-optimizes library calls)

slide-20
SLIDE 20

Weld Architecture

slide-21
SLIDE 21

Weld Architecture

machine learning SQL graph algorithms

Common Runtime

slide-22
SLIDE 22

Weld Architecture

machine learning SQL graph algorithms

CPU GPU

Common Runtime

slide-23
SLIDE 23

Weld Architecture

machine learning SQL graph algorithms

CPU GPU

… …

Weld IR Backends Runtime API Optimizer Weld runtime

slide-24
SLIDE 24

Rest of this Talk

Runtime API – How applications “speak” with Weld Weld IR – How applications express computation Results Demo

www.weld.rs

slide-25
SLIDE 25

Runtime API

Uses lazy evaluation to collect work across libraries

data = lib1.f1() lib2.map(data, item => lib3.f2(item) )

User Application Weld Runtime

Combined IR program Optimized machine code

1101110 0111010 1101111

IR fragments for each function Runtime API

f1 map f2

Data in Application

slide-26
SLIDE 26

Without Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

data

slide-27
SLIDE 27

Without Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

data squares

slide-28
SLIDE 28

Without Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

data squares sum Each call reads/writes memory

slide-29
SLIDE 29

With Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

map WeldObject

slide-30
SLIDE 30

With Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

map reduce WeldObject

slide-31
SLIDE 31

With Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

map reduce WeldObject sqrt

slide-32
SLIDE 32

With Weld

import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))

map reduce WeldObject sqrt

Optimized Program sqrt(reduce(…)) sum Evaluate the optimized program once

slide-33
SLIDE 33

Weld IR: Expressing Computations

Designed to meet three goals:

  • 1. Generality

support diverse workloads and nested calls

  • 2. Ability to express optimizations

e.g., loop fusion, vectorization, and loop tiling

  • 3. Explicit parallelism and targeting parallel

hardware

slide-34
SLIDE 34

Weld IR: Internals

Small IR* with only two main constructs.

Pa Parallel lo loops ps: iterate over a dataset Build Builders: declarative objects for producing results

» E.g., append items to a list, compute a sum » Can be implemented differently on different hardware

slide-35
SLIDE 35

Weld IR: Internals

Small IR* with only two main constructs.

Pa Parallel lo loops ps: iterate over a dataset Build Builders: declarative objects for producing results

» E.g., append items to a list, compute a sum » Can be implemented differently on different hardware

Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof

slide-36
SLIDE 36

Examples: Functional Ops

slide-37
SLIDE 37

Examples: Functional Ops

Functional operators using builders

def map(data, f): builder = new appender[i32] for x in data: merge(builder, f(x)) result(builder)

slide-38
SLIDE 38

Examples: Functional Ops

Functional operators using builders

def map(data, f): builder = new appender[i32] for x in data: merge(builder, f(x)) result(builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge(builder, x) result(builder)

slide-39
SLIDE 39

Example Optimizations

squares = map(data, |x| x * x) sum = reduce(data, 0, +) bld1 = new appender[i32] bld2 = new merger[0, +] for x: simd[i32] in data: merge(bld1, x * x) merge(bld2, x)

Loops can be merged into one pass over data and vectorized

slide-40
SLIDE 40

Other Features

Interactive REPL for debugging Weld programs Serialization/Deserialization operators for Weld data Configurable memory limit and thread limit Trace Mode for tracing execution at runtime to catch bugs Rich logging for easy debugging Utilities for generating C bindings to pass data into Weld C UDF Support for calling arbitrary C functions Ability to Dump Code for debugging Syntax Highlighting support for Vim Type Inference in Weld IR to simplify writing code manually for testing

slide-41
SLIDE 41

Implementation

slide-42
SLIDE 42

Implementation

APIs in C and Python (with Java coming soon)

  • Full LLVM-based CPU backend SIMD support

Written in ~30K lines of Rust, LLVM, C++

  • Fast, safe native language with no runtime
slide-43
SLIDE 43

Implementation

Partial Prototypes of Pandas, NumPy, TensorFlow and Apache Spark APIs in C and Python (with Java coming soon)

  • Full LLVM-based CPU backend SIMD support

Written in ~30K lines of Rust, LLVM, C++

  • Fast, safe native language with no runtime
slide-44
SLIDE 44

Grizzly

A subset of Pandas integrated with Weld

Operators include unique, filter, mask, group_by,

pivot_table

Transparent single-core and multi-core speedups Interoperates with Pandas with same API

slide-45
SLIDE 45

Grizzly in Action

Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)

slide-46
SLIDE 46

Grizzly in Action

import pandas as pd # Read dataframe from file requests = pd.read_csv(‘filename.csv’) # Fix requests with extra digits requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5) # Fix requests with 00000 zipcodes zero_zips = requests['Incident Zip'] == '00000’ requests['Incident Zip'][zero_zips] = np.nan # Display unique incident zips print requests['Incident Zip'].unique() Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)

slide-47
SLIDE 47

Grizzly in Action

import pandas as pd import grizzly as gr # Read dataframe from file requests = gr.DataFrameWeld(pd.read_csv(‘filename.csv’)) # Fix requests with extra digits requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5) # Fix requests with 00000 zipcodes zero_zips = requests['Incident Zip'] == '00000’ requests['Incident Zip'][zero_zips] = np.nan # Display unique incident zips print requests['Incident Zip'].unique() Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)

Pandas for I/O

slide-48
SLIDE 48

Integration Effort

slide-49
SLIDE 49

Integration Effort

Small up front cost to enable Weld integration

  • 500 LoC for each library we prototyped
slide-50
SLIDE 50

Integration Effort

Small up front cost to enable Weld integration

  • 500 LoC for each library we prototyped

Easy to port over each operator

  • 30 LoC each
slide-51
SLIDE 51

Integration Effort

Small up front cost to enable Weld integration

  • 500 LoC for each library we prototyped

Easy to port over each operator

  • 30 LoC each

Incrementally Deployable

  • Weld-enabled ops work with native ops
slide-52
SLIDE 52

Weld Accelerates Existing Libraries

slide-53
SLIDE 53

Weld Accelerates Existing Libraries

5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime [secs]

  • Unmod. SparkSQL

Weld

TPC-H: 3.5x speedup

slide-54
SLIDE 54

Weld Accelerates Existing Libraries

5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime [secs]

  • Unmod. SparkSQL

Weld

TPC-H: 3.5x speedup

0.1 1 10 100 Runtime [secs; log10] NumPy Weld 1T Weld 12T

Black Scholes: 4.5x speedup

slide-55
SLIDE 55

Weld Accelerates Existing Libraries

5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime [secs]

  • Unmod. SparkSQL

Weld

TPC-H: 3.5x speedup

0.1 1 10 100 Runtime [secs; log10] NumPy Weld 1T Weld 12T

Black Scholes: 4.5x speedup

1 10 100 1000 1T 12T Runtime [secs; log10] Number of threads TF TF + XLA Weld

Logistic Regression: Competitive with XLA

slide-56
SLIDE 56

Weld Accelerates Multi-Library Workflows

slide-57
SLIDE 57

Weld Accelerates Multi-Library Workflows

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Data cleaning + lin.

  • alg. with Pandas +

NumPy: 180x speedup

slide-58
SLIDE 58

Weld Accelerates Multi-Library Workflows

10 20 30 40 50 60 70 80 90 1T 12T Runtime [secs] TF TF + XLA NumPy Weld

Image whitening + linear regression with TensorFlow + NumPy: 8.9x speedup

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Data cleaning + lin.

  • alg. with Pandas +

NumPy: 180x speedup

slide-59
SLIDE 59

Weld Accelerates Multi-Library Workflows

10 20 30 40 50 60 70 80 90 1T 12T Runtime [secs] TF TF + XLA NumPy Weld

Image whitening + linear regression with TensorFlow + NumPy: 8.9x speedup

0.1 1 10 100 1000 10000 Runtime [secs; log10] Python UDF Scala UDF Weld

Linear model eval. with Spark SQL UDF: 6x speedup

0.1 1 10 100

Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)

Runtime [secs; log10]

Data cleaning + lin.

  • alg. with Pandas +

NumPy: 180x speedup

slide-60
SLIDE 60

Incremental Integration

2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld

slide-61
SLIDE 61

Incremental Integration

2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld

Implementing more operators

slide-62
SLIDE 62

Incremental Integration

2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld

NumPy Black Scholes workload: Incremental benefits with incremental integration.

Implementing more operators

slide-63
SLIDE 63

Demo.

slide-64
SLIDE 64

Conclusion

Changing the interface between libraries can speed up data analytics applications by 10-100x on modern hardware Try out Weld for yourself, or contribute! https://www.github.com/weld-project https://www.weld.rs

$ pip install pyweld $ pip install pygrizzly $ pip install weldnumpy