SLIDE 1 Weld: Accelerating Data Science by 100x
Shoumik Palkar, James Thomas, Deepak Narayanan, Pratiksha Thaker, Parimajan Negi, Rahul Palamuttam, Anil Shanbhag*, Holger Pirk**, Malte Schwarzkopf*, Saman Amarasinghe*, Sam Madden*, Matei Zaharia
Stanford DAWN, *MIT CSAIL, **Imperial College London
www.weld.rs
SLIDE 2
Motivation
Modern data applications combine many disjoint processing libraries & functions + Great results leveraging work of 1000s of authors
SLIDE 3
Motivation
Modern data applications combine many disjoint processing libraries & functions + Great results leveraging work of 1000s of authors – No optimization across functions
SLIDE 4 How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)
SLIDE 5 How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)
SLIDE 6 How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)
parse_csv
SLIDE 7 How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)
parse_csv dropna
SLIDE 8 How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)
parse_csv dropna mean
SLIDE 9 How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered)
parse_csv dropna mean
Up to 30x slowdowns in NumPy, Pandas, TensorFlow, etc. compared to an optimized C implementation
SLIDE 10
Data Science Today
Data scientists “pip install” libraries needed for prototype/get the job done
SLIDE 11
Data Science Today
Data scientists “pip install” libraries needed for prototype/get the job done Observe performance issues in pipelines composed of fast data science tools
SLIDE 12
Data Science Today
Data scientists “pip install” libraries needed for prototype/get the job done Observe performance issues in pipelines composed of fast data science tools Hire engineers to optimize your pipeline, leverage new hardware, etc.
SLIDE 13
Data Science Today
Weld’s vision: bare metal performance for data science out of the box!
Data scientists “pip install” libraries needed for prototype/get the job done Observe performance issues in pipelines composed of fast data science tools Hire engineers to optimize your pipeline, leverage new hardware, etc.
SLIDE 14 Weld: An Optimizing Runtime
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C)
SLIDE 15 Weld: An Optimizing Runtime
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) Native NumPy and Pandas
SLIDE 16 Weld: An Optimizing Runtime
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~3x Speedup from code generation (SIMD instructions + other standard compiler optimizations)
SLIDE 17 Weld: An Optimizing Runtime
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~8x Speedup from fusion within each library (eliminates within-library memory movement)
SLIDE 18 Weld: An Optimizing Runtime
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~29x Speedup from fusion across libraries library (eliminates cross-library memory movement, co-optimizes library calls)
SLIDE 19 Weld: An Optimizing Runtime
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Filter Dataset à Compute a Linear Model à Aggregate Indices Uses NumPy and Pandas (both backed by C) ~180x Speedup with automatic parallelization (eliminates cross-library memory movement, co-optimizes library calls)
SLIDE 20
Weld Architecture
SLIDE 21
Weld Architecture
machine learning SQL graph algorithms
Common Runtime
…
SLIDE 22 Weld Architecture
machine learning SQL graph algorithms
CPU GPU
…
Common Runtime
…
SLIDE 23 Weld Architecture
machine learning SQL graph algorithms
CPU GPU
… …
Weld IR Backends Runtime API Optimizer Weld runtime
SLIDE 24
Rest of this Talk
Runtime API – How applications “speak” with Weld Weld IR – How applications express computation Results Demo
www.weld.rs
SLIDE 25 Runtime API
Uses lazy evaluation to collect work across libraries
data = lib1.f1() lib2.map(data, item => lib3.f2(item) )
User Application Weld Runtime
Combined IR program Optimized machine code
1101110 0111010 1101111
IR fragments for each function Runtime API
f1 map f2
Data in Application
SLIDE 26 Without Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
data
SLIDE 27 Without Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
data squares
SLIDE 28 Without Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
data squares sum Each call reads/writes memory
SLIDE 29
With Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
map WeldObject
SLIDE 30
With Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
map reduce WeldObject
SLIDE 31
With Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
map reduce WeldObject sqrt
SLIDE 32 With Weld
import itertools as it squares = it.map(data, |x| x * x) sum = sqrt(it.reduce(squares, 0, +))
map reduce WeldObject sqrt
Optimized Program sqrt(reduce(…)) sum Evaluate the optimized program once
SLIDE 33 Weld IR: Expressing Computations
Designed to meet three goals:
support diverse workloads and nested calls
- 2. Ability to express optimizations
e.g., loop fusion, vectorization, and loop tiling
- 3. Explicit parallelism and targeting parallel
hardware
SLIDE 34
Weld IR: Internals
Small IR* with only two main constructs.
Pa Parallel lo loops ps: iterate over a dataset Build Builders: declarative objects for producing results
» E.g., append items to a list, compute a sum » Can be implemented differently on different hardware
SLIDE 35
Weld IR: Internals
Small IR* with only two main constructs.
Pa Parallel lo loops ps: iterate over a dataset Build Builders: declarative objects for producing results
» E.g., append items to a list, compute a sum » Can be implemented differently on different hardware
Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof
SLIDE 36
Examples: Functional Ops
SLIDE 37 Examples: Functional Ops
Functional operators using builders
def map(data, f): builder = new appender[i32] for x in data: merge(builder, f(x)) result(builder)
SLIDE 38 Examples: Functional Ops
Functional operators using builders
def map(data, f): builder = new appender[i32] for x in data: merge(builder, f(x)) result(builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge(builder, x) result(builder)
SLIDE 39 Example Optimizations
squares = map(data, |x| x * x) sum = reduce(data, 0, +) bld1 = new appender[i32] bld2 = new merger[0, +] for x: simd[i32] in data: merge(bld1, x * x) merge(bld2, x)
Loops can be merged into one pass over data and vectorized
SLIDE 40 Other Features
Interactive REPL for debugging Weld programs Serialization/Deserialization operators for Weld data Configurable memory limit and thread limit Trace Mode for tracing execution at runtime to catch bugs Rich logging for easy debugging Utilities for generating C bindings to pass data into Weld C UDF Support for calling arbitrary C functions Ability to Dump Code for debugging Syntax Highlighting support for Vim Type Inference in Weld IR to simplify writing code manually for testing
SLIDE 41
Implementation
SLIDE 42 Implementation
APIs in C and Python (with Java coming soon)
- Full LLVM-based CPU backend SIMD support
Written in ~30K lines of Rust, LLVM, C++
- Fast, safe native language with no runtime
SLIDE 43 Implementation
Partial Prototypes of Pandas, NumPy, TensorFlow and Apache Spark APIs in C and Python (with Java coming soon)
- Full LLVM-based CPU backend SIMD support
Written in ~30K lines of Rust, LLVM, C++
- Fast, safe native language with no runtime
SLIDE 44 Grizzly
A subset of Pandas integrated with Weld
Operators include unique, filter, mask, group_by,
pivot_table
Transparent single-core and multi-core speedups Interoperates with Pandas with same API
SLIDE 45 Grizzly in Action
Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
SLIDE 46 Grizzly in Action
import pandas as pd # Read dataframe from file requests = pd.read_csv(‘filename.csv’) # Fix requests with extra digits requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5) # Fix requests with 00000 zipcodes zero_zips = requests['Incident Zip'] == '00000’ requests['Incident Zip'][zero_zips] = np.nan # Display unique incident zips print requests['Incident Zip'].unique() Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
SLIDE 47 Grizzly in Action
import pandas as pd import grizzly as gr # Read dataframe from file requests = gr.DataFrameWeld(pd.read_csv(‘filename.csv’)) # Fix requests with extra digits requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5) # Fix requests with 00000 zipcodes zero_zips = requests['Incident Zip'] == '00000’ requests['Incident Zip'][zero_zips] = np.nan # Display unique incident zips print requests['Incident Zip'].unique() Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
Pandas for I/O
SLIDE 48
Integration Effort
SLIDE 49 Integration Effort
Small up front cost to enable Weld integration
- 500 LoC for each library we prototyped
SLIDE 50 Integration Effort
Small up front cost to enable Weld integration
- 500 LoC for each library we prototyped
Easy to port over each operator
SLIDE 51 Integration Effort
Small up front cost to enable Weld integration
- 500 LoC for each library we prototyped
Easy to port over each operator
Incrementally Deployable
- Weld-enabled ops work with native ops
SLIDE 52
Weld Accelerates Existing Libraries
SLIDE 53 Weld Accelerates Existing Libraries
5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime [secs]
Weld
TPC-H: 3.5x speedup
SLIDE 54 Weld Accelerates Existing Libraries
5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime [secs]
Weld
TPC-H: 3.5x speedup
0.1 1 10 100 Runtime [secs; log10] NumPy Weld 1T Weld 12T
Black Scholes: 4.5x speedup
SLIDE 55 Weld Accelerates Existing Libraries
5 10 15 20 25 30 35 40 45 TPC-H Q1 TPC-H Q6 Runtime [secs]
Weld
TPC-H: 3.5x speedup
0.1 1 10 100 Runtime [secs; log10] NumPy Weld 1T Weld 12T
Black Scholes: 4.5x speedup
1 10 100 1000 1T 12T Runtime [secs; log10] Number of threads TF TF + XLA Weld
Logistic Regression: Competitive with XLA
SLIDE 56
Weld Accelerates Multi-Library Workflows
SLIDE 57 Weld Accelerates Multi-Library Workflows
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Data cleaning + lin.
NumPy: 180x speedup
SLIDE 58 Weld Accelerates Multi-Library Workflows
10 20 30 40 50 60 70 80 90 1T 12T Runtime [secs] TF TF + XLA NumPy Weld
Image whitening + linear regression with TensorFlow + NumPy: 8.9x speedup
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Data cleaning + lin.
NumPy: 180x speedup
SLIDE 59 Weld Accelerates Multi-Library Workflows
10 20 30 40 50 60 70 80 90 1T 12T Runtime [secs] TF TF + XLA NumPy Weld
Image whitening + linear regression with TensorFlow + NumPy: 8.9x speedup
0.1 1 10 100 1000 10000 Runtime [secs; log10] Python UDF Scala UDF Weld
Linear model eval. with Spark SQL UDF: 6x speedup
0.1 1 10 100
Native (1T) No Fusion (1T) No CLO (1T) Weld (1T) Weld (12T)
Runtime [secs; log10]
Data cleaning + lin.
NumPy: 180x speedup
SLIDE 60 Incremental Integration
2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld
SLIDE 61 Incremental Integration
2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld
Implementing more operators
SLIDE 62 Incremental Integration
2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Runtime [secs] Number of operators NumPy Weld
NumPy Black Scholes workload: Incremental benefits with incremental integration.
Implementing more operators
SLIDE 63
Demo.
SLIDE 64 Conclusion
Changing the interface between libraries can speed up data analytics applications by 10-100x on modern hardware Try out Weld for yourself, or contribute! https://www.github.com/weld-project https://www.weld.rs
$ pip install pyweld $ pip install pygrizzly $ pip install weldnumpy