Weld: A Common Runtime for Data Analytics Shoumik Palkar, James - PowerPoint PPT Presentation

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL

Motivation Modern data apps combine many disjoint processing libraries & functions » Relational, statistics, machine learning, … » E.g. PyData stack + Great results leveraging work of 1000s of authors – No optimization across these functions

How Bad is This Problem? Growing gap between memory/processing makes traditional way of combining functions worse parse_csv data = pandas.parse_csv(string) filtered = pandas.dropna(data) dropna avg = numpy.mean(filtered) mean 5-30x slowdowns in NumPy, Pandas, TensorFlow, etc

How We Solve This machine graph … SQL learning algorithms Common Runtime … CPU GPU

How We Solve This machine graph … SQL learning algorithms Runtime API Weld Weld IR runtime Optimizer Backends … CPU GPU

Runtime API Uses lazy evaluation to collect work across libraries User Application Weld Runtime f1 data = lib1.f1() IR fragments lib2.map(data, map for each function item => lib3.f2(item) f2 ) Runtime API Combined IR program Optimized Data in 1101110 0111010 machine code application 1101111

Weld IR Designed to meet three goals: 1. Library composition: support complete workloads such as nested parallel calls 2. Ability to express optimizations: e.g. loop fusion, vectorization, loop tiling 3. Explicit parallelism

Weld IR Small, powerful design inspired by “monad comprehensions” Parallel loops: iterate over a dataset Builders: declarative objects for producing results » E.g. append items to a list, compute a sum » Can be implemented differently on different hardware Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof

Examples Implement functional operators using builders def map(data, f): builder = new vecbuilder[int] for x in data: merge (builder, f(x)) result (builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge (builder, x) result (builder)

Example Optimization: Fusion squares = map (data, x => x * x) sum = reduce (data, 0, +) bld1 = new vecbuilder[int] bld2 = new merger[0, +] for x in data: merge (bld1, x * x) merge (bld2, x) Loops can be merged into one pass over data

Implementation Prototype with APIs in Scala and Python » LLVM and Voodoo for code gen Integrations: TensorFlow, NumPy, Pandas, Spark

Results: Individual Workloads SQL (TPC-H) PageRank 12 1.2 0.7 Runtime [secs] GraphMat Runtime [secs] Runtime [secs] 0.6 10 1 Hand-opt 0.5 8 0.8 Weld 0.4 0.6 6 0.3 0.4 4 0.2 0.2 2 0.1 0 0 0 1 4 12 1 4 12 1 2 4 8 12 Number of threads Number of threads Number of threads HyPer Weld HyPer Weld H.o. H.o. Q1 Q3 Word2Vec 0.3 0.6 Runtime [secs] Runtime [secs] 0.25 0.5 25 0.2 TF 0.4 Runtime [secs] 20 TF-Op 0.15 0.3 Weld 15 0.1 0.2 0.05 0.1 10 0 0 1 4 12 1 4 12 5 Number of threads Number of threads 0 HyPer Weld HyPer Weld TF-Op = C++ operator H.o. H.o. Q6 Q12

Results: Existing Frameworks 45 1000 0.2 SparkSQL 0.18 TF Runtime [secs; log10] 40 Runtime [secs] Runtime [secs] 0.16 Weld Hand-opt 35 100 0.14 Weld 30 0.12 25 0.1 10 0.08 20 0.06 15 0.04 1 10 0.02 0 5 0.1 0 1 Core 12 Cores LR (1T) LR (12T) TPC-H Q1 TPC-H Q6 NP Weld Workload Workload NExpr TPC-H Vector Sum Logistic Regression Integration effort: 500 lines glue, 30 lines/operator

Results: Cross-Library Optimization Pandas + NumPy Spark SQL UDF 100 2.0 Current Scala UDF Weld, no CLO Weld Weld, CLO 10 1.5 Runtime (sec, log10) Runtime (sec) Weld, 12 core 31x 1 1.0 290x 0.1 0.5 14x 0.01 0.0

Conclusion The way we compose software will have to change to efficiently use modern hardware Weld is our first attempt at such a design – lots of open questions! » Optimization, specialized hardware, domain info, … Open source: this spring We’re hiring! (postdocs)

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James - PowerPoint PPT Presentation

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag, Deepak Narayanan, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, Matei Zaharia Stanford InfoLab, *MIT CSAIL Motivation Modern data apps combine many

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Intellige telligent nt We Weld lding ing and d We Weld lder er Traini aining* ng*

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

The The SeETL RunTime RunTime SeETL Utilities Presentation Utilities Presentation

SUPPORT FOR FIELD APPLICATIONS GMAW-FCAW Orbi-MIG II-K Head Field Applications West Closure, New

MOBILE WELDER TRAINING CENTER WeldEdTraining.com Weld-Ed.org On-Site, On-Demand Customized

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

Paying new hires fairly Ben Teusch HR Analytics Consultant DataCamp Human Resources Analytics

Rela latio ional data pandas SQLite Two table les Table: city Table: country name

Lecture 10: Performance Tools Abhinav Bhatele, Department of Computer Science Announcements

Tut#15-16: Pandas/Numpy CPSC 501 Dr. J. Hudson University of Calgary Arshia Hosseini T01/T02

Weld: Accelerating Data Science by 100x Shoumik Palkar , James Thomas, Deepak Narayanan ,

Digital Medicine I: Introduction to Programming Pandas Autumn 2019 December 19, 2019 So far.

Python - Data Analysis Essentials Day 2 Giuseppe Accaputo g@accaputo.ch 18.05.2019 Slide 1 IT

CS6 Practical System Skills Fall 2019 edition Leonhard Spiegelberg lspiegel@cs.brown.edu

INTRODUCTION TO D ATA AN ALYSIS AND PLOTTING WITH PANDAS JSC TUTORIAL Andreas Herten,

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James - PowerPoint PPT Presentation

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL Motivation Modern data apps combine many

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Intellige telligent nt We Weld lding ing and d We Weld lder er Traini aining* ng*

Introduction to Data Science: Common observation to be religion, income, frequency where sex and

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Testing Concurrency Runtime via a Testing Concurrency Runtime via a Stochastic Stress Framework

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

The The SeETL RunTime RunTime SeETL Utilities Presentation Utilities Presentation

SUPPORT FOR FIELD APPLICATIONS GMAW-FCAW Orbi-MIG II-K Head Field Applications West Closure, New

MOBILE WELDER TRAINING CENTER WeldEdTraining.com Weld-Ed.org On-Site, On-Demand Customized

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

Paying new hires fairly Ben Teusch HR Analytics Consultant DataCamp Human Resources Analytics

Rela latio ional data pandas SQLite Two table les Table: city Table: country name

Lecture 10: Performance Tools Abhinav Bhatele, Department of Computer Science Announcements

Tut#15-16: Pandas/Numpy CPSC 501 Dr. J. Hudson University of Calgary Arshia Hosseini T01/T02

Weld: Accelerating Data Science by 100x Shoumik Palkar , James Thomas, Deepak Narayanan ,

Digital Medicine I: Introduction to Programming Pandas Autumn 2019 December 19, 2019 So far.

Python - Data Analysis Essentials Day 2 Giuseppe Accaputo g@accaputo.ch 18.05.2019 Slide 1 IT

CS6 Practical System Skills Fall 2019 edition Leonhard Spiegelberg lspiegel@cs.brown.edu

INTRODUCTION TO D ATA AN ALYSIS AND PLOTTING WITH PANDAS JSC TUTORIAL Andreas Herten,

Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag, Deepak Narayanan, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, Matei Zaharia Stanford InfoLab, *MIT CSAIL Motivation Modern data apps combine many

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues