Interfaces for Efficient Software Composition on Modern Hardware - PowerPoint PPT Presentation

Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation Defense April 2, 2020

Software composition: A mainstay for decades!

The result? An ecosystem of libraries + users

Example: ML pipeline in Python

Example: ML pipeline in Python + Users can leverage 1000s of expertly-developed libraries across many different domains - On modern hardware, composition is no longer a “zero-cost” abstraction

Example: the function call interface Used to pass data between functionality via pointers to in-memory values. (1) Pass args void vdLog(float* a, float* out, size_t n) { through stack for (size_t i = 0; i + 8 < n; i += 8) { (2) Load data __m256 v = _mm256_loadu_ps(a + i); from memory Performance gap ... between these is (3) Process growing! _mm256_log2_ps(v, ...); loaded values ...

Example: composition with function calls Growing gap between memory/processing speed makes function call interface worse! // From Black Scholes multiply // all inputs are vectors d1 = price * strike log2 d1 = np.log2(d1) + strike add Data movement is often dominant bottleneck in composing existing functions 7

Hardware Trends are Shifting Bottlenecks CPU 1960-1994 CPU 1995- GPU Ratio of FLOPS to words 100 Memory becomes slower New hardware 80 relative to compute loaded/sec accelerators 60 make this 40 worse! 20 0 1960 1980 2000 2020 Year 1. Kagi et al. 1996. Memory Bandwidth Limitations of Future Microprocessors. ISCA 1996 2. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. TCCA 1995.

Do we need a new way to combine software? • Strawman: use a monolithic system - “Legacy" applications: thousands of users of existing APIs - Example: Community of data scientists who use optimized Python libraries • Strawman: always use low-level languages (e.g., C++) or optimize manually - Optimizations [still] require lots of manual work - Example: Manual optimizations in MKL-DNN

Challenges for software composition today Moving data is increasingly expensive Research vision : make software composition a Hardware accelerators complicate performance zero-cost abstraction again! further (e.g., memory management) Devs sacrifice programmability for performance

My Research: new interfaces to compose software on modern hardware Key idea: Use algebraic properties of software APIs in new interfaces to enable new optimizations Examples of algebraic properties: • F() ’s loops can be fused with G() ’s loops • F() ’s args can be split + pipelined with G() • F() is parallelizable after externally splitting its args

My Approach: Three interfaces with new systems to leverage their properties Name Interface/Properties System Weld Focus: Data movement optimization and automatic parallelization over existing library APIs Split annotations Focus: I/O optimization via data loading Raw filtering

Preview: What a new interface can achieve Spark Spark+RFs MKL Weld MKL + SAs 600 30 Runtime (s) Runtime (s) 400 20 200 10 0 0 Disk Q1 Q2 Q3 Q4 16 Threads Black Scholes model with Intel Querying 650GB of Censys JSON MKL: 3-5x speedup with Weld data in Spark: 4x speedup with and SAs raw filtering

Rest of this Talk • Weld • Split annotations • Raw filtering • Impact, open source, and concluding remarks

Weld: A Common CIDR ’17 PVLDB ’18 Runtime for Data Shoumik Palkar , James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Analytics Palamuttam, Parimarjan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, Samuel Madden, Matei Zaharia

Motivation for Weld + Ecosystem of 100s of existing libraries and APIs - Combining these libraries is no longer efficient! Example : Normalizing images in NumPy + classifying them in with log. reg. in TensorFlow: 13x difference compared to an end-to-end optimized implementation Can we enable existing APIs to compose efficiently on modern hardware?

Weld: A Common Runtime for Data Analytics machine graph … SQL learning algorithms Common Runtime … CPU GPU

Weld: A Common Runtime for Data Analytics machine graph … SQL learning algorithms Runtime API Weld Weld IR runtime Optimizer Focus on data Backends movement + parallelization … CPU GPU

Weld’s Runtime API

Runtime API uses lazy evaluation User Application data = lib1.f1() lib2.map(data, Data in item => lib3.f3(item)) application Runtime API Weld Weld f2 managed 11011100111 01011011110 map parallel f1 10010101010 10101000111 runtime Optimized IR fragments Combined Machine IR program for each function IR program code 20

Weld’s IR

Weld IR: Expressing Computations Designed to meet three goals: 1. Generality support diverse workloads and nested calls 2. Ability to express optimizations e.g., loop fusion, vectorization, and loop tiling 3. Explicit parallelism

Weld IR: Internals Small “functional” IR with two main constructs. loops: iterate over a dataset Pa Paralle llel l loop : declarative objects to produce results Bu Builders: • E.g., append items to a list, compute a sum • Different implementations on different hardware • Read after writes: enables mutable state Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof

Weld’s Loops and Builders Example: Functional Operators Builder that def map(data, f): builder = new appender[T] appends items for x in data: to a list. merge (builder, f(x)) result (builder) def reduce(data, zero, func): builder = new merger[zero, func] Builder that for x in data: aggregates a value. merge (builder, x) result (builder)

Weld’s Optimizer

Optimizer Goal Remove redundancy caused by composing independent libraries and functions. Optimizer CodeGen Runtime API IR Combine IR Rule-Based Adaptive LLVM Fragments Program Optimizer Optimizer Codegen

Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: tmp = map (data, |x| x * x) res1 = reduce (tmp, 0, +) // res1 = data.square().sum() res2 = map (data, |x| sqrt(x))// res2 = np.sqrt(data) Each line generated by separate function. • Unnecessary materialization of tmp • Two traversals of data • Vectorization? Output size inference?

Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: After: bld1 = new merger[0, +] tmp = map (data, |x| x * x) bld2 = new appender[i32] res1 = reduce (tmp, 0, +) ( len (data)) res2 = map (data, |x| sqrt(x)) for x: simd[i32] in data: merge (bld1, x * x) merge (bld2, sqrt(x))

Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: After: bld1 = new merger[0, +] tmp = map (data, |x| x * x) bld2 = new appender[i32] res1 = reduce (tmp, 0, +) ( len (data)) res2 = map (data, |x| sqrt(x)) for x: simd[i32] in data: merge (bld1, x * x) merge (bld2, sqrt(x)) Example: Loop Fusion Rule to Pipeline Loops

Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: After: bld1 = new merger[0, +] tmp = map (data, |x| x * x) bld2 = new appender[i32] res1 = reduce (tmp, 0, +) ( len (data)) res2 = map (data, |x| sqrt(x)) for x: simd[i32] in data: merge (bld1, x * x) merge (bld2, sqrt(x)) Example: Vectorization to leverage SIMD in CPUs

Results

Partial Integrations with Several Libraries Libraries: NumPy, Pandas, TensorFlow, Spark SQL Evaluated on 10 data science workloads + microbenchmarks vs. specialized systems 32

Weld Enables Cross-Library Optimization 80 Runtime (seconds) TensorFlow NumPy Weld 60 40 20 0 TF + NumPy Weld TF + NumPy Weld 1T 8T Image whitening + logistic regression classification with NumPy + TensorFlow: 13x speedup

Weld can be integrated incrementally Time spent in NumPy Time spent in Weld Runtime (seconds) 150 100 50 0 0 1 2 3 4 5 6 7 8 # Operators from Black Scholes ported to Weld Benefits with incremental integration.

Weld enables high quality code generation HyPer (SOTA database) C++ baseline Weld 1 Normalized Runtime 0.5 0 Q1 Q3 Q6 Q12 Q14 Q19 SQL: Competitive with state-of-the-art and handwritten baseline (other benchmarks open source!)

Impact of Optimizations: 8 Threads Experiment All -Fuse -Unrl -Pre -Vec -Pred -Grp -ADS -CLO DataClean 1.00 2.44 0.97 0.99 0.98 0.95 CrimeIndex 1.00 195 2.04 1.00 1.02 0.96 3.23 More Less BlackSch 1.00 6.68 1.44 1.95 1.64 Haversine 1.00 3.97 1.20 1.02 Impactful Impactful Nbody 1.00 1.78 2.22 1.01 BirthAn 1.00 1.02 0.97 0.98 1.00 MovieLens 1.00 1.07 1.02 0.98 1.09 LogReg 1.00 20.18 1.00 2.20 NYCFilter 1.00 9.99 1.20 1.23 2.79 FlightDel 1.00 1.27 1.01 0.96 0.96 5.50 1.47 NYC-Sel 1.00 32.43 1.29 0.96 0.93 NYC-NoSel 1.00 6.16 1.02 1.26 1.17 Q1-Few 1.00 2.60 3.75 All optimizations Q1-Many 1.00 1.13 1.12 Q3-Few 1.00 1.86 2.56 enabled. Q3-Many 1.00 1.10 0.97 Q6-Sel 1.00 1.45 1.00 1.00 0.99 0.98 Q6-NoSel 1.00 10.04 0.99 0.99 2.44 2.66

Interfaces for Efficient Software Composition on Modern Hardware - PowerPoint PPT Presentation

Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation Defense April 2, 2020 Software composition: A mainstay for decades! The result? An ecosystem of libraries + users Example: ML pipeline in Python

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

T Topic 7 i 7 Interfaces and Abstract Interfaces and Abstract Classes Interfaces Interfaces

CSSE 220 Interfaces and Polymorphism Check out Interfaces from SVN Interfaces What, When,

The History of Interaction Batch Interfaces Command-Line Interfaces Graphical User

Virtual xfrm interfaces Steffen Klassert secunet Security Networks AG Dresden Linux IPsec

CISM MODERN PENTATHLON COMMITTEE CISM Modern Pentathlon Committee Composition of the CISM Modern

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Touch Interfaces Multi-touch displays Input & interaction Mobile design 1 CS349 -- Touch

Anaphe Developer Developer Anaphe Interfaces Interfaces Lorenzo Moneta Lorenzo Moneta CERN

Interfaces Interfaces interface : A list of methods that a class promises to implement.

The History of Interaction Batch Interfaces Command-Line Interfaces Graphical

Interfaces Interfaces n interface : A list of methods that a class promises to implement. q

SVG for Automotive User SVG for Automotive User Interfaces Interfaces S. Boisgrault, Mines

Marion County Waste Composition 2016 February 27, 2018 Peter Spendelow Oregon Department of

Real Composition Algebras Steven Clanton Harriet L. Wilkes Honors College Florida Atlantic

RF Measurements and Documentation for the Industrial Fabrication of SC Cavities. Alexey Sulimov,

Product Formalisms for Measures, Multiplicative Noise, and Geometric Signal Representation Peter

Alex Suciu Northeastern University (joint work with He Wang) Workshop on Braids in Algebra,

Optimal Dispatching of Welding Robots Cornelius Schwarz and Jrg Rambau Lehrstuhl fr

Conduction Welding Conduction joining describes a family of processes in which the laser beam

Wentworth Institute of Technology College of Engineering and Technology COMP4050 Machine

IAB Report IETF 85 Atlanta, Georgia November 7, 2012 Architectural Issues in the News Slide 2

COMMITTAL FOR CONTEMPT: Procedure, Defences, Funding and Costs 14 July 2020 Paper produced by

Interfaces for Efficient Software Composition on Modern Hardware - PowerPoint PPT Presentation

Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation Defense April 2, 2020 Software composition: A mainstay for decades! The result? An ecosystem of libraries + users Example: ML pipeline in Python

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

T Topic 7 i 7 Interfaces and Abstract Interfaces and Abstract Classes Interfaces Interfaces

CSSE 220 Interfaces and Polymorphism Check out Interfaces from SVN Interfaces What, When,

The History of Interaction Batch Interfaces Command-Line Interfaces Graphical User

Virtual xfrm interfaces Steffen Klassert secunet Security Networks AG Dresden Linux IPsec

CISM MODERN PENTATHLON COMMITTEE CISM Modern Pentathlon Committee Composition of the CISM Modern

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Touch Interfaces Multi-touch displays Input &amp; interaction Mobile design 1 CS349 -- Touch

Anaphe Developer Developer Anaphe Interfaces Interfaces Lorenzo Moneta Lorenzo Moneta CERN

Interfaces Interfaces interface : A list of methods that a class promises to implement.

The History of Interaction Batch Interfaces Command-Line Interfaces Graphical

Interfaces Interfaces n interface : A list of methods that a class promises to implement. q

SVG for Automotive User SVG for Automotive User Interfaces Interfaces S. Boisgrault, Mines

Marion County Waste Composition 2016 February 27, 2018 Peter Spendelow Oregon Department of

Real Composition Algebras Steven Clanton Harriet L. Wilkes Honors College Florida Atlantic

RF Measurements and Documentation for the Industrial Fabrication of SC Cavities. Alexey Sulimov,

Product Formalisms for Measures, Multiplicative Noise, and Geometric Signal Representation Peter

Alex Suciu Northeastern University (joint work with He Wang) Workshop on Braids in Algebra,

Optimal Dispatching of Welding Robots Cornelius Schwarz and Jrg Rambau Lehrstuhl fr

Conduction Welding Conduction joining describes a family of processes in which the laser beam

Wentworth Institute of Technology College of Engineering and Technology COMP4050 Machine

IAB Report IETF 85 Atlanta, Georgia November 7, 2012 Architectural Issues in the News Slide 2

COMMITTAL FOR CONTEMPT: Procedure, Defences, Funding and Costs 14 July 2020 Paper produced by

Touch Interfaces Multi-touch displays Input & interaction Mobile design 1 CS349 -- Touch