interfaces for efficient software composition on modern
play

Interfaces for Efficient Software Composition on Modern Hardware - PowerPoint PPT Presentation

Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation Defense April 2, 2020 Software composition: A mainstay for decades! The result? An ecosystem of libraries + users Example: ML pipeline in Python


  1. Interfaces for Efficient Software Composition on Modern Hardware Shoumik Palkar Dissertation Defense April 2, 2020

  2. Software composition: A mainstay for decades!

  3. The result? An ecosystem of libraries + users

  4. Example: ML pipeline in Python

  5. Example: ML pipeline in Python + Users can leverage 1000s of expertly-developed libraries across many different domains - On modern hardware, composition is no longer a “zero-cost” abstraction

  6. Example: the function call interface Used to pass data between functionality via pointers to in-memory values. (1) Pass args void vdLog(float* a, float* out, size_t n) { through stack for (size_t i = 0; i + 8 < n; i += 8) { (2) Load data __m256 v = _mm256_loadu_ps(a + i); from memory Performance gap ... between these is (3) Process growing! _mm256_log2_ps(v, ...); loaded values ...

  7. Example: composition with function calls Growing gap between memory/processing speed makes function call interface worse! // From Black Scholes multiply // all inputs are vectors d1 = price * strike log2 d1 = np.log2(d1) + strike add Data movement is often dominant bottleneck in composing existing functions 7

  8. Hardware Trends are Shifting Bottlenecks CPU 1960-1994 CPU 1995- GPU Ratio of FLOPS to words 100 Memory becomes slower New hardware 80 relative to compute loaded/sec accelerators 60 make this 40 worse! 20 0 1960 1980 2000 2020 Year 1. Kagi et al. 1996. Memory Bandwidth Limitations of Future Microprocessors. ISCA 1996 2. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. TCCA 1995.

  9. Do we need a new way to combine software? • Strawman: use a monolithic system - “Legacy" applications: thousands of users of existing APIs - Example: Community of data scientists who use optimized Python libraries • Strawman: always use low-level languages (e.g., C++) or optimize manually - Optimizations [still] require lots of manual work - Example: Manual optimizations in MKL-DNN

  10. Challenges for software composition today Moving data is increasingly expensive Research vision : make software composition a Hardware accelerators complicate performance zero-cost abstraction again! further (e.g., memory management) Devs sacrifice programmability for performance

  11. My Research: new interfaces to compose software on modern hardware Key idea: Use algebraic properties of software APIs in new interfaces to enable new optimizations Examples of algebraic properties: • F() ’s loops can be fused with G() ’s loops • F() ’s args can be split + pipelined with G() • F() is parallelizable after externally splitting its args

  12. My Approach: Three interfaces with new systems to leverage their properties Name Interface/Properties System Weld Focus: Data movement optimization and automatic parallelization over existing library APIs Split annotations Focus: I/O optimization via data loading Raw filtering

  13. Preview: What a new interface can achieve Spark Spark+RFs MKL Weld MKL + SAs 600 30 Runtime (s) Runtime (s) 400 20 200 10 0 0 Disk Q1 Q2 Q3 Q4 16 Threads Black Scholes model with Intel Querying 650GB of Censys JSON MKL: 3-5x speedup with Weld data in Spark: 4x speedup with and SAs raw filtering

  14. Rest of this Talk • Weld • Split annotations • Raw filtering • Impact, open source, and concluding remarks

  15. Weld: A Common CIDR ’17 PVLDB ’18 Runtime for Data Shoumik Palkar , James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Analytics Palamuttam, Parimarjan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, Samuel Madden, Matei Zaharia

  16. Motivation for Weld + Ecosystem of 100s of existing libraries and APIs - Combining these libraries is no longer efficient! Example : Normalizing images in NumPy + classifying them in with log. reg. in TensorFlow: 13x difference compared to an end-to-end optimized implementation Can we enable existing APIs to compose efficiently on modern hardware?

  17. Weld: A Common Runtime for Data Analytics machine graph … SQL learning algorithms Common Runtime … CPU GPU

  18. Weld: A Common Runtime for Data Analytics machine graph … SQL learning algorithms Runtime API Weld Weld IR runtime Optimizer Focus on data Backends movement + parallelization … CPU GPU

  19. Weld’s Runtime API

  20. Runtime API uses lazy evaluation User Application data = lib1.f1() lib2.map(data, Data in item => lib3.f3(item)) application Runtime API Weld Weld f2 managed 11011100111 01011011110 map parallel f1 10010101010 10101000111 runtime Optimized IR fragments Combined Machine IR program for each function IR program code 20

  21. Weld’s IR

  22. Weld IR: Expressing Computations Designed to meet three goals: 1. Generality support diverse workloads and nested calls 2. Ability to express optimizations e.g., loop fusion, vectorization, and loop tiling 3. Explicit parallelism

  23. Weld IR: Internals Small “functional” IR with two main constructs. loops: iterate over a dataset Pa Paralle llel l loop : declarative objects to produce results Bu Builders: • E.g., append items to a list, compute a sum • Different implementations on different hardware • Read after writes: enables mutable state Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof

  24. Weld’s Loops and Builders Example: Functional Operators Builder that def map(data, f): builder = new appender[T] appends items for x in data: to a list. merge (builder, f(x)) result (builder) def reduce(data, zero, func): builder = new merger[zero, func] Builder that for x in data: aggregates a value. merge (builder, x) result (builder)

  25. Weld’s Optimizer

  26. Optimizer Goal Remove redundancy caused by composing independent libraries and functions. Optimizer CodeGen Runtime API IR Combine IR Rule-Based Adaptive LLVM Fragments Program Optimizer Optimizer Codegen

  27. Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: tmp = map (data, |x| x * x) res1 = reduce (tmp, 0, +) // res1 = data.square().sum() res2 = map (data, |x| sqrt(x))// res2 = np.sqrt(data) Each line generated by separate function. • Unnecessary materialization of tmp • Two traversals of data • Vectorization? Output size inference?

  28. Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: After: bld1 = new merger[0, +] tmp = map (data, |x| x * x) bld2 = new appender[i32] res1 = reduce (tmp, 0, +) ( len (data)) res2 = map (data, |x| sqrt(x)) for x: simd[i32] in data: merge (bld1, x * x) merge (bld2, sqrt(x))

  29. Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: After: bld1 = new merger[0, +] tmp = map (data, |x| x * x) bld2 = new appender[i32] res1 = reduce (tmp, 0, +) ( len (data)) res2 = map (data, |x| sqrt(x)) for x: simd[i32] in data: merge (bld1, x * x) merge (bld2, sqrt(x)) Example: Loop Fusion Rule to Pipeline Loops

  30. Removing Redundancy Rule-based optimizations for removing redundancy in generated Weld code. Before: After: bld1 = new merger[0, +] tmp = map (data, |x| x * x) bld2 = new appender[i32] res1 = reduce (tmp, 0, +) ( len (data)) res2 = map (data, |x| sqrt(x)) for x: simd[i32] in data: merge (bld1, x * x) merge (bld2, sqrt(x)) Example: Vectorization to leverage SIMD in CPUs

  31. Results

  32. Partial Integrations with Several Libraries Libraries: NumPy, Pandas, TensorFlow, Spark SQL Evaluated on 10 data science workloads + microbenchmarks vs. specialized systems 32

  33. Weld Enables Cross-Library Optimization 80 Runtime (seconds) TensorFlow NumPy Weld 60 40 20 0 TF + NumPy Weld TF + NumPy Weld 1T 8T Image whitening + logistic regression classification with NumPy + TensorFlow: 13x speedup

  34. Weld can be integrated incrementally Time spent in NumPy Time spent in Weld Runtime (seconds) 150 100 50 0 0 1 2 3 4 5 6 7 8 # Operators from Black Scholes ported to Weld Benefits with incremental integration.

  35. Weld enables high quality code generation HyPer (SOTA database) C++ baseline Weld 1 Normalized Runtime 0.5 0 Q1 Q3 Q6 Q12 Q14 Q19 SQL: Competitive with state-of-the-art and handwritten baseline (other benchmarks open source!)

  36. Impact of Optimizations: 8 Threads Experiment All -Fuse -Unrl -Pre -Vec -Pred -Grp -ADS -CLO DataClean 1.00 2.44 0.97 0.99 0.98 0.95 CrimeIndex 1.00 195 2.04 1.00 1.02 0.96 3.23 More Less BlackSch 1.00 6.68 1.44 1.95 1.64 Haversine 1.00 3.97 1.20 1.02 Impactful Impactful Nbody 1.00 1.78 2.22 1.01 BirthAn 1.00 1.02 0.97 0.98 1.00 MovieLens 1.00 1.07 1.02 0.98 1.09 LogReg 1.00 20.18 1.00 2.20 NYCFilter 1.00 9.99 1.20 1.23 2.79 FlightDel 1.00 1.27 1.01 0.96 0.96 5.50 1.47 NYC-Sel 1.00 32.43 1.29 0.96 0.93 NYC-NoSel 1.00 6.16 1.02 1.26 1.17 Q1-Few 1.00 2.60 3.75 All optimizations Q1-Many 1.00 1.13 1.12 Q3-Few 1.00 1.86 2.56 enabled. Q3-Many 1.00 1.10 0.97 Q6-Sel 1.00 1.45 1.00 1.00 0.99 0.98 Q6-NoSel 1.00 10.04 0.99 0.99 2.44 2.66

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend