computation reuse in analytics job service at microsoft
play

Computation Reuse in Analytics Job Service at Microsoft Alekh - PowerPoint PPT Presentation

Computation Reuse in Analytics Job Service at Microsoft Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, Sriram Rao Microsoft Computation Reuse in Analytics ! Job


  1. Computation Reuse in Analytics Job Service at Microsoft Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, Sriram Rao Microsoft

  2. Computation Reuse in Analytics ! Job Service at Microsoft Alekh Jindal, Shi Qiao, Hiren Patel, Zhicheng Yin, Jieming Di, Malay Bag, Marc Friedman, Yifung Lin, Konstantinos Karanasos, Sriram Rao Microsoft

  3. A brief history of Views Typical Materialized XML View Assumptions : Dynamic First VLDB View Selection • Tuning few databases First SIGMOD Optimizing Queries Partial • Relatively static data Logical Database View Maintenance Knowledge Bases with some updates Design Materialized Views XQuery • Views materialized a priori and offline Incremental • Accurate estimates of utility/cost of view materialization

  4. What’s new: Analytics -as-a-Service! Also, Job Service or Serverless Analytics : Typical Materialized • Not require users to manage h/w or s/w View Assumptions : • Only provide SQL queries over stored data • Tuning few databases • Service provider takes care of the execution • Relatively static data • Users only pay for the processing cost with some updates • Views materialized Experience from SCOPE Job Service: SCOPE Job Service at Microsoft: a priori and offline • ~10 5 number of machines • Cluster-wide computation overlaps • Accurate estimates of • ~10 5 number of analytical jobs • Recurring jobs with new inputs utility/cost of view • ~10 3 developers across Microsoft • Always online with SLA requirements materialization • ~EBs data processed per day • Cost estimations very challenging

  5. Boston -> Paris -> Tokyo Boston -> Paris Boston -> Tokyo Reassigning Passengers to Planes in Mid-Air

  6. Boston -> Paris -> Tokyo Boston -> Tokyo Boston -> Paris Reassigning Passengers to Planes in Mid-Air

  7. CloudViews Overview Assumption : Assumption : Recurring Workloads Exact Subexpression Matches

  8. CloudViews Overview User Interfaces & Tooling Job Coordination Recurring Workload Synchronization Feedback Metadata Service Loop View Sel. Online Phy. Design Materialization Expiry Rewrite queries using Views

  9. Recurring Workloads User Interfaces & Tooling Job Coordination Recurring Workload Synchronization Feedback Metadata Service Loop View Sel. Online Phy. Design Materialization Expiry Rewrite queries using Views

  10. Recurring Workloads • Periodic queries with different inputs and parameters • Structured/unstructured data; custom user code 8:00 am 9:00 am 10:00 am June 7, 2018 June 5, 2018 June 6, 2018 Analysis Reuse Q1 Q2 Q1’’ Q2’’ Q1’ Q2’ sig sig’ sig’’

  11. Reuse over Recurring Workloads • Problem: detect/reuse common subexpressions when new data arrives in each recurring interval • Solution: precise/normalized query signatures

  12. Metadata Service User Interfaces & Tooling Job Coordination Recurring Workload Synchronization Feedback Metadata Service Loop View Sel. Online Phy. Design Materialization Expiry Rewrite queries using Views

  13. Metadata Service • Materialized view lookup • Consistent view materialization • Quick view discovery

  14. Query Rewriting / Online Materialization User Interfaces & Tooling Job Coordination Recurring Workload Synchronization Feedback Metadata Service Loop View Sel. Online Phy. Design Materialization Expiry Rewrite queries using Views

  15. Query Rewriting / Online Materialization Query Rewriting using Views Online View Materialization

  16. Analyzing Production Workloads • Cluster-wide overlaps: • 45% jobs • 65% users • 80% subgraphs • Operator-wise overlaps: • Up to 1000s of overlaps Shuffle Sort Joins Filters

  17. Performance Impact • Workload: 32 queries Avg. Speedup: 43% • Latency: • Improvements depend on the critical path • Some queries slower due to materialization • Processing time: • Additional processing time for read/write Avg. Speedup: 36% • Savings in general • Overheads: • Workload analysis in an hour • ~10ms metadata service lookup • Optimization time higher/lower when creating/using views

  18. Lessons Learned • Discovering hidden redundancies, static computations • Important to get the view physical design right in big data systems • Interesting side effects: failure recovery, cost estimates • User expectations: automatic, debuggability, privacy regulations • Even classic database concepts take a lot of time to bake in industry • Challenge: some of the assumptions may not hold • Industrial research is fun! ☺

  19. Thanks! See you at: Poster Session 1, Wednesday 16:00-18:00, Houston 567 Coming up: Selecting Subexpressions to Materialize at Datacenter Scale Alekh Jindal, Konstantinos Karanasos, Sriram Rao, Hiren Patel VLDB 2018/PVLDB, Rio de Janeiro, Brazil

  20. Industry 1 Computation Reuse in Analytics Job Service at Microsoft Tue, 11-12:30 Alekh Jindal (Microsoft), Shi Qiao (Microsoft), Hiren Patel (Microsoft), Zhicheng Yin (Microsoft), Jieming Di (Microsoft), Malay Bag (Microsoft), Marc Friedman (Microsoft), Yifung Lin (Microsoft), Konstantinos Karanasos (Microsoft), Sriram Rao (Microsoft) Key Ingredients • What do we mean by computation reuse? Questions • What is a “job service”? How is it different from “databases”? ✓ Materialized views over • How does a job service look like at Microsoft? recurring workloads • Why is computation reuse challenging in a job service? ✓ CloudViews Analyzer • What is our solution, key insights, and takeaways? ✓ Feedback Loop ✓ View Selection ✓ Physical Design Architecture ✓ View Expiry ✓ CloudViews Runtime ✓ Metadata Service ✓ Online Materialization ✓ Query Rewriting ✓ Synchronization ✓ Job Coordination

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend