sol
play

Sol Fast Distributed Computation Over Slow Networks Fan Lai , Jie - PowerPoint PPT Presentation

Sol Fast Distributed Computation Over Slow Networks Fan Lai , Jie You, Xiangfeng Zhu Harsha V. Madhyastha, Mosharaf Chowdhury 1 Distributed Data Processing is Ubiquitous Distributed computation in Local-Area Networks (LAN) To accelerate


  1. Sol Fast Distributed Computation Over Slow Networks Fan Lai , Jie You, Xiangfeng Zhu Harsha V. Madhyastha, Mosharaf Chowdhury 1

  2. Distributed Data Processing is Ubiquitous • Distributed computation in Local-Area Networks (LAN) • To accelerate executions within a single cluster Efforts for Computation in LAN 2

  3. Distributed Data Processing is Ubiquitous • Distributed computation in Local-Area Networks (LAN) • To accelerate executions within a single cluster • Computation over Wide-Area Networks (WAN) • To reduce data transfers, mitigate privacy risks Tetrium Iridium CLARINEt Azure Cosmos DB Google Spanner Efforts for Computation in LAN Efforts for Computation over WAN 2

  4. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing 3

  5. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Execution Planner 3

  6. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Execution Planner Job 1 Job 2 Typical job execution plans 3

  7. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Execution Planner Execution Engine … Coordinator Worker 1 Worker 2 Worker N 3

  8. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Execution Planner Execution Engine … Coordinator Worker 1 Worker 2 Worker N Resource Scheduler Storage System 3

  9. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Execution Planner Efforts for Computation in LAN Execution Engine Tetrium CLARINEt Iridium … Coordinator Worker 1 Worker 2 Worker N Azure Cosmos DB Google Spanner Resource Scheduler Storage System Efforts for Computation over WAN 3

  10. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Iridium CLARINEt Execution Planner Efforts for Computation in LAN Execution Engine … Coordinator Worker 1 Worker 2 Worker N Resource Scheduler Storage System Efforts for Computation over WAN Tetrium Azure 3 Cosmos DB Google Spanner

  11. Execution Engine: Core of Big Data Stack Select * WordCount, K-means, … FROM …; TopKCount SVM SQL Stream … AI/ML Queries Processing Iridium CLARINEt Execution Planner While network conditions Efforts for Computation in LAN are diverse in real, execution Execution Engine engines remain the same … Coordinator Worker 1 Worker 2 Worker N Resource Scheduler Storage System Efforts for Computation over WAN Tetrium Azure 3 Cosmos DB Google Spanner

  12. Outline • Today’s Execution Engines • Sol Architecture • Control Plane Design • Data Plane Design • Evaluation 4

  13. Impact of Networks on Latency-sensitive Jobs 1 . 00 CDF across Queries 0 . 75 CDF 0 . 50 10 Gbps, O(1) ms 1 Gbps, O(1) ms 0 . 25 10 Gbps, O(100) ms 1 Gbps, O(100) ms 0 . 00 0 50 100 150 Query Completion Time (s) Job Completion Time (s) Queries from 100 GB TPC Benchmarks 5

  14. Impact of Networks on Latency-sensitive Jobs 1 . 00 CDF across Queries 0 . 75 CDF 0 . 50 10 Gbps, O(1) ms 1 Gbps, O(1) ms 0 . 25 10 Gbps, O(100) ms 1 Gbps, O(100) ms 0 . 00 0 50 100 150 Query Completion Time (s) Job Completion Time (s) Queries from 100 GB TPC Benchmarks 5

  15. Impact of Networks on Latency-sensitive Jobs 1 . 00 CDF across Queries 0 . 75 4.9X CDF 0 . 50 10 Gbps, O(1) ms 1 Gbps, O(1) ms 0 . 25 10 Gbps, O(100) ms 1 Gbps, O(100) ms 0 . 00 0 50 100 150 Query Completion Time (s) Job Completion Time (s) Queries from 100 GB TPC Benchmarks 5

  16. Impact of Networks on Latency-sensitive Jobs 1 . 00 CDF across Queries 0 . 75 4.9X Problem #1 CDF 0 . 50 10 Gbps, O(1) ms Slow job execution in 1 Gbps, O(1) ms 0 . 25 10 Gbps, O(100) ms high-latency networks 1 Gbps, O(100) ms 0 . 00 0 50 100 150 Query Completion Time (s) Job Completion Time (s) Queries from 100 GB TPC Benchmarks 5

  17. Control Plane Inefficiency Due to High Latency Worker Coordinator Launch( ■ ) Tasks Problem #1 Busy Time Slow job execution in high-latency networks O(1) ms 6

  18. Control Plane Inefficiency Due to High Latency Worker Coordinator Launch( ■ ) Tasks Problem #1 Busy Complete( ■ ) Time Slow job execution in Launch( ■ ) Tasks Busy high-latency networks Complete( ■ ) O(1) ms 6

  19. Control Plane Inefficiency Due to High Latency Coordinator Worker Late-binding of tasks postpones scheduling L a u n c h ■ ( ) Tasks Busy Problem #1 ■ ) ( e t e l p m o C Time Slow job execution in Idle L a u n c high-latency networks Tasks h ■ ( ) Busy Complete( ■ ) 7 O(100) ms

  20. Impact of Networks on Bandwidth-intensive Jobs Stage 3 Stage 1 Stage 2 Data transfers over networks Query25 on 1TB TPC benchmark 8

  21. Impact of Networks on Bandwidth-intensive Jobs Occupied CPUs CPU Util. B/w Util. Stage 3 Stage 1 Stage 2 Percentage of the Total (%) 100 75 50 25 0 0 50 100 150 200 250 Data transfers Time (s) Stage 1 Stage 2 Stage 3 over networks Time (s) Query25 on 1TB TPC benchmark Resource utilization throughout the job 8

  22. Impact of Networks on Bandwidth-intensive Jobs Occupied CPUs CPU Util. B/w Util. Stage 3 Stage 1 Stage 2 Percentage of the Total (%) 100 75 Low CPU util. 50 25 0 0 50 100 150 200 250 Data transfers Time (s) Stage 1 Stage 2 Stage 3 over networks Time (s) Query25 on 1TB TPC benchmark Resource utilization throughout the job 8

  23. Data Plane Inefficiency Due to Low Bandwidth Tasks hog CPUs throughout the lifespan Occupied CPUs CPU Util. B/w Util. Occupied CPUs CPU Util. B/w Util. Stage 3 Stage 1 Stage 2 Percentage of the Total (%) Percentage of the Total (%) 100 100 75 75 50 50 Problem #2 25 25 CPU underutilization in 0 0 0 50 100 150 200 250 0 50 100 150 200 250 low-bandwidth networks Data transfers Time (s) Stage 1 Time (s) Stage 1 Stage 2 Stage 2 Stage 3 Stage 3 over networks Time (s) Time (s) Query25 on 1TB TPC benchmark Resource utilization throughout the job Resource utilization throughout the job 9

  24. Outline • Today’s Execution Engines Problem #1 High latency → Idleness of workers • Sol Architecture • Control Plane Design Problem #2 • Data Plane Design Low b/w → CPU underutilization • Evaluation 10

  25. Outline Sol • Today’s Execution Engines • Sol Architecture A federated execution engine for diverse network conditions w/ • Control Plane Design • faster job execution • Data Plane Design • higher resource utilization • Evaluation 11

  26. Sol : A Federated Execution Engine WAN • Central Coordinator Sol Coordinator Task Arrivals • Coordinate inter-site executions O(100) ms LAN O(100) ms Site 1 Site 2 Site 3 WAN Sol Architecture 12

  27. Sol : A Federated Execution Engine WAN • Central Coordinator Sol Coordinator Task Arrivals • Coordinate inter-site executions O(100) ms • Site Manager LAN O(100) ms LAN • Coordinate local workers Site Manager • Manage queued tasks Site 2 Site 3 WAN Sol Architecture 12

  28. Sol : A Federated Execution Engine WAN • Central Coordinator Sol Coordinator Task Arrivals • Coordinate inter-site executions O(100) ms • Site Manager LAN O(100) ms LAN • Coordinate local workers Site Manager • Manage queued tasks Site 2 Site 3 WAN Sol Architecture 12

  29. Sol : A Federated Execution Engine WAN LAN • Central Coordinator Sol Coordinator Task Arrivals • Coordinate inter-site executions O(100) ms • Site Manager LAN O(100) ms LAN • Coordinate local workers Site Manager • Manage queued tasks Worker Worker Task Task Manager Manager • Task Manager • Manage worker resource Site 2 Site 3 WAN Sol Architecture 12

  30. Outline • Today’s Execution Engines Problem #1 High latency → Idleness of workers • Sol Architecture Push tasks proactively to • Control Plane Design reduce worker idle time • Data Plane Design • Evaluation 13

  31. Task Early-binding in Control Plane Coordinator Worker Launch( ■ ) Tasks Busy Time ■ ) ( e t e l p m o C Idle Launch( ■ ) Tasks O(100) ms Existing designs 14

  32. Task Early-binding in Control Plane Coordinator Site Manager Worker Tasks Time O(1) ms O(100) ms 15

  33. Task Early-binding in Control Plane Coordinator Site Manager Worker Launch( ■ ■ ) Tasks Time O(1) ms O(100) ms 15

  34. Task Early-binding in Control Plane Coordinator Site Manager Worker Launch( ■ ) Launch( ■ ■ ) Tasks Complete( ■ ) Busy Time O(1) ms O(100) ms 15

  35. Task Early-binding in Control Plane Coordinator Site Manager Worker • Coordinator ⟷ Site Manager • Inter-site operations are early-binding Launch( ■ ) Launch( ■ ■ ) Tasks → Guarantee high utilization Complete( ■ ) Busy Time ■ Idle ) ( e t e l p m Launch( ■ ) o C Busy Tasks Launch( ■ ) O(1) ms O(100) ms 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend