t owards an application objective aware network interface
play

T owards An Application Objective-Aware Network Interface Sangeetha - PowerPoint PPT Presentation

T owards An Application Objective-Aware Network Interface Sangeetha Abdu Jyothi Sayed Hadi Hashemi Roy Campbell Brighten Godfrey HotCloud20 Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter


  1. T owards An Application Objective-Aware Network Interface Sangeetha Abdu Jyothi Sayed Hadi Hashemi Roy Campbell Brighten Godfrey HotCloud’20

  2. Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter Network Fabric 2

  3. Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter Flow Flow Completion Time Network Fabric 2

  4. Evolution of Application Network Interface (ANI) ANI Metrics Packet Delay, jitter Flow Flow Completion Time Network Fabric Coflow Coflow Completion Time 2

  5. What is the ultimate goal of an ANI? Translating application requirements to actionable network requirements Are current ANIs sufficient? 3

  6. Understanding an Application’s Objective • Applications have complex interdependencies f 2 c 2 f 1 between computation and communication A C • Prioritizing flows based on computations in f 1 c 1 succeeding stage is critical f 2 B c 3 Coflow-Optimized Performance-Optimized f 1 Network f 1 f 2 Current abstractions fail f 2 to capture application objective effectively c 1 c 2 c 3 Compute c 1 c 2 c 3 0 0.5 1 1.5 2 0 1 1.5 2 2.5 4

  7. An Example Application: Distributed Deep Learning Parameter Server Worker Worker Worker • Gigabytes of data transferred in each iteration Update A op1’ which lasts milliseconds 
 (e.g., VGG-16 send ~1GB data every 200ms) op2’ op3’ Update B Update C • Parameters consumed in a particular order op4’ Update D • Parameter updates from PS to workers send in op4 Read D the best order can accelerate training Read B op2 op3 Read C op1 Read A Input Data Sample TensorFlow Model: One Iteration 5

  8. Other Applications Req 1 • User-facing partition-aggregation workloads 
 (remote dependency resolution at a Web proxy) Req n Client Proxy • Graph processing systems • Iterative analytics with deadlines (eg: Naiad) and so on … Gather Scatter Update 6

  9. Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7

  10. Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7

  11. Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND • An application-level objective CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7

  12. Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND • An application-level objective • Metrics may be priority, deadline, weight, etc. CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7

  13. Towards A Novel Application Network Interface • Computation completely represented by a DAG. What is the network equivalent? • The goal is to capture an application’s network objective • CadentFlow: • A set of flows with metrics AND • An application-level objective • Metrics may be priority, deadline, weight, etc. CF = {(f 1 , T 1 ), (f 2 , T 2 ), … , (f n , T n ), Γ } where T i = (t i1 , m i1 ), (t i2 , m i2 ) … 7

  14. Defining CCT flexibility ratio f 2 c 2 f 1 A When computation is the bottleneck, • C f 1 c 1 CadentFlow with deadlines provide flexibility for delaying some flows without affecting f 2 B c 3 application performance Performance-Optimized Performance-Optimized In the example, best Coflow Completion Time • (CCT) is 1s, but upto 1.5s is tolerable without f 1 f 2 f 1 f 2 any impact CCT flexibility ratio = Max tolerable CCT • c 1 c 2 c 3 c 1 c 2 c 3 Min CCT 2.5 0 0.5 1 1.5 2 0 0.5 1.5 2 c1 takes 0.5s c1 takes 1s 8

  15. Distributed DNN Training CadentFlow • Priority-based • Assign priorities based on DAG structure Update A op1’ Update B op2’ op3’ Update C • Objective: Minimize completion time subject to priorities op4’ Update D p 3 op4 Read D p 2 Read B op2 op3 Read C p 2 p 1 op1 Read A Input Data Sample TensorFlow Model: One Iteration 9

  16. Distributed DNN Training CadentFlow • Priority-based • Assign priorities based on DAG structure Update A op1’ Update B op2’ op3’ Update C • Objective: Minimize completion time subject to priorities op4’ Update D d=12ms t=2ms • Deadline-based op4 Read D • Assign deadlines based on per-op computation Read B op2 op3 Read C d=3ms d=3ms t=5ms t=4ms time op1 Read A d=0ms t=3ms Input Data • Objective: Minimize max i (endTime i − deadline i i ) Sample TensorFlow Model: One Iteration • 9

  17. Distributed DNN Training CadentFlow • Priority-based • Assign priorities based on DAG structure Update A op1’ Update B op2’ op3’ Update C • Objective: Minimize completion time subject to priorities op4’ Update D d=12ms t=2ms • Deadline-based op4 Read D • Assign deadlines based on per-op computation Read B op2 op3 Read C d=3ms d=3ms t=5ms t=4ms time op1 Read A d=0ms t=3ms Input Data • Objective: Minimize max i (endTime i − deadline i i ) Sample TensorFlow Model: One Iteration delay of flow i • 9

  18. Quantifying benefits achievable with a better network abstraction • Representative application: distributed deep learning • Methodology Update A op1’ • Tracing distributed deep learning workloads to obtain Update B op2’ op3’ Update C dependencies and computation/communication times op4’ Update D • Simulate various network control schemes op4 Read D 1. TCP (max-min fairness across flows sharing Read B op2 op3 Read C a link) op1 Read A 2. Minimum Allocation for Desired Duration (MADD) [Coflow control in Varys] Input Data 3. CadentFlow-optimized scheme Sample TensorFlow Model: One Iteration 10

  19. Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized AlexNet-v2 CifarNet Inception-v1 Inception-v3 MobileNet-v2 ResNet-v1-50 ResNet-v1-152 ResNet-v1-200 Up to 25% improvement in iteration time ResNet-v2-101 with CadentFlow ResNet-v2-152 VGG-19 0 0.2 0.4 0.6 0.8 1 1.2 Iteration time (relative to TCP) 8 workers, 8 PS 11

  20. Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized Coflow optimization may delay AlexNet-v2 CifarNet completion time because smaller Inception-v1 parameters are delayed Inception-v3 MobileNet-v2 ResNet-v1-50 ResNet-v1-152 ResNet-v1-200 ResNet-v2-101 ResNet-v2-152 VGG-19 0 0.2 0.4 0.6 0.8 1 1.2 Iteration time (relative to TCP) 8 workers, 8 PS 11

  21. Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized AlexNet-v2 -v2 CifarNet et Inception-v1 -v1 Inception-v3 -v3 MobileNet-v2 -v2 ResNet-v1-50 50 ResNet-v1-152 52 ResNet-v1-200 00 ResNet-v2-101 01 ResNet-v2-152 52 VGG-19 19 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 Iteration time Iteration time (relative to TCP) (relative to TCP) 8 workers, 8 PS 16 workers,16 PS 11

  22. Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized -v2 AlexNet-v2 -v2 -v2 et CifarNet et et -v1 Inception-v1 -v1 -v1 -v3 Inception-v3 -v3 -v3 -v2 MobileNet-v2 -v2 -v2 -50 ResNet-v1-50 50 -50 52 ResNet-v1-152 52 52 00 ResNet-v1-200 00 00 01 ResNet-v2-101 01 01 52 ResNet-v2-152 52 52 -19 VGG-19 -19 19 0 0.4 0.8 1.2 1.6 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 CCT fl exibility ratio Iteration time Iteration time CCT fl exibility ratio (max feasible CCT/ min CCT) (relative to TCP) (relative to TCP) (max feasible CCT/ min CCT) 8 workers, 8 PS 16 workers,16 PS 11

  23. Performance Improvement Coflow-optimization CadentFlow optimization Co fl ow optimized CadentFlow optimized -v2 AlexNet-v2 -v2 -v2 et CifarNet et et -v1 Inception-v1 -v1 -v1 -v3 Inception-v3 -v3 -v3 -v2 MobileNet-v2 -v2 -v2 -50 ResNet-v1-50 50 -50 52 ResNet-v1-152 52 52 00 ResNet-v1-200 00 00 01 ResNet-v2-101 01 01 52 ResNet-v2-152 52 52 -19 VGG-19 -19 19 0 0.4 0.8 1.2 1.6 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 CCT fl exibility ratio Iteration time Iteration time CCT fl exibility ratio (max feasible CCT/ min CCT) (relative to TCP) (relative to TCP) (max feasible CCT/ min CCT) When gain in iteration time is lower, CCT flexibility ratio is higher 8 workers, 8 PS 16 workers,16 PS 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend