diagnosing distributed cps with timing provenance
play

Diagnosing Distributed CPS with Timing Provenance Yang Andreas - PowerPoint PPT Presentation

NSF/Intel PI Meeting July 13, 2018 Diagnosing Distributed CPS with Timing Provenance Yang Andreas Linh Thi Xuan Wu Haeberlen Phan Problem: Timing faults n Many CPS are time dependent n The right thing must happen at the right


  1. NSF/Intel PI Meeting – July 13, 2018 Diagnosing Distributed CPS with Timing Provenance Yang Andreas Linh Thi Xuan Wu Haeberlen Phan

  2. Problem: Timing faults n Many CPS are time dependent n The “right thing” must happen at the “right time”! n What if this goes wrong? n Reasons: attack, bug, misconfiguration, … n Goal: A powerful diagnostic capability n Can we find the root cause of both functional and timing issues, such as low throughput, oscillations, high latencies, …? 2 @2018 Linh T. X. Phan – Timing Provenance

  3. (Q) How was the computing response generated? V1 Computing Rsp Challenge generated at C at 95s Storage RPC Rsp received by C at 93s Storage RPC Rsp Root cause! sent by B at 93s Storage Block was ... during [ 0s , 93s ) Bottleneck! Misbehaving Storage RPCs Root cause? Storage RPC Req received by B at 81s Storage RPC Req S4 sent by C at 81s V2 S2 Computing Req Storage Type S3 Storage received by C at 80s was remote during [ 0s , ∞ ) Maintenance Victim Storage RPC Backend (B) Service (M) S1 Why is the request taking (time) so long to complete? Computing Request span id: 1, parent id: none t6 Computing t0 Computing Request Service (C) Storage RPC span id: 2, parent id: 1 End Start t1 t2 t3 t4 t5 Bottleneck! Client Server Start Server Client Send Recv Root cause? Job Send Recv n State of the art Distributed tracing: explain what was computed when, but not why n Network provenance: only reason about functional causality n n Cannot reason about timing 3 @2018 Linh T. X. Phan – Timing Provenance

  4. Approach: Timing provenance Root cause Misbehaving Storage RPCs captured! Queuing delay! … S4 S2 S3 Storage Maintenance Victim Storage RPC Backend (B) Service (M) S1 Why is the request taking so long to complete? Computing Computing Request Service (C) n A generalization of provenance that tracks both functional causality and temporal causality i.e., causes that affect the timing of the observed symptom n may involve requests that are functionally independent n n Result: Can explain both the ‘what’ and the ‘when’ 4 @2018 Linh T. X. Phan – Timing Provenance

  5. How to capture temporal causality? n Intuition: Represent ordering relationship between exec. We need to know not just what the system did, but also in what order n (queuing and scheduling semantics) n Extend critical-path analysis in a novel way for the analysis D A B C D C Enqueue Dequeue Request D can only be dequeued after Timing provenance of D C is dequeued and finished processing must include C 5 @2018 Linh T. X. Phan – Timing Provenance

  6. Insight #1: Sequencing edges n Add a sequencing edge from execution X to execution Y if X immediately precedes Y in the queue Symptom A B C D Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 Response t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 D C t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp B Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 Seq: 0, Enq: 84, Acq: 99, Deq: 102 A RPC t=90, @C, StorageReq t=86, @M, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=102, @S, StorageRsp Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=89, @C, ComputeReq t=86, @M-S, StorageReq Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Network Storage RPC RPC t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Compute Root cause! Seq: 0, Enq: 82 , Acq: 82, Deq: 83 t=87, @C, VMReq Request Maintenance t=82, @M, MaintainReq Jobs 6 @2018 Linh T. X. Phan – Timing Provenance

  7. Challenge: Usability n Not all executions are equally important n How to isolate executions that contribute substantially to the overall delay? Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 Response t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 RPC t=86, @M, StorageReq t=90, @C, StorageReq Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=102, @S, StorageRsp t=86, @M-S, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=89, @C, ComputeReq Seq: 0, Enq: 84, Acq: 99, Deq: 102 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Network Storage t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq RPC RPC Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Seq: 0, Enq: 82 , Acq: 82, Deq: 83 Compute Maintenance t=87, @C, VMReq t=82, @M, MaintainReq Request 7 Jobs @2018 Linh T. X. Phan – Timing Provenance

  8. Insight #2: Delay annotations n Annotate vertexes with the delays that they contribute +90.009s Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 +90s Response +0.004s t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp +0.002s Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp +0.003s Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 RPC t=86, @M, StorageReq t=90, @C, StorageReq Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=102, @S, StorageRsp t=86, @M-S, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=89, @C, ComputeReq Seq: 0, Enq: 84, Acq: 99, Deq: 102 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Network Storage t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq RPC RPC Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Seq: 0, Enq: 82 , Acq: 82, Deq: 83 Compute Maintenance t=87, @C, VMReq t=82, @M, MaintainReq Request 8 Jobs @2018 Linh T. X. Phan – Timing Provenance

  9. Insight #2: Delay annotations n Annotate vertexes with the delays that they contribute n Goal: Delay annotations should correspond to “potential speedup” +0s +90.009s +0.009s Slow Compute t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 +90s Response +0.004s t=100, @B-C, BillingRsp t=113, @S-C, StorageRsp +0.002s Seq: 1, Enq: 98, Acq: 98, Deq: 100 Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=97, @C, ComputeRsp, t=98, @B, BillingRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=111, @S, StorageRsp +0.003s Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=108, @S, StorageRsp t=92, @C, NetworkRsp t=93, @C-B, BillingReq Seq: 2, Enq: 86, Acq: 105, Deq: 108 Seq: 5, Enq: 87, Acq: 91, Deq: 92 Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 Billing t=88, @M-S, StorageReq t=91, @C, BillingReq RPC Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=105, @S, StorageRsp Seq: 4, Enq: 87 , Acq: 90, Deq: 91 Compute Seq: 1, Enq: 86, Acq: 102, Deq: 105 RPC t=86, @M, StorageReq t=90, @C, StorageReq Seq: 2, Enq: 85 , Acq: 85, Deq: 86 t=102, @S, StorageRsp t=86, @M-S, StorageReq Seq: 3, Enq: 87 , Acq: 89, Deq: 90 t=89, @C, ComputeReq Seq: 0, Enq: 84, Acq: 99, Deq: 102 Seq: 1, Enq: 84, Acq: 84, Deq: 86 Seq: 2, Enq: 87 , Acq: 88, Deq: 89 Network Storage t=85, @M, MaintainReq t=84, @C-S, StorageReq t=84, @M, StorageReq RPC RPC Seq: 0, Enq: 83, Acq: 83, Deq: 84 Seq: 1, Enq: 83 , Acq: 83, Deq: 84 t=88, @C, NetworkReq Seq: 1, Enq: 87 , Acq: 87, Deq: 88 t=83, @M, StorageReq t=83, @M, MaintainReq Seq: 0, Enq: 82 , Acq: 82, Deq: 83 Compute Maintenance t=87, @C, VMReq t=82, @M, MaintainReq Request 9 Jobs @2018 Linh T. X. Phan – Timing Provenance

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend