Diagnosing Distributed CPS with Timing Provenance
Andreas Haeberlen Yang Wu Linh Thi Xuan Phan
NSF/Intel PI Meeting – July 13, 2018
Diagnosing Distributed CPS with Timing Provenance Yang Andreas - - PowerPoint PPT Presentation
NSF/Intel PI Meeting July 13, 2018 Diagnosing Distributed CPS with Timing Provenance Yang Andreas Linh Thi Xuan Wu Haeberlen Phan Problem: Timing faults n Many CPS are time dependent n The right thing must happen at the right
Andreas Haeberlen Yang Wu Linh Thi Xuan Phan
NSF/Intel PI Meeting – July 13, 2018
@2018 Linh T. X. Phan – Timing Provenance
n Many CPS are time dependent
n The “right thing” must happen at the “right time”!
n What if this goes wrong?
n Reasons: attack, bug, misconfiguration, …
n Goal: A powerful diagnostic capability
n Can we find the root cause of both functional and timing
issues, such as low throughput, oscillations, high latencies, …?
2
@2018 Linh T. X. Phan – Timing Provenance
n State of the art
n
Distributed tracing: explain what was computed when, but not why
n
Network provenance: only reason about functional causality
n Cannot reason about timing 3
S3 S4 S2 Storage Backend (B) Computing Service (C) Maintenance Service (M) S1 Why is the request taking so long to complete? Computing Request Victim Storage RPC Misbehaving Storage RPCs
Computing Request span id: 1, parent id: none Storage RPC span id: 2, parent id: 1 t1 t2 t3 t4 t5 t0 t6 Start End (time) Server Recv Server Send Start Job Client Recv Client Send
Bottleneck! Root cause?
Root cause!
Bottleneck! Root cause?
Computing Rsp generated at C at 95s Computing Req received by C at 80s Storage RPC Req sent by C at 81s Storage RPC Req received by B at 81s Storage RPC Rsp sent by B at 93s Storage RPC Rsp received by C at 93s Storage Type was remote during [0s,∞) V1 V2 Storage Block was ... during [0s,93s) (Q) How was the computing response generated?
@2018 Linh T. X. Phan – Timing Provenance
4 S3 S4 S2 Storage Backend (B) Computing Service (C) Maintenance Service (M) S1 Why is the request taking so long to complete? Computing Request Victim Storage RPC Misbehaving Storage RPCs
Root cause captured!
n A generalization of provenance that tracks both
n
i.e., causes that affect the timing of the observed symptom
n
may involve requests that are functionally independent
n Result: Can explain both the ‘what’ and the ‘when’
…
Queuing delay!
@2018 Linh T. X. Phan – Timing Provenance
n Intuition: Represent ordering relationship between exec.
n
We need to know not just what the system did, but also in what order (queuing and scheduling semantics)
n Extend critical-path analysis in a novel way for the
5
Dequeue Enqueue Request D can only be dequeued after C is dequeued and finished processing
@2018 Linh T. X. Phan – Timing Provenance
n Add a sequencing edge from execution X to execution Y
6
t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114
Slow Compute Response
t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=85, @M, MaintainReq t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq
Maintenance Jobs
t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102 A
t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111
Root cause! Symptom A B C D
t=87, @C, VMReq
Compute Request
t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89
Network RPC Compute RPC Billing RPC Storage RPC
t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98
t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91
@2018 Linh T. X. Phan – Timing Provenance
n Not all executions are equally important n How to isolate executions that contribute substantially to
7
t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102
Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs
@2018 Linh T. X. Phan – Timing Provenance
n Annotate vertexes with the delays that they contribute 8
t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102
Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs
+0.003s +0.002s +0.004s +90s +90.009s
@2018 Linh T. X. Phan – Timing Provenance
t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102
Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs
+0.003s +0.002s +0.004s +90s +90.009s +0.009s +0s
n Annotate vertexes with the delays that they contribute n Goal: Delay annotations should correspond to “potential
9
@2018 Linh T. X. Phan – Timing Provenance
10
n
Rule 1: Subdivide delay among the preconditions in the order in which they are satisfied
n
Rule 2: Attribute the remaining delay to predecessors along the sequencing edge
A@X :- B@X,E@Y B@X :- C@X E@Y :- C@X C@X :- Z@X G@X :- F@X X Y
1 2 3 4 5 6 7 8
C B A E
INS(Z)
INS(Z) DRV(C) [0s,8s] [0s,4s] [0s,2s]
G
INS(F)
C E
SND(+C) RCV(+C) DRV(E) SND(+E) INS(F) DRV(B) RCV(+E) DRV(G) DRV(A) [6s,7s] [4s,6s] [4s,5s] [4s,5s]
DRV(B) RCV(+E) DRV(G) DRV(A)
@2018 Linh T. X. Phan – Timing Provenance
t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102
Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs
+0.003s +0.002s +0.004s +90s +90.009s
t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89
Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs
11 n Aggregating subgraphs that are structurally similar n Pruning zero-delay subgraphs
@2018 Linh T. X. Phan – Timing Provenance
12
S3 S4 S2 Storage Backend (B) Computing Service (C) Maintenance Service (M) S1 Why is the request taking so long to complete? Computing Request Victim Storage RPC Misbehaving Storage RPCs
(Q) Why did the computing request take 14 seconds? Maintenance Req received by M at 74s,...,79s Storage RPC Req sent by M at 75s,...,80s Storage RPC Req received by B at 75s,...,80s (B) 13 seconds spent on queuing for other RPCs.
Root cause!
Computing Req received by C at 80s Storage RPC Req sent by C at 81s Storage RPC Req received by B at 81s
... ...
Queue Rsp dequeued on B at 81s,...,93s
...
(A) 1 second spent on issuing the RPC.
...
Queue Req enqueued on B at 81s Queue Req enqueued on B at 75s,...,80s
...
n Detailed and weighted causal explanation of the delay n Can find off-path root causes!
@2018 Linh T. X. Phan – Timing Provenance
n Zeno, a debugger for timing-related faults n Support for declarative + imperative systems
n Interfaces with NDlog and Zipkin n Gathers data from switches w/P4
n Evaluation
n Evaluated with 9 realistic bugs from Google Cloud platform* n Used networks that contained 8-700 nodes n Results are promising
13
@2018 Linh T. X. Phan – Timing Provenance
n Correctly identifies 11-28 relevant events
14
500 1000 1500 2000 2500 3000 R1 R2 R3 R4 Z1 Z2 Vertices in response Query Raw w/ annotation (w>0) w/ annotation (w=0) w/ pruning (w>0) w/ pruning (w=0) w/ aggregation (w>0) w/ aggregation (w=0)
35
Original All steps applied
Size of the provenance for different example scenarios
11-35 vertexes contributing delay
Turnaround time for provenance queries
Diagnosis in less than 10s
@2018 Linh T. X. Phan – Timing Provenance
n A generalization of provenance to explicit represent
n The provenance tracks both functional and temporal causality
through sequencing edges
n Delay annotations + provenance aggregation improves usability n Applied to RapidNet and Zipkin: Can find off-path root causes
n Benefit: Precise reasoning of both functional and
n This will be useful for CPS diagnostics where time matters!
n On-going work
n
Generalize to more complex scheduling policies
15