Diagnosing Distributed CPS with Timing Provenance Yang Andreas - - PowerPoint PPT Presentation

diagnosing distributed cps with timing provenance
SMART_READER_LITE
LIVE PREVIEW

Diagnosing Distributed CPS with Timing Provenance Yang Andreas - - PowerPoint PPT Presentation

NSF/Intel PI Meeting July 13, 2018 Diagnosing Distributed CPS with Timing Provenance Yang Andreas Linh Thi Xuan Wu Haeberlen Phan Problem: Timing faults n Many CPS are time dependent n The right thing must happen at the right


slide-1
SLIDE 1

Diagnosing Distributed CPS with Timing Provenance

Andreas Haeberlen Yang Wu Linh Thi Xuan Phan

NSF/Intel PI Meeting – July 13, 2018

slide-2
SLIDE 2

@2018 Linh T. X. Phan – Timing Provenance

Problem: Timing faults

n Many CPS are time dependent

n The “right thing” must happen at the “right time”!

n What if this goes wrong?

n Reasons: attack, bug, misconfiguration, …

n Goal: A powerful diagnostic capability

n Can we find the root cause of both functional and timing

issues, such as low throughput, oscillations, high latencies, …?

2

slide-3
SLIDE 3

@2018 Linh T. X. Phan – Timing Provenance

Challenge

n State of the art

n

Distributed tracing: explain what was computed when, but not why

n

Network provenance: only reason about functional causality

n Cannot reason about timing 3

S3 S4 S2 Storage Backend (B) Computing Service (C) Maintenance Service (M) S1 Why is the request taking so long to complete? Computing Request Victim Storage RPC Misbehaving Storage RPCs

Computing Request span id: 1, parent id: none Storage RPC span id: 2, parent id: 1 t1 t2 t3 t4 t5 t0 t6 Start End (time) Server Recv Server Send Start Job Client Recv Client Send

Bottleneck! Root cause?

Root cause!

Bottleneck! Root cause?

Computing Rsp generated at C at 95s Computing Req received by C at 80s Storage RPC Req sent by C at 81s Storage RPC Req received by B at 81s Storage RPC Rsp sent by B at 93s Storage RPC Rsp received by C at 93s Storage Type was remote during [0s,∞) V1 V2 Storage Block was ... during [0s,93s) (Q) How was the computing response generated?

slide-4
SLIDE 4

@2018 Linh T. X. Phan – Timing Provenance

Approach: Timing provenance

4 S3 S4 S2 Storage Backend (B) Computing Service (C) Maintenance Service (M) S1 Why is the request taking so long to complete? Computing Request Victim Storage RPC Misbehaving Storage RPCs

Root cause captured!

n A generalization of provenance that tracks both

functional causality and temporal causality

n

i.e., causes that affect the timing of the observed symptom

n

may involve requests that are functionally independent

n Result: Can explain both the ‘what’ and the ‘when’

Queuing delay!

slide-5
SLIDE 5

@2018 Linh T. X. Phan – Timing Provenance

How to capture temporal causality?

n Intuition: Represent ordering relationship between exec.

n

We need to know not just what the system did, but also in what order (queuing and scheduling semantics)

n Extend critical-path analysis in a novel way for the

analysis

5

A B C D

Dequeue Enqueue Request D can only be dequeued after C is dequeued and finished processing

Timing provenance of D must include C

C D

slide-6
SLIDE 6

@2018 Linh T. X. Phan – Timing Provenance

Insight #1: Sequencing edges

n Add a sequencing edge from execution X to execution Y

if X immediately precedes Y in the queue

6

t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114

Slow Compute Response

C

t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=85, @M, MaintainReq t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq

Maintenance Jobs

t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102 A

B

t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111

Root cause! Symptom A B C D

t=87, @C, VMReq

Compute Request

t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89

Network RPC Compute RPC Billing RPC Storage RPC

t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98

D

t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91

slide-7
SLIDE 7

@2018 Linh T. X. Phan – Timing Provenance

n Not all executions are equally important n How to isolate executions that contribute substantially to

the overall delay?

7

t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102

Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs

Challenge: Usability

slide-8
SLIDE 8

@2018 Linh T. X. Phan – Timing Provenance

n Annotate vertexes with the delays that they contribute 8

t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102

Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs

+0.003s +0.002s +0.004s +90s +90.009s

Insight #2: Delay annotations

slide-9
SLIDE 9

@2018 Linh T. X. Phan – Timing Provenance

t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102

Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs

+0.003s +0.002s +0.004s +90s +90.009s +0.009s +0s

n Annotate vertexes with the delays that they contribute n Goal: Delay annotations should correspond to “potential

speedup”

9

Insight #2: Delay annotations

slide-10
SLIDE 10

@2018 Linh T. X. Phan – Timing Provenance

10

Delay annotations: How to compute?

n

Rule 1: Subdivide delay among the preconditions in the order in which they are satisfied

n

Rule 2: Attribute the remaining delay to predecessors along the sequencing edge

A@X :- B@X,E@Y B@X :- C@X E@Y :- C@X C@X :- Z@X G@X :- F@X X Y

1 2 3 4 5 6 7 8

C B A E

INS(Z)

INS(Z) DRV(C) [0s,8s] [0s,4s] [0s,2s]

G

INS(F)

C E

SND(+C) RCV(+C) DRV(E) SND(+E) INS(F) DRV(B) RCV(+E) DRV(G) DRV(A) [6s,7s] [4s,6s] [4s,5s] [4s,5s]

DRV(B) RCV(+E) DRV(G) DRV(A)

slide-11
SLIDE 11

@2018 Linh T. X. Phan – Timing Provenance

t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=84, @M, StorageReq Seq: 1, Enq: 83, Acq: 83, Deq: 84 t=86, @M-S, StorageReq Seq: 1, Enq: 84, Acq: 84, Deq: 86 t=105, @S, StorageRsp Seq: 1, Enq: 86, Acq: 102, Deq: 105 t=83, @M, MaintainReq t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89 t=83, @M, StorageReq Seq: 0, Enq: 82, Acq: 82, Deq: 83 t=84, @C-S, StorageReq Seq: 0, Enq: 83, Acq: 83, Deq: 84 t=82, @M, MaintainReq t=102, @S, StorageRsp Seq: 0, Enq: 84, Acq: 99, Deq: 102

Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs

+0.003s +0.002s +0.004s +90s +90.009s

t=87, @C, VMReq t=90, @C, StorageReq Seq: 3, Enq: 87, Acq: 89, Deq: 90 t=91, @C, BillingReq Seq: 4, Enq: 87, Acq: 90, Deq: 91 t=92, @C, NetworkRsp Seq: 5, Enq: 87, Acq: 91, Deq: 92 t=97, @C, ComputeRsp, Seq: 6, Enq: 89, Acq: 92, Deq: 97 t=113, @S-C, StorageRsp Seq: 1, Enq: 111, Acq: 111, Deq: 113 t=100, @B-C, BillingRsp Seq: 1, Enq: 98, Acq: 98, Deq: 100 t=114, @C, VMRsp Seq: 7, Enq: 113, Acq: 113, Deq: 114 t=91, @C-S, StorageReq Seq: 1, Enq: 90, Acq: 90, Deq: 91 t=93, @C-B, BillingReq Seq: 1, Enq: 91, Acq: 91, Deq: 93 t=111, @S, StorageRsp Seq: 3, Enq: 91, Acq: 108, Deq: 111 t=98, @B, BillingRsp, Seq: 2, Enq: 93, Acq: 93, Deq: 98 t=86, @M, StorageReq Seq: 2, Enq: 85, Acq: 85, Deq: 86 t=88, @M-S, StorageReq Seq: 2, Enq: 86, Acq: 86, Deq: 88 t=108, @S, StorageRsp Seq: 2, Enq: 86, Acq: 105, Deq: 108 t=85, @M, MaintainReq t=88, @C, NetworkReq Seq: 1, Enq: 87, Acq: 87, Deq: 88 t=89, @C, ComputeReq Seq: 2, Enq: 87, Acq: 88, Deq: 89

Network RPC Compute RPC Compute Request Billing RPC Storage RPC Slow Compute Response Maintenance Jobs

Insight #3: Provenance aggregation

11 n Aggregating subgraphs that are structurally similar n Pruning zero-delay subgraphs

slide-12
SLIDE 12

@2018 Linh T. X. Phan – Timing Provenance

Putting everything together

12

S3 S4 S2 Storage Backend (B) Computing Service (C) Maintenance Service (M) S1 Why is the request taking so long to complete? Computing Request Victim Storage RPC Misbehaving Storage RPCs

(Q) Why did the computing request take 14 seconds? Maintenance Req received by M at 74s,...,79s Storage RPC Req sent by M at 75s,...,80s Storage RPC Req received by B at 75s,...,80s (B) 13 seconds spent on queuing for other RPCs.

Root cause!

Computing Req received by C at 80s Storage RPC Req sent by C at 81s Storage RPC Req received by B at 81s

... ...

Queue Rsp dequeued on B at 81s,...,93s

...

(A) 1 second spent on issuing the RPC.

...

Queue Req enqueued on B at 81s Queue Req enqueued on B at 75s,...,80s

...

n Detailed and weighted causal explanation of the delay n Can find off-path root causes!

slide-13
SLIDE 13

@2018 Linh T. X. Phan – Timing Provenance

Implementation, experimental setup

n Zeno, a debugger for timing-related faults n Support for declarative + imperative systems

n Interfaces with NDlog and Zipkin n Gathers data from switches w/P4

n Evaluation

n Evaluated with 9 realistic bugs from Google Cloud platform* n Used networks that contained 8-700 nodes n Results are promising

13

slide-14
SLIDE 14

@2018 Linh T. X. Phan – Timing Provenance

Evaluation results

n Correctly identifies 11-28 relevant events

14

500 1000 1500 2000 2500 3000 R1 R2 R3 R4 Z1 Z2 Vertices in response Query Raw w/ annotation (w>0) w/ annotation (w=0) w/ pruning (w>0) w/ pruning (w=0) w/ aggregation (w>0) w/ aggregation (w=0)

35

  • 89%

Original All steps applied

Size of the provenance for different example scenarios

Produces readable explanations

11-35 vertexes contributing delay

Turnaround time for provenance queries

Low run-time overhead

Diagnosis in less than 10s

Timing provenance is useful, compact and efficient!

slide-15
SLIDE 15

@2018 Linh T. X. Phan – Timing Provenance

Summary: Timing Provenance

n A generalization of provenance to explicit represent

temporal causality

n The provenance tracks both functional and temporal causality

through sequencing edges

n Delay annotations + provenance aggregation improves usability n Applied to RapidNet and Zipkin: Can find off-path root causes

n Benefit: Precise reasoning of both functional and

timing faults

n This will be useful for CPS diagnostics where time matters!

n On-going work

n

Generalize to more complex scheduling policies

15