NetPoirot: Taking The Blame Game Out of Data Center Operations
Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred
NetPoirot: Taking The Blame Game Out of Data Center Operations - - PowerPoint PPT Presentation
NetPoirot: Taking The Blame Game Out of Data Center Operations Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred Datacenters can fail 2 Failures are disruptive 3 Why is debugging hard? Azure
Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, Geoff Outhred
2
4
Penn researcher Azure VM Azure Network Service X
Network
Network Network
Someone accepts responsibility Each blames the other 5
Sherlock SIGCOMM- 07 NetMedic SIGCOMM- 09 NSDI-11 TRat SIGCOMM-02 Netprofile r P2Psys-05
7
8
His uncertainty is X
His uncertainty is X- Y
15
16
17
Feature 1 Feature 2
Easiest to
18
Hardest to classify Feature 2 Feature 1
19
Feature 2 Feature 1
20
Mean of max congestion window Min of the last congestion window 50th percentile of number of triple duplicate ACKs 50th percentile of connection duration Max of the number of triple duplicate Acks 95th percentile of the max congestion window
21
Feature 2 Feature 1
22
50TH percentile of the max RTT Number of flows 50th percentile of amount of data received 95th percentile of the number of timeouts
23
Feature 1 Feature 2
24
Mean time spent in zero window probing 95th percentile of the ratio
to received Number of flows Number of flows 95th percentile of connection durations Minimum of the number of bytes received
25
Is it a network failure? Is it a server problem? Is it a client side problem?
If throughput < x: Open more connections If throughput <x: Send more data on the same connection
31
32
General label Normal Client Networ k Precisio n 97.78% 99.7% 100% Recall 99.68% 98.25% 99.37
YouTube Event X