Detecting outages with telemetry Alessio Placitelli - @dexterp37 - - PowerPoint PPT Presentation

detecting outages with telemetry
SMART_READER_LITE
LIVE PREVIEW

Detecting outages with telemetry Alessio Placitelli - @dexterp37 - - PowerPoint PPT Presentation

Detecting outages with telemetry Alessio Placitelli - @dexterp37 June 16th - Internet Measurement Village 2020 Italy, March 11th - 2020 Tales from a mid-pandemic network outage Public 2 ...failure on a foreign network... Source :


slide-1
SLIDE 1

June 16th - Internet Measurement Village 2020

Detecting

  • utages with

telemetry

Alessio Placitelli - @dexterp37

slide-2
SLIDE 2

2

Italy, March 11th - 2020

Tales from a mid-pandemic network outage

Public

slide-3
SLIDE 3

3

“...failure on a foreign network...”

Source: “Sharing data on Italy’s mid-pandemic internet outage” - https://mzl.la/italy-outage

slide-4
SLIDE 4

4

Network outage in Italy

How many Firefox desktop users were affected by the mid-pandemic outage?

Public

slide-5
SLIDE 5

5

NOPE.

These were for something completely different!

Public

slide-6
SLIDE 6

6

Mozilla Manifesto

Principle 2  https://www.mozilla.org/about/manifesto/

“The internet is a global public resource that must remain open and accessible.”

Public

slide-7
SLIDE 7

1. Our methodology is open 2. What happened in Italy on March 11th, 2020? 3. What showed up in Jammu & Kashmir in 2019?

Key takeaways

Public

slide-8
SLIDE 8

8

Telemetry

A quick overview

1. Performance metrics for our products 2. Packaged in pings sent at controlled schedules 3. Following our Lean Data Practices (www.leandatapractices.com)

Public

slide-9
SLIDE 9

9

Firefox telemetry

How does it work?

Public

1. Relevant metrics travel in the main and health pings. 2. Documentation for metrics and pings is publicly available. 3. probes.telemetry.mozilla.org

slide-10
SLIDE 10

10

The “main” ping

Schedule and properties

Public

1. Ideally sent once per day around local midnight. 2. Is the main transport for Firefox telemetry. 3. Includes DNS, SSL and TLS metrics...

slide-11
SLIDE 11

11

The “main” ping

Interesting metrics

Public

1. dns_failed_lookup_time 2. dns_lookup_time 3. ssl_cert_verification_errors 4. http_page_tls_handshake 5. ...

slide-12
SLIDE 12

12

The “health” ping

Schedule and properties

Public

1. Telemetry health about... telemetry. 2. Extremely small (~800 bytes). 3. Collected at most once per hour in case of problems. 4. Includes the reason why the HTTPS upload failed.

slide-13
SLIDE 13

13

Our open methodology

From raw data to pretty graphs

Public

slide-14
SLIDE 14

14

Throw away that IP address!

Right after matching the IP with a country lookup, at ingestion! https://github.com/mozilla/gcp-ingestion/blob/fbfb5d28490a17d4 3329b44a1a8259bbcc0d7b20/ingestion-beam/src/main/java/com/ mozilla/telemetry/Decoder.java#L64L69

Public

slide-15
SLIDE 15

15

Cleanup: remove “inactive” sessions

Not all the “main” pings are representative. “Who can even open 100 websites in 1 second?”

Public

slide-16
SLIDE 16

16

Aggregation: step 1 - geographical

Group the data by Country. Drop the data for Countries with too few samples.

Public

slide-17
SLIDE 17

17

Aggregation: step 2 - counting things!

Count how many sessions reported a metric, within the given timeframe. Example: how many sessions had DNS_LOOKUP_TIME?

Public

slide-18
SLIDE 18

18

Aggregation: step 3 - create timing profiles

Combine the user-reported time distributions in a single distribution, for a given timeframe. Example: what’s the shape of DNS_LOOKUP_TIME in Italy, today?

Public

slide-19
SLIDE 19

19

Investigation: look for anomalies in the data

How do certain measures compare against a baseline? Were there anomalous spikes, surges, holes in the time series?

Public

slide-20
SLIDE 20

20

Jammu & Kashmir - 2019

Network interferences starting from August 5th

Public

slide-21
SLIDE 21

21

Jammu & Kashmir

How many Firefox desktop users were affected (normalized count)?

Public

Telemetry creation date

Jammu & Kashmir Outside of Jammu & Kashmir

Scaled Daily Active Users

slide-22
SLIDE 22

22

Jammu & Kashmir

The average time it takes for an unsuccessful DNS resolution, in milliseconds

Public

Telemetry creation date

Jammu & Kashmir Outside of Jammu & Kashmir

Log Scale Time (ms)

slide-23
SLIDE 23

23

Jammu & Kashmir

The proportion of active session with no DNS resolved

Public

Telemetry creation date

Jammu & Kashmir Outside of Jammu & Kashmir

  • Prop. Daily Active Users
slide-24
SLIDE 24

24

What’s next?

How are we moving this project forward

Public

slide-25
SLIDE 25

Productionize

  • ur datasets

01

slide-26
SLIDE 26

Validate the data

02

slide-27
SLIDE 27

Community collaboration

03

slide-28
SLIDE 28

28

Our team

Special thanks to Rebecca Weiss for advising on the project, and to Hamilton Ulmer for the graphics on the Italian focus Solana Larsen

Editor, Internet Health Report

Saptarshi Guha

Data Scientist

Jochai Ben-Avie

Head of International Public Policy

Alessio Placitelli

Telemetry Engineer, Project Lead

Public

slide-29
SLIDE 29

Thank you!

Reach out to: outages@mozilla.com