Introduction Harsh realities of network analytics netbeam - - PowerPoint PPT Presentation

introduction
SMART_READER_LITE
LIVE PREVIEW

Introduction Harsh realities of network analytics netbeam - - PowerPoint PPT Presentation

Introduction Harsh realities of network analytics netbeam Demo Technology Stack Alternative Approaches Lessons Learned 2 ESnet Data, Analytics and Visualization Architecture 3 The Harsh Realities of Network


slide-1
SLIDE 1
slide-2
SLIDE 2

Introduction

  • Harsh realities of network analytics
  • netbeam
  • Demo
  • Technology Stack
  • Alternative Approaches
  • Lessons Learned

2

slide-3
SLIDE 3

ESnet Data, Analytics and Visualization Architecture

3

slide-4
SLIDE 4

The Harsh Realities of Network Analytics

  • 1. It’s a mess
  • 2. Things change
  • 3. There’s always more
  • 4. It’s never really done
  • Your data isn’t neat and tidy
  • Time and money are limited
  • More devices & more telemetry
  • What you need today may not

be what you need tomorrow.

4

slide-5
SLIDE 5

Coping strategies

  • 1. It’s a mess
  • 2. Things change
  • 3. There’s always more
  • 4. It’s never really done
  • Design knowing things won’t

be tidy

  • “What” not “How”
  • Rely on the cloud for scaling
  • Keep raw data to keep your
  • ptions open

5

slide-6
SLIDE 6

netbeam

Network Analytics in Google Cloud Three Pillars 1. Real time analytics

○ Low latency, incomplete

2. Offline analytics

○ High latency, complete

3. Flexible data model

○ Changing needs? Recompute from raw data!

Secret sauce: Apache Beam

6

slide-7
SLIDE 7

What is Apache Beam?

1. The Beam Programming Model 2. SDKs for writing Beam pipelines 3. Runners for existing distributed processing backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local runner for testing

Slide courtesy of the Apache Beam Project

7

slide-8
SLIDE 8

The Evolution of Apache Beam

MapReduce

BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel

Apache Beam

Google Cloud Dataflow Slide courtesy of the Apache Beam Project

8

slide-9
SLIDE 9

Architecture Diagram

Apache Beam (Stream Processing)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Apache Beam (Batch Processing)

BigQuery (historical)

Old SNMP system avro 9

slide-10
SLIDE 10

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Google Pubsub
  • Uses Python outside
  • f Google Cloud to

poll devices and write to Pubsub topic

  • Code within Google

Cloud subscribes to topic to process data

Old SNMP system avro 10

slide-11
SLIDE 11

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Stream processing
  • Subscribes to

Pubsub topic

Old SNMP system avro 11

slide-12
SLIDE 12

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Stream processing
  • Subscribes to

Pubsub topic

  • Raw data is written to

BigQuery

Old SNMP system avro 12

slide-13
SLIDE 13

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Stream processing
  • Subscribes to

Pubsub topic

  • Raw data is written to

BigQuery

  • Real time

transformed data (e.g. aligned data rates) written to Bigtable

  • Writes and makes

use of meta data in BigTable (not shown)

Old SNMP system avro 13

slide-14
SLIDE 14

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Cloud Bigtable
  • Like HBase
  • Write to cells in rows,

indexed by keys

  • We write 1 day of

data to a single row (columns are the time

  • f day, key is metric

and day)

  • Fast access to row by

key, can serve data from here

  • Store one year

Old SNMP system avro 14

slide-15
SLIDE 15

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • BigQuery
  • Data warehousing

solution

  • Cheap storage, SQL

access, but not suitable for real-time access

  • Allows SQL queries

for ad hoc investigation

  • We store our source of

truth here

Old SNMP system avro 15

slide-16
SLIDE 16

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • BigQuery
  • Data warehousing

solution

  • Cheap storage, SQL

access, but not suitable for real-time access

  • Allows SQL queries

for ad hoc investigation

  • We store our source of

truth here

  • Also store historical

data (7 years), imported via avro files

Old SNMP system avro 16

slide-17
SLIDE 17

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Batch processing
  • Run with cron job

Old SNMP system avro 17

slide-18
SLIDE 18

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Batch processing
  • Run with cron job
  • Recalculate Bigtable

data each night from source of truth in BigQuery

Old SNMP system avro 18

slide-19
SLIDE 19

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Batch processing
  • Run with cron job
  • Recalculate Bigtable

data each night from source of truth in BigQuery

  • Process Bigtable

rows into new rows of 5min, 1 hr and 1 day aggregations

Old SNMP system avro 19

slide-20
SLIDE 20

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

API SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles

...

  • Apache Beam /

Google Dataflow

  • Batch processing
  • Run with cron job
  • Recalculate Bigtable

data each night from source of truth in BigQuery

  • Process Bigtable

rows into new rows of 5min, 1 hr and 1 day aggregations

  • Additional

pre-computed views e.g. percentiles for traffic distribution over a month

Old SNMP system avro 20

slide-21
SLIDE 21

Architecture Diagram

Apache Beam (Stream)

BigQuery (immutable)

Dataserver API (node.js) SNMP collection system Client

Bigtable (realtime)

Rollups 5m, 1h, 1d avg Align/rates

BigQuery (historical)

Percentiles Old SNMP system avro

...

  • API
  • Currently runs on

App Engine

  • Node.js
  • Serves data out of

Bigtable

  • Timeseries data is

served as ‘tiles’, each tile is one row

  • Would like to use

Cloud Endpoints and provide a gRPC service

  • Looking forward to

grpc-web solution

21

slide-22
SLIDE 22

Use case example: Historical Trends

22

slide-23
SLIDE 23

Use case example: Historical Trends

Stream to BQ Dataserver API (node.js) SNMP collection system Client

Bigtable

Per-month totals Per-day Interface totals

BigQuery (historical)

Old SNMP system avro

snmp-daily::2017-08::$interface

Jan 1 Jan 2 1.8 Pb 1.9 Pb ... Dec31 3.1 Pb ...

snmp-monthly-totals

Jan 1991 28 Gb Feb 1991 29 Gb ... ...

BigQuery

Sep 2017 56 Pb

Bigtable rows

23

slide-24
SLIDE 24

Use case: real time anomaly detection

Stream to BQ Dataserver API (node.js) SNMP collection system Client

Bigtable

Baseline generation

baseline::5m::avg::$interface

Mon 12am Mon 1am

2.1 1.9 ...

Sun 11pm

0.5 ...

anomaly::5m::avg

iface-1 +0.1 iface-2 +2.0 ... ...

BigQuery

iface-n

  • 1.5

Anomaly detection

Mon 2am

0.3 Generates avg for each interface over the past 3 months for that hour/day Compares baseline to real time values to generate current deviation from normal 24

slide-25
SLIDE 25

Use case example: Percentiles

25

slide-26
SLIDE 26

Stream to Bigtable Dataserver API (node.js) SNMP collection system Client

Bigtable

Percentiles Daily rollups 5m avg

rollup-month-5m::2017-08::$interface::in

1 2 6Gbps 5Gbps ... 8640 2Gbps ...

percentiles::2017-08::$interface::in

1 pct 0.1 Gbps 2 pct 0.3 Gbps ... ... 99 pct 22.1Gbps

Bigtable rows

Use case example: Percentiles

26

slide-27
SLIDE 27

Example: Computing Total Traffic

# Python Beam SDK pipeline = beam.Pipeline('DirectRunner') (pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp')) pipeline.run()

Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/

27

slide-28
SLIDE 28

Our Stack

  • Apache Beam using Scio
  • Google Cloud Platform

○ Dataflow ○ Bigtable ○ BigQuery ○ Pub/Sub ○ App Engine

  • Languages

○ Scala ○ Javascript / Typescript ○ Python

Cloud Dataflow BigQuery Cloud Bigtable Cloud Endpoints App Engine Cloud Pub/Sub

28

slide-29
SLIDE 29

Current Status & Future Plans

Current

Release candidate for SNMP data:

  • Ingest to BigQuery is working
  • Migration of historical data is complete
  • Streaming ingest to Bigtable
  • Early version of utilization visualization
  • Simple data server can provide data to

clients, but gRPC API coming

  • Interface time series charts functional

29

Future

More types of data:

  • Flow data
  • perfSONAR

Machine Learning Anomaly Detection “Mash up” various data sources

slide-30
SLIDE 30

Why not InfluxDB, Elastic or ${FAVORITE_DB}

  • We have a data processing problem, not a data storage problem per se.

○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us ○ Ability to move to different platform components ○ machine learning (TensorFlow and others)

  • InfluxDB & Elastic

○ require care and feeding -- have to think about disks and machines, etc. ○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this may have changed but other benefits outweigh that. ○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier

30

slide-31
SLIDE 31

Why the cloud? Why Google Cloud Platform?

Why the cloud?

  • Focus on our problems not on infrastructure
  • Scalability without needing to own lots of systems
  • Managed services for databases and compute

Why Google Cloud?

  • Apache Beam was Google Dataflow when we first encountered it
  • More cohesive ecosystem than AWS in our experience
  • Although we have used Google Cloud specific services, the approach is

portable to other environments

31

slide-32
SLIDE 32

Lessons learned / Life in the cloud / Good & Bad

The Good

  • Not a silver bullet, but makes many things are easier
  • Scaling! We processed 9,902,585,175 data points in 3.5 hours
  • Focus on your services, not on infrastructure
  • Scio and Scala allow working at a high level of abstraction

The Not So Good

  • GCP tech support is pretty bad
  • Python is a second class citizen in Beam for now
  • Scala is powerful but challenging at times
  • Learning curve is pretty steep in places

32

slide-33
SLIDE 33

Thank you!

Jon Dugan <jdugan@es.net>

  • MyESnet: https://my.es.net
  • ESnet Open Source: http://software.es.net/

○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/

  • Scio: https://github.com/spotify/scio
  • Beam: https://beam.apache.org

33

The ESnet netbeam team:

  • Peter Murphy
  • Monte Goode
  • Sowmya Balasubramanian
  • Scott Richmond