Introduction Harsh realities of network analytics netbeam - - PowerPoint PPT Presentation
Introduction Harsh realities of network analytics netbeam - - PowerPoint PPT Presentation
Introduction Harsh realities of network analytics netbeam Demo Technology Stack Alternative Approaches Lessons Learned 2 ESnet Data, Analytics and Visualization Architecture 3 The Harsh Realities of Network
Introduction
- Harsh realities of network analytics
- netbeam
- Demo
- Technology Stack
- Alternative Approaches
- Lessons Learned
2
ESnet Data, Analytics and Visualization Architecture
3
The Harsh Realities of Network Analytics
- 1. It’s a mess
- 2. Things change
- 3. There’s always more
- 4. It’s never really done
- Your data isn’t neat and tidy
- Time and money are limited
- More devices & more telemetry
- What you need today may not
be what you need tomorrow.
4
Coping strategies
- 1. It’s a mess
- 2. Things change
- 3. There’s always more
- 4. It’s never really done
- Design knowing things won’t
be tidy
- “What” not “How”
- Rely on the cloud for scaling
- Keep raw data to keep your
- ptions open
5
netbeam
Network Analytics in Google Cloud Three Pillars 1. Real time analytics
○ Low latency, incomplete
2. Offline analytics
○ High latency, complete
3. Flexible data model
○ Changing needs? Recompute from raw data!
Secret sauce: Apache Beam
6
What is Apache Beam?
1. The Beam Programming Model 2. SDKs for writing Beam pipelines 3. Runners for existing distributed processing backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local runner for testing
Slide courtesy of the Apache Beam Project
7
The Evolution of Apache Beam
MapReduce
BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel
Apache Beam
Google Cloud Dataflow Slide courtesy of the Apache Beam Project
8
Architecture Diagram
Apache Beam (Stream Processing)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Apache Beam (Batch Processing)
BigQuery (historical)
Old SNMP system avro 9
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Google Pubsub
- Uses Python outside
- f Google Cloud to
poll devices and write to Pubsub topic
- Code within Google
Cloud subscribes to topic to process data
Old SNMP system avro 10
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Stream processing
- Subscribes to
Pubsub topic
Old SNMP system avro 11
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Stream processing
- Subscribes to
Pubsub topic
- Raw data is written to
BigQuery
Old SNMP system avro 12
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Stream processing
- Subscribes to
Pubsub topic
- Raw data is written to
BigQuery
- Real time
transformed data (e.g. aligned data rates) written to Bigtable
- Writes and makes
use of meta data in BigTable (not shown)
Old SNMP system avro 13
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Cloud Bigtable
- Like HBase
- Write to cells in rows,
indexed by keys
- We write 1 day of
data to a single row (columns are the time
- f day, key is metric
and day)
- Fast access to row by
key, can serve data from here
- Store one year
Old SNMP system avro 14
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- BigQuery
- Data warehousing
solution
- Cheap storage, SQL
access, but not suitable for real-time access
- Allows SQL queries
for ad hoc investigation
- We store our source of
truth here
Old SNMP system avro 15
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- BigQuery
- Data warehousing
solution
- Cheap storage, SQL
access, but not suitable for real-time access
- Allows SQL queries
for ad hoc investigation
- We store our source of
truth here
- Also store historical
data (7 years), imported via avro files
Old SNMP system avro 16
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Batch processing
- Run with cron job
Old SNMP system avro 17
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Batch processing
- Run with cron job
- Recalculate Bigtable
data each night from source of truth in BigQuery
Old SNMP system avro 18
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Batch processing
- Run with cron job
- Recalculate Bigtable
data each night from source of truth in BigQuery
- Process Bigtable
rows into new rows of 5min, 1 hr and 1 day aggregations
Old SNMP system avro 19
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
API SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles
...
- Apache Beam /
Google Dataflow
- Batch processing
- Run with cron job
- Recalculate Bigtable
data each night from source of truth in BigQuery
- Process Bigtable
rows into new rows of 5min, 1 hr and 1 day aggregations
- Additional
pre-computed views e.g. percentiles for traffic distribution over a month
Old SNMP system avro 20
Architecture Diagram
Apache Beam (Stream)
BigQuery (immutable)
Dataserver API (node.js) SNMP collection system Client
Bigtable (realtime)
Rollups 5m, 1h, 1d avg Align/rates
BigQuery (historical)
Percentiles Old SNMP system avro
...
- API
- Currently runs on
App Engine
- Node.js
- Serves data out of
Bigtable
- Timeseries data is
served as ‘tiles’, each tile is one row
- Would like to use
Cloud Endpoints and provide a gRPC service
- Looking forward to
grpc-web solution
21
Use case example: Historical Trends
22
Use case example: Historical Trends
Stream to BQ Dataserver API (node.js) SNMP collection system Client
Bigtable
Per-month totals Per-day Interface totals
BigQuery (historical)
Old SNMP system avro
snmp-daily::2017-08::$interface
Jan 1 Jan 2 1.8 Pb 1.9 Pb ... Dec31 3.1 Pb ...
snmp-monthly-totals
Jan 1991 28 Gb Feb 1991 29 Gb ... ...
BigQuery
Sep 2017 56 Pb
Bigtable rows
23
Use case: real time anomaly detection
Stream to BQ Dataserver API (node.js) SNMP collection system Client
Bigtable
Baseline generation
baseline::5m::avg::$interface
Mon 12am Mon 1am
2.1 1.9 ...
Sun 11pm
0.5 ...
anomaly::5m::avg
iface-1 +0.1 iface-2 +2.0 ... ...
BigQuery
iface-n
- 1.5
Anomaly detection
Mon 2am
0.3 Generates avg for each interface over the past 3 months for that hour/day Compares baseline to real time values to generate current deviation from normal 24
Use case example: Percentiles
25
Stream to Bigtable Dataserver API (node.js) SNMP collection system Client
Bigtable
Percentiles Daily rollups 5m avg
rollup-month-5m::2017-08::$interface::in
1 2 6Gbps 5Gbps ... 8640 2Gbps ...
percentiles::2017-08::$interface::in
1 pct 0.1 Gbps 2 pct 0.3 Gbps ... ... 99 pct 22.1Gbps
Bigtable rows
Use case example: Percentiles
26
Example: Computing Total Traffic
# Python Beam SDK pipeline = beam.Pipeline('DirectRunner') (pipeline | 'read' >> ReadFromText('./example.csv') | 'csv' >> beam.ParDo(FormatCSVDoFn()) | 'ifName key' >> beam.Map(group_by_device_interface) | 'group by iface' >> beam.GroupByKey() | 'compute rate' >> beam.FlatMap(compute_rate) | 'timestamp key' >> beam.Map(lambda row: (row['timestamp'], row['rateIn'])) | 'group by timestamp' >> beam.GroupByKey() | 'sum by timestamp' >> beam.Map(lambda rates: (rates[0], sum(rates[1]))) | 'format' >> beam.Map(lambda row: '{},{}'.format(row[0], row[1])) | 'save' >> beam.io.WriteToText('./total_by_timestamp')) pipeline.run()
Full code available at: http://x1024.net/blog/2017/05/chinog-flexible-network-analytics-in-the-cloud/
27
Our Stack
- Apache Beam using Scio
- Google Cloud Platform
○ Dataflow ○ Bigtable ○ BigQuery ○ Pub/Sub ○ App Engine
- Languages
○ Scala ○ Javascript / Typescript ○ Python
Cloud Dataflow BigQuery Cloud Bigtable Cloud Endpoints App Engine Cloud Pub/Sub
28
Current Status & Future Plans
Current
Release candidate for SNMP data:
- Ingest to BigQuery is working
- Migration of historical data is complete
- Streaming ingest to Bigtable
- Early version of utilization visualization
- Simple data server can provide data to
clients, but gRPC API coming
- Interface time series charts functional
29
Future
More types of data:
- Flow data
- perfSONAR
Machine Learning Anomaly Detection “Mash up” various data sources
Why not InfluxDB, Elastic or ${FAVORITE_DB}
- We have a data processing problem, not a data storage problem per se.
○ Beam and the ecosystem around it give a huge amount of flexibility -- can try new ideas as they occur to us ○ Ability to move to different platform components ○ machine learning (TensorFlow and others)
- InfluxDB & Elastic
○ require care and feeding -- have to think about disks and machines, etc. ○ At our last evaluation (a while ago now) InfluxDB wasn’t able to keep up with our load -- this may have changed but other benefits outweigh that. ○ Elastic doesn’t seem to be a good fit for long term storage -- everything is in the “hot” tier
30
Why the cloud? Why Google Cloud Platform?
Why the cloud?
- Focus on our problems not on infrastructure
- Scalability without needing to own lots of systems
- Managed services for databases and compute
Why Google Cloud?
- Apache Beam was Google Dataflow when we first encountered it
- More cohesive ecosystem than AWS in our experience
- Although we have used Google Cloud specific services, the approach is
portable to other environments
31
Lessons learned / Life in the cloud / Good & Bad
The Good
- Not a silver bullet, but makes many things are easier
- Scaling! We processed 9,902,585,175 data points in 3.5 hours
- Focus on your services, not on infrastructure
- Scio and Scala allow working at a high level of abstraction
The Not So Good
- GCP tech support is pretty bad
- Python is a second class citizen in Beam for now
- Scala is powerful but challenging at times
- Learning curve is pretty steep in places
32
Thank you!
Jon Dugan <jdugan@es.net>
- MyESnet: https://my.es.net
- ESnet Open Source: http://software.es.net/
○ http://software.es.net/react-timeseries-charts/ ○ http://software.es.net/pond/ ○ http://software.es.net/react-network-diagrams/
- Scio: https://github.com/spotify/scio
- Beam: https://beam.apache.org
33
The ESnet netbeam team:
- Peter Murphy
- Monte Goode
- Sowmya Balasubramanian
- Scott Richmond