NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ - - PowerPoint PPT Presentation

network telemetry and
SMART_READER_LITE
LIVE PREVIEW

NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ - - PowerPoint PPT Presentation

NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ PATHAK/Senior Product Manager/INVENTEC Open Computing Project (OCP) Participation OCP Accepted! D7032Q28BP and D7032Q28BX D6254QSBP and D6254QSBX Inventec is a Platinum Member !!!


slide-1
SLIDE 1

NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA

RUTURAJ PATHAK/Senior Product Manager/INVENTEC

slide-2
SLIDE 2
  • NIDC Submissions for OCP Certification

– Specifications Accepted by OCP as of January 2016

  • 10/40G: D6254QSBP and D6254QSBX

– http://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Inventec_DCS6072QS

  • 100G: D7032Q28BP and D7032Q28BX

– http://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Inventec_DCS7032Q28

2

Open Computing Project (OCP) Participation

OCP Accepted!

D6254QSBP and D6254QSBX D7032Q28BP and D7032Q28BX

Inventec Confidential

Inventec is a Platinum Member !!!

slide-3
SLIDE 3

Inventec Confidential 3

Open Architecture

NETWORK APPS MONITORING APPS MANAGEMENT APPS REST APIs REST APIs

KPI, SLA, Capacity Realtime Security Resource discovery Reconciliation, Real time Provisioning

SAI

slide-4
SLIDE 4

Inventec Confidential 4

RUDIMENTARY TELEMETRY

root> show chassis alarms 1 alarms currently active Alarm time Class Description 2014-07-29 07:27:12 UTC Minor Host 0 Temperature Warm

  • <Syslog Messages>

Jul 29 07:26:47 chassisd[1387]: CHASSISD_SNMP_TRAP6: SNMP trap generated: Over Temperature! Red Alarm 2014-07-29 08:07:50 UTC Major Host 0 Temperature Hot <Syslog Messages> CHASSISD_RE_OVER_TEMP_WARNING: Routing Engine 0 temperature (73 C) over 72 degrees C, platform will shut down in 240 seconds if condition persists

Debugging based on such information is difficult

slide-5
SLIDE 5

5

TELEMETRY today

Checking for CPU events and setting up notifications may not work

  • CLI sessions are not closed gracefully on the router. In this case, one would see mgd running

high on CPU, starving the kernel of CPU cycles. 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

  • One way to address this issue is to kill the mgd processes eating up the CPU.
  • 'Sampling' is enabled on the router. This sometimes leads to high kernel CPU; to address this,

reduce the rate at which you are sampling on the router.

slide-6
SLIDE 6

6

TELEMETRY issues

▪ Quite coarse data granularity ▪ SNMP polling puts lot of load on CPU and has severe scaling issues ▪ CLI Scripts break and need frequent changes ▪ Even IPFIX Flow sampling misses important information ▪ No Data Correlation ▪ Reactive, yet no information or hint given on root cause

SNMP SYSLOG CLI Scripts

A shift in the way we optimize and diagnose the networks is required

slide-7
SLIDE 7

Inventec Confidential 7

SDN TELEMETRY & ANALYTICS- Agent Based

PROGRAM AUTOMATIC/ MANUAL TELEMETRY VISUALIZE Controllability Observability

Closed loop feedback

SDN Controller GUI

AGENT NOS SAI SDK

OPEN API HOST CPU

slide-8
SLIDE 8

Inband Network Telemetry with P4

Inventec Confidential 8

insert or modify packet headers with custom metadata

slide-9
SLIDE 9

9

Discovering information

I

▪ This fan speed increase is in response to abnormal behavior in chip… ▪ This switch A is seeing more congestion @ 3PM because of … ▪ Bit error rate is increasing on interface x/y/z due to … Packet loss in 3 min. Do we need a Crystal Ball to answer the above questions and ACT on it? Optimize Network for Data Center SLAs ▪ Latency ▪ Network Jitter ▪ Packet Loss ▪ Bandwidth Guarantees

slide-10
SLIDE 10

10

Why Deep learning now?

SDN is the key enabler!

▪ Plethora of Data available ▪ Lot of cheap compute power and storage available now ▪ More layers of NN are required to solve complex problems ▪ Introduction of GPUs : Perfect for matrix multiplication ▪ NN Algorithms have matured and can be scaled ▪ Lot of research is being conducted in this field

NN can be used to control nonlinear dynamic system

slide-11
SLIDE 11

11

Deep Reinforcement Learning

  • Learning Behaviors and skills
  • No modeling
  • Sequence of decisions is necessary
  • Actions have consequences
  • Environment Stateful

RL Agent Reward (network delay) State Observation (link bandwidth) Action (spine-leaf link weight) Environment (Network)

Sequence of states and actions: s0, a0, r0, sT-1, aT-1, rT-1, sT, rT Used for non linear complex multidimensional systems Transition Function : P (st+1, rt) | st, at)

slide-12
SLIDE 12

Inventec Confidential 12

SDN TELEMETRY & ANALYTICS- Deep Learning

PROGRAM AUTOMATIC/ MANUAL TELEMETRY VISUALIZE Controllability Observability

Closed loop feedback

SDN Controller GUI

AGENT NOS SAI SDK

REST API HOST CPU DEEP LEARNING

slide-13
SLIDE 13

Conclusion

  • Can we design proactive networks
  • Can we get predictive insights
  • Can we do Risk Mitigation
  • Can we do Anomaly Detection
  • Can we make networks more efficient?

AI/Robots will be Omnipresent by 2025

13

Monitoring Topology Management Maintenance Performance Location Functional Data

Data Center

Costing

Opinion: Yes

LEARN ANALYZE PATTERNS ANTICIPATE PROBLEMS SUGGEST

slide-14
SLIDE 14