network telemetry and
play

NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ - PowerPoint PPT Presentation

NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ PATHAK/Senior Product Manager/INVENTEC Open Computing Project (OCP) Participation OCP Accepted! D7032Q28BP and D7032Q28BX D6254QSBP and D6254QSBX Inventec is a Platinum Member !!!


  1. NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ PATHAK/Senior Product Manager/INVENTEC

  2. Open Computing Project (OCP) Participation OCP Accepted! D7032Q28BP and D7032Q28BX D6254QSBP and D6254QSBX Inventec is a Platinum Member !!! • NIDC Submissions for OCP Certification – Specifications Accepted by OCP as of January 2016 • 10/40G: D6254QSBP and D6254QSBX – http://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Inventec_DCS6072QS • 100G: D7032Q28BP and D7032Q28BX – http://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Inventec_DCS7032Q28 Inventec Confidential 2

  3. Open Architecture MANAGEMENT APPS NETWORK APPS MONITORING APPS Resource discovery Realtime KPI, SLA, Reconciliation, Real time Security Capacity Provisioning REST APIs REST APIs SAI 3 Inventec Confidential

  4. RUDIMENTARY TELEMETRY root> show chassis alarms 1 alarms currently active Alarm time Class Description 2014-07-29 07:27:12 UTC Minor Host 0 Temperature Warm ------------------------------------------------------- <Syslog Messages> Jul 29 07:26:47 chassisd[1387]: CHASSISD_SNMP_TRAP6: SNMP trap generated: Over Temperature! Red Alarm 2014-07-29 08:07:50 UTC Major Host 0 Temperature Hot <Syslog Messages> CHASSISD_RE_OVER_TEMP_WARNING: Routing Engine 0 temperature (73 C) over 72 degrees C, platform will shut down in 240 seconds if condition persists Debugging based on such information is difficult 4 Inventec Confidential

  5. TELEMETRY today • CLI sessions are not closed gracefully on the router. In this case, one would see mgd running high on CPU, starving the kernel of CPU cycles. 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization • One way to address this issue is to kill the mgd processes eating up the CPU. • 'Sampling' is enabled on the router. This sometimes leads to high kernel CPU; to address this, reduce the rate at which you are sampling on the router. Checking for CPU events and setting up notifications may not work 5

  6. TELEMETRY issues SNMP SYSLOG CLI Scripts ▪ Quite coarse data granularity ▪ SNMP polling puts lot of load on CPU and has severe scaling issues ▪ CLI Scripts break and need frequent changes ▪ Even IPFIX Flow sampling misses important information ▪ No Data Correlation ▪ Reactive, yet no information or hint given on root cause A shift in the way we optimize and diagnose the networks is required 6

  7. SDN TELEMETRY & ANALYTICS- Agent Based GUI AUTOMATIC/ VISUALIZE MANUAL SDN Controller OPEN API HOST CPU Observability Controllability AGENT NOS SAI SDK PROGRAM TELEMETRY Closed loop feedback 7 Inventec Confidential

  8. Inband Network Telemetry with P4 insert or modify packet headers with custom metadata Inventec Confidential 8

  9. Discovering information ▪ This fan speed increase is in response to abnormal behavior in chip… ▪ This switch A is seeing more congestion @ 3PM because of … I ▪ Bit error rate is increasing on interface x/y/z due to … Packet loss in 3 min. Optimize Network for Data Center SLAs ▪ Latency ▪ Network Jitter ▪ Packet Loss ▪ Bandwidth Guarantees Do we need a Crystal Ball to answer the above questions and ACT on it ? 9

  10. Why Deep learning now? ▪ Plethora of Data available ▪ Lot of cheap compute power and storage available now ▪ More layers of NN are required to solve complex problems ▪ Introduction of GPUs : Perfect for matrix multiplication ▪ NN Algorithms have matured and can be scaled ▪ Lot of research is being conducted in this field NN can be used to control nonlinear dynamic system SDN is the key enabler! 10

  11. Deep Reinforcement Learning • Learning Behaviors and skills • No modeling • Sequence of decisions is necessary • Actions have consequences • Environment Stateful RL Agent Reward (network delay) Action (spine-leaf link weight) State Observation (link bandwidth) Environment (Network) Sequence of states and actions: s 0 , a 0 , r 0 , s T-1 , a T-1 , r T-1 , s T , r T Transition Function : P (s t+1 , r t ) | s t , a t ) Used for non linear complex multidimensional systems 11

  12. SDN TELEMETRY & ANALYTICS- Deep Learning GUI VISUALIZE AUTOMATIC/ REST API MANUAL DEEP LEARNING SDN Controller HOST CPU Observability AGENT Controllability NOS SAI SDK PROGRAM TELEMETRY Closed loop feedback 12 Inventec Confidential

  13. Conclusion AI/Robots will be Omnipresent by 2025 LEARN • Can we design proactive networks SUGGEST ANALYZE • Can we get predictive insights • Can we do Risk Mitigation ANTICIPATE PATTERNS PROBLEMS • Can we do Anomaly Detection Costing Topology • Can we make networks more efficient? Monitoring Performance Management Data Center Maintenance Location Opinion: Yes Functional Data 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend