Using Machine Learning with Wide Area Networks (WANs)
- Dr. Mariam Kiran
ANTG Research Group
Energy Sciences Network (ESnet) Lawrence Berkeley National Lab Summer Student 2017 Talk
1
Using Machine Learning with Wide Area Networks (WANs) Dr. Mariam - - PowerPoint PPT Presentation
Using Machine Learning with Wide Area Networks (WANs) Dr. Mariam Kiran Energy Sciences Network (ESnet) ANTG Research Group Lawrence Berkeley National Lab Summer Student 2017 Talk 1 Agenda Machine learning (ML) is everywhere: Is it
ANTG Research Group
Energy Sciences Network (ESnet) Lawrence Berkeley National Lab Summer Student 2017 Talk
1
– Is it the Next Hype or substantial?
– Project 1: ‘Talking’ to Networks – Project 2: Self-healing Networks
2
3
ML at the peak
interactions
5
– Playing Chess with your
Computer
– Making games difficult for
players
multiple areas of
– Speech recognition – Image recognition – Robotics – Philosophy of mind
6
data, need training to work)
Nvidia blog
AI Optimization technique Many more…. Expert systems Fuzzy systems Neural Networks Evolutionary algorithms (Genetic algorithms, evolutionary strategies, etc) Swarm intelligence (ant colony, particle swarm, more) Deep belief networks Deep boltzman networks Convolutional networks Stacked autoencoders
Networks : graph algorithm (routing – shortest path) Where ever learning involved (training): ML
recognize images with 70% accuracy and predict Gmail replies
Toolkit Language Use Processing capability Caffe C++ Images and video Distributed (HPC, GPU) TensorFlow Python Images, regression, video, text, speech Distributed (HPC, GPU) Theano Python Images Distributed (HPC, GPU) Torch Lua Images and speech Distributed (HPC, GPU)
Deep neural network Input Data Applied for Variants Feed forward neural network Hierarchical data representations
(uses restricted boltzman machine for activation function)
networks Recurrent neural network Sequential data representation (i.e. time series data) Sequential learning, especially useful when time relationship exists. Long short term memory (LTSM) used for speech translation.
12
Each algorithm is chosen depending on data being explored and problem being explored (some 50% accuracy, others 80% accuracy)
Machine learning Knowledge based reasoning Probabilistic reasoning Data-driven reasoning Ontology based reasoning Kalman filter Hidden Markov models Unsupervised Semi supervised Supervised Decision trees Rule based systems (expert systems) Bayesian networks Support vector machines K-means Deep boltzman
Using statistical classification (need training)
Range of possibilities of ML algorithms
16
HipGISAXS & RMC
GISAXS
Slot-die printing of Organic photovoltaics
Borrowed from E Dart
– Multi-domain provisioning
– Ease of use, Reliable
– Dedicated Bandwidth on demand, loss free – Isolation – Monitoring (perfSonar) – Network virtualization
– Virtualization, SDN, switches, routers, etc
20
Designing for
Teams: Science Engagement, Software, Infrastructure, SDN Testbed, etc
21
22
2030 2017 1990
23
utilized
and network use
Problems!
– Normal and outlier behaviors in traffic
– This <QoS value> will cause this <event Y> with probability <P>
– Software or hardware faults
– Anticipate congestion – Divert traffic to alternate paths
multiple layers
User traffic data User traffic (directed flows)
28
WAN Topology (traffic engineering) (flow-level, traffic prediction, adaptation, path optimization, link failure) Infrastructure traffic data (Packet-level, queues, TCP, UDP) Infrastructure-level modifications (Switches, deployment, etc)
– B4, Jupiter, BwE, etc. (data center to user-based provisioning)
awareness for:
– Forming topologies, optimum path finding – Improve path utilizations depending on traffic
10 20 30 40 50 60 User Traffic Traffic Engineering Packet-level improvements Optimizing infrastructure ML Non-ML
(2010-2017)
– We need to develop network techniques catering to science workflows
30
Global
Local
Intent-driven networks: INDIRA Self-healing networks
– Solution: development of the INDIRA tool – Use of semantic ontology – Network provisioning and rendering of commands – Current state – interaction with multiple tools (NSI, globus, and more)
32
33
R&E ESnet networks DoE universities instruments facilities
scientist scientist Network engineer Network engineer Network engineer Network engineer scientist scientist
I want to watch a movie tonight on netflix
scientist
I want to see my real time high resolution big data visualization I want to stream the big data directly into the cache
– Example: Scientist> Can you set up a connection between Berkeley and Argonne. Network> Do you want guaranteed bandwidth? Scientist> Sure! Network> OK! Ill get this setup for you................................. You’re all set!
34
./onsa reserveprovision -g urn:uuid:6e1f288a-5a26-4ad8-a9bc-eb91785cee15
v2/ConnectionServiceProvider -p es.net:2013:nsa:nsi-aggr-west -r canada.eh:2016:nsa:requester -h 198.128.151.17 -o 8443 -l /etc/hostcert/muclient.crt -k /etc/hostcert/muclient.key -i /etc/ssl/certs/ -y -x -z -v -q;
Language processing to take intent input
like bandwidth, time schedule, topology
NLP, OWL, “AI” Network engineering Renderer translates intent intent Network state
(Intelligent Network Deployment Intent Renderer Application)
“ I want to send data to my SuperComputer at NERSC by 5:00pm today” “ Ok ill reconfigure the network to make this possible!”
– Project profiles – Semantics – Network services and
conditions
– Multi-layer provisioning – Multi-tool connectivity – No platform dependence
(multiple programming languages)
36 User interface
Physical layer Virtual layer
37
INDIRA> Tell me what want the network to do? Hint “I want to [connect] [disconnect] services Scientist> I want to transfer files from LBL to ANL INDIRA> Do you have any time constraints with the transfer from LBL to ANL? Scientist> No INDIRA> Hello and welcome! Tell me your project name? Scientist> Bob INDIRA> Is ‘Bob’ your project username? Y/N Scientist> N INDIRA> What is your project username Scientist> Project1
INDIRA> Ok. Give me the link to the file to be moved. Scientist> project_test@LBL/datasource1 to project2@ANL/dataset INDIRA> Do you have any bandwidth requirements? Scientist> Yes Maximum INDIRA> Sorry, you have only 5Mb/s allowable to use. Do you want to use all of this? Scientist>No INDIRA> Thank you. Please contact the network administrator to allow you more bandwidth, as you don’t have permission to go above 5Mbps
38
INDIRA> Tell me what want the network to do? Hint “I want to [connect] [disconnect] services Scientist> I want to transfer files from LBL to ANL INDIRA> Do you have any time constraints with the transfer from LBL to ANL? Scientist> No INDIRA> Hello and welcome! Tell me your project name? Scientist> Bob INDIRA> Is ‘Bob’ your project username? Y/N Scientist> N INDIRA> What is your project username Scientist> Project1
INDIRA> Ok. Give me the link to the file to be moved. Scientist> project_test@LBL/datasource1 to project2@ANL/dataset INDIRA> Do you have any bandwidth requirements? Scientist> No INDIRA> Thankyou. I am configuring your transfer to start ‘now’ at 5GB/s… …..... Congratulations....All Done!
– Rule based reasoning and pattern matching from ontologies (RDF/OWL) – Learn usage patterns so iNDIRA can make better suggestions (pattern recog.)
– Find available links – Check user requirements are met – Optimize paths provisioned using ML (graph optimization) – Learn optimal network behaviors (pattern recognition)
– iNDIRA can ’converse’ with users and translate their needs to NSI/Globus – Room to explore other ML techniques to improve iNDIRA’s performance
39
Project 2: Initial phases
across multiple layers
– Behavior forecast and
anomaly using DL
– Simple classification
for traffic patterns
Reactive or rule based systems (if.. then)
solve one problem
network as ‘stable’ as possible
41
LBL FNL ANL CRN
Telemetry
Learning phase (training data sets, classification, etc) Real-time monitoring Recovery phase (action- plan) Anomaly detection Behavior forecasting
– Real-time anomaly detection: Need quick response – Data collection issues: 1s versus 30s intervals
– Networks are extremely dynamic, we want to limit processing for
42
Complex Problem that needs to be broken down and solved as individual parts!
– Network behavior – Network-user interactions or Science use cases
43
etc)
44
LBL FNL ANL CRN
Detecting performance anomalies in real-time Cannot predict traffic patterns yet….
solve one problem like a puzzle
network as ‘stable’ as possible
45
LBL FNL ANL CRN
Telemetry
Learning phase (training data sets, classification, etc) Real-time monitoring Recovery phase (action- plan) Anomaly detection Behavior forecasting
– HAL 9000 says “There is a 70% chance a link will fail in the next 48 hours” – Humans reply “No our results don’t show that” – HAL 9000 says “Human error!”
– Our focus is on automation and optimization
– New areas in network and perhaps even more
46
47