Using Machine Learning with Wide Area Networks (WANs) Dr. Mariam - - PowerPoint PPT Presentation

using machine learning with wide area networks wans
SMART_READER_LITE
LIVE PREVIEW

Using Machine Learning with Wide Area Networks (WANs) Dr. Mariam - - PowerPoint PPT Presentation

Using Machine Learning with Wide Area Networks (WANs) Dr. Mariam Kiran Energy Sciences Network (ESnet) ANTG Research Group Lawrence Berkeley National Lab Summer Student 2017 Talk 1 Agenda Machine learning (ML) is everywhere: Is it


slide-1
SLIDE 1

Using Machine Learning with Wide Area Networks (WANs)

  • Dr. Mariam Kiran

ANTG Research Group

Energy Sciences Network (ESnet) Lawrence Berkeley National Lab Summer Student 2017 Talk

1

slide-2
SLIDE 2

Agenda

  • Machine learning (ML) is everywhere:

– Is it the Next Hype or substantial?

  • But what exactly in ML versus AI (versus deep learning)?
  • Some thoughts on Networks with additional ML support
  • The future? Current projects:

– Project 1: ‘Talking’ to Networks – Project 2: Self-healing Networks

2

slide-3
SLIDE 3

ML Riding the Wave

3

ML at the peak

slide-4
SLIDE 4

Is Machine Learning New?

slide-5
SLIDE 5

Is it New?

  • NOPE!
  • Science Fiction has been exploring it for ages
  • Brought the main ideas around AI and human

interactions

5

slide-6
SLIDE 6

ML in the 90s…

  • Computers becoming smarter:

– Playing Chess with your

Computer

– Making games difficult for

players

  • A number of innovations in

multiple areas of

– Speech recognition – Image recognition – Robotics – Philosophy of mind

6

slide-7
SLIDE 7

What is the Difference between AI, ML and DL

  • Turing‘s paper “Can Machines Think!” – Turing Test : Exhibit human-like intelligence
  • Recently seen in movies
  • Machine learning is an approach to achieve AI – spam filters, HR
  • Deep learning is one of the techniques for ML:
  • Recent advances due to GPU and HPC processing (previously very slow, too much

data, need training to work)

  • Mainly for image and speech recognition – commercial apps

Nvidia blog

slide-8
SLIDE 8

AI Tree (subset is defined as ML techniques)

AI Optimization technique Many more…. Expert systems Fuzzy systems Neural Networks Evolutionary algorithms (Genetic algorithms, evolutionary strategies, etc) Swarm intelligence (ant colony, particle swarm, more) Deep belief networks Deep boltzman networks Convolutional networks Stacked autoencoders

Networks : graph algorithm (routing – shortest path) Where ever learning involved (training): ML

slide-9
SLIDE 9

Trying out Deep Learning Libraries

  • Google’s DNN platform TensorFlow used to tag unlabeled videos,

recognize images with 70% accuracy and predict Gmail replies

  • Scikit-learn good for learning, python library
  • NERSC for data analysis in simulations, e.g. climate image analysis
  • HPC innovation: analyze massive data sets, quick training
  • Model and data parallelism

Toolkit Language Use Processing capability Caffe C++ Images and video Distributed (HPC, GPU) TensorFlow Python Images, regression, video, text, speech Distributed (HPC, GPU) Theano Python Images Distributed (HPC, GPU) Torch Lua Images and speech Distributed (HPC, GPU)

slide-10
SLIDE 10

Choosing Algorithms for Specific Problems

Deep neural network Input Data Applied for Variants Feed forward neural network Hierarchical data representations

  • General classification
  • Clustering
  • anomaly finding
  • feature extraction
  • Deep belief networks

(uses restricted boltzman machine for activation function)

  • Convolutional neural

networks Recurrent neural network Sequential data representation (i.e. time series data) Sequential learning, especially useful when time relationship exists. Long short term memory (LTSM) used for speech translation.

  • There are many variants of DNNs. Papers and researchers in each specific DNN
  • DeepMind used Deep Q-learning for Attari and Go
  • Action-pairs based on learned data.
slide-11
SLIDE 11

12

Each algorithm is chosen depending on data being explored and problem being explored (some 50% accuracy, others 80% accuracy)

slide-12
SLIDE 12

Also Choose Techniques based on ‘Learning’

Machine learning Knowledge based reasoning Probabilistic reasoning Data-driven reasoning Ontology based reasoning Kalman filter Hidden Markov models Unsupervised Semi supervised Supervised Decision trees Rule based systems (expert systems) Bayesian networks Support vector machines K-means Deep boltzman

Using statistical classification (need training)

slide-13
SLIDE 13

What would Networks do with Machine learning?

Range of possibilities of ML algorithms

  • Problem
  • Data
  • Processing/Experience
slide-14
SLIDE 14

ESnet Background

  • R&E networks for science (CERN, LHC, and more)
  • Provide reliable robust network connections to enable science workflows
  • Investigate research and techniques to help build better networks
  • Guarantees for our scientists for network needs (users)

16

slide-15
SLIDE 15

For example: Many Actors, softwares, data, etc ….

  • 19 -

HipGISAXS & RMC

GISAXS

Slot-die printing of Organic photovoltaics

Borrowed from E Dart

slide-16
SLIDE 16

ESnet Team continually engaged

  • Science workflows (using NSI or OSCARS)

– Multi-domain provisioning

  • Transfer tools and protocols (using Globus) (TCP research)

– Ease of use, Reliable

  • R&E Networks support big data oriented services (using ScienceDMZ)

– Dedicated Bandwidth on demand, loss free – Isolation – Monitoring (perfSonar) – Network virtualization

  • Network research

– Virtualization, SDN, switches, routers, etc

20

Designing for

  • Specific science cases
  • End-users
  • Network engineers

Teams: Science Engagement, Software, Infrastructure, SDN Testbed, etc

slide-17
SLIDE 17

A Day in the Life of a Packet

21

slide-18
SLIDE 18

Traffic Volume Growing Exponentially

22

2030 2017 1990

slide-19
SLIDE 19

Managing Multiple Sites

23

  • Different traffic requirements
  • Quality of service, bandwidth, speed, time-based deliveries, etc
  • Reliability and heterogeneity
slide-20
SLIDE 20

Automating Networks using ML

  • Predict traffic peak times
  • Find anomalies for security threats
  • Optimize how paths are currently

utilized

  • Predict link failures
  • Understand or predict user behavior

and network use

  • These are Core Network Research

Problems!

slide-21
SLIDE 21

Most uses of Machine Learning in Networks (IETF forums)

  • Network Security

– Normal and outlier behaviors in traffic

  • Change or predict possible behavior

– This <QoS value> will cause this <event Y> with probability <P>

  • Bug detection

– Software or hardware faults

  • WAN path optimization

– Anticipate congestion – Divert traffic to alternate paths

slide-22
SLIDE 22

Problem becomes quite Complex

  • WAN are complete system with

multiple layers

  • Focus on specific case studies
  • Over 2000 papers in the area
slide-23
SLIDE 23

WANS are complex!

slide-24
SLIDE 24

Depends on what we use ML for…

User traffic data User traffic (directed flows)

28

WAN Topology (traffic engineering) (flow-level, traffic prediction, adaptation, path optimization, link failure) Infrastructure traffic data (Packet-level, queues, TCP, UDP) Infrastructure-level modifications (Switches, deployment, etc)

slide-25
SLIDE 25

What is Most Published?

  • Most ML techniques used for classification (of traffic) and prediction (failures)
  • Recent Google papers have been most influential:

– B4, Jupiter, BwE, etc. (data center to user-based provisioning)

  • Network tools enhanced by embedding informed decisions such as traffic

awareness for:

– Forming topologies, optimum path finding – Improve path utilizations depending on traffic

10 20 30 40 50 60 User Traffic Traffic Engineering Packet-level improvements Optimizing infrastructure ML Non-ML

  • No. of papers

(2010-2017)

slide-26
SLIDE 26

Contribute to Research

(tools, software, papers)

  • Simple graph optimization algorithms still largely used (non-ML)
  • Room for unexplored ML techniques
  • Game theory also seems a promising area (local vs global)
  • Our applications are more complex, more actors and requirements

– We need to develop network techniques catering to science workflows

30

Global

  • ptimum

Local

  • ptimum
slide-27
SLIDE 27

Current Solutions being explored: Two projects using ML

Intent-driven networks: INDIRA Self-healing networks

  • Studying ML to understand how we can apply to our problems
  • Start with simple algorithms and progress to more complex tasks
  • Depends on the Goals we want to achieve
slide-28
SLIDE 28

Intent-driven networks (user-level)

  • Focuses on improving user network interaction
  • Intent-based network

– Solution: development of the INDIRA tool – Use of semantic ontology – Network provisioning and rendering of commands – Current state – interaction with multiple tools (NSI, globus, and more)

32

slide-29
SLIDE 29

Setting the Stage

33

R&E ESnet networks DoE universities instruments facilities

scientist scientist Network engineer Network engineer Network engineer Network engineer scientist scientist

I want to watch a movie tonight on netflix

scientist

I want to see my real time high resolution big data visualization I want to stream the big data directly into the cache

  • f my super computer
  • Applications have complex workloads
  • Network behavior tailored for my application ‘intent’
  • Difficult to fulfill these diverse set of needs
  • Learning curve is huge and complex
  • Difficult to specify needs in ‘english’
  • Specify in high-level language, portable, multi-domain
slide-30
SLIDE 30

Setting Up Paths for Individual: Intent

  • Traffic paths provisioned with basic QoS values, what if this is optimized for ‘end-users’
  • Rather than running the following command: (setting up a link with QoS)
  • Networks can understand users: “Tell me what do you want!”

– Example: Scientist> Can you set up a connection between Berkeley and Argonne. Network> Do you want guaranteed bandwidth? Scientist> Sure! Network> OK! Ill get this setup for you................................. You’re all set!

34

./onsa reserveprovision -g urn:uuid:6e1f288a-5a26-4ad8-a9bc-eb91785cee15

  • d es.net:2013::bnl-mr2:xe-1_2_0:+#1000 -s es.net:2013::lbl-mr2:xe-9_3_0:+#1000
  • b 5096
  • a 2016-11-13T09:00:00 -e 2017-04-04T17:00:00 -u https://nsi-aggr-west.es.net:443/nsi-

v2/ConnectionServiceProvider -p es.net:2013:nsa:nsi-aggr-west -r canada.eh:2016:nsa:requester -h 198.128.151.17 -o 8443 -l /etc/hostcert/muclient.crt -k /etc/hostcert/muclient.key -i /etc/ssl/certs/ -y -x -z -v -q;

slide-31
SLIDE 31

Introducing iNDIRA... “Hello! Im iNDIRA”

Language processing to take intent input

  • Automate rendering into network commands

like bandwidth, time schedule, topology

  • Optimize the network
  • Return success or failure to user
  • Understand English (e.g. transfer, connect)
  • Check conditions
  • Ask for any further details
  • Check conflicts and permissions

NLP, OWL, “AI” Network engineering Renderer translates intent intent Network state

iNDIRA

(Intelligent Network Deployment Intent Renderer Application)

“ I want to send data to my SuperComputer at NERSC by 5:00pm today” “ Ok ill reconfigure the network to make this possible!”

slide-32
SLIDE 32

Indira’s working: top level view

  • Intent parser engine
  • ML to optimize decisions
  • Knowledge base

– Project profiles – Semantics – Network services and

conditions

  • Backend details:

– Multi-layer provisioning – Multi-tool connectivity – No platform dependence

(multiple programming languages)

36 User interface

Physical layer Virtual layer

slide-33
SLIDE 33

Negotiate with Users

37

INDIRA> Tell me what want the network to do? Hint “I want to [connect] [disconnect] services Scientist> I want to transfer files from LBL to ANL INDIRA> Do you have any time constraints with the transfer from LBL to ANL? Scientist> No INDIRA> Hello and welcome! Tell me your project name? Scientist> Bob INDIRA> Is ‘Bob’ your project username? Y/N Scientist> N INDIRA> What is your project username Scientist> Project1

INDIRA> Ok. Give me the link to the file to be moved. Scientist> project_test@LBL/datasource1 to project2@ANL/dataset INDIRA> Do you have any bandwidth requirements? Scientist> Yes Maximum INDIRA> Sorry, you have only 5Mb/s allowable to use. Do you want to use all of this? Scientist>No INDIRA> Thank you. Please contact the network administrator to allow you more bandwidth, as you don’t have permission to go above 5Mbps

slide-34
SLIDE 34

Negotiate with Users

38

INDIRA> Tell me what want the network to do? Hint “I want to [connect] [disconnect] services Scientist> I want to transfer files from LBL to ANL INDIRA> Do you have any time constraints with the transfer from LBL to ANL? Scientist> No INDIRA> Hello and welcome! Tell me your project name? Scientist> Bob INDIRA> Is ‘Bob’ your project username? Y/N Scientist> N INDIRA> What is your project username Scientist> Project1

INDIRA> Ok. Give me the link to the file to be moved. Scientist> project_test@LBL/datasource1 to project2@ANL/dataset INDIRA> Do you have any bandwidth requirements? Scientist> No INDIRA> Thankyou. I am configuring your transfer to start ‘now’ at 5GB/s… …..... Congratulations....All Done!

slide-35
SLIDE 35

Exploiting ML

  • Frontend AI:

– Rule based reasoning and pattern matching from ontologies (RDF/OWL) – Learn usage patterns so iNDIRA can make better suggestions (pattern recog.)

  • Backend AI:

– Find available links – Check user requirements are met – Optimize paths provisioned using ML (graph optimization) – Learn optimal network behaviors (pattern recognition)

  • Work across multiple network layers
  • No changes done to the multiple tools attached:

– iNDIRA can ’converse’ with users and translate their needs to NSI/Globus – Room to explore other ML techniques to improve iNDIRA’s performance

39

slide-36
SLIDE 36

Towards a Self-healing Network

Project 2: Initial phases

slide-37
SLIDE 37

Building Self-healing Networks (infrastructure)

  • ML algorithms explored

across multiple layers

– Behavior forecast and

anomaly using DL

– Simple classification

for traffic patterns

  • Recovery phase:

Reactive or rule based systems (if.. then)

  • Bring it all together to

solve one problem

  • Objective: Keeping

network as ‘stable’ as possible

41

LBL FNL ANL CRN

Telemetry

Learning phase (training data sets, classification, etc) Real-time monitoring Recovery phase (action- plan) Anomaly detection Behavior forecasting

slide-38
SLIDE 38

High-level Research Questions

  • What does it mean for a network to be ‘Stable’?
  • What tools or devices we have control over to help automate recovery?
  • What are the suitable machine learning algorithms for us?

– Real-time anomaly detection: Need quick response – Data collection issues: 1s versus 30s intervals

  • How often do we do training?

– Networks are extremely dynamic, we want to limit processing for

  • ptimum results
  • Cost of processing data, quicken processing so that we can react quickly?
  • This is a long term goal!

42

Complex Problem that needs to be broken down and solved as individual parts!

slide-39
SLIDE 39

Leveraging Existing Research

  • Classification and Anomaly detection have extensively been studied.
  • Exploiting or coupled with more techniques to understand

– Network behavior – Network-user interactions or Science use cases

  • Working with multiple data sets to find relationships
  • Understanding what can be automated or not?
  • Computational aspect: Cloud, HPC and GPU processing times…

43

slide-40
SLIDE 40

Many Data Sets

  • Perfsonar logs: throughput, loss, utilization
  • SNMP data: ingress, outgress
  • Netflow data: More packet data (TCP, UDP,

etc)

  • And much more…

44

LBL FNL ANL CRN

Detecting performance anomalies in real-time Cannot predict traffic patterns yet….

  • A. Chabbra results
slide-41
SLIDE 41

Building Self-healing Networks

  • Bring it all together to

solve one problem like a puzzle

  • Objective: Keeping

network as ‘stable’ as possible

45

LBL FNL ANL CRN

Telemetry

Learning phase (training data sets, classification, etc) Real-time monitoring Recovery phase (action- plan) Anomaly detection Behavior forecasting

slide-42
SLIDE 42

Summary: The Future?

  • If you are really interested: “2001: A Space Odyssey”
  • Abstract:

– HAL 9000 says “There is a 70% chance a link will fail in the next 48 hours” – Humans reply “No our results don’t show that” – HAL 9000 says “Human error!”

  • So who was correct?
  • AI has advanced and shown some promise: Used in a lot of applications in various fields
  • Solve some current problems:

– Our focus is on automation and optimization

  • ESnet is leading efforts in network (research, data, tools, expertise and complex apps)
  • NERSC are leading machine learning with HPC, software, scaling methods, more
  • Combining techniques (and algos) to advance research in explored:

– New areas in network and perhaps even more

46

slide-43
SLIDE 43

Contact

  • Thankyou!
  • Summer internships/students
  • Feel free to reach out for more information/collaboration/ideas:
  • <Mkiran@es.net>

47