PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo - PowerPoint PPT Presentation

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs ML4HPC Workshop - March 2020

PICKS AND SHOVELS? During the California Gold Rush (160 years ago), only a few miners struck it rich. The most consistent business came from providing the picks and shovels to the miners. Data prep tools (good old ETL, clean up the mess) Auto-ML tools (help pick the right model) Workflow including complex edge-to-cloud pipelines Post-deployment ML/Ops tools (model verification, profiling, drift detection, explainability, etc.) Underlying data layers (file systems, burst buffers, Source: hitsorynotes.info

TALK OUTLINE Edge-to-cloud AI example: Autonomous Driving AI for IT operations: AI/ops I/O challenges for next-generation accelerators

• • 2021 delivery • 2021 delivery Early 2023 delivery • • More than 1.5 EF/s • More than 1 EF/s More than 2.0 EF/s • • Future Intel Xeon CPU and Intel X e Future AMD EPYC CPU and • Future AMD EPYC CPU and Radeon GPU architecture Radeon GPU • • Slingshot interconnect • Slingshot interconnect Slingshot interconnect • • Mixed AI and HPC workload • Mixed AI and HPC workload AI-assisted mission HPC workload 4

BEYOND EXASCALE: AI FOR SCIENCE Source: AI for science DoE TownHall

SCIENCE CHALLENGES EVERY ASPECT OF AI Source: AI for science DoE TownHall

AI AT HEWLETT PACKARD LABS Enterprise Customers AI Journey How to navigate the AI ecosystem today? AI models, platforms, data pipelines Edge-to-core AI computing, AI for IT operations, How to enable AI at the edge? Swarm Learning (federated) AI for Science (US DoE), combining AI + simulation, How to approach data-driven science? Exascale computing Unconventional accelerators Can compute keep up with AI? (DPE PUMA - analog computing)

THE ENTERPRISE AI JOURNEY

THE ENTERPRISE AI JOURNEY for image and voice applications 50% 30% 20% Early Proof of Concept Production How Decide Scale Data Infrastructure Ethics People Other Use cases Strategy 80% 0% 20% 0%

FROM POCTO PRODUCTION PoC Production Weeks Month Years Never Inference Pipeline Infra Model Pipeline Buildup Optimization deployment integration Management Model Compute Analysis Edge Monitoring Model Testing Optimization Performance Tools Management Data ML Model Data Data Software Transformation CODE Retraining Verification Accelerators Data Data move Data needed Users Access Scalability Federation Ingestion Tools and Machine Resource Container Storage Libraries Management Security Orchestration Performance

BARRIERS TO MOVING AT AI SPEED Unprecedented Data across volumes of data Unpredictable costs multiple silos and capacity needs Operationalizing AI is complex Lack of AI talent and skilled resources Ever-changing, expanding open source ecosystem Scalable infrastructure optimized for AI and ML Data protection, security, governance and privacy

A TYPICAL AI JOURNEY The User selects AI models/Applications 1 • System integrator, ISV, or customer-built Request Orchestration plane to implement: 2 • Cluster and Pipeline deployment 1 • 7 Model/Container/Resource Management • Infrastructure requirements summary CONTENT DELIVERY 3 Compute/networking/storage can be on-premise AI Models and Applications or cloud (multi-cloud) based ISVs Integrators Users 4 Infrastructure SW implements changes and Support and Services 2 6 deployment of HW 5 Orchestration plane leverages Infrastructure SW Orchestration for monitoring Bare Metal Container 6 Application is ready to use A 3 5 7 User work can begin Infrastructure Software Compute Storage Networking A The Orchestration Plane interacts with the Infrastructure plane to determine if HW can be On- premise and multi-cloud optimized 4

DELIVERING ON AI Solving the Business Problem Content Delivery: ISV, integrator, Users Data to Solve a Business Problem App/ Model Data Transformation Machine Learning Code Vertical Use cases Toolchain Plane Tools Tools Analysis Tools Libraries Monitoring Enabling the Business Solution Inference Model Optimization Tools Scalability Optimization Optimization Data Plane Model Model Model Testing Management Retraining Orchestration Plane Application Pipeline Deployment automation Data Data Pipeline Data move Data Ingestion Federation Verification Buildup Bare-Metal Containerized Pipeline Container Users Access Security Environment Environment deployment Orchestration Intelligent Service Management Resource Software Management Infrastructure Software Plane Compute Accelerators Storage Infra Edge Performance needed Performance integration Compute, Storage, Fabric

EDGE-TO-CLOUD AI PIPELINES: AN HAD EXAMPLE

AUTONOMOUS DRIVING

THE GENERIC AD VEHICLE AND ENVIRONMENT Levels of Autonomy Level 2 Level 2+ Level 3 Level 4 Level 1 Level 5 No Steering Hands and Hands 0ff Wheel Eyes Off + Hands on + Eyes Off Hands 0ff steering Mind OFF (Available to + wheel Most take over ) Eyes On functions Lane Fully Fully changes automated Steering automated driving under Acceleration/ driving under Data System Steering ALL Deceleration limited Acceleration/ conditions + conditions Deceleration Steering Lane and places + Acceleration/ Acceleration/De changes Vehicle Control and Communication Lane Autonomous Vehicle Operation Deceleration celeration GPS changes Destination Climate Control RF Uplink Route Planning OOB Storage Infotainment Management Road and Lane Detection Input Data Bluetooth Safety Convenience Processing Wi-Fi Surroundings In-Vehicle Security Detection Monitoring ABS Brake Cameras Controller Vehicle Action Vehicle logistics Powertrain Radar Controller Black-Box Steering Controller 16 Vehicle Manufacturer specific Systems

AUTONOMOUS DRIVING: AI TRAINING DATA PIPELINE Vehicle Data per Data per Fleet Data Rate 8h Shift Shift Size Collecting Moving data (Gbps) (TB) (PB) Data Ground Truth 5 18 80 1.4 Ingestion Station 10 36 80 2.8 Simulation 20 72 80 5.7 30 108 80 8.6 Single Events Data Center Edge Cloud

DATACENTER OPERATIONAL INTELLIGENCE: AI/ OPS

AI-OPS: DATACENTER OPERATIONAL INTELLIGENCE Large number of metrics (thousands) Do not know where to look. Large number of threshold-based rules is not manageable, many false positives. Some anomalies are identifiable in high dimensional space (multiple metrics). Broad range of problems (beyond anomaly detection) Anomaly Detection (single metric / multi-metric) Preventive maintenance Performance prediction Optimization (Digital Twin)

PROBLEMS WE ARE TACKLING #Metrics: thousands #Data points: millions per minute Metric diversity: stationarity, modality, irregularities, sparseness Pre-processing: trend-removal, normalization Algorithm selection Post-processing: information fusion, Solution: build an automated correlation, root-cause analysis end-to-end anomaly detection at scale Optimization: Power Usage Effectiveness

EXAMPLE: EXAMPLE: NREL CO NREL COOL OLING ING TOWER WER VAL ALVE VE FAIL AILURE URE Facility Inlet Temperature (C) Facility Inlet Temperature (C) measurements Anomaly Decision Threshold Number of Anomaly Score Threshold of 19.8C Facility Inlet Temperature Event reported 2015-05-27 10:11:00 Event detect 2015-05-27 10:06:00 Anomaly Score 12.42 Temperature (C) 19.80

AIOPS SUMMARY • Improve data center resiliency, efficiency • Collaboration with NREL • Today: anomaly detection for single/multiple metrics • Tomorrow: predictive analytics, autonomous control • Even with multiple dashboards and monitoring, events can still be missed • Advanced AI/ML data analytics: no need for thresholds, or temporal resolutions • Some event was predicted 5 minutes prior to NREL identifying the event

FEEDING THE BEAST: IO CHALLENGES IN ACCELERATED AI

FEEDING THE DATA TO AI TRAINING WILL BE A BOTTLENECK • Economically infeasible to keep the whole dataset in DRAM during the training process o traversing PBs of data by training over billions of samples • Requires a lot of I/O bandwidth by increasing computation o the challenge is to provide enough bandwidth without overprovisioning capacity Internal data flow for a training task The life-cycle of a dataset for supervised training

SHA SHARED RED STORA ORAGE GE • Inspired by the burst buffer architecture in HPC • Compute nodes (CNs) and data nodes (IONs) connected via a high-speed network (such as Slingshot  ) • Fast SSDs in the IONs serve as a performance tier • Size the ratio of CNs / IONs to optimize cost / performance  Flexibility in configuration, allocation of resources, provisioning of capacity and bandwidth, and ease of data management and sharing across nodes  Interference on the network caused by o the I/O traffic between CNs and DNs, and o communication traffic between CNs when synchronizing during the end of each training iteration  Long tail effect:smaller achievable I/O bandwidth on some CNs can slow down overall training

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo - PowerPoint PPT Presentation

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs ML4HPC Workshop - March 2020 PICKS AND SHOVELS? During the California Gold Rush (160 years ago), only

From Dirt to Shovels: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data ASCII

Picks, Shovels & Cannabis: Automated Extraction and Processing Solutions Ancillary equipment,

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Highlights of Oracle Database 11g: Top Picks of Manageability & Real Application Testing

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Ms Priors Top Picks The Catcher in the Rye - JD Salinger What is it about? After

Frontal Crash Protection Frontal Crash Protection Real World Experience with Real World

Real Students Real World Real Work Real Life: A Plan for a Holistic Approach to Supporting

Housekeeping Twitter: # ACMLearning Welcome to todays ACM Learning Webinar ,

Boolean Formulas for the Static Identification of Injection Attacks in Java Michael D. Ernst

Human Pose Estimation by Yannic Jnike - 04.11.2019 https://www.youtube.com/watch?v=mxKlUO_tjcg

Using Machine Learning for Intent-based Provisioning in High-Speed Science Network Hocine

Fifth Grade Information Parent Resources Digital Citizenship Daily Schedule Reading &

Far Edge with VMs and Containers and beyond Open Infrastructure Summit 2019, Denver, CO

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Virtualize and Share Non-Volatile Memory in User Space Chih Chieh Chou, Jaemin Jung, A. L.

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo - PowerPoint PPT Presentation

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs ML4HPC Workshop - March 2020 PICKS AND SHOVELS? During the California Gold Rush (160 years ago), only

From Dirt to Shovels: From Dirt to Shovels: Inferring PADS descriptions from ASCII Data ASCII

Picks, Shovels &amp; Cannabis: Automated Extraction and Processing Solutions Ancillary equipment,

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Pipelines on Pipelines: Creating Agile CI/CD Workflows for Airflow DAGs By Victor Shafran CPO

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

Highlights of Oracle Database 11g: Top Picks of Manageability &amp; Real Application Testing

Building highly reliable data pipelines @ Datadog Quentin FRANCOIS Team Lead, Data Engineering

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&amp;D

Ms Priors Top Picks The Catcher in the Rye - JD Salinger What is it about? After

Frontal Crash Protection Frontal Crash Protection Real World Experience with Real World

Real Students Real World Real Work Real Life: A Plan for a Holistic Approach to Supporting

Housekeeping Twitter: # ACMLearning Welcome to todays ACM Learning Webinar ,

Boolean Formulas for the Static Identification of Injection Attacks in Java Michael D. Ernst

Human Pose Estimation by Yannic Jnike - 04.11.2019 https://www.youtube.com/watch?v=mxKlUO_tjcg

Using Machine Learning for Intent-based Provisioning in High-Speed Science Network Hocine

Fifth Grade Information Parent Resources Digital Citizenship Daily Schedule Reading &amp;

Far Edge with VMs and Containers and beyond Open Infrastructure Summit 2019, Denver, CO

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Virtualize and Share Non-Volatile Memory in User Space Chih Chieh Chou, Jaemin Jung, A. L.

Picks, Shovels & Cannabis: Automated Extraction and Processing Solutions Ancillary equipment,

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Highlights of Oracle Database 11g: Top Picks of Manageability & Real Application Testing

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D

Fifth Grade Information Parent Resources Digital Citizenship Daily Schedule Reading &