PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD
ML4HPC Workshop - March 2020
PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo - - PowerPoint PPT Presentation
PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs ML4HPC Workshop - March 2020 PICKS AND SHOVELS? During the California Gold Rush (160 years ago), only
ML4HPC Workshop - March 2020
During the California Gold Rush (160 years ago),
The most consistent business came from providing the picks and shovels to the miners.
Data prep tools (good old ETL, clean up the mess) Auto-ML tools (help pick the right model) Workflow including complex edge-to-cloud pipelines Post-deployment ML/Ops tools (model verification, profiling, drift detection, explainability, etc.) Underlying data layers (file systems, burst buffers, Source: hitsorynotes.info
Edge-to-cloud AI example: Autonomous Driving AI for IT operations: AI/ops I/O challenges for next-generation accelerators
4
architecture
Radeon GPU
Radeon GPU
Source: AI for science DoE TownHall
Source: AI for science DoE TownHall
How to navigate the AI ecosystem today? How to enable AI at the edge? How to approach data-driven science? Can compute keep up with AI?
Enterprise Customers AI Journey AI models, platforms, data pipelines Edge-to-core AI computing, AI for IT operations, Swarm Learning (federated) AI for Science (US DoE), combining AI + simulation, Exascale computing Unconventional accelerators (DPE PUMA - analog computing)
Decide Infrastructure Other Use cases
Scale Ethics Strategy
How Data People
ML CODE Model Optimization Data Ingestion Data Federation Data Transformation Data Data Verification Monitoring Analysis Tools Pipeline Buildup Inference Optimization Model Management Model Testing Model Retraining Scalability Container Orchestration Tools and Libraries Accelerators needed Storage Performance Compute Performance Management Software Security Users Access Machine Resource Management Data move Pipeline deployment Edge Infra integration
Unprecedented volumes of data Operationalizing AI is complex Lack of AI talent and skilled resources Ever-changing, expanding
Unpredictable costs and capacity needs Scalable infrastructure
Data protection, security, governance and privacy Data across multiple silos
AI Models and Applications
Integrators ISVs Users
Orchestration Infrastructure
Software Compute Storage Networking Container Bare Metal
Support and Services
On- premise and multi-cloud
CONTENT DELIVERY
The User selects AI models/Applications
Request Orchestration plane to implement:
Compute/networking/storage can be on-premise
Infrastructure SW implements changes and deployment of HW
Orchestration plane leverages Infrastructure SW for monitoring
Application is ready to use
User work can begin
The Orchestration Plane interacts with the Infrastructure plane to determine if HW can be
Solving the Business Problem Enabling the Business Solution
Infrastructure Plane Orchestration Plane App/ Model Plane Optimization Tools Data Plane Application Pipeline Deployment automation Bare-Metal Environment Tools Vertical Use cases Toolchain Content Delivery: ISV, integrator, Users Compute, Storage, Fabric Software Intelligent Service Containerized Environment
Model Optimization Data Ingestion Data Federation Data Verification Pipeline Buildup Inference Optimization Model Management Model Testing Model Retraining Scalability Container Orchestration Security Users Access Data move Pipeline deployment Accelerators needed Storage Performance Compute Performance Management Software Resource Management Edge Infra integration Machine Learning Code Data Transformation
Data to Solve a Business Problem
Monitoring Analysis Tools Tools Libraries
16
Most functions
Acceleration/De celeration
Hands on steering wheel Lane changes Steering Acceleration/ Deceleration Hands 0ff + Eyes On Steering Acceleration/ Deceleration + Lane changes Hands 0ff + Eyes Off (Available to take over ) Steering Acceleration/ Deceleration + Lane changes Hands and Eyes Off + Mind OFF Fully automated driving under limited conditions and places No Steering Wheel Fully automated driving under ALL conditions
Level 1 Level 2 Level 2+ Level 3 Level 4 Level 5
ABS Brake Controller Steering Controller Powertrain Controller Surroundings Detection Road and Lane Detection Route Planning Radar Cameras Input Data Processing GPS Vehicle Action Black-Box Autonomous Vehicle Operation Safety Climate Control Vehicle Control and Communication RF Uplink Bluetooth Wi-Fi Destination Infotainment Convenience Vehicle logistics In-Vehicle Monitoring Storage OOB Management Security
Vehicle Manufacturer specific Systems Data System
Collecting Data Moving data Ingestion Station Ground Truth Simulation Single Events Data Center
Edge Cloud Vehicle Data Rate
(Gbps)
Data per 8h Shift
(TB)
Fleet Size Data per Shift
(PB)
5 18 80 1.4 10 36 80 2.8 20 72 80 5.7 30 108 80 8.6
Large number of metrics (thousands) Do not know where to look. Large number of threshold-based rules is not manageable, many false positives. Some anomalies are identifiable in high dimensional space (multiple metrics). Broad range of problems (beyond anomaly detection) Anomaly Detection (single metric / multi-metric) Preventive maintenance Performance prediction Optimization (Digital Twin)
Facility Inlet Temperature (C) Facility Inlet Temperature (C) Anomaly Decision Threshold Anomaly Score
Event detect 2015-05-27 10:06:00 Anomaly Score 12.42 Temperature (C) 19.80
Threshold of 19.8C
Facility Inlet Temperature Number of measurements
Event reported 2015-05-27 10:11:00
during the training process
The life-cycle of a dataset for supervised training Internal data flow for a training task
via a high-speed network (such as Slingshot )
Flexibility in configuration, allocation of resources, provisioning of capacity and bandwidth, and ease of data management and sharing across nodes
Interference on the network caused by
Long tail effect:smaller achievable I/O bandwidth on some CNs can slow down overall training
bandwidth needs of accelerators using NVMe disks, reducing costs compared to DRAM
Accelerators have exclusive and fair access to the NVMe devices so linear scaling is easier to achieve
Minimal network traffic due to the dataset during the training phase. The exclusive use of SSDs provides more consistent and predictable performance so there is less likely to have a node
The cost is additive,but could be controlled by adding limited storage devices to the compute node
Fixed ratio of I/O capacity/bandwidth to accelerators
Datasets have to be striped across SSDs to achieve target bandwidth CPU
The next generation of AI accelerators is packing an Graphcore
16 cores per Colossus IPU, 1,216 IPUs/chip + dataflow 100 GF16/IPU: ~120 TF16/s
Groq
144-way VLIW with 320-wide vector ops @ 900 MHz 1,000 8b TOPS per chip (~500 TF16/s)
Habana (to be acquired by Intel)
8 Tensor Cores + 4 HBM + RoCE Train 1650 Resnet-50 images / s @ 140W (~800 TF16/s)
Cerebras
400,000 cores per wafer No performance yet, but certainly >> 2,000 TF16/s Source: Cerebras
Performance (Flop16/s) Arithmetic Intensity (Flop16/byte)