PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo - - PowerPoint PPT Presentation

picks and shovels ai data pipelines in the real world
SMART_READER_LITE
LIVE PREVIEW

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo - - PowerPoint PPT Presentation

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs ML4HPC Workshop - March 2020 PICKS AND SHOVELS? During the California Gold Rush (160 years ago), only


slide-1
SLIDE 1

PICKS AND SHOVELS: AI DATA PIPELINES IN THE REAL WORLD

ML4HPC Workshop - March 2020

Paolo Faraboschi, VP and HPE Fellow Artificial Intelligence Lab Hewlett Packard Labs

slide-2
SLIDE 2

PICKS AND SHOVELS?

During the California Gold Rush (160 years ago),

  • nly a few miners struck it rich.

The most consistent business came from providing the picks and shovels to the miners.

Data prep tools (good old ETL, clean up the mess) Auto-ML tools (help pick the right model) Workflow including complex edge-to-cloud pipelines Post-deployment ML/Ops tools (model verification, profiling, drift detection, explainability, etc.) Underlying data layers (file systems, burst buffers, Source: hitsorynotes.info

slide-3
SLIDE 3

TALK OUTLINE

Edge-to-cloud AI example: Autonomous Driving AI for IT operations: AI/ops I/O challenges for next-generation accelerators

slide-4
SLIDE 4

4

  • 2021 delivery
  • More than 1 EF/s
  • Future Intel Xeon CPU and Intel Xe

architecture

  • Slingshot interconnect
  • Mixed AI and HPC workload
  • 2021 delivery
  • More than 1.5 EF/s
  • Future AMD EPYC CPU and

Radeon GPU

  • Slingshot interconnect
  • Mixed AI and HPC workload
  • Early 2023 delivery
  • More than 2.0 EF/s
  • Future AMD EPYC CPU and

Radeon GPU

  • Slingshot interconnect
  • AI-assisted mission HPC workload
slide-5
SLIDE 5

BEYOND EXASCALE: AI FOR SCIENCE

Source: AI for science DoE TownHall

slide-6
SLIDE 6

SCIENCE CHALLENGES EVERY ASPECT OF AI

Source: AI for science DoE TownHall

slide-7
SLIDE 7

AI AT HEWLETT PACKARD LABS

How to navigate the AI ecosystem today? How to enable AI at the edge? How to approach data-driven science? Can compute keep up with AI?

Enterprise Customers AI Journey AI models, platforms, data pipelines Edge-to-core AI computing, AI for IT operations, Swarm Learning (federated) AI for Science (US DoE), combining AI + simulation, Exascale computing Unconventional accelerators (DPE PUMA - analog computing)

slide-8
SLIDE 8

THE ENTERPRISE AI JOURNEY

slide-9
SLIDE 9

THE ENTERPRISE AI JOURNEY

Proof of Concept

Decide Infrastructure Other Use cases

30%

Production

Scale Ethics Strategy

20%

How Data People

50%

Early

80% 0% 20% 0%

for image and voice applications

slide-10
SLIDE 10

FROM POCTO PRODUCTION

Weeks Month Years Never PoC Production

ML CODE Model Optimization Data Ingestion Data Federation Data Transformation Data Data Verification Monitoring Analysis Tools Pipeline Buildup Inference Optimization Model Management Model Testing Model Retraining Scalability Container Orchestration Tools and Libraries Accelerators needed Storage Performance Compute Performance Management Software Security Users Access Machine Resource Management Data move Pipeline deployment Edge Infra integration

slide-11
SLIDE 11

BARRIERS TO MOVING AT AI SPEED

Unprecedented volumes of data Operationalizing AI is complex Lack of AI talent and skilled resources Ever-changing, expanding

  • pen source ecosystem

Unpredictable costs and capacity needs Scalable infrastructure

  • ptimized for AI and ML

Data protection, security, governance and privacy Data across multiple silos

slide-12
SLIDE 12

A TYPICAL AI JOURNEY

AI Models and Applications

Integrators ISVs Users

Orchestration Infrastructure

Software Compute Storage Networking Container Bare Metal

1 2 3 4 5 6 7 A

Support and Services

On- premise and multi-cloud

CONTENT DELIVERY

1

The User selects AI models/Applications

  • System integrator, ISV, or customer-built

2

Request Orchestration plane to implement:

  • Cluster and Pipeline deployment
  • Model/Container/Resource Management
  • Infrastructure requirements summary

3

Compute/networking/storage can be on-premise

  • r cloud (multi-cloud) based

4

Infrastructure SW implements changes and deployment of HW

5

Orchestration plane leverages Infrastructure SW for monitoring

6

Application is ready to use

7

User work can begin

A

The Orchestration Plane interacts with the Infrastructure plane to determine if HW can be

  • ptimized
slide-13
SLIDE 13

DELIVERING ON AI

Solving the Business Problem Enabling the Business Solution

Infrastructure Plane Orchestration Plane App/ Model Plane Optimization Tools Data Plane Application Pipeline Deployment automation Bare-Metal Environment Tools Vertical Use cases Toolchain Content Delivery: ISV, integrator, Users Compute, Storage, Fabric Software Intelligent Service Containerized Environment

Model Optimization Data Ingestion Data Federation Data Verification Pipeline Buildup Inference Optimization Model Management Model Testing Model Retraining Scalability Container Orchestration Security Users Access Data move Pipeline deployment Accelerators needed Storage Performance Compute Performance Management Software Resource Management Edge Infra integration Machine Learning Code Data Transformation

Data to Solve a Business Problem

Monitoring Analysis Tools Tools Libraries

slide-14
SLIDE 14

EDGE-TO-CLOUD AI PIPELINES: AN HAD EXAMPLE

slide-15
SLIDE 15

AUTONOMOUS DRIVING

slide-16
SLIDE 16

THE GENERIC AD VEHICLE AND ENVIRONMENT

16

Levels of Autonomy

Most functions

Acceleration/De celeration

Hands on steering wheel Lane changes Steering Acceleration/ Deceleration Hands 0ff + Eyes On Steering Acceleration/ Deceleration + Lane changes Hands 0ff + Eyes Off (Available to take over ) Steering Acceleration/ Deceleration + Lane changes Hands and Eyes Off + Mind OFF Fully automated driving under limited conditions and places No Steering Wheel Fully automated driving under ALL conditions

Level 1 Level 2 Level 2+ Level 3 Level 4 Level 5

ABS Brake Controller Steering Controller Powertrain Controller Surroundings Detection Road and Lane Detection Route Planning Radar Cameras Input Data Processing GPS Vehicle Action Black-Box Autonomous Vehicle Operation Safety Climate Control Vehicle Control and Communication RF Uplink Bluetooth Wi-Fi Destination Infotainment Convenience Vehicle logistics In-Vehicle Monitoring Storage OOB Management Security

Vehicle Manufacturer specific Systems Data System

slide-17
SLIDE 17

AUTONOMOUS DRIVING: AI TRAINING DATA PIPELINE

Collecting Data Moving data Ingestion Station Ground Truth Simulation Single Events Data Center

Edge Cloud Vehicle Data Rate

(Gbps)

Data per 8h Shift

(TB)

Fleet Size Data per Shift

(PB)

5 18 80 1.4 10 36 80 2.8 20 72 80 5.7 30 108 80 8.6

slide-18
SLIDE 18

DATACENTER OPERATIONAL INTELLIGENCE: AI/ OPS

slide-19
SLIDE 19

AI-OPS: DATACENTER OPERATIONAL INTELLIGENCE

Large number of metrics (thousands) Do not know where to look. Large number of threshold-based rules is not manageable, many false positives. Some anomalies are identifiable in high dimensional space (multiple metrics). Broad range of problems (beyond anomaly detection) Anomaly Detection (single metric / multi-metric) Preventive maintenance Performance prediction Optimization (Digital Twin)

slide-20
SLIDE 20

PROBLEMS WE ARE TACKLING

#Metrics: thousands #Data points: millions per minute Metric diversity: stationarity, modality, irregularities, sparseness Pre-processing: trend-removal, normalization Algorithm selection Post-processing: information fusion, correlation, root-cause analysis Optimization: Power Usage Effectiveness

Solution: build an automated end-to-end anomaly detection at scale

slide-21
SLIDE 21

Facility Inlet Temperature (C) Facility Inlet Temperature (C) Anomaly Decision Threshold Anomaly Score

Event detect 2015-05-27 10:06:00 Anomaly Score 12.42 Temperature (C) 19.80

Threshold of 19.8C

Facility Inlet Temperature Number of measurements

Event reported 2015-05-27 10:11:00

EXAMPLE: EXAMPLE: NREL CO NREL COOL OLING ING TOWER WER VAL ALVE VE FAIL AILURE URE

slide-22
SLIDE 22
  • Improve data center resiliency, efficiency
  • Collaboration with NREL
  • Today: anomaly detection for

single/multiple metrics

  • Tomorrow: predictive analytics,

autonomous control

  • Even with multiple dashboards and

monitoring, events can still be missed

  • Advanced AI/ML data analytics: no need

for thresholds, or temporal resolutions

  • Some event was predicted 5 minutes

prior to NREL identifying the event

AIOPS SUMMARY

slide-23
SLIDE 23

FEEDING THE BEAST: IO CHALLENGES IN ACCELERATED AI

slide-24
SLIDE 24

FEEDING THE DATA TO AI TRAINING WILL BE A BOTTLENECK

  • Economically infeasible to keep the whole dataset in DRAM

during the training process

  • traversing PBs of data by training over billions of samples
  • Requires a lot of I/O bandwidth by increasing computation
  • the challenge is to provide enough bandwidth without
  • verprovisioning capacity

The life-cycle of a dataset for supervised training Internal data flow for a training task

slide-25
SLIDE 25

SHA SHARED RED STORA ORAGE GE

  • Inspired by the burst buffer architecture in HPC
  • Compute nodes (CNs) and data nodes (IONs) connected

via a high-speed network (such as Slingshot )

  • Fast SSDs in the IONs serve as a performance tier
  • Size the ratio of CNs / IONs to optimize cost / performance

Flexibility in configuration, allocation of resources, provisioning of capacity and bandwidth, and ease of data management and sharing across nodes

Interference on the network caused by

  • the I/O traffic between CNs and DNs, and
  • communication traffic between CNs when synchronizing during the end of each training iteration

Long tail effect:smaller achievable I/O bandwidth on some CNs can slow down overall training

slide-26
SLIDE 26

PRIVATE STORAGE

  • Customized server for local caching of datasets
  • Abundant PCIe CPU lanes with GPU-direct storage matches the

bandwidth needs of accelerators using NVMe disks, reducing costs compared to DRAM

  • Enable the caching of tens of TB dataset in local flash per node

Accelerators have exclusive and fair access to the NVMe devices so linear scaling is easier to achieve

Minimal network traffic due to the dataset during the training phase. The exclusive use of SSDs provides more consistent and predictable performance so there is less likely to have a node

The cost is additive,but could be controlled by adding limited storage devices to the compute node

Fixed ratio of I/O capacity/bandwidth to accelerators

Datasets have to be striped across SSDs to achieve target bandwidth CPU

slide-27
SLIDE 27

The next generation of AI accelerators is packing an Graphcore

16 cores per Colossus IPU, 1,216 IPUs/chip + dataflow 100 GF16/IPU: ~120 TF16/s

Groq

144-way VLIW with 320-wide vector ops @ 900 MHz 1,000 8b TOPS per chip (~500 TF16/s)

Habana (to be acquired by Intel)

8 Tensor Cores + 4 HBM + RoCE Train 1650 Resnet-50 images / s @ 140W (~800 TF16/s)

Cerebras

400,000 cores per wafer No performance yet, but certainly >> 2,000 TF16/s Source: Cerebras

slide-28
SLIDE 28

100PF16/s

Performance (Flop16/s) Arithmetic Intensity (Flop16/byte)

slide-29
SLIDE 29

WRAPPING UP THREE KEY TAKEAWAYS

1. they are the biggest impediment to real-world deployments in the enterprise 2. We have learned how to help customers in true edge-to-cloud AI pipelines (for example, HAD). The edge deployment component (and corresponding data management) is the toughest challenge 3. With the upcoming breed of AI accelerators, the data pipeline is going to get a lot worse be prepared for a new data architecture

slide-30
SLIDE 30

Thank you - questions? paolo.faraboschi@hpe.com