Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, - - PowerPoint PPT Presentation

project maglev nvidia s production grade ai platform
SMART_READER_LITE
LIVE PREVIEW

Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, - - PowerPoint PPT Presentation

Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, Yehia Khoja - Mar 21 2019 AI inside of NVIDIA Constraints and scale AI Platform needs Agenda Technical solutions Scenario walkthrough Maglev


slide-1
SLIDE 1

Project MagLev: NVIDIA’s production-grade AI Platform

Divya Vavili, Yehia Khoja - Mar 21 2019

slide-2
SLIDE 2

2

Agenda

  • AI inside of NVIDIA
  • Constraints and scale
  • AI Platform needs
  • Technical solutions
  • Scenario walkthrough
  • Maglev architecture evolution
slide-3
SLIDE 3

3

AI inside of NVIDIA

Deep Learning is fueling all areas of business

Self-Driving Cars Robotics Healthcare AI Cities Retail AI for Public Good

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

Constraints and scale

SDC Scale Today at NVIDIA

slide-6
SLIDE 6

6

Constraints and scale

What are our requirements?

Safety Tons of data! Inference on edge Reproducibility

slide-7
SLIDE 7

— NVIDIA’s data collection (miles) –– Active testing to date (miles) — Target robustness (miles)

DATA AND INFERENCE TO GET THERE?

30PB 60PB 120PB 180PB

Real-time test runs in 24h

  • n 400 Nodes*

24h test

  • n 1,600 Nodes*

24h test

  • n 3,200 Nodes*

* DRIVE PEGASUS Nodes

What testing scale are we talking about?

We’re on our way to 100s PB of real test data = millions of real miles + 1,000s DRIVE Constellation nodes for offline testing alone & billions of simulated miles

15PB

slide-8
SLIDE 8

8

The need for an AI platform

An end-to-end solution for industry-grade AI development

Scalable AI Training Seamless PB-Scale Data Access AI-based Data Selection/Mining Traceability: model=>code+data Workflow Automation PB-Scale AI Testing

Enable the development of AV Perception, fully tested across 1000s of conditions, and yielding failure rates < 1 in N miles, N large

slide-9
SLIDE 9

9

1PB per month

slide-10
SLIDE 10

10

The need for an AI platform

Enabling automation of training, and testing workflows

Data Factory TransferPilot Automated Workflows Data Indexing Data Selection Data Labeling Model Training Model Testing Dataset Store

[Training & Testing Datasets]

Training Workflows

[Data preproc, DNN training, pruning, export, fine-tuning]

LabelStore

[Labels, tags, etc.]

NVOrigin

[On-demand transcoding] Labeled Datasets

SaturnV Storage

Model Store

[Trained Models] Trained Models

Testing Workflows

[Nightly tests, re-simulation, etc.] Tested Models

slide-11
SLIDE 11

11

So how did we solve for this?

slide-12
SLIDE 12

12

Tons of data! Inference on edge Reproducibility Safety

  • Non-compromisable primary objective for the passengers

All other engineering requirements stem from this

  • Models tested on huge datasets to be confident
  • Faster iteration that aids in producing extremely good and

well-tested models

  • Reproducibility/Traceability

Technical solution(s)

Safety

slide-13
SLIDE 13

13

Safety Tons of data!

  • Collecting enormous amounts of data under innumerable scenarios

is key to building good AV models

  • Now that we data, what next?
  • How do engineers access this data?
  • How do you make sure that the data:
  • can be preprocessed for each team’s need?
  • is not corrupted by other members of the team or across teams?
  • Lifecycle management of data

Technical solution(s)

Tons of data!

Inference on edge Reproducibility

slide-14
SLIDE 14

14

Safety Tons of data!

What is the solution? vdisk

  • Virtualized Immutable

file-system

  • Offers broad platform support
  • Structured to support data

deduplication

  • Inherently supports caching
  • Provides kubernetes

integration making it cloud-native

Technical solution(s)

Tons of data!

Inference on edge Reproducibility

slide-15
SLIDE 15

15

Safety Tons of data! Reproducibility Inference on edge

Technical solution(s)

Inference on edge

  • AV model inference is limited in terms of hardware capabilities
  • So, finding a lighter model without losing performance is prudent

and takes multiple and faster iterations

slide-16
SLIDE 16

16

Safety Tons of data! Inference on edge Reproducibility

Why?

  • Being able to run a 10 year old workflow and get the same results
  • Faster iteration of model development
  • Understand why a model behaved certain way

Requires:

  • Proper version control of datasets, models and the experiments

Reproducibility

  • … and traceability go hand in hand

Technical solution(s)

Reproducibility

slide-17
SLIDE 17

17

  • Predicting 12 month mortgage delinquency using Fannie Mae Single family home loan

data Key points: Immutable dataset creation Specifying workflows and launching them End-to-traceability

MagLev

Scenario walkthrough

slide-18
SLIDE 18

18

  • Creating an immutable dataset
  • Creates a ISO image
  • ISO image only contains the metadata for the dataset while the actual dataset resides in

S3

MagLev

Scenario walkthrough

>> maglev volumes create --name <my-volume> --path </some/local/directory/path> [--resume-version <version>]

Creating volume: Volume(name = my-volume, version = 449c8efa-eaef-4d9b-81b9-3a59fe269e9b) Uploading '<local-file>'... … Successfully created new volume. Volume(name = my-volume, version = 449c8efa-eaef-4d9b-81b9-3a59fe269e9b)

slide-19
SLIDE 19

19

MagLev

Scenario walkthrough

slide-20
SLIDE 20

20

MagLev

Scenario walkthrough

slide-21
SLIDE 21

21

MagLev

Scenario walkthrough

slide-22
SLIDE 22

22

MagLev

Scenario walkthrough

slide-23
SLIDE 23

23

MagLev

Scenario walkthrough

slide-24
SLIDE 24

24

slide-25
SLIDE 25

25

slide-26
SLIDE 26

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

MagLev Architecture Evolution

Compute and data on public cloud

  • Mostly for technical evaluation
  • Costs skyrocketing
  • Poor performance
  • clash between functionality and efficiency

Early decisions

  • Cloud native platform
  • General purpose services/ETL pipelines hosted on public cloud allows us to elastically

scale based on requirements

Version 1 - Technical viability

Image source: shuttershock.com

slide-29
SLIDE 29

29

MagLev Architecture Evolution

Compute on internal data-center for GPU workloads

  • Minimize costs
  • Take advantage of innovation on GPUs before it hits the market
  • Huge compute cluster that is always kept busy by the training/testing workflows

What needed to improve:

  • Performance due to lack of data locality

Version 2 - Minimize costs

slide-30
SLIDE 30

30

MagLev Architecture Evolution

Internal data center specialized for both compute and data performance

  • High performance due to data locality
  • Better UX for data scientists
  • Programmatically create workflows

Version 3 - High performance

slide-31
SLIDE 31

31

MagLev Data Center Architecture

slide-32
SLIDE 32

32

MagLev Service Architecture

  • General service cluster on public

cloud

  • Authentication
  • Volume management
  • Workflow traceability
  • Experiment/Model management
  • Compute cluster on internal NGC

cloud

  • Both clusters are cloud-native built
  • n top of Kubernetes
slide-33
SLIDE 33

33

Questions