project maglev nvidia s production grade ai platform
play

Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, - PowerPoint PPT Presentation

Project MagLev: NVIDIAs production-grade AI Platform Divya Vavili, Yehia Khoja - Mar 21 2019 AI inside of NVIDIA Constraints and scale AI Platform needs Agenda Technical solutions Scenario walkthrough Maglev


  1. Project MagLev: NVIDIA’s production-grade AI Platform Divya Vavili, Yehia Khoja - Mar 21 2019

  2. ● AI inside of NVIDIA ● Constraints and scale ● AI Platform needs Agenda ● Technical solutions ● Scenario walkthrough ● Maglev architecture evolution 2

  3. AI inside of NVIDIA Deep Learning is fueling all areas of business Self-Driving Cars Robotics Healthcare AI Cities Retail AI for Public Good 3

  4. 4

  5. Constraints and scale SDC Scale Today at NVIDIA 5

  6. Constraints and scale What are our requirements? Safety Tons of data! Inference on edge Reproducibility 6

  7. What testing scale are we talking about? DATA AND INFERENCE TO GET THERE? We’re on our way to 100s PB of real test data = millions of real miles + 1,000s DRIVE Constellation nodes for offline testing alone & billions of simulated miles 180PB — NVIDIA’s data collection (miles) 120PB –– Active testing to date (miles) 24h test — Target robustness (miles) on 3,200 Nodes* 60PB 30PB 15PB 24h test on 1,600 Nodes* Real-time test runs in 24h * DRIVE PEGASUS Nodes on 400 Nodes*

  8. The need for an AI platform An end-to-end solution for industry-grade AI development Enable the development of AV Perception, fully tested across 1000s of conditions, and yielding failure rates < 1 in N miles, N large Scalable AI PB-Scale AI AI-based Data Training Testing Selection/Mining Traceability: Seamless PB-Scale Workflow model=>code+data Data Access Automation 8

  9. 1PB per month 9

  10. The need for an AI platform Enabling automation of training, and testing workflows Data Data Data Model Model TransferPilot Indexing Selection Labeling Training Testing Data Factory Automated Workflows NVOrigin Training [On-demand transcoding] Dataset Store Testing Workflows Workflows Model Store Tested [Training & Testing [Nightly tests, [Data preproc, DNN Models [Trained Models] Datasets] re-simulation, etc.] training, pruning, export, LabelStore fine-tuning] [Labels, tags, etc.] Labeled Datasets Trained Models SaturnV Storage 10

  11. So how did we solve for this? 11

  12. Technical solution(s) Safety - Non-compromisable primary objective for the passengers Safety All other engineering requirements stem from this - Models tested on huge datasets to be confident Tons of data! - Faster iteration that aids in producing extremely good and well-tested models - Reproducibility/Traceability Inference on edge Reproducibility 12

  13. Technical solution(s) Tons of data! Safety - Collecting enormous amounts of data under innumerable scenarios is key to building good AV models - Now that we data, what next? - How do engineers access this data? Tons of data! - How do you make sure that the data: - can be preprocessed for each team’s need? - is not corrupted by other members of the team or across teams? Inference on edge - Lifecycle management of data Reproducibility 13

  14. Technical solution(s) Tons of data! Safety What is the solution? vdisk - Virtualized Immutable file-system Tons of data! Offers broad platform support - - Structured to support data deduplication - Inherently supports caching Inference on edge Provides kubernetes - integration making it cloud-native Reproducibility 14

  15. Technical solution(s) Inference on edge Safety - AV model inference is limited in terms of hardware capabilities - So, finding a lighter model without losing performance is prudent Tons of data! and takes multiple and faster iterations Inference on edge Reproducibility 15

  16. Technical solution(s) Reproducibility Safety Why? - Being able to run a 10 year old workflow and get the same results - Faster iteration of model development Tons of data! - Understand why a model behaved certain way Requires: Inference on edge - Proper version control of datasets, models and the experiments Reproducibility - … and traceability go hand in hand Reproducibility 16

  17. MagLev Scenario walkthrough Predicting 12 month mortgage delinquency using Fannie Mae Single family home loan ● data Key points: Immutable dataset creation Specifying workflows and launching them End-to-traceability 17

  18. MagLev Scenario walkthrough Creating an immutable dataset • >> maglev volumes create --name <my-volume> --path </some/local/directory/path> [--resume-version <version>] Creating volume: Volume(name = my-volume, version = 449c8efa-eaef-4d9b-81b9-3a59fe269e9b) Uploading '<local-file>'... … Successfully created new volume. Volume(name = my-volume, version = 449c8efa-eaef-4d9b-81b9-3a59fe269e9b) Creates a ISO image • ISO image only contains the metadata for the dataset while the actual dataset resides in • S3 18

  19. MagLev Scenario walkthrough 19

  20. MagLev Scenario walkthrough 20

  21. MagLev Scenario walkthrough 21

  22. MagLev Scenario walkthrough 22

  23. MagLev Scenario walkthrough 23

  24. 24

  25. 25

  26. 26

  27. 27

  28. MagLev Architecture Evolution Version 1 - Technical viability Compute and data on public cloud Mostly for technical evaluation - Costs skyrocketing - Poor performance - clash between functionality and efficiency - Early decisions Cloud native platform - Image source: shuttershock.com General purpose services/ETL pipelines hosted on public cloud allows us to elastically - scale based on requirements 28

  29. MagLev Architecture Evolution Version 2 - Minimize costs Compute on internal data-center for GPU workloads Minimize costs - Take advantage of innovation on GPUs before it hits the market - Huge compute cluster that is always kept busy by the training/testing workflows - What needed to improve: Performance due to lack of data locality - 29

  30. MagLev Architecture Evolution Version 3 - High performance Internal data center specialized for both compute and data performance High performance due to data locality - Better UX for data scientists - Programmatically create workflows - 30

  31. MagLev Data Center Architecture 31

  32. MagLev Service Architecture General service cluster on public - cloud Authentication - Volume management - Workflow traceability - Experiment/Model management - Compute cluster on internal NGC - cloud Both clusters are cloud-native built - on top of Kubernetes 32

  33. Questions 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend