1
Pradeep Gupta | Solutions Architecture, Autonomous Driving Poonam Chitale | AI Infra Product Manager
DEEP LEARNING INFRASTRUCTURE FOR AUTONOMOUS VEHICLES Pradeep Gupta - - PowerPoint PPT Presentation
DEEP LEARNING INFRASTRUCTURE FOR AUTONOMOUS VEHICLES Pradeep Gupta | Solutions Architecture, Autonomous Driving Poonam Chitale | AI Infra Product Manager 1 Deep Learning has changed the way we think about developing software 2 NVIDIA DRIVE
1
Pradeep Gupta | Solutions Architecture, Autonomous Driving Poonam Chitale | AI Infra Product Manager
2
COLLECT DATA SIMULATE DRIVE
Pedestrians Cars Lanes Path Lights SignsTRAIN MODELS
Pedestrians Cars Lanes Path Lights Signs4
Data Scale and Management How to Build Compute, Storage and other Infra to enable Training DL Deployment Infrastructure
5
6
POST /datasets/{id} Datasets Deep Learning Manually selected data Labels Train/test data Labeling Metrics Simulation, verification results Inference optimized DNN (TensorRT)
PBs of data, large-scale labeling, large-scale training, etc.
7
Datasets Intelligently selected data Train/test data Inference optimized DNN (TensorRT) POST /datasets/{id} Trained Models Labels Mine highly confused / most informative data
Active learning strategies to meet business needs
Deep Learning Labeling
8
Rand Corporation, Driving to Safety
10
Data Collection fleet == 100 cars 2000h of data collected per car, per year Assuming 5 2MP cameras per car, radar data, etc. => 1 TB / h / car Grand total of 200 PB collected per year! Only 1/1000 likely to be used for training (curated, labeled data) 12.1 years training a ResNet50-like network on Pascal, 1.5 years on DGX1 w/ Volta Today, with 8 DGX1s, and 1/10th of that training data, can train in 1 week
11
Best Practices
Collaborating on datasets, workflows and experiments
Managing Datasets
Tracking large, continuously evolving datasets
Tracking Experiments
Reproducible Research Performance tracking Optimal scheduling and automation of AI workflows
Scaling Workflows
12
Application Platform
Build training workflows Discover best model Validate with re- simulation Deploy to TensorRT and run with NVIDIA DRIVE
Data Platform
Transcode and index raw data Ingest petabytes of recorded data Label data and export for training Guide selection
Continuous Optimization
Inspect recorded workflows Generate metrics
13
14
Label Export Process Curate Analyze Dashboard Collect
Process Curate Annotate Export Data Ingestion Metrics
Continuously Validate Repeat
Storage Cluster Data Management and Services
15
➢ Continuously ingest data, at roughly 1TB/hour/car ➢ Data Ingestion linearly increases with number of cars. ➢ Diverse data-sets get better DNN ➢ Dedicated systems for Ingestion ➢ Transcoding of raw data to consumable formats ➢ Data compression and caching
16 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Couple of factors ➢ Data compression - Car and/or Cloud ➢ Data environment – Day/Night, Urban vs Highway ➢ Lossless vs Lossy compression ➢ NVIDIA’s Experience
➢ DW exposes lossless compression today LRAW, ~2x compression. ➢ Lossy compression – Active area of R&D, How does AI work on compressed data? Good area of R&D
17
Raw Data Compressed Data Useful Data Labeled Data
100s of PBs of Data 10s of PBs Data 20% to 50% of data may not be useful Labelling throughput
Data from test fleets of 10, 30, 50 and 100 cars
DNNs
18
Search from recorded sessions Frame selection
19
➢ Standard guidelines and processes are required to correctly annotate frames ➢ Producing high quality labeled data exported for model training purpose ➢ QA and double labeling is important Unlabeled frame Labeled frame Dataset Export
20
21
Steps
Prepare the model for serving and validate it Hours - Weeks 10s 4-8 Tesla P/V100 Optimize and Validate the Model Provide functionality using the model Milliseconds Hundreds (test fleet) Millions (live fleet) Xavier Deploy the Model Make the model work with real data and
Days - Weeks 10s – 100s 4-8 Tesla P/V100 Train the model on real data
(hyperparameter tuning)
Make sure that the code base remains bug free Hours 10s 4-8 Tesla P/V100 Continuous Integration Build a promising model Hours 1 2-4 TitanX / Tesla P/V 100 Build the Model
Goal Iteration Time # of Machines GPU
22
Model Store Workflow Manager
Run Training Use Datasets Analyze Results Build Experiments
Dataset Service Experiment Service
Training Cluster (10’s of thousands of GPUs)
Test Validate Repeat
Level1 Storage Hundreds
High- bandwidth Storage Level0 Storage 7TB SSD In DGX-1
➢ Cluster using NVIDIA DGX-1 with Volta ➢ Every DGX1 connected via Infiniband for multi- node training ➢ Hierarchical Storage – Local SSD in DGX-1 and high bandwidth Storage for training data cache ➢ Multiple Level of Storage Hierarchies ➢ Dedicated connection between on premises and cloud Infra for dedicated bandwidth.
24 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
Level1 Storage High- bandwidth Storage
per DGX1
storage,
Available Storage, 10s of PBs
storage for Archival, may be 50’s PBs
Level0 Storage 7TB SSD In DGX-1 Level2 Storage Highly available replicated storage Level3 Storage Cold storage for archival
On Premises Infrastructure CLOUD
Dedicated connection
Infrastructure 960 TFLOPs per DGX1 (FP16) 7TB SSD per DGX1 High-speed external storage (multi-PB) Infiniband as interconnect NCCL 2.0 Data+model management
26
27
Workflow Automation
▪ Traceability of data ▪ Models ▪ Experiment sets ▪ Datasets ▪ Versioning
▪ Automated Scheduling ▪ Optimal GPU selection
▪ Best practices ▪ Modular flexible extensible APIs
Continuous Optimization
new data
▪ Integration with Data Platform
▪ Rigorous Testing ▪ Simulation
▪ Data diversity ▪ KPIs tracking ▪ Accuracy ▪ Performance
28
Hyper Parameters
▪ Learning rate, batch size, optimizer, weight decay, regularization strength
▪ Batch-norm, activation functions, convolution stride, filter size
▪ Max translation, color augmentations, potentially shearing, flips, crops
▪ Clustering
29
Dataset exported from Labeling Software Trained Model
Fine Tuned Model Exported Model
At the Edge
Train & Test Adjust Export Get Data Test & Validate and Repeat
30
31
TensorRT Optimizer (platform, batch size, precision) TensorRT Runtime Engine Optimized Plans Trained Neural Network
Step 1: Optimize trained model
Plan 1 Plan 2 Plan 3 Serialize to disk
Step 2: Deploy optimized plans with runtime
Plan 1 Plan 2 Plan 3
Embedded Automotive Data center
33
TRAINING INFERENCE
Embedded
Jetson TX1
Data Center
Tesla P4 Tesla V100/P40
Automotive
Drive PX2
Tesla P100
Tesla V100 DGX Station
Desk Side Fully Integrated DL Supercomputer
DGX-1
Data Center
34
35
36
Deep dive on your current and future state use of AI for Self-Driving Understand and discuss your goals + objectives, frame approach and size scale Develop phased roadmap for AI computational scale Leverage NVIDIA Deep Learning Institute to train and develop your team
37