Louis Capps, NVIDIA Solutions Architect, lcapps@nvidia.com
THE GREAT SYNERGY OF BIG DATA TECHNOLGIES Louis Capps, NVIDIA - - PowerPoint PPT Presentation
THE GREAT SYNERGY OF BIG DATA TECHNOLGIES Louis Capps, NVIDIA - - PowerPoint PPT Presentation
THE GREAT SYNERGY OF BIG DATA TECHNOLGIES Louis Capps, NVIDIA Solutions Architect, lcapps@nvidia.com Background Big Data vs Fast Data AGENDA HPC and Hyperscale Evolving data technologies Future research 2 Who is NVIDIA? ACCELERATED
2
AGENDA
Background Big Data vs Fast Data HPC and Hyperscale Evolving data technologies Future research
5
ACCELERATED COMPUTING REVOLUTION
Tesla Accelerated Computing Platform
GPU Accelerators CUDA Servers & Interconnects Developer T
- ols
Accelerated Data Center Accelerated Algorithms
AMBER · GROMACS · NAMD
HPC
MRI · Tomography · Ultrasound
Medical Imaging
Signal · Image · Video RTM · FWI · Elastic
Oil & Gas
person dog chair
Image & Voice Recognition
Deep Learning Defense Optimized Applications in the Data Center
Founded in 1993
Jen-Hsun Huang is co-founder/CEO
Joined NASDAQ as NVDA in 1999 FY14: $4.13 billion in revenue
>9,000 employees worldwide Headquarters: Santa Clara, CA
Created a revolution with first GPU in 1999, has shipped > 1 Billion Leader in
Parallel simulation Visualization Deep Learning Innovation with > 7,000 patents Research investment Bleeding edge technologies
Who is NVIDIA?
6
X
THE WORLD LEADER IN VISUAL COMPUTING
GAMING ENTERPRISE OEM & IP HPC & CLOUD AUTO
7
https://research.nvidia.com/publication/online-detection-and-classification-dynamic-hand-gestures-recurrent-3d-convolutional https://research.nvidia.com/publication/parallel-spectral-graph-partitioning https://research.nvidia.com/publication/robust-model-based-3d-head-pose-estimation
8
BIG DATA VS FAST DATA
9
FAST DATA IS THE NEW NORM
“...4,300 percent increase in annual data production by 2020.”
– Forbes Magazine and CSC, April 2016
“'Big Data' Is No Longer Enough: It's Now All About 'Fast Data’”
- Entreprenuer Media, June 2016
10
OPEN DATA SCIENCE
“Open source software is fundamental to big data, says Roman Shaposhnik, who runs the Apache Incubator project ...’In a way, open source has won in the enterprise’ says Shaposhnik, whose day job is director of open source at Pivotal.” – Datanami, Feb 2016
- And just in the past month
- 1. The embrace of stream processing and real-time data access is driving enterprise adoption
- f the Apache Kafka
- 2. Google open-sources SyntaxNet, a natural-language understanding library for TensorFlow
- 3. IBM Is Now Letting Anyone Play With Its Quantum Computer
- 4. Amazon open-sources its own deep learning software, DSSTNE
- 5. Facebook details its company-wide machine learning platform, FBLearner Flow
- 6. Google gives TensorFlow distributed computing support
- 7. OpenAI launches Gym, a toolkit for testing and comparing reinforcement learning
algorithms
11
SYNERGY OF BIG DATA TECHNOLOGIES
Growing Research Enterprise Embrace of Open Tech
Data Storage
- SSD, FLASH
- Huge DRAM
Database Scalability / Velocity
- Hadoop
- Spark
- Kafka
- Mutli system interconnect
Cloud
- Broad acceptance
- Production
- Large shared storage
Machine Intelligence
- Deep Learning
- Image, text, speech, sensor
Data Capabilities
- Unstructured
- Graphs
- Frameworks
Compute Engines
- Large clusters
- Acceleration
- Extreme bandwidth
Visualization
- Real-time point clouds
- Interactive
- Precise
- Remote
Insatiable Desire for Insight Geometric Data Growth --- Reducing Insight latency Rise of the Data Scientist Intelligent Insight
12
REBIRTH OF THE DATA SCIENTIST
- Computerworld 2016
13
HPC AND HYPERSCALE
14
Hyperscale
BIG DATA – FROM HPC TO HYPERSCALE TO ...
HPC Insight Autonomy Huge compute Huge storage Big compute Big storage Huge data out Big Data in Small Data in Small Data out Discovery Prediction Create data Process data
Similar engine?
15
REMOTE VISUALIZATION ON BLUE WATERS
Faster Time to Results
*Mark D. Klein and John E. Stone, “Unlocking the full potential of the Cray XK7 accelerator”, Cray Users Group, Lugano, Switzerland, May 2014
Data transfer Rendering
48 days 1 day
6 GPUs in local viz cluster 128 GPUs in HPC center
Stellar combustion visualized on Blue Waters (26 TB dataset) 48x Acceleration with Tesla GPUs in the HPC Center
- Limited GPUs and other hardware
resources
- Long data transfer times
Local Viz Cluster Limitations HPC Cluster Advantages
- Scales to 100s of GPUs in the
cluster
- Eliminates data transfers
Paul Woodward, U. Minnesota: HVR w/ OpenGL
- n Blue Waters*
16
PHOTO REALISTIC VR RENDERING
17
DATA TECHNOLOGY
18
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2009 2010 2011 2012 2013 2014 2015 2016
AI RACE IS ON
IBM Watson Achieves Breakthrough in Natural Language Processing Facebook Launches Big Sur Baidu Deep Speech 2 Beats Humans Google Launches TensorFlow Microsoft & U. Science & Tech, China Beat Humans on IQ Toyota Invests $1B in AI Labs
IMAGENET Accuracy Rate
Traditional CV Deep Learning
19
20
DEEP LEARNING FOR IMAGE ANALYTICS
person car helmet motorcycle bird frog person dog chair person hammer flower pot power drill
21
Uber Enters the Race Toyota Invests $1B in AI Lab Volvo Drive Me on Public Roads in 2017 NHTSA: Computer Counts as Driver Tesla Model 3: 300K pre-orders
2016: AN AMAZING YEAR FOR SELF-DRIVING CARS
Audi, BMW, Daimler Buy HERE Tesla Model S Auto-pilot Baidu Enters the Race Honda, Nissan, Toyota Team Up GM Buys Cruise
22
DEEP LEARNING AND KITTI
23
ACCELERATING SIGNAL & VIDEO ANALYTICS
Made possible only with GPUs
Real-time HD video enhancements and analytics
12x Faster with GPUs
Video surveillance with faster real time analytics
50-100x speed up over CPU
Unmanned submarine with accelerated sonar processing
12x Faster with GPUs
Faster satellite image processing for actionable intelligence
24
MISSION PLANNING WITH REAL-TIME LINE OF SIGHT
http://www.luciad.com/
Video Data Image Data Signal Data 1 Computation/Second Delayed Response CPU 100 Computations/Second Real-Time Response GPU
World Leader in Geospatial Situational Awareness
25
DEEP LEARNING REVOLUTIONIZING MEDICAL RESEARCH
Detecting Mitosis in Breast Cancer Cells Molecular Activity Prediction for Drug Discovery Predicting the Toxicity of New Drugs Understanding Gene Mutation to Prevent Disease
IDSIA Merck Johannes Kepler University University of T
- ronto
26
ACCELERATED DATABASE TECHNOLOGY
Big data ISVs moving to the accelerated model
SQL No SQL Graph
Trend towards accelerated computing - variety of firms with accelerated databases for big data available today –well-funded start-ups to large well-known players.
27
MAPD
Lightning-fast analytic SQ database and vis
- MapD processing -
> 40k cores vs
Traditional Processing 20 cores
- 100-1000x
faster queries
- Visualization
- f billions of
data points
- http://www.ma
pd.com/demos/ tweetmap/
28
NETFLOW GRAPH ANALYTICS USE CASE
Initial graph filter by PageRank / SecureRank using DASL.
- 140M netflows in real-time
- Analytics in Scala from Spark
- Blazegraph DASL
- Run pagerank
- Qty shown by color
29
NETFLOW GRAPH ANALYTICS USE CASE
Interactive visual query session.
Suspicious node communicating with our internal network – but also one outside Then look at internal nodes – many are communicating with a single outside node
30
NETFLOW GRAPH ANALYTICS USE CASE
Identification of exfiltrated data traffic.
- Clicking and exploring
attributes
- Appears node is scanning
from network externally
- Possibly pulling out data
31
NVIDIA Research
32
33
GPU GRAPH RESEARCH
NVGRAPH
- Take analytics and make it linear algebra
- What Blaze graph and DASL do today
- Pagerank, SSSP, Some accel on GPU
Standardizing with Graph BLAS Working on glue for GraphX Spark + GPUs
- Offload core ops to GPU
- Compiler eval data flows and fuse ops together into one kernel (Project Tungsten)
Acceleration
- Project Tungsten: https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal/
- Graph BLAS www.graphblas.org
34
NVIDIA DIGITS
Interactive Deep Learning GPU Training System
Test ImageMonitor Progress Configure DNN Process Data Visualize Layers
developer.nvidia.com/digits github.com/NVIDIA/DIGITS
35
DIGITS FUTURE
Object Detection Workflows for Automotive and Defense Targeted at Autonomous Vehicles, Remote Sensing Object Detection Workflow
developer.nvidia.com/digits github.com/NVIDIA/DIGITS
36
NEED FOR SPEED
Progress in DNN research depends on compute
Lavin & Gray, “Fast Algorithms for Convolutional Neural Networks”, 2015
37
GOALS OF ACCELERATION
Progress in DNN research depends on compute
Faster Performance - inf/sec More Efficent
§ Cost – inf/$ § Energy – inf/J Inference – run example forward through the network Training Run the network forward Back-propagate the gradient Update of parameters
38
THREE KINDS OF NETWORKS
DNN – all fully connected layers (filters, recommendation) CNN – some convolutional layers (image, vision, text) RNN – recurrent neural network, LSTM (semantics, intent)
http://scikit-learn.org/stable/tutorial/machine_learning_map/
39
DATA PARALLEL EXAMPLE (CPU)
Linear speed-up.
NVIDIA Whitepaper “GPU based deep learning inference: A performance and power analysis.”
40
MODEL PARALLEL EXAMPLE (CPU)
Results vary.
Dean et al, Google, “Large scale distributed deep networks.” NIPS 2012
41
NVIDIA DGX-1
WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
170 TFLOPS FP16 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Accelerates Major AI Frameworks Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W
42
P100 NVLINK SYSTEM
nvidia.com/nvlink github.com/NVIDIA/nccl images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf
NVLINK – 40GB/s
- low overhead
- Global memory
43
github.com/NVIDIA/nccl images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf
44
REDUCED PRECISION FOR DNN
Weight updates and sums are main operations.
The menu of options.
45
MIXED PRECISION MODEL
Batch normalization important to “center” dynamic range.
46
REDUCED PRECISION FOR INFERENCE
16 bits is a good spot
47
REDUCED PRECISION FOR TRAINING
Training diverges without stochastic rounding
- S. Gupta et al “Deep Learning with Limited Numerical Precision” ICML 2015
48
DNN RESEARCH
Compress deep neural net with pruning
- Deep compression - 3 main stages
1. Learn important connections – 9x to 13x reduction 2. Quantize weights for weight sharing – 32 bits to 5
- Retrain
3. Apply Huffman coding
- Total of 35x (AlexNet) to 49x (VGG-16)
- Ability to fit model into on chip SRAM
- Greatly reduced for use in mobile
Deep Compression
49
PRUNING DNNS
Reducing size of network reduces work and storage
Han et al “Learning both Weights and Connections for Efficient Neural Networks” NIPS 2015
50
PRUNING DNNS
Retrain to recover accuracy
Han et al “Learning both Weights and Connections for Efficient Neural Networks” NIPS 2015
Pruned
51
PRUNING ALEXNET
52
PRUNING SPEEDUP ON CPU/GPU
Inference
53
GIE (GPU Inference Engine)
MANAGE TRAIN DEPLOY
DIGITS
DATA CENTER AUTOMOTIVE TRAIN TEST MANAGE / AUGMENT EMBEDDED
GPU INFERENCE ENGINE
developer.nvidia.com/gpu-inference-engine
54
GPU INFERENCE ENGINE
Optimizations
- Fuse network layers
- Eliminate concatenation layers
- Kernel specialization
- Auto-tuning for target platform
- Select optimal tensor layout
- Batch size tuning
TRAINED NEURAL NETWORK
OPTIMIZED INFERENCE RUNTIME
developer.nvidia.com/gpu-inference-engine
55
GPU INFERENCE ENGINE
Performance
BATCH SIZE PERFORMANCE POWER EFFICIENCY Tesla M4
128 1153 images/s 20 images/s/W
Jetson TX1
2 133 images/s 24 images/s/W
developer.nvidia.com/gpu-inference-engine
56
GPU REST ENGINE (GRE)
GRE is a REST server for low-latency image classification (inference) using NVIDIA GPUs . Software that will allow you to build your own accelerated microservices. Makes use of several technologies:
- Docker: for bundling all the dependencies of our program and for easier deployment.
- Go: for its efficient builtin HTTP server.
- Caffe: because it has good performance and a simple C++ API.
- cuDNN: for accelerating common deep learning primitives on the GPU.
- OpenCV: to have a simple C++ API for GPU image processing.
GPU REST Engine
github.com/NVIDIA/gpu-rest-engine
57
DESIGN FOR PRODUCTION DATA
Drinking data from a firehose Scalable stream rate Generator -> Analytic Extended runs
- Resource limitations, memory fragmentation, windowed vs continuous
Firehose Research
http://firehose.sandia.gov http://wsga.sandia.gov/docs/Anderson.firehose_benchmark.pdf
58
NVIDIA END-TO-END AUTONOMOUS DRIVING PLATFORM
NVIDIA DRIVE PX 2 NVIDIA DGX-1 NVIDIA DRIVENET
Localization Planning Visualization Perception
DRIVEWORKS
59
DAVE NET
End to End Learning for Self Driving Cars
60
PX-2 INTERFACES
Sensor Fusion Interfaces:
GMSL Camera, CAN, GbE, BroadR-Reach, FlexRay, LIN, GPIO
Displays and Cockpit Computer Interfaces
HDMI, FPDLink III and GMSL
Development and Debug Interfaces
HDMI, GbE, 10GbE, USB3, USB 2 (UART/debug), JTAG
70 Gigabits per second of I/O
Auto Grade connectors Debug/Lab interfaces
61
CALL TO ACTION
Review research.nvidia.com, developer.nvidia.com to see emerging technologies Join linked-in big data page to follow real-time trends Tutorials on Spark, Kafka, Mapd, GPUdb, Blazegraph technologies Visit OpenAI playground for ideas Kaggle competitions for ideas
- Louis Capps, lcapps@nvidia.com
62