GPU OPEN ANALYTICS INITIATIVE
END-TO-END ACCELERATED ANALYTICS
Brad Rees, Ph.D. - Senior Solution Architect - NVIDIA GTC DC, November 2017
GPU OPEN ANALYTICS INITIATIVE END-TO-END ACCELERATED ANALYTICS Brad - - PowerPoint PPT Presentation
GPU OPEN ANALYTICS INITIATIVE END-TO-END ACCELERATED ANALYTICS Brad Rees, Ph.D. - Senior Solution Architect - NVIDIA GTC DC, November 2017 The AI Computing Company AGENDA TWO PARTS Discuss Analysis from the Perspective of Data Science
Brad Rees, Ph.D. - Senior Solution Architect - NVIDIA GTC DC, November 2017
Better Exploration ∝ Better Science Fail Fast Needs to be Embraces
I have not failed. I've just found 10,000 ways that won't work.
“Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data …”
Faster Analytics yield better Exploration
mission critical workloads is prohibitively expensive
Commercial Government HPC
CPU
Avro XML JSON GML ProtoBuf HDFS Pickle CSV Parquet Panda Plain Text vs Binary Compressed vs Uncompressed CSR COO CSC * Not a complete list Numpy
Data Manipulation
MapD GPU Ram BlazingDB Disk
STORAGE IN GPU MEMORY DATA STRUCTURE PROCESSING AND ANALYTICS INTERACTION
MapD BlazingDB (“SQL”) Many Columnar Data Frames (everyone has their own makeshift data frame) Anaconda * (Dask “Python”) Fast Data (Streaming) NV Graph Graphistry MapD Immerse Jupyter NB Open Source Free to Use Closed Source
Key:
* Primarily x86 w/ some GPU acceleration
many holes
moving data between applications
accelerated analytics, not deep learning yet
Data Manipulation MapD GPU Ram BlazingDB Disk
STORAGE IN GPU MEMORY DATA STRUCTURE PROCESSING AND ANALYTICS INTERACTION
MapD BlazingDB (“SQL”) Standard Columnar Data Frame (Open Sourced/Free to Use from MapD) H2O (Data. Table “R”) Anaconda (Dask “Python”) Fast Data (Streaming) H2O.ai (GPU MLlib) NV Graph MapD + BlazingDB System Memory Graphistry MapD Immerse Jupyter NB Open Source Free to Use Closed Source
Key:
Big Data ecosystem facing similar issues Major push in the big data world to remove bottlenecks
Apache Arrow™
latest SIMD (Single input multiple data) operations
better performance on modern hardware like CPUs and GPUs.
reads for lightning-fast data access without serialization overhead.
CPU
So …. What does this get me?
Demos available on goai github
Demos available on goai github
pygdf: Python library for manipulating GDFs
Convert from Pandas and Numpy
HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read SQL Query ETL Train HDFS Read SQL Query ETL ML Train HDFS Read GPU Read SQL Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read SQL Query ETL ML Train
Hadoop Processing, Reading from disk
5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU
Spark In-Memory Processing GPU + Spark In-Memory Processing
Large TCO benefit
Large Adoption? Small TCO benefit
Small Adoption Large TCO benefit
Large Adoption 25-100x Improvement Less code Language flexible Primarily In-Memory
End-to-End GPU Processing (GOAI)
to a host-side struct
DataFrames
using Numba
chunked onto different GPUs and different servers.
~100x speedup using MapD on half a DGX to analyze census data vs a 20 node Spark cluster >50x speedup in performing pagerank on a graph on half a DGX vs an 8 node Spark cluster ~8.5x speedup on half a DGX to produce a robust GLM via 10-fold cross-validation vs an 8 node Spark cluster ~100x more cyber security data interactively visualized using an intuitive layout algorithm on a single GPU as a connected graph ~5X faster than Redshift to utilize full disk storage and system memory Python on GPU... Numba and Pandas
MapD Core database is an in-GPU-memory, columnar, open-source, GPU-accelerated, SQL database. MapD Enterprise brings distributed and high availability modes, GPU-accelerated backend rendering, Kerberos/LDAP security, and ODBC/JDBC. MapD Immerse is a visual analytics platform on top
scientists and analysts to interactively explore large datasets.
21 596 1560 80 518 1250 150 795 2250 372 1209 2970
500 1000 1500 2000 2500 3000 3500 4000 4500 5000 MapD DGX-1 Kinetica DGX-1 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
Time in Milliseconds
Source: MapD Benchmarks on DGX-1 from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS
@marklit82
10190 8134 19624 85942
GPU Memory based databases 8x to 15x faster than CPU in- memory databases such as Redshift. 100x to 485x faster than Spark
Open Source core DBMS Free Community Edition
BlazingDB database is a disk-based, columnar, GPU-accelerated SQL database. BlazingDB has distributed and high availability modes, JDBC, and Python/C# APIs. BlazingDB offers a Community Edition that can be downloaded for free and has an Enterprise Edition that you can launch today on AWS.
Blazing speedup
Anaconda Accelerate provides access to libraries
CUDA Sorting and cuBLAS. Numba is a compiler for Python functions that generates native code for GPU hardware. Dask is a parallel computing library for analytic computing in Python. It enables distributed computing in Pure Python and integrates with Anaconda Accelerate and Numba.
Deep learning researcher & educator.
Founder: fast.ai Faculty: USF & Singularity University Previously - CEO: Enlitic President: Kaggle CEO Fastmail
Rewrote the PolynomialFeatures from scikit_learn in Numba. Got a 40x speedup in only 12 lines of code
H2O.ai has a working implementation of GPU- accelerated generalized linear modeling. H2O.ai is working to GPU-accelerate additional machine learning algorithms such as random forests, gradient boosting machines, and clustering. H2O.ai is working on porting data.table, a columnar data frame library, along with the world's fastest implementation of the sort algorithm to NVIDIA GPUs.
Graphistry uses GPUs in the backend for layout calculation and machine learning. Graphistry uses GPUs in the frontend for rendering the visualization in a web browser. Graphistry allows a user to interactively visualize magnitudes more data than traditional solutions in an intuitive way.
Gunrock has multi-GPU implementations of graph algorithms such as PageRank, Breadth First Search, Single Source Shortest Path, etc. Gunrock has high level API in C that is accessible from Python.
https://arrow.apache.org/ @ApacheArrow https://parquet.apache.org/ @ApacheParquet
http://gpuopenanalytics.com/ @Gpuoai Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
Session # Topic
Wednesday 11/1 2:00pm Hemisphere A
DC7213
World's Fastest Machine Learning With GPUs
Jon Mckinney - Senior Developer, H2O.ai Wednesday 11/1 2:30pm Hemisphere A
DC7212
Interpretable AI: Not Just For Regulators
Patrick Hall - Director of Data Science, H2O.ai Wednesday 11/1 5:00pm Polaris
DC7189
The Impact of GPUs in Geovisualization for Government
Todd Mostak - CEO & Founder, MapD Thursday11/2 2:00pm Hemisphere B
DC7133
Scaling Event Data Investigations with GPU Visual Graph Analytics
Leo Meyerovich - CEO, Graphistry, Inc Thursday 11/2 4:30pm Atrium Hall
DC7111
Accelerating Cyber Threat Detection with GPUs
Josh Patterson - NVIDIA
Fundamentals Autonomous Vehicles Media & Entertainment Finance
Training available as online self-paced labs and instructor-led workshops Take self-paced labs at www.nvidia.com/dlilabs Find or request an instructor-led workshop at www.nvidia.com/dli Educators: download the Teaching Kit at developer.nvidia.com/teaching-kit and contact nvdli@nvidia.com for info on the University Ambassador Program
Machine Vision - IVA Healthcare …and more
http://gpuopenanalytics.com/