Joshua Patterson, Director AI Infrastructure 3/27/18 @datametrician
GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure - - PowerPoint PPT Presentation
GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure - - PowerPoint PPT Presentation
GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure 3/27/18 @datametrician THE WORLD WE ANALYZE Realities of Data 2 IN A FINITE CRISIS CPU Performance Has Plateaued 10 7 10 6 Transistors 10 5 1.1X per year (thousands) 10 4 10
2
THE WORLD WE ANALYZE
Realities of Data
3
1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year Transistors (thousands)
IN A FINITE CRISIS
CPU Performance Has Plateaued
4
1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year GPU-Computing perf 1.5X per year 1000X By 2025
IN A FINITE CRISIS
GPU Performance Grows
5
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
IN A FINITE CRISIS
CPU Performance Has Plateaued
6
APP A
PRE-52 WEEKS LATER
Fast Was Made Slow
CPU GPU
APP B Read Data H2O.ai Continuum Gunrock Graphistry BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A
GPU Data
APP B
GPU Data
7
GPU LEADERS UNITE
Could We Do Better Than Big Data?
8
25-100x Improvement Less code Language flexible Primarily In-Memory
TRADITIONAL DATA SCIENCE ON GPUS
Lots of glue code and plagued by copy and converts
HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read
Query ETL ML Train
HDFS Read
Query ETL ML Train
HDFS Read
GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train
5-10x Improvement More code Language rigid Substantially on GPU
GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
- Each system has a different internal memory
format with mostly overlapping functionality
- Depending on the workflow, 80+% of time and
computation is wasted on the serialization, deserialization, and copying of data
9
PRE-52 WEEKS LATER
What We Want
CPU GPU
Read Data H2O.ai Continuum Gunrock Graphistry BlazingDB MapD Load Data
GPU Memory Buffer
10
GOAI AND THE BIG DATA ECOSYSTEM
No copy and converts on the GPU, compatible with Apache Arrow
No Copy & Converts - Full Interoperability
Continuum Gunrock Graphistry BlazingDB MapD
GPU Data Frame
github.com/gpuopenanalytics github.com/apache/arrow
- All systems utilize the same memory format, so overhead for cross-system
communication is minimized and projects can share features and functionality
- Most, if not all of the GPU Data Frame functionality is going back into Apache Arrow
- Currently three GPU Data Frame libraries: libgdf (C library), pygdf (Python library), and
Dask_gdf (multi-gpu, multi-node Python library) H2O.ai nvGRAPH Simantex
11
25-100x Improvement Less code Language flexible Primarily In-Memory
DATA SCIENCE ON GPUS WITH GOAI + GDF
Faster Data Access Less Data Movement
HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read
Query ETL ML Train
HDFS Read
Query ETL ML Train
HDFS Read
GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train
Arrow Read
Query ETL ML Train
5-10x Improvement More code Language rigid Substantially on GPU 10-25x Improvement Same code Language flexible Primarily on GPU
End to End GPU Processing (GOAI) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
12
ANACONDA – PyGDF & DaskGDF
Moving From Traditional Flows
ETL Database Model Data Frame Arrays Sparse Matrix
Traditional Workflows
Data originates from a database Nearly all data curation happens within the database (joins, group bys, unions, etc…) Database has already dealt with providing structure to the data and contains nearly all useable data for ML The output of a database and its functionality is a data frame where additional ETL is minimal Manipulation of columns
- ccur here (encoding,
transformations, training variable creation). Data structure is converted from a dataframe to a matrix or arrays Training with many algorithms to find the most accurate method
13
PYGDF & DaskGDF UDFs
Python -> GPU Acceleration
- Write a custom function in Python that
gets JIT compiled into a GPU Kernel by Numba
- Functions can be applied by row,
column, or groupby group
14
ANACONDA – PyGDF & DaskGDF
To Complex Flows
ETL Database Many Data Types Data Frame Data Frame Data Frame ETL Model Array Sparse Matrix ETL Model Array Sparse Matrix ETL Model Array Sparse Matrix Streams Data Lake
Data originates from wherever developers can find data, and it’s stored in many formats. With many groups using data in different ways, data is stored in formats for maximum usability, but pushes more manipulation to the ETL functions The output of all these sources are varying data frames with varying structure ETL process is in charge of moving data into one usable format (from csv, xml, json, db formats, hadoop formats, etc…) Data curation performed on all the disparate data Subsets of the data created for different modeling targets The rest of traditional ETL occurs. Training with many algorithms occur. Feedback loop to ETL then
- ccurs.
Back in the ETL process: if a subset of the data is the root cause of accuracy issues, new subsets are formed for new algorithmic approaches.
Complex Workflows
15
PYGDF NEW JOINS
Faster Join Support Coming
TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 1329 31731 465064 V100 (PCIe3) 22 164 1521 V100 (3xNVLINK2) 12 45 466 3.2x 300x
TPCH Query 21 – End to End Results Using 32-bit Keys*
TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 150 2041 24960 V100 (PCIe3) 13 105 946 V100 (3xNVLINK2) 7 23 308 3.1x 26x
TPCH Query 4 – End to End Results Using 32-bit Keys*
*A *Assuming g the input tables are loaded and pinned in system memory
16
PYGDF NEW JOINS
GPU memory capacity is not a limiting factor GPU query performance up to 2-3 orders of magnitude better than CPU GPU query perf is dominated by the CPU-GPU interconnect throughput NVLINK systems show 3x better E2E query performance compared to PCIe Thanks Nikolay Sakharnykh! S8417 - Breaking the Speed of Interconnect with Compression for Database Applications – Tuesday, Mar 27, 2:00pm – Room 210F
Takeaways
17
BLAZINGDB JOINS GOAI
Scale Out Data Warehousing
18
BLAZINGDB
Scale Out Data Warehousing
SCHEMA METADATA DATA
- Compression/Descompression
- Filtering (Predicate Pushdown)
- Aggregations
- Transformations
- Joins
- Sorting/Ordering
DATA LAKE
0001010100001001011010110- RAM Cache
- Disk Cache
- HDD
- SSD
Same system more interoperability Parquet In Arrow/GDF Out Compression/Decompression on GPU to improve throughput
19
GDF Arrow
Common Data Layer INGEST STORAGE
(Data Lake)
Coming Soon
BLAZINGDB
The Future
20
H2O.AI
ML On GPU First 3 algorithms
- GLM
- K-Means
- GBM (XGBoost)
21
H2O.AI
ML On GPU 2 new algorithms
- tSVD
- PCA
22
XGBOOST
Faster, More Scalable, & Better Inferencing
Thanks Andrey Adinets, Vinay Deshpande, and Thejaswi Nanditale! Scalability increase from 16GB to 100GB on DGX-1 Performance Improvement not only on single-GPU, but multi-GPU scaling GBDT Inference Library
23
NVGRAPH
Arrow to Graph
- Ported nvGRAPH to run natively on GPU
DataFrame, so we can use two columns as source and destinations to define an unweighted graph.
- Breadth First Search, Jaccard Similarity
and Pagerank with Python Bindings
- Developing Hornet integration for GoAi
as well as Gunrock
- 1x P100 2-3 orders of magnitude faster
than i7-3930K NetworkX Python Library
24
IBM SNAPML
More Proof of ML on GPU
25
MAPD
In Memory GPU Database
+
100x Faster Queries Speed of Thought Visualization
MapD Core MapD Immerse
A fast, relational, column store database powered by GPUs A visual analytics engine that leverages the speed + rendering capabilities of MapD Core
26
MAPD
Improvements Since GTC17
Multi-source Dashboards Multi-layer GeoCharts Auto-refresh for streaming data Charting - Combo chart, multi-measure line chart, stacked bar MapD Immerse MapD Core Performance - Joins, string literal comparisons Data Ingestion - Read from Kafka, compressed files, S3 Major rendering performance improvements O(1-10MM ) polygons in ~ms Arrow - improved GPU memory mgt, pymapd with bi-directional arrow-based ingest
27
MAPD
MapD Presto
https://github.com/NVIDIA/presto-mapd-connector 8GPU MapD alone up to 40x dual 20-core CPU on inferencing streaming data Faux Multi-node MapD Presto being developed
20 4 0.1 1.2 25 6 0.1 1.2 30 8 0.1 1.2 5 10 15 20 25 30 35 PRESTO ON JSON PRESTO ON PARQUET MAPD PRESTO + MAPD
GPU Database Performance
10 mins 30 mins 60 mins
28
MAPD
Dashboards Comparison vs Kibana
29
MAPD
MapD Immerse vs Elastic Kibana
100 200 300 1 6 11 16 21 26 31 MapD Immerse (DGX) MapD Immerse (P2) Elastic Kibana x
< 9s < 12s
Days of Data Time to Fully Load (seconds)
30
GRAPHISTRY
Accelerated visual graph analytics and investigation platform
31
GRAPHISTRY
Improvements since GTC17
CSV, JSON, ETC GPU DATA FRAME Arrow.js
https://www.npmjs.com/package/arrow
32
TESLA V100 32GB
WORLD’S MOST ADVANCED DATA CENTER GPU WITH 2X THE MEMORY 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s | 300 GB/s NVLink
33
34
35
36
A NEW COMPUTE PLATFORM
Building It Together
APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE
- Learn what the domain requires
- Use best practices and standards
- Build scalable systems and algorithms
- Test Applications
- Iterate
37
JOIN THE REVOLUTION
Everyone Can Help! APACHE ARROW APACHE PARQUET GPU Open Analytics Initiative
https://arrow.apache.org/ @ApacheArrow https://parquet.apache.org/ @ApacheParquet http://gpuopenanalytics.com/ @Gpuoai Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!