GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure - - PowerPoint PPT Presentation

goai one year later
SMART_READER_LITE
LIVE PREVIEW

GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure - - PowerPoint PPT Presentation

GOAI ONE YEAR LATER Joshua Patterson, Director AI Infrastructure 3/27/18 @datametrician THE WORLD WE ANALYZE Realities of Data 2 IN A FINITE CRISIS CPU Performance Has Plateaued 10 7 10 6 Transistors 10 5 1.1X per year (thousands) 10 4 10


slide-1
SLIDE 1

Joshua Patterson, Director AI Infrastructure 3/27/18 @datametrician

GOAI ONE YEAR LATER

slide-2
SLIDE 2

2

THE WORLD WE ANALYZE

Realities of Data

slide-3
SLIDE 3

3

1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year Transistors (thousands)

IN A FINITE CRISIS

CPU Performance Has Plateaued

slide-4
SLIDE 4

4

1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year GPU-Computing perf 1.5X per year 1000X By 2025

IN A FINITE CRISIS

GPU Performance Grows

slide-5
SLIDE 5

5

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

2008 2010 2012 2014 2016 2017

Peak Double Precision

NVIDIA GPU x86 CPU

TFLOPS

IN A FINITE CRISIS

CPU Performance Has Plateaued

slide-6
SLIDE 6

6

APP A

PRE-52 WEEKS LATER

Fast Was Made Slow

CPU GPU

APP B Read Data H2O.ai Continuum Gunrock Graphistry BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A

GPU Data

APP B

GPU Data

slide-7
SLIDE 7

7

GPU LEADERS UNITE

Could We Do Better Than Big Data?

slide-8
SLIDE 8

8

25-100x Improvement Less code Language flexible Primarily In-Memory

TRADITIONAL DATA SCIENCE ON GPUS

Lots of glue code and plagued by copy and converts

HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read

Query ETL ML Train

HDFS Read

Query ETL ML Train

HDFS Read

GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train

5-10x Improvement More code Language rigid Substantially on GPU

GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing

  • Each system has a different internal memory

format with mostly overlapping functionality

  • Depending on the workflow, 80+% of time and

computation is wasted on the serialization, deserialization, and copying of data

slide-9
SLIDE 9

9

PRE-52 WEEKS LATER

What We Want

CPU GPU

Read Data H2O.ai Continuum Gunrock Graphistry BlazingDB MapD Load Data

GPU Memory Buffer

slide-10
SLIDE 10

10

GOAI AND THE BIG DATA ECOSYSTEM

No copy and converts on the GPU, compatible with Apache Arrow

No Copy & Converts - Full Interoperability

Continuum Gunrock Graphistry BlazingDB MapD

GPU Data Frame

github.com/gpuopenanalytics github.com/apache/arrow

  • All systems utilize the same memory format, so overhead for cross-system

communication is minimized and projects can share features and functionality

  • Most, if not all of the GPU Data Frame functionality is going back into Apache Arrow
  • Currently three GPU Data Frame libraries: libgdf (C library), pygdf (Python library), and

Dask_gdf (multi-gpu, multi-node Python library) H2O.ai nvGRAPH Simantex

slide-11
SLIDE 11

11

25-100x Improvement Less code Language flexible Primarily In-Memory

DATA SCIENCE ON GPUS WITH GOAI + GDF

Faster Data Access Less Data Movement

HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read

Query ETL ML Train

HDFS Read

Query ETL ML Train

HDFS Read

GPU ReadQuery CPU Write GPU Read ETL CPU Write GPU Read ML Train

Arrow Read

Query ETL ML Train

5-10x Improvement More code Language rigid Substantially on GPU 10-25x Improvement Same code Language flexible Primarily on GPU

End to End GPU Processing (GOAI) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing

slide-12
SLIDE 12

12

ANACONDA – PyGDF & DaskGDF

Moving From Traditional Flows

ETL Database Model Data Frame Arrays Sparse Matrix

Traditional Workflows

Data originates from a database Nearly all data curation happens within the database (joins, group bys, unions, etc…) Database has already dealt with providing structure to the data and contains nearly all useable data for ML The output of a database and its functionality is a data frame where additional ETL is minimal Manipulation of columns

  • ccur here (encoding,

transformations, training variable creation). Data structure is converted from a dataframe to a matrix or arrays Training with many algorithms to find the most accurate method

slide-13
SLIDE 13

13

PYGDF & DaskGDF UDFs

Python -> GPU Acceleration

  • Write a custom function in Python that

gets JIT compiled into a GPU Kernel by Numba

  • Functions can be applied by row,

column, or groupby group

slide-14
SLIDE 14

14

ANACONDA – PyGDF & DaskGDF

To Complex Flows

ETL Database Many Data Types Data Frame Data Frame Data Frame ETL Model Array Sparse Matrix ETL Model Array Sparse Matrix ETL Model Array Sparse Matrix Streams Data Lake

Data originates from wherever developers can find data, and it’s stored in many formats. With many groups using data in different ways, data is stored in formats for maximum usability, but pushes more manipulation to the ETL functions The output of all these sources are varying data frames with varying structure ETL process is in charge of moving data into one usable format (from csv, xml, json, db formats, hadoop formats, etc…) Data curation performed on all the disparate data Subsets of the data created for different modeling targets The rest of traditional ETL occurs. Training with many algorithms occur. Feedback loop to ETL then

  • ccurs.

Back in the ETL process: if a subset of the data is the root cause of accuracy issues, new subsets are formed for new algorithmic approaches.

Complex Workflows

slide-15
SLIDE 15

15

PYGDF NEW JOINS

Faster Join Support Coming

TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 1329 31731 465064 V100 (PCIe3) 22 164 1521 V100 (3xNVLINK2) 12 45 466 3.2x 300x

TPCH Query 21 – End to End Results Using 32-bit Keys*

TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 150 2041 24960 V100 (PCIe3) 13 105 946 V100 (3xNVLINK2) 7 23 308 3.1x 26x

TPCH Query 4 – End to End Results Using 32-bit Keys*

*A *Assuming g the input tables are loaded and pinned in system memory

slide-16
SLIDE 16

16

PYGDF NEW JOINS

GPU memory capacity is not a limiting factor GPU query performance up to 2-3 orders of magnitude better than CPU GPU query perf is dominated by the CPU-GPU interconnect throughput NVLINK systems show 3x better E2E query performance compared to PCIe Thanks Nikolay Sakharnykh! S8417 - Breaking the Speed of Interconnect with Compression for Database Applications – Tuesday, Mar 27, 2:00pm – Room 210F

Takeaways

slide-17
SLIDE 17

17

BLAZINGDB JOINS GOAI

Scale Out Data Warehousing

slide-18
SLIDE 18

18

BLAZINGDB

Scale Out Data Warehousing

SCHEMA METADATA DATA

  • Compression/Descompression
  • Filtering (Predicate Pushdown)
  • Aggregations
  • Transformations
  • Joins
  • Sorting/Ordering

DATA LAKE

0001010100001001011010110
  • RAM Cache
  • Disk Cache
  • HDD
  • SSD

Same system more interoperability Parquet In Arrow/GDF Out Compression/Decompression on GPU to improve throughput

slide-19
SLIDE 19

19

GDF Arrow

Common Data Layer INGEST STORAGE

(Data Lake)

Coming Soon

BLAZINGDB

The Future

slide-20
SLIDE 20

20

H2O.AI

ML On GPU First 3 algorithms

  • GLM
  • K-Means
  • GBM (XGBoost)
slide-21
SLIDE 21

21

H2O.AI

ML On GPU 2 new algorithms

  • tSVD
  • PCA
slide-22
SLIDE 22

22

XGBOOST

Faster, More Scalable, & Better Inferencing

Thanks Andrey Adinets, Vinay Deshpande, and Thejaswi Nanditale! Scalability increase from 16GB to 100GB on DGX-1 Performance Improvement not only on single-GPU, but multi-GPU scaling GBDT Inference Library

slide-23
SLIDE 23

23

NVGRAPH

Arrow to Graph

  • Ported nvGRAPH to run natively on GPU

DataFrame, so we can use two columns as source and destinations to define an unweighted graph.

  • Breadth First Search, Jaccard Similarity

and Pagerank with Python Bindings

  • Developing Hornet integration for GoAi

as well as Gunrock

  • 1x P100 2-3 orders of magnitude faster

than i7-3930K NetworkX Python Library

slide-24
SLIDE 24

24

IBM SNAPML

More Proof of ML on GPU

slide-25
SLIDE 25

25

MAPD

In Memory GPU Database

+

100x Faster Queries Speed of Thought Visualization

MapD Core MapD Immerse

A fast, relational, column store database powered by GPUs A visual analytics engine that leverages the speed + rendering capabilities of MapD Core

slide-26
SLIDE 26

26

MAPD

Improvements Since GTC17

Multi-source Dashboards Multi-layer GeoCharts Auto-refresh for streaming data Charting - Combo chart, multi-measure line chart, stacked bar MapD Immerse MapD Core Performance - Joins, string literal comparisons Data Ingestion - Read from Kafka, compressed files, S3 Major rendering performance improvements O(1-10MM ) polygons in ~ms Arrow - improved GPU memory mgt, pymapd with bi-directional arrow-based ingest

slide-27
SLIDE 27

27

MAPD

MapD Presto

https://github.com/NVIDIA/presto-mapd-connector 8GPU MapD alone up to 40x dual 20-core CPU on inferencing streaming data Faux Multi-node MapD Presto being developed

20 4 0.1 1.2 25 6 0.1 1.2 30 8 0.1 1.2 5 10 15 20 25 30 35 PRESTO ON JSON PRESTO ON PARQUET MAPD PRESTO + MAPD

GPU Database Performance

10 mins 30 mins 60 mins

slide-28
SLIDE 28

28

MAPD

Dashboards Comparison vs Kibana

slide-29
SLIDE 29

29

MAPD

MapD Immerse vs Elastic Kibana

100 200 300 1 6 11 16 21 26 31 MapD Immerse (DGX) MapD Immerse (P2) Elastic Kibana x

< 9s < 12s

Days of Data Time to Fully Load (seconds)

slide-30
SLIDE 30

30

GRAPHISTRY

Accelerated visual graph analytics and investigation platform

slide-31
SLIDE 31

31

GRAPHISTRY

Improvements since GTC17

CSV, JSON, ETC GPU DATA FRAME Arrow.js

https://www.npmjs.com/package/arrow

slide-32
SLIDE 32

32

TESLA V100 32GB

WORLD’S MOST ADVANCED DATA CENTER GPU WITH 2X THE MEMORY 5,120 CUDA cores 640 NEW Tensor cores 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS | 125 Tensor TFLOPS 20MB SM RF | 16MB Cache 32GB HBM2 @ 900 GB/s | 300 GB/s NVLink

slide-33
SLIDE 33

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

35

slide-36
SLIDE 36

36

A NEW COMPUTE PLATFORM

Building It Together

APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE

  • Learn what the domain requires
  • Use best practices and standards
  • Build scalable systems and algorithms
  • Test Applications
  • Iterate
slide-37
SLIDE 37

37

JOIN THE REVOLUTION

Everyone Can Help! APACHE ARROW APACHE PARQUET GPU Open Analytics Initiative

https://arrow.apache.org/ @ApacheArrow https://parquet.apache.org/ @ApacheParquet http://gpuopenanalytics.com/ @Gpuoai Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!

slide-38
SLIDE 38