Big Data, Little Cluster: Using a Small Footprint of GPU Servers to - PowerPoint PPT Presentation

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to Interactively Query and Visualize Massive Datasets May 9, 2017 Todd Mostak | Co-founder + CEO, MapD @toddmostak | @mapd

The data explosion is just beginning Exabytes 40k 40k 30k 30k Doubling in less than 3 years Sensor + Devices 20k 20k Social Media & Web 10k 10k VOIP Enterprise Data 0k 2014 2015 2016 2017 2018 2019 2020 Source: IDC and EMC Digital Universe Report 2

But storage is not the problem Terabytes 0.12 0.10 Amount of Storage $1 Buys (in T erabytes) 0.08 0.06 0.04 0.02 0.00 2015 2016 2017 2018 2019 2020 Source: Wikibon 2015 4-year costs/TB SSD including packaging, power, cooling, maintenance + space 3

A compute inflection point Data Growth 40% per year CPU Processing Power 20% per year 4

GPUs offer a way forward GPU Processing Power 50% per year Data Growth 40% per year CPU Processing Power 20% per year 5

GPUs outperform CPUs in data critical areas Ability to Read Data Compute Power Compute Power Ability to Read Data Teraflops Memory Bandwidth 90 7,000 80 6,000 floating point operations /sec memory bandwidth GB/sec 70 GPU GPU 5,000 60 4,000 50 40 3,000 30 2,000 20 CPU 1,000 CPU 10 0 0 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 6

MapD: software optimized for the fastest hardware 100x Faster Queries Speed of Thought Visualization + MapD Core MapD Immerse An in-memory, relational, column A visual analytics engine that store database powered by GPUs leverages the speed + rendering capabilities of MapD Core 7

Where MapD sits Tableau or Thrift ODBC, JDBC, 3 rd party viz MapD Immerse GPU Acceleration JDBC/Hadoop MapD Core Non-Viz Output Kafka Streaming Data 8

MapD Core The world's fastest in-memory GPU database powers the world's most immersive data exploration experience 9

Performance starts with memory management Hot Data GPU RAM (L1) Speedup = 1500x to 5000x 24GB to 384GB Over Cold Data 3000-5000 GB/sec COMPUTE LAYER SPEED INCREASES Warm Data CPU RAM (L2) Speedup = 35x to 120x 32GB to 3TB Over Cold Data 70-120 GB/sec Cold Data SSD or NVRAM STORAGE (L3) STORAGE 250GB to 20TB LAYER 1-2 GB/sec Data Lake/Data Warehouse/SOR 10

Query Compilation with LLVM Traditional DBs can be highly inefficient each operator in SQL treated as a separate function • • incurs tremendous overhead and prevents vectorization MapD compiles queries w/LLVM to create one custom function • Queries run at speeds approaching hand-written functions • LLVM enables generic targeting of different architectures (GPUs, X86, ARM, etc). Code can be generated to run query on CPU and GPU simultaneously • 10111010101001010110101101010101 00110101101101010101010101011101 11

These innovations drive exceptional speed + scale Noted DB blogger, Mark Litwintschik has benchmarked MapD vs. major CPU systems and found it to be between 74x to 3,500x faster than CPU DBs. 12

The GPU Open Analytics Initiative (GOAI) and the GPU Data Frame (GDF) 13

End-to-end on the GPU: Supporting ML with MapD Compute Engine (Roadmap) Custom functions Result set Output result set ML frameworks GPU Acceleration Zone 14

MapD Immerse Lightning fast visual analytics for the MapD Core database

MapD Immerse: our hybrid approach Basic charts are frontend Scatterplots, pointmaps + Geo-Viz is composited over a rendered using D3 and other polygons are backend frontend rendered basemap related toolkits rendered using the Iris Rendering Engine on GPUs 16

Server side rendering Data goes from compute (CUDA) to graphics (OpenGL) pipeline without copy and comes back as compressed PNG (~100 KB) rather than raw data (> 1GB) Frontend Vega Spec (a visualization grammar) • A declarative JSON format for creating visualization designs • Used to describe backend visualizations The X-Factor SQL+ • Defines attributes of render primitives PNG which can be driven Vega by data columns and mapped by scales Shader Compilation Framework • Templatized: supports multiple types (ints, Backend floats, colors, etc), Query-to- and multiple continuities Render (discrete, continuous) 17

Scale Up then Out Performantly scaling the MapD Analytics Platform to analyze big data on small clusters

Benefits of a Distributed System • Better Ability to Scale • Multiple servers means ability to support more GPU RAM and CPU RAM for caching bigger datasets in memory • Better Write Performance • The MapD 3.0 distributed capability supports distributed loading for better throughput • Better Read Performance • Multiple servers can support more GPUs 19

“MapD was already the fastest analytics database I had ever tested, even when comparing a single server of MapD against large clusters of CPU-based solutions. With the new distributed architecture, MapD offers users record-beating performance over even more massive data sets.” Mark Litwintschik, tech.marksblogg.com 20

Distributed MapD Core Database Architecture MapD Aggregator Cluster Metadata MapD Leaf MapD Leaf MapD Leaf Data Data Data Data Data Data Data Data Data Confidential & Proprietary 21

MapD Core Database (Single Node) Accept MapD Query Handler Prepare Execution Identify and Identify and Identify and Load Data (If Load Data (If Load Data (If Needed) Needed) Needed) No Execute Execute Execute Query Query Query GPU1 GPU2 ... GPU N Reduce Result Query Done? Yes Return 22 Result

Distributed Scale Up and Scale Out Parse and Validate SQL Generate Algebraic Shared Sequence Dictionary Prepare Execution Leaf 1 Leaf 2 Leaf N No . . . Reduce Leaf Aggregator Result Query Done? Yes Return 23 Result

Distributed Benchmark 1.1B record NYC Taxi Dataset benchmark (conducted by Mark Litwintschik) Query AWS P2.8xlarge timings (seconds) 2 x P2.8xlarge cluster timings (seconds) SELECT cab_type, count(*) FROM trips GROUP 0.022 0.034 BY cab_type; SELECT passenger_count, avg(total_amount) 0.156 0.061 FROM trips GROUP BY passenger_count; SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, count(*) FROM trips GROUP BY passenger_count, 0.309 0.178 pickup_year; SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, cast(trip_distance as int) AS distance, count(*) AS the_count FROM trips GROUP BY passenger_count, pickup_year, distance ORDER 0.771 0.499 BY pickup_year, the_count desc; 48 minutes 26 minutes Load Time 24

Demo 25

MapD, Now Open Source 26

Delivering significant customer value across industries Polling smartphones on Running complex queries in npm looks at over 8B records demand to assess network real-time for customers to at a given moment to identify health. drive insights and ad-buys. trends, segments + anomalies in the javascript world. Took hours on Oracle Previously had to respond in Splunk couldn’t scale economically previously. 24+ hours. or performance-wise. 27

Closing thoughts We are at an inflection point in compute and GPUs are set to dominate the coming decade. 28

Closing thoughts GPUs allow users to scale up before needing to scale out. lowering performance-killing network overheads and decreasing hardware and administration costs. 29

Closing thoughts Integrated Analytics on GPUs comprising querying, viz and ML provide critical efficiencies and capabilities not found in siloed systems. 30

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to - PowerPoint PPT Presentation

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to Interactively Query and Visualize Massive Datasets May 9, 2017 Todd Mostak | Co-founder + CEO, MapD @toddmostak | @mapd The data explosion is just beginning Exabytes 40k

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Doing big.LITTLE right: little and big obstacles Uladizislau Rezki, Vitaly Wool Softprise

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

From little things big things grow How digital connectivity is helping Australian small

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Social Media and NFP organisatons Emma Bennett Why are you using social media? Today

SOCIAL MEDIA FOR COUNTY GOVERNMENT 2018 FAR WEST TEXAS JUDGES AND COMMISSIONERS CONFERENCE Dr.

WELCOME! WELCOME! ALLYSON HEADRICK, ALLYSON HEADRICK,SAFETY SOCIAL WORKER RUSS UHING RUSS

Education Division-National Ed Tour March2Success 2016 Recruit the Strength of Our Army M2S

Directors Update Rose Mwebaza UN Framework Convention on Climate Change Technology Mechanism

Cyber Civics Government Communications in a Digital World The Evolution of Communication Cyber

Financial Services Asset Management Asset Management in Europe A group of financial companies

WELCOME Labor & Employment Labor & Employment Executive Breakfast Meeting Executive

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to - PowerPoint PPT Presentation

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to Interactively Query and Visualize Massive Datasets May 9, 2017 Todd Mostak | Co-founder + CEO, MapD @toddmostak | @mapd The data explosion is just beginning Exabytes 40k

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Doing big.LITTLE right: little and big obstacles Uladizislau Rezki, Vitaly Wool Softprise

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

From little things big things grow How digital connectivity is helping Australian small

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Social Media and NFP organisatons Emma Bennett Why are you using social media? Today

SOCIAL MEDIA FOR COUNTY GOVERNMENT 2018 FAR WEST TEXAS JUDGES AND COMMISSIONERS CONFERENCE Dr.

WELCOME! WELCOME! ALLYSON HEADRICK, ALLYSON HEADRICK,SAFETY SOCIAL WORKER RUSS UHING RUSS

Education Division-National Ed Tour March2Success 2016 Recruit the Strength of Our Army M2S

Directors Update Rose Mwebaza UN Framework Convention on Climate Change Technology Mechanism

Cyber Civics Government Communications in a Digital World The Evolution of Communication Cyber

Financial Services Asset Management Asset Management in Europe A group of financial companies

WELCOME Labor &amp; Employment Labor &amp; Employment Executive Breakfast Meeting Executive

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

WELCOME Labor & Employment Labor & Employment Executive Breakfast Meeting Executive