big data little cluster using a small footprint of gpu
play

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to - PowerPoint PPT Presentation

Big Data, Little Cluster: Using a Small Footprint of GPU Servers to Interactively Query and Visualize Massive Datasets May 9, 2017 Todd Mostak | Co-founder + CEO, MapD @toddmostak | @mapd The data explosion is just beginning Exabytes 40k


  1. Big Data, Little Cluster: Using a Small Footprint of GPU Servers to Interactively Query and Visualize Massive Datasets May 9, 2017 Todd Mostak | Co-founder + CEO, MapD @toddmostak | @mapd

  2. The data explosion is just beginning Exabytes 40k 40k 30k 30k Doubling in less than 3 years Sensor + Devices 20k 20k Social Media & Web 10k 10k VOIP Enterprise Data 0k 2014 2015 2016 2017 2018 2019 2020 Source: IDC and EMC Digital Universe Report 2

  3. But storage is not the problem Terabytes 0.12 0.10 Amount of Storage $1 Buys (in T erabytes) 0.08 0.06 0.04 0.02 0.00 2015 2016 2017 2018 2019 2020 Source: Wikibon 2015 4-year costs/TB SSD including packaging, power, cooling, maintenance + space 3

  4. A compute inflection point Data Growth 40% per year CPU Processing Power 20% per year 4

  5. GPUs offer a way forward GPU Processing Power 50% per year Data Growth 40% per year CPU Processing Power 20% per year 5

  6. GPUs outperform CPUs in data critical areas Ability to Read Data Compute Power Compute Power Ability to Read Data Teraflops Memory Bandwidth 90 7,000 80 6,000 floating point operations /sec memory bandwidth GB/sec 70 GPU GPU 5,000 60 4,000 50 40 3,000 30 2,000 20 CPU 1,000 CPU 10 0 0 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 6

  7. MapD: software optimized for the fastest hardware 100x Faster Queries Speed of Thought Visualization + MapD Core MapD Immerse An in-memory, relational, column A visual analytics engine that store database powered by GPUs leverages the speed + rendering capabilities of MapD Core 7

  8. Where MapD sits Tableau or Thrift ODBC, JDBC, 3 rd party viz MapD Immerse GPU Acceleration JDBC/Hadoop MapD Core Non-Viz Output Kafka Streaming Data 8

  9. MapD Core The world's fastest in-memory GPU database powers the world's most immersive data exploration experience 9

  10. Performance starts with memory management Hot Data GPU RAM (L1) Speedup = 1500x to 5000x 24GB to 384GB Over Cold Data 3000-5000 GB/sec COMPUTE LAYER SPEED INCREASES Warm Data CPU RAM (L2) Speedup = 35x to 120x 32GB to 3TB Over Cold Data 70-120 GB/sec Cold Data SSD or NVRAM STORAGE (L3) STORAGE 250GB to 20TB LAYER 1-2 GB/sec Data Lake/Data Warehouse/SOR 10

  11. Query Compilation with LLVM Traditional DBs can be highly inefficient each operator in SQL treated as a separate function • • incurs tremendous overhead and prevents vectorization MapD compiles queries w/LLVM to create one custom function • Queries run at speeds approaching hand-written functions • LLVM enables generic targeting of different architectures (GPUs, X86, ARM, etc). Code can be generated to run query on CPU and GPU simultaneously • 10111010101001010110101101010101 00110101101101010101010101011101 11

  12. These innovations drive exceptional speed + scale Noted DB blogger, Mark Litwintschik has benchmarked MapD vs. major CPU systems and found it to be between 74x to 3,500x faster than CPU DBs. 12

  13. The GPU Open Analytics Initiative (GOAI) and the GPU Data Frame (GDF) 13

  14. End-to-end on the GPU: Supporting ML with MapD Compute Engine (Roadmap) Custom functions Result set Output result set ML frameworks GPU Acceleration Zone 14

  15. MapD Immerse Lightning fast visual analytics for the MapD Core database

  16. MapD Immerse: our hybrid approach Basic charts are frontend Scatterplots, pointmaps + Geo-Viz is composited over a rendered using D3 and other polygons are backend frontend rendered basemap related toolkits rendered using the Iris Rendering Engine on GPUs 16

  17. Server side rendering Data goes from compute (CUDA) to graphics (OpenGL) pipeline without copy and comes back as compressed PNG (~100 KB) rather than raw data (> 1GB) Frontend Vega Spec (a visualization grammar) • A declarative JSON format for creating visualization designs • Used to describe backend visualizations The X-Factor SQL+ • Defines attributes of render primitives PNG which can be driven Vega by data columns and mapped by scales Shader Compilation Framework • Templatized: supports multiple types (ints, Backend floats, colors, etc), Query-to- and multiple continuities Render (discrete, continuous) 17

  18. Scale Up then Out Performantly scaling the MapD Analytics Platform to analyze big data on small clusters

  19. Benefits of a Distributed System • Better Ability to Scale • Multiple servers means ability to support more GPU RAM and CPU RAM for caching bigger datasets in memory • Better Write Performance • The MapD 3.0 distributed capability supports distributed loading for better throughput • Better Read Performance • Multiple servers can support more GPUs 19

  20. “MapD was already the fastest analytics database I had ever tested, even when comparing a single server of MapD against large clusters of CPU-based solutions. With the new distributed architecture, MapD offers users record-beating performance over even more massive data sets.” Mark Litwintschik, tech.marksblogg.com 20

  21. Distributed MapD Core Database Architecture MapD Aggregator Cluster Metadata MapD Leaf MapD Leaf MapD Leaf Data Data Data Data Data Data Data Data Data Confidential & Proprietary 21

  22. MapD Core Database (Single Node) Accept MapD Query Handler Prepare Execution Identify and Identify and Identify and Load Data (If Load Data (If Load Data (If Needed) Needed) Needed) No Execute Execute Execute Query Query Query GPU1 GPU2 ... GPU N Reduce Result Query Done? Yes Return 22 Result

  23. Distributed Scale Up and Scale Out Parse and Validate SQL Generate Algebraic Shared Sequence Dictionary Prepare Execution Leaf 1 Leaf 2 Leaf N No . . . Reduce Leaf Aggregator Result Query Done? Yes Return 23 Result

  24. Distributed Benchmark 1.1B record NYC Taxi Dataset benchmark (conducted by Mark Litwintschik) Query AWS P2.8xlarge timings (seconds) 2 x P2.8xlarge cluster timings (seconds) SELECT cab_type, count(*) FROM trips GROUP 0.022 0.034 BY cab_type; SELECT passenger_count, avg(total_amount) 0.156 0.061 FROM trips GROUP BY passenger_count; SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, count(*) FROM trips GROUP BY passenger_count, 0.309 0.178 pickup_year; SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, cast(trip_distance as int) AS distance, count(*) AS the_count FROM trips GROUP BY passenger_count, pickup_year, distance ORDER 0.771 0.499 BY pickup_year, the_count desc; 48 minutes 26 minutes Load Time 24

  25. Demo 25

  26. MapD, Now Open Source 26

  27. Delivering significant customer value across industries Polling smartphones on Running complex queries in npm looks at over 8B records demand to assess network real-time for customers to at a given moment to identify health. drive insights and ad-buys. trends, segments + anomalies in the javascript world. Took hours on Oracle Previously had to respond in Splunk couldn’t scale economically previously. 24+ hours. or performance-wise. 27

  28. Closing thoughts We are at an inflection point in compute and GPUs are set to dominate the coming decade. 28

  29. Closing thoughts GPUs allow users to scale up before needing to scale out. lowering performance-killing network overheads and decreasing hardware and administration costs. 29

  30. Closing thoughts Integrated Analytics on GPUs comprising querying, viz and ML provide critical efficiencies and capabilities not found in siloed systems. 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend