efficient data ingestion March 27th 2018 Data Processing at the - PowerPoint PPT Presentation

efficient data ingestion March 27th 2018 Data Processing at the Speed of Thought fastdata.io inc. | Santa Monica | Seattle

Performance Goals ● Must be limited to hardware constraint ● Disk, Network and PCI bus are the most critical ● Speeds up to 10GB/s per GPU are theoretically possible ● Conservative goal: 2-4GB/s per GPU card for complex queries ● With CPU only approach we get 10 MB/s per core ● GPU executor speed up of 200-400 times 2

Problem Statement ● GPU Accelerated large-scale structured data processing ● Similar stream and batch programming model ● Treat microbatches as temporary database tables ● Existing GPU Databases demonstrated that queries work very fast ● They do not solve the problem of very fast data loading due to the CPU bottleneck ● Data Ingestion Problem 3

GPU Streaming Architecture partition query Data preparation microbatch partition GPU query Convert Results Data preparation micro batch shuffle partition Data preparation query micro batch Cuda CPU GPU stream 4

Problem Constraint Mitigation Strategies Constraints Speed Strategies CPU - Data preparation on GPU 2 GB/s * Compression, partitions Disk new NVMe disks 3 GB/s Compression, partitions Network PCI Bus 10 GB/s NVLINK (only on Power) up to 60 GB/s GpuDirect RDMA RAM up to 1 TB/s * Columnar format, shared VRAM memory Assumes optimal memory access by threads 5

Data Preparation Tasks ● Decompression ● Micro Batch parsing (Kafka) ● Input Data conversion to columnar format (CSV) ● Metadata and dictionaries computation ● Output Data conversion/compression 6

Decompression ● LZ4 decompression is very fast ● Gompresso algorithm showed that decompression on GPU may be even faster ● We pay data load over PCI cost only once as further operations are only on GPU ● Save Disk, Network and CPU bandwidth 7

Micro Batch Parsing 1) Kafka interface gives random access to messages by message index 2) Each message may have arbitrary format, e.g. CSV line 3) Allows batch read of multiple messages 4) Spark Kafka client parses the batch and generates a Java object for each message 5) We still need to pass batch to GPU Solution: skip Java object generation and pass raw message batch to GPU Result: 2.5-3x speedup 8

Input Data Conversion to Columnar Format CSV AVRO GPU Dataframe Columnar (Arrow) JSON Etc., e.g. Syslog 9

Metadata and Dictionaries 1)Dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection. 2)Need two vectorized operations: a) Insert string b) Lookup string   3)The two major solutions to the dictionary problem are a hash table or a search tree. Not optimal on GPU   4)We aim to process up to 4GB/s on a single GPU card and dictionary construction should take not more than ⅓ of that. 5)Our target is at least 12GB/s 6)Dictionaries are required only for string columns and specific queries 10

Practical Use Case Input table Dictionary Output table 1 click id time name name count 2 view 1 00:01 click 1 2 2 00:02 view 2 1 3 00:03 click click 1 view 2 1 SELECT NAME, COUNT(NAME) FROM INPUT_TABLE GROUP_BY NAME; 11

Solutions for Dictionary Structure Algo Pros Cons Hash table Classic solution Not optimal on GPU Search tree String sort No collisions Slow on GPU Hashes sort Fast speed on GPU Collisions 12

Hash Collisions ● Birthday Paradox ● Probability of 2 people having a birthday on the same day is ○ P = n^2/2m, where n - number of people, m - number of days in the year ● n = sqrt(2m*p) ● For 64 bit hash m = 2^64 ● To get a collision with 99.99% probability it would take: ○ N = sqrt(2*2^64*0.9999) = 8 billion strings ● Assuming our engine processes 100 million events per second ● It would have one collision every 80 seconds! ○ 320 billion seconds for 128 bit ● Conclusion: even with 64 bit hash, we must handle collisions 13

Dictionary Construction Algorithm Insert strings 1)Compute hashes for all strings 2)Sort on hash + string key 3)“Unique” operation on hash + string key Lookup strings 1)Binary search using hash + string key 14

Comparison of Dictionary Implementation * 1 mil of 40 byte uuids * tested on V100 15

Latest Results FDIO (one GPU) speed: 1 GB/s (as of now, still improving it) Original Spark (8 CPU Cores) speed: 60 MB/s Query on Kafka stream for CSV data aggregation:   telecomStream .withWatermark( "call_time" , "60 seconds" ) .groupBy(window( $"call_time" , "60 seconds" ), $"cell_from" ) .agg(count( "*" )) .join(cellsStaticDf, telecomStream.col( "cell_from" ) === cellsStaticDf.col( "cell" )) 16

FDIO Engine ™ Architecture Diagram 17

Conclusion ● Streaming at gigabytes/sec speed is a very hard problem ● The whole system is only as fast as the slowest link ● What we have is still a target rich environment ● Plenty of optimization opportunities remain ● Still at the beginning of a long journey ● The possibilities are very exciting ● We plan the GA release of FDIO engine in April 2018 19

Links ● Can GPUs Sort Strings Efficiently?   http://web.engr.illinois.edu/~ardeshp2/papers/Aditya13StringSort.pdf ● Massively-Parallel Lossless Data Decompression   https://arxiv.org/pdf/1606.00519 20

Sign up for a Visit us in the Inception T est Flight POC Startup Pavilion at booth 935 at fastdata.io 21

Data Processing at the Speed of Thought Vassili Gorshkov, CTO vassili@fastdata.io 1-888-707-3346 fastdata.io inc. | Santa Monica | Seattle

efficient data ingestion March 27th 2018 Data Processing at the - PowerPoint PPT Presentation

efficient data ingestion March 27th 2018 Data Processing at the Speed of Thought fastdata.io inc. | Santa Monica | Seattle Performance Goals Must be limited to hardware constraint Disk, Network and PCI bus are the most critical

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Lepp

Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software

Data Acquisition and Ingestion Corso di Sistemi e Architetture per Big Data A.A. 2019/2020

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov

Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo Antonelli 1,3 , Alessandro

Lessons learned on data discovery, integration and ingestion in AGRIS Fabrizio Celli (FAO)

Bench'19 Benchmarking Database Ingestion Ability with Real-Time Big Astronomical Data Qing Tang

for polypeptide release in the small intestine Team UW Madison 2010 iDIET in Brief Growth

Alpha Presentation Force Platform Ingestion Tool The Capstone Experience Team Rook Roy Barnes

Beta Presentation Force Platform Ingestion Tool The Capstone Experience Team Rook Roy Barnes

Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi

Register Reports content Store/WU DUA Ingestion Analytics Vetting APIs

Kumquat ( Fortunella margarita ): a good alternative for the ingestion of nutrients and bioactive

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Ana Analyzing t g the he Effect cts o of Di Different S Signs gns to Incr ncrea ease t

SPHINCS: practical stateless hash-based signatures Daniel J. Bernstein Daira Hopwood Andreas H

File Organisation Part - II Dr. V. V. Subrahmanyam Associate Professor, SOCIS, IGNOU Heap File

Message-locked Encryption with Deduplication Consistency Sbastien Canard 1 , Fabien Laguillaumie

Community Relations Service 1 Source: Shutterstock: 496979950 What is ? Created under Title X

W h it h er W or d pr e ss? J A MS t ac k C M S a re r ea d y t o s hi n e! B y B r ia n R i na l

BLPC #8 Meeting BLPC #8 Meeting McKinley Elementary School McKinley Elementary School

Addressing Youth Victim ization Cyber-Bullying Kids New Reality April 11, 2011 Presented by

efficient data ingestion March 27th 2018 Data Processing at the - PowerPoint PPT Presentation

efficient data ingestion March 27th 2018 Data Processing at the Speed of Thought fastdata.io inc. | Santa Monica | Seattle Performance Goals Must be limited to hardware constraint Disk, Network and PCI bus are the most critical

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Lepp

Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software

Data Acquisition and Ingestion Corso di Sistemi e Architetture per Big Data A.A. 2019/2020

Radically modular data ingestion APIs in Apache Beam Eugene Kirpichov

Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo Antonelli 1,3 , Alessandro

Lessons learned on data discovery, integration and ingestion in AGRIS Fabrizio Celli (FAO)

Bench'19 Benchmarking Database Ingestion Ability with Real-Time Big Astronomical Data Qing Tang

for polypeptide release in the small intestine Team UW Madison 2010 iDIET in Brief Growth

Alpha Presentation Force Platform Ingestion Tool The Capstone Experience Team Rook Roy Barnes

Beta Presentation Force Platform Ingestion Tool The Capstone Experience Team Rook Roy Barnes

Apache Hadoop Ingestion &amp; Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi

Register Reports content Store/WU DUA Ingestion Analytics Vetting APIs

Kumquat ( Fortunella margarita ): a good alternative for the ingestion of nutrients and bioactive

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

Ana Analyzing t g the he Effect cts o of Di Different S Signs gns to Incr ncrea ease t

SPHINCS: practical stateless hash-based signatures Daniel J. Bernstein Daira Hopwood Andreas H

File Organisation Part - II Dr. V. V. Subrahmanyam Associate Professor, SOCIS, IGNOU Heap File

Message-locked Encryption with Deduplication Consistency Sbastien Canard 1 , Fabien Laguillaumie

Community Relations Service 1 Source: Shutterstock: 496979950 What is ? Created under Title X

W h it h er W or d pr e ss? J A MS t ac k C M S a re r ea d y t o s hi n e! B y B r ia n R i na l

BLPC #8 Meeting BLPC #8 Meeting McKinley Elementary School McKinley Elementary School

Addressing Youth Victim ization Cyber-Bullying Kids New Reality April 11, 2011 Presented by

Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi