S8443: Feeding th the Big ig Data Engine How to Import Data in - PowerPoint PPT Presentation

S8443: Feeding th the Big ig Data Engine How to Import Data in Parallel Presented By: Bria rian Kennedy, CT CTO Providence – Atlanta Email: bkennedy@simantex.com

In Introduction to Sim imantex, , In Inc. • Sim imantex Le Leadership – Experts in diverse public gaming, artificial intelligence applications, e-commerce, and software development – Gaming industry experience in lottery, casino, horse racing, sports betting, and eSports – Large business/enterprise pedigree complemented by start-up experience and the ability to scale up – Track record of creating partnerships, ecosystems, and collaboration – Global B2B and B2G experience • He Heli lios Gen eneral Purp rpose AI/S I/Simulation Pla latf tform – Helios is a revolutionary new approach to Enterprise software, forming a marriage of Wisdom and Artificial Intelligence to provide real-world solutions – Leveraging a proprietary simulation approach, Helios incorporates human learning, reasoning, and perceptual processes into an AI platform – Simantex is looking to apply it to the emerging eSports industry to combat fraud, detect software weakness, and improve player performance

Motivation for Hig igh Speed Data Im Importing This module is a part of the Helios Platform’s High Performance Data Querying & Access Layer. When we began work on this module the intent to be able to achieve these objectives: • Efficient utilization of server resources (Multitenant / Cost-savings) • Scalability to handle clients with massive data needs • Develop a complete enterprise solution that was 100% GPU based Proving that just about any problem, no matter how serial in nature it appears, can be mapped to the GPU and achieve significant performance gains

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • The first line of data could be a column name header record

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • Column widths are inconsistent from one record to the next

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson, " Hobbies include: tennis , football , swimming " ,19,Kansas,2.85 Applicant " Sarah " ,Baxter,17,New Jersey,2.90 CSV File Fred,Roberts, " Focuses on ""Extreme"" Sports like: sky diving , base jumping , etc " ,19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • Columns may be quoted (meaning they start and end with a quote) • This means that the Delimiter cou ould be be part of the data • The quotes surrounding a column should not be treated as part of the column data

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always " early " to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts, " Focuses on "" Extreme "" Sports like: sky diving, base jumping, etc " ,19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • Quotes may exist in column data where the column is not quoted • Quoted columns may have quotes in the data which are then double quoted

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • Columns may exceed target data size • Let’s say in this example the Notes column is a nvarchar(50) Notice that we only counted the double quote characters as 1, and we made sure not to count the outer quotes. Even still, this column exceeds our size constraint, so this record is an error.

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 missing Notes column! CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • Number of columns may differ from one record to the next • Possible error situation

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey ,17,California,3.65 • UTF-8 Text support for multi-language support means: • A character may be 1 – 3 bytes long affecting how we “count” characters to determine max size constraints • Columns can have a mixture of 1, 2, and 3 byte characters

Complexities of the CSV format First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Sample Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 College Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 Applicant "Sarah",Baxter,17,New Jersey,2.90 error row – missing columns – not imported CSV File Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白 , 常 , 专注于研究和运动 : hockey,17,California,3.65 • Not all columns may need to be retrieved from the text • Maybe in this run we only want to import: • Last Name, Age and Applying From State • So the Importer needs to be able to skip columns without writing out the data

Thin inking Dif ifferently, Adapting to Massiv ively Parallel Approaches This type of problem is traditionally handled by 1 2 3 reading data seq sequentia ially lly and managing a variety of “states” . Our approach will compute the “states” for each 111111111111111111111111 1 byte in the CSV file in parallel and store them in 222222222222222222222222 2 333333333333333333333333 a series of arrays. Let’s take a look at the general algorithm flow…

CSV Reader Program Flo low Output arrays are in GPU memory Col1 Col2 Data Data CSV Reader processes the Data Data Read CSV File from disk CudaMalloc and CSV File chunk in GPU into CPU Memory in cudaMemcpy CSV File Memory and outputs Data Data chunks. chunk into GPU Memory results to Arrays for each column/field. Data Data Col1 Col2 Data Data GPU Processing and Results to return to the Data Data Calculations on the CPU via cudaMemcpy. Output Arrays Data Data Data Data Queries Data Consolidation Math Operations Etc.

A Sim implif ified Example le To simplify the problem for now, let’s assume: 1. Field delimiters only appear in field boundaries. No commas within quotes or double quotes to escape a quote. 2. All data fit within their defined output array widths. There are no overruns. 3. All data are ASCII text characters, so we are always dealing with 1 byte per character. 4. All records or rows have the correct number of fields or columns. No column count errors.

S8443: Feeding th the Big ig Data Engine How to Import Data in - PowerPoint PPT Presentation

S8443: Feeding th the Big ig Data Engine How to Import Data in Parallel Presented By: Bria rian Kennedy, CT CTO Providence Atlanta Email: bkennedy@simantex.com In Introduction to Sim imantex, , In Inc. Sim imantex Le

What are you feeding on? Daniel 1 What are you feeding on? What are you feeding on? What are

aims Silage Feeding pigs silage Soyabean meal Feeding pigs silage

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

The TMR Feeding Program Dr. Jim Linn University of Minnesota St. Paul, Minnesota Keys to a

Feeding System - centralized feeding system that is easy to install and simple to use Easy to

Pediatric Feeding Pediatric Feeding Difficulties Difficulties Erin Erin Reier Reier, OTD,

Swine Day 2004 and Feeding Gestating Sows Feeding sows in gestation based on body weight and

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

1 Mapping Relational Data Model Patterns To The App Engine Datastore Max Ross November 19,

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Pediatric Feeding and SLP role in feeding/swallowing disorders Evaluation Swallowing

STARK CORPORATION STARK Adisak PROMBUN Business overview (1) Source: Company Data,

1 Standardization through XBRL Why XBRL? XBRL = open standard for electronic exchange of

Havells India Limited Growing Tenaciously March 2016 H AVELLS - I NTRODUCTION Havells a

PARTNERSHIPS TO HELP LAST MILE Samantha Taylor, National Partnerships Leader - Sustainability and

NPC Check January 2008 Sales Summit January 2008 Forecasted Consumer Spending Consumer Spending

Clean Mobility Options Voucher Pilot: About the Program Adapted from Presentations at Regional

Introduction: Roger Withers , Chairman Financial Review: Shuki Barak , CFO

Corporate Presentation December 2017 Contents Overview: A leading travel & tourism company

S8443: Feeding th the Big ig Data Engine How to Import Data in - PowerPoint PPT Presentation

S8443: Feeding th the Big ig Data Engine How to Import Data in Parallel Presented By: Bria rian Kennedy, CT CTO Providence Atlanta Email: bkennedy@simantex.com In Introduction to Sim imantex, , In Inc. Sim imantex Le

What are you feeding on? Daniel 1 What are you feeding on? What are you feeding on? What are

aims Silage Feeding pigs silage Soyabean meal Feeding pigs silage

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

The TMR Feeding Program Dr. Jim Linn University of Minnesota St. Paul, Minnesota Keys to a

Feeding System - centralized feeding system that is easy to install and simple to use Easy to

Pediatric Feeding Pediatric Feeding Difficulties Difficulties Erin Erin Reier Reier, OTD,

Swine Day 2004 and Feeding Gestating Sows Feeding sows in gestation based on body weight and

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

1 Mapping Relational Data Model Patterns To The App Engine Datastore Max Ross November 19,

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Pediatric Feeding and SLP role in feeding/swallowing disorders Evaluation Swallowing

STARK CORPORATION STARK Adisak PROMBUN Business overview (1) Source: Company Data,

1 Standardization through XBRL Why XBRL? XBRL = open standard for electronic exchange of

Havells India Limited Growing Tenaciously March 2016 H AVELLS - I NTRODUCTION Havells a

PARTNERSHIPS TO HELP LAST MILE Samantha Taylor, National Partnerships Leader - Sustainability and

NPC Check January 2008 Sales Summit January 2008 Forecasted Consumer Spending Consumer Spending

Clean Mobility Options Voucher Pilot: About the Program Adapted from Presentations at Regional

Introduction: Roger Withers , Chairman Financial Review: Shuki Barak , CFO

Corporate Presentation December 2017 Contents Overview: A leading travel &amp; tourism company

Corporate Presentation December 2017 Contents Overview: A leading travel & tourism company