S8443: Feeding th the Big ig Data Engine How to Import Data in - - PowerPoint PPT Presentation

s8443 feeding th the big ig data engine
SMART_READER_LITE
LIVE PREVIEW

S8443: Feeding th the Big ig Data Engine How to Import Data in - - PowerPoint PPT Presentation

S8443: Feeding th the Big ig Data Engine How to Import Data in Parallel Presented By: Bria rian Kennedy, CT CTO Providence Atlanta Email: bkennedy@simantex.com In Introduction to Sim imantex, , In Inc. Sim imantex Le


slide-1
SLIDE 1

S8443: Feeding th the Big ig Data Engine

How to Import Data in Parallel

Presented By: Bria rian Kennedy, CT CTO

Providence – Atlanta Email: bkennedy@simantex.com

slide-2
SLIDE 2

In Introduction to Sim imantex, , In Inc.

  • Sim

imantex Le Leadership

– Experts in diverse public gaming, artificial intelligence applications, e-commerce, and software development – Gaming industry experience in lottery, casino, horse racing, sports betting, and eSports – Large business/enterprise pedigree complemented by start-up experience and the ability to scale up – Track record of creating partnerships, ecosystems, and collaboration – Global B2B and B2G experience

  • He

Heli lios Gen eneral Purp rpose AI/S I/Simulation Pla latf tform

– Helios is a revolutionary new approach to Enterprise software, forming a marriage of Wisdom and Artificial Intelligence to provide real-world solutions – Leveraging a proprietary simulation approach, Helios incorporates human learning, reasoning, and perceptual processes into an AI platform – Simantex is looking to apply it to the emerging eSports industry to combat fraud, detect software weakness, and improve player performance

slide-3
SLIDE 3

Motivation for Hig igh Speed Data Im Importing

This module is a part of the Helios Platform’s High Performance Data Querying & Access Layer. When we began work on this module the intent to be able to achieve these objectives:

  • Efficient utilization of server resources

(Multitenant / Cost-savings)

  • Scalability to handle clients with massive data needs
  • Develop a complete enterprise solution that was

100% GPU based Proving that just about any problem, no matter how serial in nature it appears, can be mapped to the GPU and achieve significant performance gains

slide-4
SLIDE 4

Complexities of the CSV format

  • The first line of data could be a column name header record

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-5
SLIDE 5

Complexities of the CSV format

  • Column widths are inconsistent from one record to the next

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-6
SLIDE 6

Complexities of the CSV format

  • Columns may be quoted (meaning they start and end with a quote)
  • This means that the Delimiter cou
  • uld be

be part of the data

  • The quotes surrounding a column should not be treated as part of the column

data

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-7
SLIDE 7

Complexities of the CSV format

  • Quotes may exist in column data where the column is not quoted
  • Quoted columns may have quotes in the data which are then double

quoted

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-8
SLIDE 8

Complexities of the CSV format

  • Columns may exceed target data size
  • Let’s say in this example the Notes column is a nvarchar(50)

Notice that we only counted the double quote characters as 1, and we made sure not to count the outer quotes. Even still, this column exceeds our size constraint, so this record is an error.

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-9
SLIDE 9

Complexities of the CSV format

  • Number of columns may differ from one record to the next
  • Possible error situation

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 missing Notes column! Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-10
SLIDE 10

Complexities of the CSV format

  • UTF-8 Text support for multi-language support means:
  • A character may be 1 – 3 bytes long affecting how we “count” characters to

determine max size constraints

  • Columns can have a mixture of 1, 2, and 3 byte characters

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-11
SLIDE 11

Complexities of the CSV format

  • Not all columns may need to be retrieved from the text
  • Maybe in this run we only want to import:
  • Last Name, Age and Applying From State
  • So the Importer needs to be able to skip columns without writing out

the data

First Name,Last Name,Notes,Age,Applying From State,GPA John,Smith,Honnor Student,18,Nevada,3.77 Marybeth,Parkins,Always "early" to class,17,Colorado,3.42 Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85 "Sarah",Baxter,17,New Jersey,2.90 error row – missing columns – not imported Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05 白,常,专注于研究和运动: hockey,17,California,3.65 Sample College Applicant CSV File

slide-12
SLIDE 12

Thin inking Dif ifferently, Adapting to Massiv ively Parallel Approaches

This type of problem is traditionally handled by reading data seq sequentia ially lly and managing a variety of “states”. Our approach will compute the “states” for each byte in the CSV file in parallel and store them in a series of arrays. Let’s take a look at the general algorithm flow…

1 2 3

111111111111111111111111 1 222222222222222222222222 2 333333333333333333333333

slide-13
SLIDE 13

Read CSV File from disk into CPU Memory in chunks. CudaMalloc and cudaMemcpy CSV File chunk into GPU Memory CSV Reader processes the CSV File chunk in GPU Memory and outputs results to Arrays for each column/field.

Col1 Data Data Data Data Col2 Data Data Data Data

CSV Reader Program Flo low

Output arrays are in GPU memory

GPU Processing and Calculations on the Output Arrays Results to return to the CPU via cudaMemcpy.

Col1 Data Data Data Data Col2 Data Data Data Data

Queries Data Consolidation Math Operations Etc.

slide-14
SLIDE 14

A Sim implif ified Example le

To simplify the problem for now, let’s assume:

  • 1. Field delimiters only appear in field boundaries. No

commas within quotes or double quotes to escape a quote.

  • 2. All data fit within their defined output array
  • widths. There are no overruns.
  • 3. All data are ASCII text characters, so we are always

dealing with 1 byte per character.

  • 4. All records or rows have the correct number of

fields or columns. No column count errors.

slide-15
SLIDE 15

Fin inding the In Individual l Record Boundarie ies

Objectives:

1. Locate the record boundaries 2. Assign that record number to all the characters in that record.

Tasks:

  • Allocate two 32bit integer arrays of the same dimension as the CSV byte array.

One array is the Header, the other is the Prefix Sum (or Scan) array.

Performance Tip: All arrays we use that hold state information are made of 32bit integers (4 bytes) to ensure optimal alignment when all 32 threads in a warp write or read data to/from the array.

  • Run a kernel where each thread maps to a byte in the CSV byte array.

If the byte is a Line Feed write a 1 to the Header Array, otherwise write a 0.

  • Run the Prefix Sum (in this case an Exclusive Scan) on the Headers.

Now the Scan Array will have the 0-based record number that corresponds to every byte in the CSV Array.

slide-16
SLIDE 16

Fin inding the In Individual l Record Boundarie ies

  • The following figure illustrates the start of a simple 3-column CSV file.
  • The first row is the array index for the next three rows.
  • The second row is the array of bytes in the start of the CSV.
  • The third row is the Linefeed Header array created by the GPU kernel.

The Linefeeds separate the records in the CSV.

  • The fourth row is the Exclusive Scan. The value is the 0-based record number

that the CSV byte is in.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W o r l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

slide-17
SLIDE 17

Fin inding Colu lumn Boundarie ies

Objectives:

1. Locate the delimiter characters 2. Assign the relative column number to all the characters in that column.

Tasks:

  • Run a kernel that populates a Header Array for the field delimiters.
  • Run a Segmented Scan.

The Segmented Scan works like a regular Scan except that the count is reset at various points, creating separate segments. In this case our delimiters are commas, and the Segment Boundaries are the Record Boundaries - Linefeeds.

slide-18
SLIDE 18

Fin inding Colu lumn Boundarie ies

  • The figure below shows the Segmented Exclusive Scan.
  • The fifth row is the Columns (Delimiters) Header Array.
  • The sixth row is the Segmented Exclusive Scan, which resets to 0 after every
  • Linefeed. The value is the 0-based column number within the record (row).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W o r l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2

slide-19
SLIDE 19

The Records Table

Objectives:

1. Build an array mapping the end of each Record in the CSV

Tasks:

  • Run Exclusive Scan and Stream Compact kernels on the Header array.

The Index of the array is the 0-based Record Number, and the Value in the Array is the Index of the Linefeed at the end end of the record in the CSV Array

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W o r l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 38 50 66 . . . . . . . . . . . .

slide-20
SLIDE 20

The Colu lumns Table le

Objectives:

1. Build an array mapping the end of each Column in the CSV

Tasks:

  • Run Exclusive Scan and Stream Compact kernels on the Columns Header array.

Third row identifies the column delimiters in blue, and linefeeds in pink. Fourth row is the Exclusive Scan, mapping index offsets into the Columns Table.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W o r l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 7 7 8 8 8 8 8 8 9 9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5 10 17 27 30 38 42 44 50 . . . . . . .

slide-21
SLIDE 21

Record to Colu lumns Mapping Table le

Objectives:

1. Build an array mapping the Column Table array index value that corresponds to the last column in the record

Tasks:

  • Run a specialized Stream Compact kernel.
  • With this we can catch and filter out records with column-count errors, and as an
  • ptimization for threads to compute where in the Columns table their information starts.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5 10 17 27 30 38 42 44 50 . . . . . . . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 38 50 66 . . . . . . . . . . . . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 5 8 11 . . . . . . . . . . . .

Records-To-Columns Table: Columns Table: Records Table:

slide-22
SLIDE 22

Prin intable le Byt ytes Array

Objectives:

1. Build an array mapping the end of each Record in the CSV

Tasks:

  • Run a set of kernels that identifies Printable Bytes.

In our simple example, it means all bytes except commas, carriage returns, or linefeeds.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W

  • r

l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-23
SLIDE 23

Character Position Array

Objectives:

1. Build an array indicating the byte position of each character in a column

Tasks:

  • Run a Segmented Exclusive Scan of the Printable Bytes, with the Segment resetting on both

th Column Headers and Record.

Row 8 shows the values given the byte or character position of each input character.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W

  • r

l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 1 2 3 4 1 2 3 4 5 5 1 2 3 4 5 6 7 8 9 1 2 1 2 3 4 5 6 6 1 2 3 1 1 2 3 4 4 1 H e l l

  • 1

2 . 2 W

  • r

l d M a r g a r i n e 1 5 B u t t e r A m y 3 A b l e N a

Rec ecord: Col

  • lumn:

Printable: Pos

  • sition:
slide-24
SLIDE 24

Puttin ing it it all ll together

  • A final kernel creates the Output Arrays with each thread checking the

Printing Headers to see if its byte should be written.

  • If so, it checks the Scans and Segmented Scans to identify exactly where this

byte goes in the Output:

  • Which Record.
  • Which Column (or Output Array in this case).
  • Which byte or character position within the Record within the Output Array.
  • Here are the first few records in our sample CSV in the Output Arrays.

H e l l

  • 1

2 . 2 W o r l d 1 M a r g a r i n e 1 5 B u t t e r 2 A m y 3 A b l e 3 N a t i

  • n

a l i t y 1 7 9 . 1 A r m e n i a n 4 A l a b a s t e r 2 3 4 p a l e

  • 5

M o p e d 7 7 . 3 3 3 3 a w a r d 6 P a r l i a m e n t 4 5 P a s c a l 7 M o v i n g 6 7 8 v a n 8 p a v l

  • v

3 4 5 8 d

  • g
slide-25
SLIDE 25

Let’s Walk Through a Single Thread

H e l l

  • 1

2 . 2 W o r l d 1 M a r g a r i n e 1 5 B u t t e r 2 A m y 3 A b l e 3 N a t i

  • n

a l i t y 1 7 9 . 1 A r m e n i a n 4 A l a b a s t e r 2 3 4 p a l e

  • 5

M o p e d 7 7 . 3 3 3 3 a w a r d 6 P a r l i a m e n t 4 5 P a s c a l 7 M o v i n g 6 7 8 v a n 8 p a v l

  • v

3 4 5 8 d

  • g

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 H e l l

  • ,

1 2 . 2 , W

  • r

l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 1 2 3 4 1 2 3 4 5 5 1 2 3 4 5 6 7 8 9 1 2 1 2 3 4 5 6 6 1 2 3 1 1 2 3 4 4 1 H e l l

  • 1

2 . 2 W

  • r

l d M a r g a r i n e 1 5 B u t t e r A m y 3 A b l e N a

Rec ecord: Col

  • lumn:

Printable: Posi

  • sition:

Thread 8

slide-26
SLIDE 26

Handli ling the More Advanced CSV Features

In the above example, we greatly simplified the features of the CSV format that we dealt with. However, the remaining features can be supported in much the same way, building custom arrays to indicate the state of each feature. The following are a list of additional kernels that we developed to handle these features:

slide-27
SLIDE 27

More Buffers, Scans, and Compacts – Oh My

  • DoubleQuotes: Identify double quotes in column content
  • Merge2ndQuotesAndNonPrinting: Remove 2nd quotes as printable bytes
  • QuoteBoundaryHeaders: Identify if character is inside quotations
  • FixColumnHeaderCommas: Include delimiters in quotes as printable bytes
  • PrintingCharacters: Filters out quotes around columns and 2nd quotes from the printable bytes
  • BufferPrinting: Stream compacts to remove non-printable bytes
  • IdentifyColumnCountErrors: Identifies records with incorrect number of columns
  • BuildCharsHeadersOnly: Creates a scan mapping for multi-byte characters

so they can be identified as a single character when counting.

  • CharacterCountErrors: Counts the number of characters (not bytes) a column contains.
slide-28
SLIDE 28

Optim imizations of the Core Parsin ing Engin ine

Unlike our simplified example, where each thread writes one byte to the output array, the Core is based on 4 bytes per thread.

Memory alignment is critical, and reads and writes should always be configured on even 4- byte boundaries relative to their allocations. Never access a 32-bit integer at byte offsets 1, 2

  • r 3, but only at 0 or 4 (or multiples of 4).

Take advantage of casting byte arrays to 32-bit or 64-bit integer arrays to move 4 or 8 bytes in single instructions. When dealing with objects that can be many bytes long (such as strings), sometimes its better to think of Warps, not Threads, as your logical worker unit.

We will look into that in detail next...

slide-29
SLIDE 29

A Warp Centric ic Approach

  • Each Warp handles one full record or row, regardless of the length of that row.
  • The CSV byte array starts on an even 128-byte boundary, so the first record will

start on an even 128-byte boundary. However, subsequent records are most likely not to do so.

  • Each Warp will calculate its Warp Index, which is the Index of Thread 0 in the Warp

divided by 32.

  • The Warp Index will correspond to the 0-based Record Number which it will

process.

  • Special additional code determines if the target record was marked invalid (having an error),

and if so, the wrap is re-assigned to the next record.

slide-30
SLIDE 30

Core CSV Buffer Read Alignment

1. The Warp computes the 128-byte aligned starting address for its Record from information in the Records Table based on its Warp Index. 2. It then grabs the next value in the Records Table to find where its record ends. 3. The Warp will load one or more 128-byte chunks of memory until it encompasses one entire record. Below shows how 128-byte chunks may map unevenly to records, whose lengths can vary. In this situation, we read the first chunk for Record 0, the first and second for Record 1, the second through fourth for Record 2, and so on.

128 Bytes 128 Bytes 128 Bytes 128 Bytes 128 Bytes 128 Bytes 128 Bytes Record 0 Record 1 Record 2 Record 3 Record 4 Record 5

This may seem inefficient as often time more bytes will be read than needed, however we gain several advantages: 1) All reads are properly aligned on a multiple of 4 boundary. 2) Several wraps may need the same 128-byte chunk, and therefore we increase the change that it is available in cache, eliminating the slow fetch from Global memory.

slide-31
SLIDE 31

Memory Types and Usage

  • Shared Memory
  • When a record is longer than one 128-byte chunk, the next chunk is pre-fetched into Shared Memory.
  • This is required to support look-ahead logic within the Core algorithm for shuffling and other calculations.
  • As the algorithm moves onto chunk 2->N, it is loaded from Shared Memory, and if needed pre-fetches the

next chunk 3->N into Shared Memory, continuing the cycle.

  • Constant Memory
  • Constant memory is used when all threads within a Wrap need the same value at once (“broadcasting”).
  • The following values used by all threads are copied into Constant Memory:
  • Field Character Widths
  • Field Byte Widths
  • The Pointers to each of the Output Arrays
  • These values are used by all warps for all records.

IMP MPORTANT: There is one Output Array for each column in the CSV to be retrieved. To

  • ensure mem

memory ry alignment, we we re requir ire tha that t the the wi widths of

  • f all ou
  • utpu

tput t array rrays be defi fined in mu multip tiple les of

  • f 4 or
  • r even (pre

(preferably ly) )

  • 8. If you used array widths such as 5 or 6, this would be problematic for the Core’s logic.
slide-32
SLIDE 32

Shuffling for Aligned Writes

  • At this point we assume that all non-printing characters, except column delimiters and record

terminators, have been removed from the CSV buffer by the previous kernels.

  • The printing bytes of each column are “shuffled down” so that the column aligns with the start of a

4-byte boundary.

  • As the process completes, threads that have 4 bytes ready write out to an Output Array.
  • If the column ends before the end of a 4-byte boundary, the unused bytes are masked off.

The figure below shows the first part of a sample chunk. The tall vertical bars simply demark the 4- byte boundaries representing individual threads. You see the printing characters of each column in colors in the middle row. The bottom row represents the positions of the printing characters after the shuffling is complete.

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 x y z CR LF a b c , H e l l

  • " M

y " W o r l d , P l e n u m s , E x p

  • n

e n t , e 2 3 . 1 , A B C a b c H e l l

  • " M

y " W o r l d P l e n u m s E x p

  • n

e n t , e 2 3 . 1 A B C D E

slide-33
SLIDE 33

Shuffling for Aligned Writes

  • Each column now starts at the beginning of a 4-byte boundary.

The dark gray bytes represent the masking that is done to allow full 4-byte writes each time, but eliminate extraneous bytes from the shuffle operations.

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 x y z CR LF a b c , H e l l

  • " M

y " W o r l d , P l e n u m s , E x p

  • n

e n t , e 2 3 . 1 , A B C a b c H e l l

  • " M

y " W o r l d P l e n u m s E x p

  • n

e n t , e 2 3 . 1 A B C D E a b c H e l l

  • " M

y " W o r l d P l e n u m s E x p

  • n

e n t , e 2 3 . 1

*The actual shuffle algorithm is fairly complex, this has been an oversimplification for this presentation.

slide-34
SLIDE 34

Performance Results

Result lting Metric CPU* GP GPU Total Time 02:57.917 00:04.460 Row/Second 249,148 9,938,983 Speed Increase 1x 39.9x Test Pla latform

Intel Core i7 X990 @ 3.47GHz NVIDIA GeForce GTX 1080 (Pascal Architecture)

Test File File

Size: 1 Gigabyte Rows: 44,327,867

* None of the CPU based CSV importers tested supported all the complexity our engine did, thus if they added the missing functionality, their Rows/Second would drop even lower. ** We used CsvHelper for our CPU benchmark. It has large community with over 4 million downloads and is the closest to our functionality that we tested.

slide-35
SLIDE 35

Future Enhancements

We will be looking to add the following future enhancements to our library:

  • Multi-GPU support – sending each file “chunk” to a separate GPU for

processing.

  • Apache Arrow support – adapting our already columnar approach to

be aligned with Apache Arrow’s format

  • Further performance optimizations like potential kernel fusion
  • Possible experimentation with Unified Memory performance
slide-36
SLIDE 36

Simantex is please to announce that it has joined GOAi, and contributed this CSV Importer as open source to the project.

You can find the code and a whitepaper explaining this algorithm in detail here:

https://github.com/gpuopenanalytics

We are also available for consulting and implementation services, please contact me at:

bkennedy@simantex.com