the GPU Sky Morey Chief Architect @DEG degdigital.com Library: - - PowerPoint PPT Presentation

the gpu sky morey chief architect deg degdigital com
SMART_READER_LITE
LIVE PREVIEW

the GPU Sky Morey Chief Architect @DEG degdigital.com Library: - - PowerPoint PPT Presentation

SQLite Running entirely on the GPU Sky Morey Chief Architect @DEG degdigital.com Library: GpuEx How did we get here? The team was working on an n-body simulator with a requirement to dynamically add, remove, backup,


slide-1
SLIDE 1

SQLite – Running entirely on the GPU Sky Morey Chief Architect @DEG degdigital.com Library: GpuEx

slide-2
SLIDE 2

The team was working on an n-body simulator with a requirement to dynamically add, remove, backup, restore, and query the elements in the simulator. A SQL solution was the natural fit.

How did we get here?

GTC :: April 2016

DEG/15/5/ATS

slide-3
SLIDE 3

 Part 1

– The Application – Strategies – Testing the Stack

 Goals

– Try the NuGet packages – Think about ways to use them in your own projects

 Part 2

– Development Process – Runtime Layer – System Layer – Data Layer

Agenda

GTC :: April 2016

2TLK/50M/Q

slide-4
SLIDE 4

Part 1

slide-5
SLIDE 5

The Application

see it in action

slide-6
SLIDE 6

Start with why

 Working on an n-body simulation  Had an engine running with some nodes  Next needed to

– Insert, update and delete nodes – Backup and restore the nodes – Query the current state

 Good fit for SQL  SQLite used as an embedded database  SQLite has a small code base  SQLite was under the MIT license  SQLite was best match

GTC :: April 2016

ISTD/C/C+/REF/STA

slide-7
SLIDE 7

Block Diagram

GTC :: April 2016

CPU CPU GPU dSql JimT cl TinyT cl

Runtime Host

System

Win Unix

Runtime

Sentinel Runtime-JimT cl Runtime-TinyT cl

Runtime

Runtime-JimT cl Runtime-TinyT cl Sentinel

System

SysEx

System.Data

DataEx

Win Unix Pager VDBE B-Tree

System.Data

DataEx

Pager VDBE B-Tree

System

SysEx

Map Runtime Host Gpu

slide-8
SLIDE 8

NuGet Simplicity

 GpuEx-TinyTcl  GpuEx-JimTcl  GpuEx-dSql  GpuEx-Runtime  GpuEx-Runtime.TinyTcl  GpuEx-Runtime.JimTcl  GpuEx-System  GpuEx-System.Data

GTC :: April 2016 Compute Microarch Library x86 x64 Exceptions None CPU {library}_cpu.lib Yes Yes 11,12,13 T esla {library}_11.lib Yes

  • Runtime only

20,21 Fermi {library}_20.lib Yes Yes 30,32 Kepler {library}_30.lib Yes Yes 35,37 {library}_35.lib Yes Yes 50 Maxwell {library}_50.lib Yes Yes 52, 53 {library}_52.lib Yes Yes 60 Pascal {library}_60.lib Yes Yes

PKG/LIB35/ENV/PSAKE

slide-9
SLIDE 9

Video :: NuGet Runtime

GTC :: April 2016

slide-10
SLIDE 10

Video :: NuGet TinyTcl

GTC :: April 2016

slide-11
SLIDE 11

Video :: NuGet DataEx

GTC :: April 2016

slide-12
SLIDE 12

Video :: NuGet dSql

GTC :: April 2016

slide-13
SLIDE 13

dSql Examples

 Memory database  x86 or x64  DDL, DML

– Create table – Insert – Select – Delete – Drop table

 File database  x64 only  DML

– Limit – Join – Aggregate

GTC :: April 2016

slide-14
SLIDE 14

Video :: Memory Database

GTC :: April 2016

slide-15
SLIDE 15

Video :: File Based Database

GTC :: April 2016

slide-16
SLIDE 16

Strategies

single threaded limitation and using with CUDA

slide-17
SLIDE 17

Intermixing – single/multi kernel calls Execution plan – plan single, exec multi Ganging – warp execution as single thread

Strategies

GTC :: April 2016

slide-18
SLIDE 18

Strategy :: Intermixing

 Intermix data probes with application kernels

GTC :: April 2016

Kernel<<<N, 32>>>() DataProbe<<<1, 1>>>() Kernel<<<N, 32>>>() Kernel<<<N, 32>>>()

Intermixing GPU threaded application kernel calls with single threaded SQLite kernel calls.

slide-19
SLIDE 19

Strategy :: Execution Plan

 Build plan single- threaded  Execute plan muli- threaded

GTC :: April 2016

ExecutePlan<<<N, 32>>>(plan) ExecutePlan<<<N, 32>>>(plan) plan = BuildPlan<<<1, 1>>>()

Having SQLite execution plan generation single threaded, while its execution is GPU threaded.

slide-20
SLIDE 20

Strategy :: Ganging

 Singular warp default  Primary warp method  Per warp pattern

GTC :: April 2016

return single malloc single instruction single instruction c c c c c c c m single instruction T1 T2 T3 T4 T5 T6 T7

Introduce “Ganging” for single thread kernel acceleration. Ganging executes in 32 thread form with a primary thread and 31 supporting threads for localized search or computation acceleration.

slide-21
SLIDE 21

Testing the Stack

how do we know it works?

slide-22
SLIDE 22

In-place and xUnit tests for unit testing TCL scripts for unit and integration testing Other patterns for automated testing

Testing Codebase

GTC :: April 2016

slide-23
SLIDE 23

Try the NuGet packages Think about ways to use them in your own projects

End of Part 1

GTC :: April 2016

TRNS/Q

slide-24
SLIDE 24

Part 2

slide-25
SLIDE 25

The Development Process

how did this get built?

slide-26
SLIDE 26

SQLite is a single thread application, so it is implemented as such Converted from C to C++ and CUDA, and segmented into three additive parts

– Runtime – represents lower level operations with stdio/stdlib functions – System – OS layer abstraction and sentinel message bus – System.Data – SQLite core data engine

Porting

GTC :: April 2016

slide-27
SLIDE 27

 Packaging for multiple architectures and compilations issues

– Project file changes, and build scripts for multiple build targets and NuGet packaging, with variations of release\debug, and win32\x64, and cpu\gpu20-35 – Multiple libraries, and hard-coded lib building for CUDA – Multiple cu files combined to hide context switching costs for faster build

Build and Package

GTC :: April 2016

LIB

slide-28
SLIDE 28

Video :: Packaging

GTC :: April 2016

slide-29
SLIDE 29

Video :: Nvcc Context Switching Costs

GTC :: April 2016

slide-30
SLIDE 30

Video :: Ellipse

GTC :: April 2016

slide-31
SLIDE 31

Native file system access with Sentinel

– Sentinel is a host to device message bus that solves GPU access to host resources

Future: GPU only device file system

– Host file system calls though an IPC to GPU files – dcat, dcmp, dcp, dgrep, dls, dmkdir, dmore, dmv, drm, drmdir – dchgrp, dchmod, dchown

File System

GTC :: April 2016

IPC/CTX/INPROC

slide-32
SLIDE 32

Sentinel

GTC :: April 2016

CPU GPU MEMORY fopen fopen FILE *f = _fopen("fopen.txt", "w"); _fprintfR(f, "The quick brown fox jumps over the lazy dog"); _fcloseR(f); fprintf fclose fopen fprintf fclose fopen.txt, w 0x1234 fprintf 0x1234, The quick brow… 43 fclose 0x1234 2 2 2

slide-33
SLIDE 33

NuGet Packaging

GTC :: April 2016

slide-34
SLIDE 34
  • Lemon generator for lexical processing, converted to .net tool with CUDA output
  • TCL ported to CUDA to run unit tests on hardware

Tools ported

GTC :: April 2016

 Lemon generator  TCL

PKG/LIB35

slide-35
SLIDE 35

The Runtime Layer

Layers

slide-36
SLIDE 36

Block Diagram

GTC :: April 2016

CPU CPU GPU

Runtime Host

Runtime

Sentinel

Runtime

Sentinel Runtime Host

VCRT/HEA P/SENT

slide-37
SLIDE 37

 Separate heap for _printf, _throw, _assert data shuffling to host  C runtime replacement in GPU, like stdio/stdlib

– Memory management, alloc, realloc, free and debugging – atof, atoi64, atoi, itoa methods – toupper, isupper, isspace, isalnum, isalpha, isdigit, isxdigit, isidchar, tolower methods – strcpy, strncpy, strcat, strchr, strstr, strcmp, strncmp, memcpy, memstr, memchr, memcmp, memmove, strlen, hextobyte methods – snprintf methods

 Methods prefixed with “_” to avoid naming collisions

Runtime

GTC :: April 2016

SQLITE/DIY

slide-38
SLIDE 38

The System Layer

Layers

slide-39
SLIDE 39

Block Diagram

GTC :: April 2016

CPU CPU GPU

Runtime Host

System

Win Unix

Runtime

Sentinel

Runtime

Sentinel

System

SysEx

Win Unix

System

SysEx

Map Runtime Host Gpu

ISYS/GPU

slide-40
SLIDE 40

Host only implementation of the Unix and Windows Systems Device only version of a “map system” which sits in the GPU and shuffles messages back to the Host version using Sentinal

System

GTC :: April 2016

ABSTRACT/SENTINEL

slide-41
SLIDE 41

The Data Layer

Layers

slide-42
SLIDE 42

Block Diagram

GTC :: April 2016 CPU CPU GPU

Runtime Host

System

Win Unix

Runtime

Sentinel

Runtime

Sentinel

System

SysEx

System.Data

DataEx

Win Unix Pager VDBE B-Tree

System.Data

DataEx

Pager VDBE B-Tree

System

SysEx

Map Runtime Host Gpu

slide-43
SLIDE 43

B-Tree Pager VDBE Stack had to be big, would crash if not > 5Meg

System.Data

GTC :: April 2016

CODEVOLUME/STACK

slide-44
SLIDE 44

Conclusion :: NVidia Ask

GTC :: April 2016

Thank you for your time

Ask: Add a native SATA / NVMe interface to the GeForce line.

GPUASSET/DB