the GPU Sky Morey Chief Architect @DEG degdigital.com Library: - - PowerPoint PPT Presentation
the GPU Sky Morey Chief Architect @DEG degdigital.com Library: - - PowerPoint PPT Presentation
SQLite Running entirely on the GPU Sky Morey Chief Architect @DEG degdigital.com Library: GpuEx How did we get here? The team was working on an n-body simulator with a requirement to dynamically add, remove, backup,
The team was working on an n-body simulator with a requirement to dynamically add, remove, backup, restore, and query the elements in the simulator. A SQL solution was the natural fit.
How did we get here?
GTC :: April 2016
DEG/15/5/ATS
Part 1
– The Application – Strategies – Testing the Stack
Goals
– Try the NuGet packages – Think about ways to use them in your own projects
Part 2
– Development Process – Runtime Layer – System Layer – Data Layer
Agenda
GTC :: April 2016
2TLK/50M/Q
Part 1
The Application
see it in action
Start with why
Working on an n-body simulation Had an engine running with some nodes Next needed to
– Insert, update and delete nodes – Backup and restore the nodes – Query the current state
Good fit for SQL SQLite used as an embedded database SQLite has a small code base SQLite was under the MIT license SQLite was best match
GTC :: April 2016
ISTD/C/C+/REF/STA
Block Diagram
GTC :: April 2016
CPU CPU GPU dSql JimT cl TinyT cl
Runtime Host
System
Win Unix
Runtime
Sentinel Runtime-JimT cl Runtime-TinyT cl
Runtime
Runtime-JimT cl Runtime-TinyT cl Sentinel
System
SysEx
System.Data
DataEx
Win Unix Pager VDBE B-Tree
System.Data
DataEx
Pager VDBE B-Tree
System
SysEx
Map Runtime Host Gpu
…
NuGet Simplicity
GpuEx-TinyTcl GpuEx-JimTcl GpuEx-dSql GpuEx-Runtime GpuEx-Runtime.TinyTcl GpuEx-Runtime.JimTcl GpuEx-System GpuEx-System.Data
GTC :: April 2016 Compute Microarch Library x86 x64 Exceptions None CPU {library}_cpu.lib Yes Yes 11,12,13 T esla {library}_11.lib Yes
- Runtime only
20,21 Fermi {library}_20.lib Yes Yes 30,32 Kepler {library}_30.lib Yes Yes 35,37 {library}_35.lib Yes Yes 50 Maxwell {library}_50.lib Yes Yes 52, 53 {library}_52.lib Yes Yes 60 Pascal {library}_60.lib Yes Yes
PKG/LIB35/ENV/PSAKE
Video :: NuGet Runtime
GTC :: April 2016
Video :: NuGet TinyTcl
GTC :: April 2016
Video :: NuGet DataEx
GTC :: April 2016
Video :: NuGet dSql
GTC :: April 2016
dSql Examples
Memory database x86 or x64 DDL, DML
– Create table – Insert – Select – Delete – Drop table
File database x64 only DML
– Limit – Join – Aggregate
GTC :: April 2016
Video :: Memory Database
GTC :: April 2016
Video :: File Based Database
GTC :: April 2016
Strategies
single threaded limitation and using with CUDA
Intermixing – single/multi kernel calls Execution plan – plan single, exec multi Ganging – warp execution as single thread
Strategies
GTC :: April 2016
Strategy :: Intermixing
Intermix data probes with application kernels
GTC :: April 2016
Kernel<<<N, 32>>>() DataProbe<<<1, 1>>>() Kernel<<<N, 32>>>() Kernel<<<N, 32>>>()
Intermixing GPU threaded application kernel calls with single threaded SQLite kernel calls.
Strategy :: Execution Plan
Build plan single- threaded Execute plan muli- threaded
GTC :: April 2016
ExecutePlan<<<N, 32>>>(plan) ExecutePlan<<<N, 32>>>(plan) plan = BuildPlan<<<1, 1>>>()
Having SQLite execution plan generation single threaded, while its execution is GPU threaded.
Strategy :: Ganging
Singular warp default Primary warp method Per warp pattern
GTC :: April 2016
return single malloc single instruction single instruction c c c c c c c m single instruction T1 T2 T3 T4 T5 T6 T7
Introduce “Ganging” for single thread kernel acceleration. Ganging executes in 32 thread form with a primary thread and 31 supporting threads for localized search or computation acceleration.
Testing the Stack
how do we know it works?
In-place and xUnit tests for unit testing TCL scripts for unit and integration testing Other patterns for automated testing
Testing Codebase
GTC :: April 2016
Try the NuGet packages Think about ways to use them in your own projects
End of Part 1
GTC :: April 2016
TRNS/Q
Part 2
The Development Process
how did this get built?
SQLite is a single thread application, so it is implemented as such Converted from C to C++ and CUDA, and segmented into three additive parts
– Runtime – represents lower level operations with stdio/stdlib functions – System – OS layer abstraction and sentinel message bus – System.Data – SQLite core data engine
Porting
GTC :: April 2016
Packaging for multiple architectures and compilations issues
– Project file changes, and build scripts for multiple build targets and NuGet packaging, with variations of release\debug, and win32\x64, and cpu\gpu20-35 – Multiple libraries, and hard-coded lib building for CUDA – Multiple cu files combined to hide context switching costs for faster build
Build and Package
GTC :: April 2016
LIB
Video :: Packaging
GTC :: April 2016
Video :: Nvcc Context Switching Costs
GTC :: April 2016
Video :: Ellipse
GTC :: April 2016
Native file system access with Sentinel
– Sentinel is a host to device message bus that solves GPU access to host resources
Future: GPU only device file system
– Host file system calls though an IPC to GPU files – dcat, dcmp, dcp, dgrep, dls, dmkdir, dmore, dmv, drm, drmdir – dchgrp, dchmod, dchown
File System
GTC :: April 2016
IPC/CTX/INPROC
Sentinel
GTC :: April 2016
CPU GPU MEMORY fopen fopen FILE *f = _fopen("fopen.txt", "w"); _fprintfR(f, "The quick brown fox jumps over the lazy dog"); _fcloseR(f); fprintf fclose fopen fprintf fclose fopen.txt, w 0x1234 fprintf 0x1234, The quick brow… 43 fclose 0x1234 2 2 2
NuGet Packaging
GTC :: April 2016
- Lemon generator for lexical processing, converted to .net tool with CUDA output
- TCL ported to CUDA to run unit tests on hardware
Tools ported
GTC :: April 2016
Lemon generator TCL
PKG/LIB35
The Runtime Layer
Layers
Block Diagram
GTC :: April 2016
CPU CPU GPU
Runtime Host
Runtime
Sentinel
Runtime
Sentinel Runtime Host
VCRT/HEA P/SENT
Separate heap for _printf, _throw, _assert data shuffling to host C runtime replacement in GPU, like stdio/stdlib
– Memory management, alloc, realloc, free and debugging – atof, atoi64, atoi, itoa methods – toupper, isupper, isspace, isalnum, isalpha, isdigit, isxdigit, isidchar, tolower methods – strcpy, strncpy, strcat, strchr, strstr, strcmp, strncmp, memcpy, memstr, memchr, memcmp, memmove, strlen, hextobyte methods – snprintf methods
Methods prefixed with “_” to avoid naming collisions
Runtime
GTC :: April 2016
SQLITE/DIY
The System Layer
Layers
Block Diagram
GTC :: April 2016
CPU CPU GPU
Runtime Host
System
Win Unix
Runtime
Sentinel
Runtime
Sentinel
System
SysEx
Win Unix
System
SysEx
Map Runtime Host Gpu
ISYS/GPU
Host only implementation of the Unix and Windows Systems Device only version of a “map system” which sits in the GPU and shuffles messages back to the Host version using Sentinal
System
GTC :: April 2016
ABSTRACT/SENTINEL
The Data Layer
Layers
Block Diagram
GTC :: April 2016 CPU CPU GPU
Runtime Host
System
Win Unix
Runtime
Sentinel
Runtime
Sentinel
System
SysEx
System.Data
DataEx
Win Unix Pager VDBE B-Tree
System.Data
DataEx
Pager VDBE B-Tree
System
SysEx
Map Runtime Host Gpu
B-Tree Pager VDBE Stack had to be big, would crash if not > 5Meg
System.Data
GTC :: April 2016
CODEVOLUME/STACK
Conclusion :: NVidia Ask
GTC :: April 2016
Thank you for your time
Ask: Add a native SATA / NVMe interface to the GeForce line.
GPUASSET/DB