GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline
Chuck Seberino Principal Software Engineer
GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer - - PDF document
GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline Chuck Seberino Principal Software Engineer Chucks Three Guiding Principles Hardware Architecture - Understanding limitations (strengths) of SIMD and compute-heavy
Chuck Seberino Principal Software Engineer
SIMD and compute-heavy ASIC.
doesn’t have to be a bad thing.
just single kernel.
2
be geared towards compute!
4
5
6
// Scale integer value and store as floating point. __global__ void kernel(const int* input, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x;
} kernel<<<1000, 1000>>>(input, scale, output);
threads (1000x1000).
7
blocksize gridsize
// Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return;
} kernel<<< 977, 1024>>>(input, 1000000, scale, output);
8
blocksize gridsize
// Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return;
} kernel<<< 977, 1024>>>(input, 1000000, scale, output);
“wasting” 24000 threads. (31 Full warps + 1 warp at 25%) x 1000
Full warps x 976 + 18 Full warps + 14 Early return warps. Could have also used 15625x64.
9
blocksize gridsize
10
advantage of it, especially if porting new code.
–Alleviates user from managing both host and device memory and handling data transfers. –When NVLink comes on the scene, UM can make immediate use of it.
12
transfers into one larger transfer performs significantly better than making each transfer separately, even if doing so requires packing non-contiguous regions of memory into a contiguous buffer and then unpacking after the transfer.”
host memory ...”
memory?
13
1.5 3.0 4.5 6.0 20 40 60 80 100 120 140 160 180 200
Host to Device Transfer Time for Small Sizes
Transfer time (us) Transfer Size (KB) 7500 15000 22500 30000 1 22 200 12364 65612
Host to Device Transfer Time
Transfer Time (us) Transfer Size (KB) 1300 2600 3900 5200 6500 7800 9100 10400 11700 13000 1 22 200 12364 65612
Host To Device Transfer Speed
MB/s Transfer Size (KB)
MBP(2013) GeForce GT 650M PCIe 2.0 GeForce Titan X PCIe 3.0 Quadro M6000
Cost in time for < 200KB transfers is fixed
14
–Copy operation gets serialized on GPU along with kernel launches - no copy engine overlap with kernels –Host doesn’t block on call though (silently pins) –Can examine in Visual Profiler
15
16
Not Pinned!
17
it may be advantageous to parallelize work in multiple ways.
different areas of memory
19
same time.
–Stalls in one kernel allow other kernels to become active keeping GPU busy. –NOTE: Does not work for branching kernels!
20
careful to do it properly.
–Makes it easier to debug default stream problems –Able to verify in NVVP correct behavior - thrust, accidental non-stream commands, etc
21
for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } // Main stream "Compute" gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); }
Sync Streams 0-3 => 4 Sync Stream 4 => 0-3
22
for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } // Main stream "Compute" gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); }
23
Compute & Checkpoint Stream
Stream 17 not in sync with Stream 16
24
restructuring to provide adequate separation.
–This lets you to plug in a replacement more easily or support multiple configurations. –Can also allow asynchronous CPU processing during GPU sections.
Take advantage of it.
–Spending lots of time optimizing a particular section might be better spent elsewhere. Make sure you have basics covered first.
26
computations, in addition to Image and Video specific algorithms (convolutions). OpenCV is another good source for GPU-accelerated algorithms.
– Note - As of CUDA 7.0, I didn’t see any real performance gain from performing NPP
component RGBA format. YMMV, so test!
cuSparse provides sparse matrix manipulation.
transient allocations! More on that later.
In some cases allows lower level access to implementation.
27
automatically allocate and free. Makes for a great STL-like API, but at a price.
28
Thrust is calling cudaMalloc and cudaFree every time, which serializes the streams!
29
Custom Allocator calls cudaMalloc once the first time, then reuses on subsequent calls.
30
Registration Normalization Normalization Normalization Normalization Matrix Store Matrix Solve Scale Call, Score Metrics
32
Registration Registration Registration Registration Normalization Normalization Normalization Normalization Matrix Store Matrix Solve Scale Call, Score Metrics
=Good for GPU
33
–Image Registration involves classical image processing techniques (convolution, background subtraction). –Regression stage requires lots of matrix computations. –Other components deal with data at an individual pixel level, which are inherently parallel.
–Metrics computations –Quantiles, mean, binning
34
– Example emission spectra of commercial dyes typically used. – Ideally filters would limit undesirable wavelengths, but due to overlap, can’t completely remove other signals.
Bleed
– When samples are physically close to each other, their light emission is captured in neighboring pixels.
Image courtesy of Thermo Fisher DyLight Fluorophores
35
means that there are more “equations” than there are unknowns.
the equation for the line that best describes the measured data.
–Vertical distance between point and the line is the error or noise.
Image courtesy of Wikipedia - Linear Regression
y1 = a1x1 + b1 y2 = a2x2 + b2 ... yn = anxn + bn
36
color channels and neighbor information.
corresponds to a particular feature (coefficient).
General Equation: b = Ax x = (ATA)-1ATb
37
foreach pixel in image { // Pack column 1 if pixel == “edge pixel” || criteria == “not good”: matrix[row][column1] = 0 // Remove contribution else matrix[row][column1] = column1_data[pixel] // Pack column 2 if ... // Pack column N if ... }
BAD case for porting to GPU
38
the data on the GPU.
be better than copying it back to CPU to process.
39
due to physical characteristics). This made copying data into matrix form more difficult.
know what was desired.
40
1 3 4 6 9 11 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 1 1 1 1 + =
41
Mask Kernel Result
due to physical characteristics). This made copying data into matrix form more difficult.
know what was desired.
1 3 4 6 9 11 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 1 1 1 1 + = Mask Kernel Result
42
due to physical characteristics). This made copying data into matrix form more difficult.
know what was desired.
properties of the image. Therefore they could be computed up front.
–Broke down into groups of operations:
1.Contiguous memory copies 2.Individual element copies 3.Contiguous data set to 0 4.Individual data set to 0
GOOD - No conditionals, just operations
43
many 1D memory copies, used a single 2D memory copy.
copies to ~50 2D copies.
44
Registration Registration Registration Registration Normalization Normalization Normalization Normalization Matrix Store Matrix Solve Scale Call, Score Metrics
=Good for GPU
45
processing block and then farmed out thread to handle.
and I/O being done async.
Registratio BG Metrics Metrics
Mean
Registratio BG Metrics
Mean
Time
BG Registratio BG
Mean
Metrics
46
immediately afterwards (asynchronously). I thought, “Use a callback!”
–The amount of work I was trying to perform greatly exceeded its current design. –Ended up spinning up a separate CPU thread and implementing dispatch method to utilize.
serially!
–Best used to perform small work, or to set conditional variable(s) to trigger follow on work.
47
48
https://github.com/chuckseberino/GTC16
–GPU wrapper –Thrust allocator (per stream) –Examples used in this presentation
49