GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer - - PDF document

gpu centric thinking use case acceleration of a dna
SMART_READER_LITE
LIVE PREVIEW

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer - - PDF document

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline Chuck Seberino Principal Software Engineer Chucks Three Guiding Principles Hardware Architecture - Understanding limitations (strengths) of SIMD and compute-heavy


slide-1
SLIDE 1

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline

Chuck Seberino Principal Software Engineer

slide-2
SLIDE 2

Chuck’s Three Guiding Principles

  • Hardware Architecture - Understanding limitations (strengths) of

SIMD and compute-heavy ASIC.

  • Memory - Shipping data back and forth between CPU and GPU

doesn’t have to be a bad thing.

  • Multi-level Parallelism - Look at optimizing overlapping kernels, not

just single kernel.

2

slide-3
SLIDE 3

Hardware

slide-4
SLIDE 4

Hardware Architecture

  • There are definite areas where you can fall off the fast track.
  • GPU hardware is intentionally simplistic, so that silicon real estate can

be geared towards compute!

4

slide-5
SLIDE 5

ASIC Comparison

  • Intel Haswell i7-5960X 8 core: ~2B
  • NVIDIA GM200 Maxwell 3072 core: ~8B

5

slide-6
SLIDE 6

ASIC Comparison

  • Intel Haswell i7-5960X 8 core: ~2B
  • NVIDIA GM200 Maxwell 3072 core: ~8B

6

slide-7
SLIDE 7

Understand the Hardware

// Scale integer value and store as floating point. __global__ void kernel(const int* input, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x;

  • utput[index] = scale * input[index];

} kernel<<<1000, 1000>>>(input, scale, output);

  • A naive implementation would set grid and block size to launch exactly 1M

threads (1000x1000).

7

blocksize gridsize

slide-8
SLIDE 8

Understand the Hardware

// Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return;

  • utput[index] = scale * input[index];

} kernel<<< 977, 1024>>>(input, 1000000, scale, output);

  • Pass in input length
  • Perform range check
  • Adjust blocksize and gridsize to align with warp size

8

blocksize gridsize

slide-9
SLIDE 9

Understand the Hardware

// Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return;

  • utput[index] = scale * input[index];

} kernel<<< 977, 1024>>>(input, 1000000, scale, output);

  • Making block size a multiple of warp size (32) makes code more efficient.
  • 1000x1000 kernel sparsely populates 1000 thread blocks with only 8 threads,

“wasting” 24000 threads. (31 Full warps + 1 warp at 25%) x 1000

  • 977x1024 kernel also launches 1000488 total threads, with 488 early return. 32

Full warps x 976 + 18 Full warps + 14 Early return warps. Could have also used 15625x64.

9

blocksize gridsize

slide-10
SLIDE 10

Details of Comparison

10

slide-11
SLIDE 11

Memory

slide-12
SLIDE 12

Unified Memory Model

  • Unified Memory - Right now might be a good time to start taking

advantage of it, especially if porting new code.

–Alleviates user from managing both host and device memory and handling data transfers. –When NVLink comes on the scene, UM can make immediate use of it.

  • CON: Doesn’t support explicit data movement.

12

slide-13
SLIDE 13

Memory Copy Ambiguity

  • From CUDA C Best Practices Guide Chapter 9.1:
  • “Also, because of the overhead associated with each transfer, batching many small

transfers into one larger transfer performs significantly better than making each transfer separately, even if doing so requires packing non-contiguous regions of memory into a contiguous buffer and then unpacking after the transfer.”

  • “In contrast with cudaMemcpy(), the asynchronous transfer version requires pinned

host memory ...”

  • What is considered a small transfer?
  • What happens if I try to use cudaMemcpyAsync() with pageable

memory?

13

slide-14
SLIDE 14

Small Transfers For a Fixed Price

1.5 3.0 4.5 6.0 20 40 60 80 100 120 140 160 180 200

Host to Device Transfer Time for Small Sizes

Transfer time (us) Transfer Size (KB) 7500 15000 22500 30000 1 22 200 12364 65612

Host to Device Transfer Time

Transfer Time (us) Transfer Size (KB) 1300 2600 3900 5200 6500 7800 9100 10400 11700 13000 1 22 200 12364 65612

Host To Device Transfer Speed

MB/s Transfer Size (KB)

MBP(2013) GeForce GT 650M PCIe 2.0 GeForce Titan X PCIe 3.0 Quadro M6000

Cost in time for < 200KB transfers is fixed

14

slide-15
SLIDE 15

cudaMemcpyAsync() with pageable memory

  • Calling cudaMemCpyAsync() with pageable memory works, but ...

–Copy operation gets serialized on GPU along with kernel launches - no copy engine overlap with kernels –Host doesn’t block on call though (silently pins) –Can examine in Visual Profiler

15

slide-16
SLIDE 16

cudaMemcpyAsync Pinned

16

slide-17
SLIDE 17

Not Pinned!

17

... vs. cudaMemcpyAsync Paged

slide-18
SLIDE 18

Multi-Level Parallelism

slide-19
SLIDE 19

Multi-Level Parallelism

  • In the case where there is more than a single task to be completed,

it may be advantageous to parallelize work in multiple ways.

  • Use several sets of CPU threads to dispatch identical sets of work to

different areas of memory

  • Use CUDA streams to split out work into independent chunks.

19

slide-20
SLIDE 20

Keeping the GPU Busy

  • Increase efficiency by running non-optimal kernels at the

same time.

–Stalls in one kernel allow other kernels to become active keeping GPU busy. –NOTE: Does not work for branching kernels!

20

slide-21
SLIDE 21

Multiple Streams

  • Synchronization is a great way to parallelize, but you need to be

careful to do it properly.

  • When using more than one stream, never use default stream

–Makes it easier to debug default stream problems –Able to verify in NVVP correct behavior - thrust, accidental non-stream commands, etc

21

slide-22
SLIDE 22

Example Stream Sync Code

for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } // Main stream "Compute" gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); }

Sync Streams 0-3 => 4 Sync Stream 4 => 0-3

22

slide-23
SLIDE 23

Example Stream Sync Code

for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } // Main stream "Compute" gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); }

23

Compute & Checkpoint Stream

slide-24
SLIDE 24

Stream Synchronization

Stream 17 not in sync with Stream 16

24

slide-25
SLIDE 25

Porting Recommendations

slide-26
SLIDE 26

Pick a Good Wave

  • Select a well defined region. This might mean reorganizing or

restructuring to provide adequate separation.

–This lets you to plug in a replacement more easily or support multiple configurations. –Can also allow asynchronous CPU processing during GPU sections.

  • GPU hardware and software is advancing at a phenomenal rate.

Take advantage of it.

–Spending lots of time optimizing a particular section might be better spent elsewhere. Make sure you have basics covered first.

26

slide-27
SLIDE 27

Don’t Reinvent The Wheel!

  • Take advantage of available 3rd-party libraries to perform various
  • computations. Might need to massage data into particular format though.
  • NPP is excellent for format-specific data types (RGB), and subset (Masked, ROI)

computations, in addition to Image and Video specific algorithms (convolutions). OpenCV is another good source for GPU-accelerated algorithms.

– Note - As of CUDA 7.0, I didn’t see any real performance gain from performing NPP

  • perations on 4 separate 16-bit values vs. packing them into an interleaved, 4-

component RGBA format. YMMV, so test!

  • cuBLAS used for linear algebra routines (matrix-vector or matrix-matrix).

cuSparse provides sparse matrix manipulation.

  • Thrust provides general purpose algorithms that support CUDA. Be careful of

transient allocations! More on that later.

  • CUB (CUDA Unbound) is similar to Thrust, but it provides block level parallelism.

In some cases allows lower level access to implementation.

27

slide-28
SLIDE 28

Thrust is Great, But Use a Custom Allocator!

  • By default, if thrust method requires temporary memory, it will

automatically allocate and free. Makes for a great STL-like API, but at a price.

  • By using a custom allocator, you can control creation and deletion.

28

slide-29
SLIDE 29

Be Careful of Thrust Allocations!

Thrust is calling cudaMalloc and cudaFree every time, which serializes the streams!

29

  • Default Allocator:
slide-30
SLIDE 30

Be Careful of Thrust Allocations!

Custom Allocator calls cudaMalloc once the first time, then reuses on subsequent calls.

30

  • Custom Allocator:
slide-31
SLIDE 31

DNA Sequencer Pipeline

slide-32
SLIDE 32

Pipeline Overview

Registration Normalization Normalization Normalization Normalization Matrix Store Matrix Solve Scale Call, Score Metrics

32

slide-33
SLIDE 33

Pipeline Overview

Registration Registration Registration Registration Normalization Normalization Normalization Normalization Matrix Store Matrix Solve Scale Call, Score Metrics

=Good for GPU

33

slide-34
SLIDE 34

Initial Impression and Findings

  • Lots of places to leverage GPU acceleration!

–Image Registration involves classical image processing techniques (convolution, background subtraction). –Regression stage requires lots of matrix computations. –Other components deal with data at an individual pixel level, which are inherently parallel.

  • Some areas are not so GPU friendly

–Metrics computations –Quantiles, mean, binning

34

slide-35
SLIDE 35

Sources of Optical Noise

  • Spectral Crosstalk

– Example emission spectra of commercial dyes typically used. – Ideally filters would limit undesirable wavelengths, but due to overlap, can’t completely remove other signals.

  • Spatial (Neighbor)

Bleed

– When samples are physically close to each other, their light emission is captured in neighboring pixels.

Image courtesy of Thermo Fisher DyLight Fluorophores

35

slide-36
SLIDE 36

Linear Regression Fundamentals

  • Linear regression with over-fitted data

means that there are more “equations” than there are unknowns.

  • In this example, we are trying to find

the equation for the line that best describes the measured data.

–Vertical distance between point and the line is the error or noise.

Image courtesy of Wikipedia - Linear Regression

y1 = a1x1 + b1 y2 = a2x2 + b2 ... yn = anxn + bn

36

slide-37
SLIDE 37

Linear Regression for Sequencing

  • Build a linear combination of variables which are related to camera

color channels and neighbor information.

  • b = a1x1 + a2x2 + a3x3 + a4x4 + ... + anxn
  • Use input image data and store in matrix form, where each column

corresponds to a particular feature (coefficient).

General Equation: b = Ax x = (ATA)-1ATb

37

slide-38
SLIDE 38

Linear Regression Input Generation - Problem

  • (pseudo-)CPU code looked something like this:

foreach pixel in image { // Pack column 1 if pixel == “edge pixel” || criteria == “not good”: matrix[row][column1] = 0 // Remove contribution else matrix[row][column1] = column1_data[pixel] // Pack column 2 if ... // Pack column N if ... }

  • Had lots of conditionals
  • Conversion to matrix form needed to happen several times for each set of images

BAD case for porting to GPU

38

slide-39
SLIDE 39

Sometimes Slow Is OK

  • I actually implemented that conditional logic in a kernel!
  • Even though the performance was abysmal, it did allow me to keep

the data on the GPU.

  • As a first pass, even if you can’t make efficient GPU code, it can still

be better than copying it back to CPU to process.

39

slide-40
SLIDE 40

Pack Data To Achieve Contiguous Access

  • The source data wasn’t completely contiguous in memory (gaps

due to physical characteristics). This made copying data into matrix form more difficult.

  • Pre-processed to pack data into more usable form. Had mask to

know what was desired.

40

slide-41
SLIDE 41

Pack Data To Achieve Contiguous Access

1 3 4 6 9 11 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 1 1 1 1 + =

41

Mask Kernel Result

  • The source data wasn’t completely contiguous in memory (gaps

due to physical characteristics). This made copying data into matrix form more difficult.

  • Pre-processed to pack data into more usable form. Had mask to

know what was desired.

slide-42
SLIDE 42

Pack Data To Achieve Contiguous Access

1 3 4 6 9 11 12 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 1 1 1 1 + = Mask Kernel Result

42

  • The source data wasn’t completely contiguous in memory (gaps

due to physical characteristics). This made copying data into matrix form more difficult.

  • Pre-processed to pack data into more usable form. Had mask to

know what was desired.

  • This improvement gave me an idea on how to clean up the previous mess...
slide-43
SLIDE 43

Linear Regression Input Generation - Solution

  • The conditional criteria for data selection was based upon static

properties of the image. Therefore they could be computed up front.

–Broke down into groups of operations:

1.Contiguous memory copies 2.Individual element copies 3.Contiguous data set to 0 4.Individual data set to 0

  • By having a list of operations and their arguments (src, dest, size,
  • ffset, etc), they are now able to be issued efficiently in CUDA.

GOOD - No conditionals, just operations

43

slide-44
SLIDE 44

Take Advantage of 2D Memory

  • When performing copies on regular, non-contiguous data, instead of

many 1D memory copies, used a single 2D memory copy.

  • Saw huge decrease in time spent when going from ~3500 smaller 1D

copies to ~50 2D copies.

44

slide-45
SLIDE 45

Pipeline Overview

Registration Registration Registration Registration Normalization Normalization Normalization Normalization Matrix Store Matrix Solve Scale Call, Score Metrics

=Good for GPU

45

slide-46
SLIDE 46

Consolidation of CPU Processing

  • Code reorganized to move CPU-based (serial) work to end of

processing block and then farmed out thread to handle.

  • Try to keep GPU busy with starting new task, while rest of metrics

and I/O being done async.

Registratio BG Metrics Metrics

Mean

Registratio BG Metrics

Mean

Time

BG Registratio BG

Mean

Metrics

46

slide-47
SLIDE 47

cudaStreamAddCallback Notes

  • After finishing GPU processing, I wanted to run CPU work

immediately afterwards (asynchronously). I thought, “Use a callback!”

–The amount of work I was trying to perform greatly exceeded its current design. –Ended up spinning up a separate CPU thread and implementing dispatch method to utilize.

  • Note: As of CUDA 7.5, ALL callbacks share same thread and are run

serially!

–Best used to perform small work, or to set conditional variable(s) to trigger follow on work.

47

slide-48
SLIDE 48

Putting It All Together

  • Single GPU (Titan X) with 7 CPU threads, each running identical tasks
  • Inefficiencies in a single kernel (due to memory access) can be
  • vercome with coarse-grain parallelism.
  • Production system used 2 K80 cards - 4 total GPUs with 28 CPU threads

48

slide-49
SLIDE 49

Thank You

  • Source code is available:

https://github.com/chuckseberino/GTC16

–GPU wrapper –Thrust allocator (per stream) –Examples used in this presentation

  • Please fill out evaluation!

49