GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer - PDF document

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline Chuck Seberino Principal Software Engineer

Chuck’s Three Guiding Principles • Hardware Architecture - Understanding limitations (strengths) of SIMD and compute-heavy ASIC. • Memory - Shipping data back and forth between CPU and GPU doesn’t have to be a bad thing. • Multi-level Parallelism - Look at optimizing overlapping kernels, not just single kernel. 2

Hardware

Hardware Architecture • There are definite areas where you can fall off the fast track. • GPU hardware is intentionally simplistic, so that silicon real estate can be geared towards compute! 4

ASIC Comparison • Intel Haswell i7-5960X 8 core: ~2B • NVIDIA GM200 Maxwell 3072 core: ~8B 5

ASIC Comparison • Intel Haswell i7-5960X 8 core: ~2B • NVIDIA GM200 Maxwell 3072 core: ~8B 6

Understand the Hardware // Scale integer value and store as floating point. __global__ void kernel(const int* input, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = scale * input[index]; } kernel<<<1000, 1000>>>(input, scale, output); gridsize blocksize • A naive implementation would set grid and block size to launch exactly 1M threads (1000x1000). 7

Understand the Hardware // Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return; output[index] = scale * input[index]; } kernel<<< 977, 1024>>>(input, 1000000, scale, output); gridsize blocksize • Pass in input length • Perform range check • Adjust blocksize and gridsize to align with warp size 8

Understand the Hardware // Scale integer value and store as floating point. __global__ void kernel(const int* input, int length, float scale, float* output) { int index = blockIdx.x*blockDim.x + threadIdx.x; if (index >= length) return; output[index] = scale * input[index]; } kernel<<< 977, 1024>>>(input, 1000000, scale, output); gridsize blocksize • Making block size a multiple of warp size (32) makes code more efficient. • 1000x1000 kernel sparsely populates 1000 thread blocks with only 8 threads, “wasting” 24000 threads. (31 Full warps + 1 warp at 25%) x 1000 • 977x1024 kernel also launches 1000488 total threads, with 488 early return. 32 Full warps x 976 + 18 Full warps + 14 Early return warps. Could have also used 15625x64. 9

Details of Comparison 10

Memory

Unified Memory Model • Unified Memory - Right now might be a good time to start taking advantage of it, especially if porting new code. –Alleviates user from managing both host and device memory and handling data transfers. –When NVLink comes on the scene, UM can make immediate use of it. • CON: Doesn’t support explicit data movement. 12

Memory Copy Ambiguity • From CUDA C Best Practices Guide Chapter 9.1: • “Also, because of the overhead associated with each transfer, batching many small transfers into one larger transfer performs significantly better than making each transfer separately, even if doing so requires packing non-contiguous regions of memory into a contiguous buffer and then unpacking after the transfer.” • “In contrast with cudaMemcpy() , the asynchronous transfer version requires pinned host memory ...” • What is considered a small transfer? • What happens if I try to use cudaMemcpyAsync() with pageable memory? 13

Small Transfers For a Fixed Price Host To Device Transfer Speed Host to Device Transfer Time 13000 30000 11700 10400 22500 Transfer Time (us) 9100 7800 MB/s 6500 15000 5200 3900 7500 2600 1300 0 0 1 22 200 12364 65612 1 22 200 12364 65612 Transfer Size (KB) Transfer Size (KB) MBP(2013) GeForce GT 650M PCIe 2.0 GeForce Titan X PCIe 3.0 Quadro M6000 Host to Device Transfer Time for Small Sizes 6.0 4.5 Transfer time (us) Cost in time for < 200KB 3.0 transfers is fixed 1.5 0 20 40 60 80 100 120 140 160 180 200 Transfer Size (KB) 14

cudaMemcpyAsync() with pageable memory • Calling cudaMemCpyAsync () with pageable memory works, but ... –Copy operation gets serialized on GPU along with kernel launches - no copy engine overlap with kernels –Host doesn’t block on call though (silently pins) –Can examine in Visual Profiler 15

cudaMemcpyAsync Pinned 16

... vs. cudaMemcpyAsync Paged Not Pinned! 17

Multi-Level Parallelism

Multi-Level Parallelism • In the case where there is more than a single task to be completed, it may be advantageous to parallelize work in multiple ways. • Use several sets of CPU threads to dispatch identical sets of work to different areas of memory • Use CUDA streams to split out work into independent chunks. 19

Keeping the GPU Busy • Increase efficiency by running non-optimal kernels at the same time. –Stalls in one kernel allow other kernels to become active keeping GPU busy. –NOTE: Does not work for branching kernels! 20

Multiple Streams • Synchronization is a great way to parallelize, but you need to be careful to do it properly. • When using more than one stream, never use default stream –Makes it easier to debug default stream problems –Able to verify in NVVP correct behavior - thrust, accidental non-stream commands, etc 21

Example Stream Sync Code for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii Sync Streams 0-3 => 4 gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } Sync Stream 4 => 0-3 // Main stream "Compute" gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); 22 }

Example Stream Sync Code for (int n = 0; n < numIterations; ++n) { // Perform initial work as separate streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for previous loop main stream gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(10+ii*50, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Break synchronization on last stream if (syncStreams || ii != 3) { // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } } Compute & // Main stream "Compute" gpuPtr->sleep(100, EventStream); Checkpoint Stream // Synchronization point for other streams gpuPtr->timerStop(EventStream); // Perform additional work as individual streams for (int ii = 0; ii < numStreams; ++ii) { // Wait for main stream to be complete. gpuPtr->streamWait(ii, EventStream); // "Compute" gpuPtr->sleep(30+10*ii, ii); // Create event record for stream ii gpuPtr->timerStop(ii); // Tell main stream to wait for stream ii stop record. gpuPtr->streamWait(EventStream, ii); } // Again, consolidate and run on a single stream gpuPtr->sleep(100, EventStream); // Synchronization point for other streams gpuPtr->timerStop(EventStream); 23 }

Stream Synchronization Stream 17 not in sync with Stream 16 24

Porting Recommendations

Pick a Good Wave • Select a well defined region . This might mean reorganizing or restructuring to provide adequate separation. –This lets you to plug in a replacement more easily or support multiple configurations. –Can also allow asynchronous CPU processing during GPU sections. • GPU hardware and software is advancing at a phenomenal rate. Take advantage of it. –Spending lots of time optimizing a particular section might be better spent elsewhere. Make sure you have basics covered first. 26

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer - PDF document

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline Chuck Seberino Principal Software Engineer Chucks Three Guiding Principles Hardware Architecture - Understanding limitations (strengths) of SIMD and compute-heavy

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

ALL THINGS Lindy Strong DNA & OUR THINKING DNA & Our Thinking The following section

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

DNA Nucleus Contains cells genetic info (DNA) controls cell functions DNA Structure

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Go Bananas! Introduction Tell you about DNA Show you how to extract DNA from a Banana

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

2056 OP Robotics Team 2056 Keys to Success Who am I? Tyler Holtzman Lead Design Mentor

Architecting Ceph Solutions TUT1234 David Byte, Sr. Technology Strategist Agenda Discuss SUSE

API Design Lessons Learned: Enterprise to Startup Mohamed El-Geish g@workfit.io 4 This

When thinking about concepts, remember the three main drivers; physical condition, program /

Dynamic Hadoop clusters on HPC scheduling systems Michele Muggiri, Luca Pireddu*, Simone Leo,

MapBox & TileMill An open-source-ish alternative to MapKit Flip Sasser inthebackforty.com

Sustainable profitable growth 2020 and beyond Hans Van Bylen, Carsten Knobel Dsseldorf,

Nature of Games What is a Game? the 2 gamedesigninitiative Nature of Games at cornell

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer - PDF document

GPU-Centric Thinking: Use Case Acceleration of a DNA Sequencer Pipeline Chuck Seberino Principal Software Engineer Chucks Three Guiding Principles Hardware Architecture - Understanding limitations (strengths) of SIMD and compute-heavy

DNA D DNA Double bl Helix DNA stands for: DNA stands for: U d Under a Deoxyribose

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on DNA

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Table of Contents Why DNA Computing? The Structure of DNA DNA Computing Operations on

ALL THINGS Lindy Strong DNA &amp; OUR THINKING DNA &amp; Our Thinking The following section

DNA Computing Information Processing with DNA Molecules Christian Jacob, 01/2002. Table of

Eastern Shores (GHOTES) DNA A Family Tree DNA Project Family Tree DNA Family Tree DNA or

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

DNA IN OUR FOOD? EXTRACTION OF DNA FROM STRAWBERRIES (GETTING THE DNA OUT OF STRAWBERRIES) -OR

The Design of Autonomous DNA The Design of Autonomous DNA Nanomechanical Devices: Devices:

DNA evidence: two important features match between two DNA profiles frequency of the DNA profile in

DNA Nucleus Contains cells genetic info (DNA) controls cell functions DNA Structure

Self-Assembling DNA Self-Assembling DNA N. Jonoska Jonoska, N. C. , N. C. Seeman Seeman, DNA

Go Bananas! Introduction Tell you about DNA Show you how to extract DNA from a Banana

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

2056 OP Robotics Team 2056 Keys to Success Who am I? Tyler Holtzman Lead Design Mentor

Architecting Ceph Solutions TUT1234 David Byte, Sr. Technology Strategist Agenda Discuss SUSE

API Design Lessons Learned: Enterprise to Startup Mohamed El-Geish g@workfit.io 4 This

When thinking about concepts, remember the three main drivers; physical condition, program /

Dynamic Hadoop clusters on HPC scheduling systems Michele Muggiri, Luca Pireddu*, Simone Leo,

MapBox &amp; TileMill An open-source-ish alternative to MapKit Flip Sasser inthebackforty.com

Sustainable profitable growth 2020 and beyond Hans Van Bylen, Carsten Knobel Dsseldorf,

Nature of Games What is a Game? the 2 gamedesigninitiative Nature of Games at cornell

ALL THINGS Lindy Strong DNA & OUR THINKING DNA & Our Thinking The following section

MapBox & TileMill An open-source-ish alternative to MapKit Flip Sasser inthebackforty.com