CUB: A pattern of collective software design, abstraction, and reuse - PowerPoint PPT Presentation

CUB: A pattern of “collective” software design, abstraction, and reuse for kernel-level programming DUANE MERRILL, PH.D. NVIDIA RESEARCH

What is CUB ? 1. A design model for collective kernel-level primitives How to make reusable software components for SIMT groups (warps, blocks, etc.) 2. A library of collective primitives Block-reduce, block-sort, block-histogram, warp-scan, warp-reduce, etc. 3. A library of global primitives (built from collectives) Device-reduce, device-sort, device-scan, etc. Demonstrate collective composition, performance, performance-portability 2

Outline 1. Software reuse 2. SIMT collectives: the “missing” CUDA abstraction layer 3. The soul of collective component design 4. Using CUB ’s collective primitives 5. Making your own collective primitives 6. Other Very Useful Things in CUB 7. Final thoughts 3

Software reuse Abstraction & composability are fundamental design principles Reduce redundant programmer effort Save time, energy, money Reduce buggy software Encapsulate complexity Empower productivity-oriented programmers Insulation from underlying hardware – five NVIDIA GPU architectures between 2008-2014 Software reuse empowers a durable programming model 4

Software reuse Abstraction & composability are fundamental design principles Reduce redundant programmer effort Save time, energy, money Reduce buggy software Encapsulate complexity Empower productivity-oriented programmers Insulation from changing capabilities of the underlying hardware – NVIDIA has produced nine different CUDA GPU architectures since 2008! Software reuse empowers a durable programming model 5

Parallel programming is hard … 7

No, cooperative parallel programming is hard… Parallel decomposition and grain sizing Bookkeeping control structures Synchronization Memory access conflicts, coalescing, etc. Deadlock, livelock, and data races Occupancy constraints from SMEM, RF, etc Plurality of state Algorithm selection and instruction scheduling Plurality of flow control (divergence, etc.) Special hardware functionality, instructions, etc. 8

No, cooperative parallel programming is hard… Parallel decomposition and grain sizing Bookkeeping control structures Synchronization Memory access conflicts, coalescing, etc. Deadlock, livelock, and data races Occupancy constraints from SMEM, RF, etc Plurality of state Algorithm selection and instruction scheduling Plurality of flow control (divergence, etc.) Special hardware functionality, instructions, etc. … 9

CUDA today Application CUDA function stub Threadblock Threadblock Threadblock … 10

Software abstraction in CUDA Application CUDA function stub Kernel threadblock … PROBLEM: virtually every CUDA kernel written today is cobbled from scratch A tunability, portability, and maintenance concern 11

Software abstraction in CUDA Application scalar interface Kernel function stub BlockLoad BlockLoad collective interface collective BlockSort BlockSort function … BlockStore BlockStore Collective software components reduce development cost, hide complexity, bugs, etc. 12

What do these applications have in common? 2 1 1 1 2 2 2 0 1 1 2 ∞ 2 2 1 2 ∞ 3 2 ∞ 3 Parallel radix sort Parallel sparse graph traversal Parallel BWT compression Parallel SpMV 13

What do these applications have in common? Block-wide prefix-scan 2 1 1 1 2 Scan for 2 2 0 Scan for 1 1 enqueueing 2 ∞ partitioning 2 2 1 2 ∞ 3 2 ∞ 3 Parallel radix sort Parallel sparse graph traversal Scan for Scan for solving segmented recurrences reduction (move-to-front) Parallel BWT compression Parallel SpMV 14

Examples of parallel scan data flow 16 threads contributing 4 items each t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 id t 0 t 1 t 2 t 3 id t 8 t 9 id t 4 t 5 t 6 t 7 id 0 1 2 3 4 5 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 t 3 id id t 3 t 3 t 2 id id t 4 t 5 t 6 t 7 id id t 8 t 9 id id 0 1 2 3 4 5 t 1 t 1 t 1 t 1 t 1 t 1 id t 0 t 1 t 2 t 3 t 2 t 3 t 3 t 3 t 4 t 5 t 6 t 7 t 8 t 9 0 1 5 2 3 4 id id t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 t 4 t 5 t 6 t 7 t 8 t 9 0 1 2 3 4 5 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 t 1 t 1 t 8 t 9 0 1 2 3 4 5 t 0 t 1 t 2 t 3 t 1 t 1 t 1 t 1 5 2 3 4 t 0 t 1 t 2 t 3 t 8 t 9 t 10 t 11 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 8 t 9 t 10 t 11 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12 t 13 t 14 t 15 Brent-Kung hybrid Kogge-Stone hybrid (Work-efficient ~130 binary ops, depth 15) (Depth-efficient ~170 binary ops, depth 12) 15

CUDA today Kernel programming is complicating Application CUDA function stub threadblock threadblock threadblock … 16

Software abstraction in CUDA Application scalar interface Kernel function stub BlockLoad BlockLoad collective interface collective BlockSort BlockSort function … BlockStore BlockStore Collective software components reduce development cost, hide complexity, bugs, etc. 17

Collective composition CUB primitives are easily nested & sequenced threadblock application CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockSort 19

Collective composition CUB primitives are easily nested & sequenced threadblock application BlockRadixRank CUDA stub threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockExchange BlockSort 20

Collective composition CUB primitives are easily nested & sequenced threadblock application BlockRadixRank CUDA stub BlockScan threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockExchange BlockSort 21

Collective composition CUB primitives are easily nested & sequenced threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock BlockSort BlockSort BlockSort … BlockExchange BlockSort 22

Tunable composition Flexible grain-size (“shape” remains the same) Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 23

Tunable composition Flexible grain- size (“shape” remains the same) Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 24

Tunable composition Algorithmic-variant selection Parllel width threadblock application BlockRadixRank CUDA stub BlockScan WarpScan threadblock threadblock threadblock Block Block Block Sort Sort Sort … BlockExchange BlockSort 25

CUB: device-wide performance-portability vs. Thrust and NPP across the last 4 major NVIDIA arch families (Telsa, Fermi, Kepler, Maxwell) 1.40 21 CUB Thrust v1.7.1 CUB Thrust v1.7.1 billions of 32b keys / sec billions of 32b items / sec 1.05 14 0.71 0.66 0.50 0.51 8 6 6 4 Tesla Tesla Tesla Tesla Tesla Tesla C1060 C2050 K20C C1060 C2050 K20C Global radix sort Global prefix scan 16.4 CUB NPP CUB Thrust v1.7.1 19.3 billions of 8b items / sec billions of 32b inputs / sec 16.2 8.6 4.2 2.7 2 2.4 2 2.2 1.7 0 Tesla Tesla Tesla Tesla Tesla Tesla C1060 C2050 K20C C1060 C2050 K20C Global Histogram Global partition-if 29

CUB: A pattern of collective software design, abstraction, and reuse - PowerPoint PPT Presentation

CUB: A pattern of collective software design, abstraction, and reuse for kernel-level programming DUANE MERRILL, PH.D. NVIDIA RESEARCH What is CUB ? 1. A design model for collective kernel-level primitives How to make reusable software

ZA 2016-1920-CUB ENV-2016-192-EAF CUB 6/1/2016 Lisette Covarrubias CONDITIONAL USE A T T A C H

The Stata module CUB for fitting mixture models for ordinal data Christopher F. BAUM 1 , Giovanni

Shooting Sports Cub Scouts, Boy Scouts, Venturing & Sea Scouts University of Scouting Sam

Shia Imami Ismaili Cub Scouts Pack 758 History "I wanted to devise a wholesome,

TIGER CUB 63: ELECTRICAL SYSTEM Ricky Feig COMPONENTS: Lighting and ignition Switch Wire

Huguenot Trail District Roundtable Cub Scout Topic Aqua Cubs March 02, 2017, Mount Pisgah

DEN CHIEF TRAINING May 19, 2019 Scout Oath and Scout Law The Cub Scout Motto- Do Your Best The

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

5/13/2016 Jose Elias Env. Case No.: ENV-2016-1682-CE ZA 2016-1681-CUB CONDITIONAL USE -

Huguenot Trail District Roundtable Cub Scout Topic - Duty to God January 05, 2017, Mount Pisgah

DEN CHIEF TRAINING January 27. 2019 Scout Oath and Scout Law The Cub Scout Motto- Do Your Best

Building Master Builders 2017 College of Cub Scouting: Neckerchief Slide Making Neckerchief

Robot-Cub Outline Robotcub 1 st Open Day Genova July 14, 2005 Main Keywords Cognition

Save on a residential solar system through a group buy! Open to residents of Cook, Will, DuPage,

he Soa dependability cub Dr Graham Fletcher Defence Academy of the United Kingdom and Cranfield

vertical farming solutions for the worlds biggest agricultural markets CubicFarms.com |

Recovering Primitives in 3D CAD meshes Roseline B eni` ere G. Subsol, G. Gesqui` ere, F .

Cyclic Sieving of Dual Hamming Codes Alex Mason 1 Shruthi Sridhar 2 1 Washington University, St.

Data Types Data Types Evolution of Data Types: COSC337 FORTRAN I (1956) - INTEGER,

Incotec Acquisition Cultivating profitable growth Innovation you can build on Acquisition

CREATING NOTES FOR A PRESENTATION 1. So you dont have to rely solely on your memory, you can

Print and Apply 2000 Series, 2000 Pallet, CimJet and CimPak Global Communications July 09

INVESTOR PRESENTATION August 2016 Safe Harbor / Disclaimer This presentation includes

Quarterly statement as at 30 September 2016 9 November 2016 Jrg Schneider Agenda 1 Munich Re

CUB: A pattern of collective software design, abstraction, and reuse - PowerPoint PPT Presentation

CUB: A pattern of collective software design, abstraction, and reuse for kernel-level programming DUANE MERRILL, PH.D. NVIDIA RESEARCH What is CUB ? 1. A design model for collective kernel-level primitives How to make reusable software

ZA 2016-1920-CUB ENV-2016-192-EAF CUB 6/1/2016 Lisette Covarrubias CONDITIONAL USE A T T A C H

The Stata module CUB for fitting mixture models for ordinal data Christopher F. BAUM 1 , Giovanni

Shooting Sports Cub Scouts, Boy Scouts, Venturing &amp; Sea Scouts University of Scouting Sam

Shia Imami Ismaili Cub Scouts Pack 758 History &quot;I wanted to devise a wholesome,

TIGER CUB 63: ELECTRICAL SYSTEM Ricky Feig COMPONENTS: Lighting and ignition Switch Wire

Huguenot Trail District Roundtable Cub Scout Topic Aqua Cubs March 02, 2017, Mount Pisgah

DEN CHIEF TRAINING May 19, 2019 Scout Oath and Scout Law The Cub Scout Motto- Do Your Best The

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

5/13/2016 Jose Elias Env. Case No.: ENV-2016-1682-CE ZA 2016-1681-CUB CONDITIONAL USE -

Huguenot Trail District Roundtable Cub Scout Topic - Duty to God January 05, 2017, Mount Pisgah

DEN CHIEF TRAINING January 27. 2019 Scout Oath and Scout Law The Cub Scout Motto- Do Your Best

Building Master Builders 2017 College of Cub Scouting: Neckerchief Slide Making Neckerchief

Robot-Cub Outline Robotcub 1 st Open Day Genova July 14, 2005 Main Keywords Cognition

Save on a residential solar system through a group buy! Open to residents of Cook, Will, DuPage,

he Soa dependability cub Dr Graham Fletcher Defence Academy of the United Kingdom and Cranfield

vertical farming solutions for the worlds biggest agricultural markets CubicFarms.com |

Recovering Primitives in 3D CAD meshes Roseline B eni` ere G. Subsol, G. Gesqui` ere, F .

Cyclic Sieving of Dual Hamming Codes Alex Mason 1 Shruthi Sridhar 2 1 Washington University, St.

Data Types Data Types Evolution of Data Types: COSC337 FORTRAN I (1956) - INTEGER,

Incotec Acquisition Cultivating profitable growth Innovation you can build on Acquisition

CREATING NOTES FOR A PRESENTATION 1. So you dont have to rely solely on your memory, you can

Print and Apply 2000 Series, 2000 Pallet, CimJet and CimPak Global Communications July 09

INVESTOR PRESENTATION August 2016 Safe Harbor / Disclaimer This presentation includes

Quarterly statement as at 30 September 2016 9 November 2016 Jrg Schneider Agenda 1 Munich Re

Shooting Sports Cub Scouts, Boy Scouts, Venturing & Sea Scouts University of Scouting Sam

Shia Imami Ismaili Cub Scouts Pack 758 History "I wanted to devise a wholesome,