Wh Where is the Data? i th D t ? Why you Cannot Debate CPU Why - PowerPoint PPT Presentation

Wh Where is the Data? i th D t ? Why you Cannot Debate CPU Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering Labs p g g 1

GPUs and Data Transfer • GPU computing offers tremendous speedup in GPU computing offers tremendous speedup in computation. They are fast, but data transfer is a bottleneck. • All data that goes to the GPU must get there through the PCI Express Bus. g p • Data generated on the GPU only sometimes has to come back to the CPU has to come back to the CPU. • Some algorithms either generate or consume data, and only have one-way transfers. 2

GPUs and Data Transfer • “GPU Speedup” is misleading without GPU Speedup is misleading without describing data transfer necessities. • Example: Sorting can be extremely fast on E ample Sorting can be e tremel fast on the GPU, but • How did the original data get to the device g g • What happens after it is sorted? • GPUs have one I/O path: the PCIe Express bus • GPUs have one I/O path: the PCIe Express bus. 3

Why Data Transfer Overhead Matters • GPU kernels are fast . Data transfer time can GPU kernels are fast . Data transfer time can overwhelm kernel runtime. Kernel Runtime: 4ms Data Transfer: 198ms SAXPY (128MB) Transfer from GPU from GPU Transfer T f to GPU 48% 50% 2% 4 Kernel Time

Faster GPUs Exacerbate the Problem Radix Sort NVIDIA 330M (Slow) C2050 (Fast) Transfer from GPU from GPU Kernel Ti Time Transfer (256MB): Transfer (256MB): 97ms 97ms Transfer (32MB): 13ms T f (32MB) 13 Kernel: 125ms Kernel: 215ms 5 Total Time: 222ms Total Time: 228ms

PCI Express (PCIe) Overview • CPUs and discrete GPUs contain separate CPUs and discrete GPUs contain separate memory systems, and memory transfer between the two happens across the PCIe bus. • PCIe 2.1 with 16 lanes has a maximum PCIe 2 1 ith 16 lanes has a ma im m throughput of 8GB/s, but in practice it is lower: Dependencies: • Paged/Pinne d • Lanes available in slot 6

PCI Express (PCIe) Overview • GPUs can move data via pinned or paged GPUs can move data via pinned or paged transfers, synchronously or asynchronously • Pinned transfers provide direct mapped access to CPU main memor access to CPU main memory from the GPU, from the GPU however the data still has to travel across the PCIe bus. • Asynchronous transfers can hide memory transfer overhead • Not all algorithms support starting on • Not all algorithms support starting on partial data • Not as straightforward to program 7

Categorizing Transfer Requirements All kernels can be placed into one of the following All kernels can be placed into one of the following categories, although a kernel’s category could change depending on the application: 1. Non-Dependent (ND) 2. Dependent-Streaming (DS) p g ( ) 3. Single-Dependent-Host-to-Device (SDH2D) 4. Single-Dependent-Device-to-Host (SDD2H) 5 5. Dual-Dependent (DD) Dual Dependent (DD) 8

Categorizing Transfer Requirements Why is categorization important? Why is categorization important? • CPU / GPU comparisons 1.2x Speedup 62x Speedup 9

Data Transfer Categories • Non-Dependent (ND) Non Dependent (ND) • Data is present on the GPU at kernel launch, and little or no data needs to be returned to CPU • Examples: Monte Carlo, Sort (sometimes) E amples Monte Carlo Sort (sometimes) 10

Data Transfer Categories • Dependent-Streaming Dependent Streaming (DS) • Information is processed on the GPU processed on the GPU and concurrently streamed over the PCI Express Bus. • Excellent for hiding data transfer overhead data transfer overhead • More difficult to program • Example: JPEG Conversion 11

Data Transfer Categories • Single-Dependent-Host- Single Dependent Host to-Device (SDH2D) • Data is sent to the GPU and cons med GPU and consumed; little or no data is returned. • “First Kernel” Syndrome • Examples: Search • Examples: Search, Histogram 12

Data Transfer Categories • Single-Dependent- Single Dependent Device-to-Host (SDD2H) • Data is created (or e ists alread ) on the exists already) on the GPU and is returned to the CPU • “Last Kernel” Syndrome • Example: Mersenne • Example: Mersenne Twister 13

Data Transfer Categories • Dual Dependent (DD) Dual Dependent (DD) • Data is sent to the GPU, is processed, and is sent back to the and is sent back to the CPU. • Common for single- g kernel applications • Examples: Matrix Multiply Convolution Multiply, Convolution, FFT. 14

Categorizing Transfer Requirements Kernels can change categorization depending on Kernels can change categorization depending on their role in an application. E ample 1 Example 1: • Kernel pipelines: data is moved onto the GPU for the first kernel, manipulated in-place for , p p other kernels, and moved back after the last kernel. SDH2D ND ND SDD2H Data Data Transfer to Kernel 1 Kernel 2 Kernel 3 Kernel 4 Transfer GPU from GPU 15

Changing Categories Example 2: Example 2: • Some kernels, such as CUJ2K, stream (DS category) when more memory is available, but don’t stream in limited memor don’t stream in limited-memory cases (DD cases (DD category). When reporting on speedup, “common-use case” categories should be described. • E g • E.g., FFT is Dual-Dependent in general, but a FFT is Dual Dependent in general but a common-use case would be as the last kernel in a pipeline of kernels, which would make the kernel SDD2H. 16

Scheduling and Runtime Prediction • Heterogeneous scheduling depends on overall time to complete work not simply kernel time time to complete work, not simply kernel time. • Data location is important for future scheduling p g decisions • We want to minimize data transfer but • We want to minimize data transfer, but maximize device utilization – this is a difficult problem when considering the data transfer overhead. • Categorizing data transfer requirements for Categorizing data transfer requirements for kernels helps to inform the scheduling decision. 17

Conclusions • The CPU -vs- GPU performance question is not The CPU vs GPU performance question is not simply “this algorithm runs X times faster on the GPU than the CPU.” • Categorizing kernel behavior by including data transfer requirements provides necessary q p y information to the discussion. • Heterogeneous scheduling can benefit greatly • Heterogeneous scheduling can benefit greatly from the increased information. 18

Questions? C OMPUTER E NGINEERING L ABS 19

Wh Where is the Data? i th D t ? Why you Cannot Debate CPU Why - PowerPoint PPT Presentation

Wh Where is the Data? i th D t ? Why you Cannot Debate CPU Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering Labs p

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

RCMS Debate Club Try Out information 2018 - 2019 1 Agenda Debate: What and Why? RCMS

What is Debate What is Debate Debate is the art of arguing and presenting ones opinion on

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Why this debate? Why this debate? The End of Error had dozens of reviewers, including David

Multi Cycle CPU Jason Mars Monday, February 4, 13 Why a Multiple Cycle CPU? Monday, February 4,

CPU Scheduling Eric McCreath Introduction CPU scheduling is at the heart of a multiprogrammed

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

CPU Scheduling Mehdi Kargahi School of ECE University of Tehran Spring 2008 CPU and I/O Bursts

CPU Scheduling Heechul Yun 1 Administrative Midterm Mar. 15, 2016 Closed book,

CPSC 410/611: Week 4 Threads CPU Scheduling Synchronization (Part I) CPU

The views expressed in these slides are solely the views of the Investor Advisory Group members

Experimental study of the effects of Transmission Power Control and Blacklisting in Wireless

Financial & Retirement Security Before & After COVID-19: A Conversation with the St. Louis

Reflections on the Global Economic Outlook A presentation to the ACI-ICA World Congress

Maritta Paloviita, and Michael Weber 50 th Konstanzer Seminar June 5, 2019 Nathanael Vellekoop

SELFISH MINING RE-EXAMINED Kevin Alarcn Negy 1 , Peter Rizun 2 , Emin Gn Sirer 1 1 Computer

RENEGOTIATION OF TRANSPORTATION PUBLIC- PRIVATE PARTNERSHIPS: THE U.S. EXPERIENCE Jonathan L.

UPC++: A High-Performance Communication Framework for Asynchronous Computation John Bachan, Scott

Wh Where is the Data? i th D t ? Why you Cannot Debate CPU Why - PowerPoint PPT Presentation

Wh Where is the Data? i th D t ? Why you Cannot Debate CPU Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering Labs p

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

RCMS Debate Club Try Out information 2018 - 2019 1 Agenda Debate: What and Why? RCMS

What is Debate What is Debate Debate is the art of arguing and presenting ones opinion on

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Why this debate? Why this debate? The End of Error had dozens of reviewers, including David

Multi Cycle CPU Jason Mars Monday, February 4, 13 Why a Multiple Cycle CPU? Monday, February 4,

CPU Scheduling Eric McCreath Introduction CPU scheduling is at the heart of a multiprogrammed

Lecture 16: Basic CPU Design Todays topics: Single-cycle CPU Multi-cycle CPU

CPU Scheduling Mehdi Kargahi School of ECE University of Tehran Spring 2008 CPU and I/O Bursts

CPU Scheduling Heechul Yun 1 Administrative Midterm Mar. 15, 2016 Closed book,

CPSC 410/611: Week 4 Threads CPU Scheduling Synchronization (Part I) CPU

The views expressed in these slides are solely the views of the Investor Advisory Group members

Experimental study of the effects of Transmission Power Control and Blacklisting in Wireless

Financial &amp; Retirement Security Before &amp; After COVID-19: A Conversation with the St. Louis

Reflections on the Global Economic Outlook A presentation to the ACI-ICA World Congress

Maritta Paloviita, and Michael Weber 50 th Konstanzer Seminar June 5, 2019 Nathanael Vellekoop

SELFISH MINING RE-EXAMINED Kevin Alarcn Negy 1 , Peter Rizun 2 , Emin Gn Sirer 1 1 Computer

RENEGOTIATION OF TRANSPORTATION PUBLIC- PRIVATE PARTNERSHIPS: THE U.S. EXPERIENCE Jonathan L.

UPC++: A High-Performance Communication Framework for Asynchronous Computation John Bachan, Scott

Financial & Retirement Security Before & After COVID-19: A Conversation with the St. Louis