Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each - PowerPoint PPT Presentation

CUDA Deep Dive – Performance   CSE 6230 Jee Choi September 10, 2013

Performance Target • Tesla M2090 – 512 CUDA processors @ 1.3 GHz – each processor is capable of issuing 1 FMA (2 flops) instruction per cycle – throughput = 512 × 1.3 × 2 = 1.33 TFLOP/s – 384-bit memory interface @ 1.85 GHz – GDDR5 memory transfers data on both rising and falling edge of the clock signal – bandwidth = 384 /8 bytes × 1.85 × 2 = 177 GB/s • How can we achieve this level of performance in our code?

GPU Architecture Dual warp scheduler • warp ¡ warp ¡ scheduler ¡ scheduler ¡ – in each cycle, dual warp scheduler selects two warps and issues one instruction from each register ¡file ¡ warp to a group of sixteen cores (32,768 ¡ × 32-bit) ¡ Register file • CP ¡ CP ¡ CP ¡ CP ¡ – each thread currently residing on the CP ¡ CP ¡ CP ¡ CP ¡ multiprocessor gets its own register set CP ¡ CP ¡ CP ¡ CP ¡ 32 cores • CP ¡ CP ¡ CP ¡ CP ¡ – an instruction that has been scheduled from a CP ¡ CP ¡ CP ¡ CP ¡ warp is assigned to a group of 16 cores – it takes 2 cycles to issue 1 warp (first half-warp CP ¡ CP ¡ CP ¡ CP ¡ on the first cycle, second half-warp on the CP ¡ CP ¡ CP ¡ CP ¡ second cycle) CP ¡ CP ¡ CP ¡ CP ¡ Shared memory • – 64 KB can be configured to give 48 KB or 16 KB shared ¡memory ¡/ ¡ ¡ to shared memory, and the rest to L1 cache L1 ¡cache ¡(64 ¡KB) ¡ Streaming ¡ – shared memory is “shared” amongst the thread blocks currently residing on the SM MulGprocessor ¡(SM) ¡

Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance?

Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance? – theoretically yes, but realistically no

Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance? – theoretically yes, but realistically no • Latency – from beginning to end, instructions take X cycles to finish – arithmetic instructions ~10’s of cycles – memory reads ~100’s of cycles • Data dependency – dependent instructions can’t be issued back-to-back

Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle

Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡1 ¡issued ¡ instrucGon ¡1 ¡

Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡2 ¡issued ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡ 24 ¡cycles ¡

Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡3 ¡issued ¡ 26 ¡cycles ¡ instrucGon ¡3 ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡

Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡4 ¡ instrucGon ¡3 ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡

Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡5 ¡ only ¡5 ¡instrucGons ¡ have ¡completed ¡ instrucGon ¡4 ¡ vs. ¡ instrucGon ¡3 ¡ 73 ¡instrucGons ¡to ¡ achieve ¡peak ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡ 73 ¡cycles ¡

Thread level parallelism • Thus, we need lots of independent threads/ warps to hide latency and achieve peak performance • Exactly how many do we need?

Achieving performance • Thus, we need lots of independent threads/ warps to hide latency and achieve peak performance • Exactly how many do we need? – Little’s Law can help us figure this out – recall that we used Little’s Law to help us figure out the number of in-flight memory requests to maximize bandwidth utilization

Performance Notes   Bandwidth Utilization II • Little’s Law – L = λ W • L = average number of customers in a store • λ = arrival rate • W = average time spent • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)

Thread level parallelism • For the G80 architecture GPU – arithmetic instruction latency ~24 cycles (W) – 8 cores per multiprocessor ( λ ) – parallelism (number of in-flight instructions) = 24 × 8 = 192 (6 warps) – we need 6 warps of FMA instructions with independent instructions in order to achieve peak performance (equivalently, issue FMA instructions every cycle)

Thread level parallelism • What if we have no FMA instructions? – achieving “peak” is impossible • If we don’t have enough thread-level parallelism? – is it impossible to achieve peak?

Thread level parallelism • What if we have no FMA instructions? – achieving “peak” is impossible • If we don’t have enough thread-level parallelism? – is it impossible to achieve peak? – not if you take advantage of instruction-level parallelism (ILP)

Occupancy • Recall that occupancy is – (# of warps) / (maximum # warps) • Higher is generally better, but not always – higher occupancy means more warps with which to hide latency – higher occupancy comes at a cost of fewer registers per thread – registers are the fastest memory and are necessarily in achieving high performance

Occupancy on G80 • G80 – maximum of 24 warps per SM – 6 warps are required to hide instruction latency using only TLP – occupancy = 6 / 24 = 0.25 • Let’s try to achieve peak using occupancy < 0.25 using ILP

Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – 192 threads / 6 warps are required to completely hide instruction latency for(i=0; i< N; i++) { a = a * b + c; // no ILP; requires 6 warps to achieve peak } `

Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – with ILP of 2, each warp has 2 instructions that can be issued back-to-back – this is similar to having 2 warps with ILP of 1 – you can achieve peak using half the warps for(i=0; i< N; i++) { a = a * b + c; // 2 independent instructions; ILP is 2 d = d * e + f; } `

Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – what is the minimum amount of ILP needed to achieve peak performance with 2 warps?

Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – what is the minimum amount of ILP needed to achieve peak performance with 2 warps? – since we need 6 warps of ILP of 1, we can get away with 2 warps of ILP of 3 – equivalent to an occupancy of 2 / 24 = 0.083 for(i=0; i< N; i++) { a = a * b + c; // 3 independent instructions; ILP is 3 d = d * e + f; g = g * h + I; } `

Impact of using fewer warps • Typically there is a limit on the amount of ILP that the GPU can take advantage of – you need to mix ILP and TLP to more easily achieve peak or target performance • Having lower occupancy means more registers per thread with which to increase performance – more work per thread may also reduce thread block scheduling overheads

Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each - PowerPoint PPT Presentation

CUDA Deep Dive Performance CSE 6230 Jee Choi September 10, 2013 Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each processor is capable of issuing 1 FMA (2 flops) instruction per cycle throughput = 512 1.3

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

Cotton Incorporated TARGET SPOT UPDATE A. K. Hagan Auburn University TARGET SPOT Target Spot

LBNE 1.2MW Target NBI 2014 Presented by Brian Hartsell LBNE Target - Introduction Target

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

Thermal stress & cooling of J-Parc neutrino target Introduction neutrino target requirement

Target NEO 2 Workshop Summary Rich Dissly, Ball Aerospace Cheryl Reed, APL And Target NEO 2

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Renewable Energy Target Introduction to the Large-scale Renewable Energy Target Who administers

Target Test Program Peter Loveridge STFC/ RAL Mu2e Target, Remote Handling, and Heat &

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

Safety Performance Management: Target Setting and Coordination Stephen Ratke, P.E. FHWA Texas

Target Setting Regional Transportation Council November 8, 2018 Federal Measures Target Status

EUROPEAN GAS TARGET MODEL REVIEW AND UPDATE European Gas Target Model review and update January

HERE FOR GOOD D AVID HUGHES , ENERGY MANAGEMENT Target Overview May 1, 1962 - First Target

What is an Achievement Target? Planning and Assessment Ramapo College What is an Achievement

Meggy Jr Simple and AVR Meggy Jr Simple Library Key concepts Today LED screen (pixels)

CPU Architecture system clock Memory 64-bit adder Every CPU architecture is implemented using

CS356 : Discussion #4 Assembly Instructions & Debugging with GDB Last week: Operand Forms

Compiling Techniques Lecture 10: An Introduction to MIPS assembly Hugh Leather 15 October 2019

Efficient and Provably Secure Methods for Switching from Arithmetic to Boolean Masking Blandine

Efficient Software Implementation of Binary Field Arithmetic Using Vector Instruction Sets Diego

The 2018 NSSME+ FEBRUARY 7, 2019 Daniel Heck Kristen Malzahn Courtney Plumley Nadine Bezuk,

ARITHMETIC OPERATIONS WE HAVE THE DATA, WHAT NOW? Operations in C Bit-wise boolean operations

Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each - PowerPoint PPT Presentation

CUDA Deep Dive Performance CSE 6230 Jee Choi September 10, 2013 Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each processor is capable of issuing 1 FMA (2 flops) instruction per cycle throughput = 512 1.3

Target Risk vs. Target Date Funds in 401(k) Plans: Maybe the answer is both January 14, 2015

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

Cotton Incorporated TARGET SPOT UPDATE A. K. Hagan Auburn University TARGET SPOT Target Spot

LBNE 1.2MW Target NBI 2014 Presented by Brian Hartsell LBNE Target - Introduction Target

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

Thermal stress &amp; cooling of J-Parc neutrino target Introduction neutrino target requirement

Target NEO 2 Workshop Summary Rich Dissly, Ball Aerospace Cheryl Reed, APL And Target NEO 2

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Renewable Energy Target Introduction to the Large-scale Renewable Energy Target Who administers

Target Test Program Peter Loveridge STFC/ RAL Mu2e Target, Remote Handling, and Heat &amp;

Target Formula Re-evaluation Target Formula Background Target formula is used to distribute

Safety Performance Management: Target Setting and Coordination Stephen Ratke, P.E. FHWA Texas

Target Setting Regional Transportation Council November 8, 2018 Federal Measures Target Status

EUROPEAN GAS TARGET MODEL REVIEW AND UPDATE European Gas Target Model review and update January

HERE FOR GOOD D AVID HUGHES , ENERGY MANAGEMENT Target Overview May 1, 1962 - First Target

What is an Achievement Target? Planning and Assessment Ramapo College What is an Achievement

Meggy Jr Simple and AVR Meggy Jr Simple Library Key concepts Today LED screen (pixels)

CPU Architecture system clock Memory 64-bit adder Every CPU architecture is implemented using

CS356 : Discussion #4 Assembly Instructions &amp; Debugging with GDB Last week: Operand Forms

Compiling Techniques Lecture 10: An Introduction to MIPS assembly Hugh Leather 15 October 2019

Efficient and Provably Secure Methods for Switching from Arithmetic to Boolean Masking Blandine

Efficient Software Implementation of Binary Field Arithmetic Using Vector Instruction Sets Diego

The 2018 NSSME+ FEBRUARY 7, 2019 Daniel Heck Kristen Malzahn Courtney Plumley Nadine Bezuk,

ARITHMETIC OPERATIONS WE HAVE THE DATA, WHAT NOW? Operations in C Bit-wise boolean operations

Thermal stress & cooling of J-Parc neutrino target Introduction neutrino target requirement

Target Test Program Peter Loveridge STFC/ RAL Mu2e Target, Remote Handling, and Heat &

CS356 : Discussion #4 Assembly Instructions & Debugging with GDB Last week: Operand Forms