performance target
play

Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each - PowerPoint PPT Presentation

CUDA Deep Dive Performance CSE 6230 Jee Choi September 10, 2013 Performance Target Tesla M2090 512 CUDA processors @ 1.3 GHz each processor is capable of issuing 1 FMA (2 flops) instruction per cycle throughput = 512 1.3


  1. CUDA Deep Dive – Performance 
 CSE 6230 Jee Choi September 10, 2013

  2. Performance Target • Tesla M2090 – 512 CUDA processors @ 1.3 GHz – each processor is capable of issuing 1 FMA (2 flops) instruction per cycle – throughput = 512 × 1.3 × 2 = 1.33 TFLOP/s – 384-bit memory interface @ 1.85 GHz – GDDR5 memory transfers data on both rising and falling edge of the clock signal – bandwidth = 384 /8 bytes × 1.85 × 2 = 177 GB/s • How can we achieve this level of performance in our code?

  3. GPU Architecture Dual warp scheduler • warp ¡ warp ¡ scheduler ¡ scheduler ¡ – in each cycle, dual warp scheduler selects two warps and issues one instruction from each register ¡file ¡ warp to a group of sixteen cores (32,768 ¡ × 32-bit) ¡ Register file • CP ¡ CP ¡ CP ¡ CP ¡ – each thread currently residing on the CP ¡ CP ¡ CP ¡ CP ¡ multiprocessor gets its own register set CP ¡ CP ¡ CP ¡ CP ¡ 32 cores • CP ¡ CP ¡ CP ¡ CP ¡ – an instruction that has been scheduled from a CP ¡ CP ¡ CP ¡ CP ¡ warp is assigned to a group of 16 cores – it takes 2 cycles to issue 1 warp (first half-warp CP ¡ CP ¡ CP ¡ CP ¡ on the first cycle, second half-warp on the CP ¡ CP ¡ CP ¡ CP ¡ second cycle) CP ¡ CP ¡ CP ¡ CP ¡ Shared memory • – 64 KB can be configured to give 48 KB or 16 KB shared ¡memory ¡/ ¡ ¡ to shared memory, and the rest to L1 cache L1 ¡cache ¡(64 ¡KB) ¡ Streaming ¡ – shared memory is “shared” amongst the thread blocks currently residing on the SM MulGprocessor ¡(SM) ¡

  4. Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance?

  5. Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance? – theoretically yes, but realistically no

  6. Achieving performance • Does this mean as long as we have 2 warps in each SM, we can achieve peak performance? – theoretically yes, but realistically no • Latency – from beginning to end, instructions take X cycles to finish – arithmetic instructions ~10’s of cycles – memory reads ~100’s of cycles • Data dependency – dependent instructions can’t be issued back-to-back

  7. Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle

  8. Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡1 ¡issued ¡ instrucGon ¡1 ¡

  9. Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡2 ¡issued ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡ 24 ¡cycles ¡

  10. Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡3 ¡issued ¡ 26 ¡cycles ¡ instrucGon ¡3 ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡

  11. Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡4 ¡ instrucGon ¡3 ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡

  12. Example • On the G80 architecture arithmetic instruction latency is 24 cycles a = a * b + c; // 1) takes 24 cycles to complete a = a * b + c; // 2) stalled because of dependency on 1) x = x * y + z; // 3) issued immediately after 2) because no dependency a = a * b + c; // 4) stalled because of dependency on 2) X = x * y + z; // 5) immediately issued after 4) because by the time 4) // was issued 3) was on its last cycle instrucGon ¡5 ¡ only ¡5 ¡instrucGons ¡ have ¡completed ¡ instrucGon ¡4 ¡ vs. ¡ instrucGon ¡3 ¡ 73 ¡instrucGons ¡to ¡ achieve ¡peak ¡ instrucGon ¡2 ¡ instrucGon ¡1 ¡ 73 ¡cycles ¡

  13. Thread level parallelism • Thus, we need lots of independent threads/ warps to hide latency and achieve peak performance • Exactly how many do we need?

  14. Achieving performance • Thus, we need lots of independent threads/ warps to hide latency and achieve peak performance • Exactly how many do we need? – Little’s Law can help us figure this out – recall that we used Little’s Law to help us figure out the number of in-flight memory requests to maximize bandwidth utilization

  15. Performance Notes 
 Bandwidth Utilization II • Little’s Law – L = λ W • L = average number of customers in a store • λ = arrival rate • W = average time spent • Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth ( λ ) Latency (W)

  16. Thread level parallelism • For the G80 architecture GPU – arithmetic instruction latency ~24 cycles (W) – 8 cores per multiprocessor ( λ ) – parallelism (number of in-flight instructions) = 24 × 8 = 192 (6 warps) – we need 6 warps of FMA instructions with independent instructions in order to achieve peak performance (equivalently, issue FMA instructions every cycle)

  17. Thread level parallelism • What if we have no FMA instructions? – achieving “peak” is impossible • If we don’t have enough thread-level parallelism? – is it impossible to achieve peak?

  18. Thread level parallelism • What if we have no FMA instructions? – achieving “peak” is impossible • If we don’t have enough thread-level parallelism? – is it impossible to achieve peak? – not if you take advantage of instruction-level parallelism (ILP)

  19. Occupancy • Recall that occupancy is – (# of warps) / (maximum # warps) • Higher is generally better, but not always – higher occupancy means more warps with which to hide latency – higher occupancy comes at a cost of fewer registers per thread – registers are the fastest memory and are necessarily in achieving high performance

  20. Occupancy on G80 • G80 – maximum of 24 warps per SM – 6 warps are required to hide instruction latency using only TLP – occupancy = 6 / 24 = 0.25 • Let’s try to achieve peak using occupancy < 0.25 using ILP

  21. Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – 192 threads / 6 warps are required to completely hide instruction latency for(i=0; i< N; i++) { a = a * b + c; // no ILP; requires 6 warps to achieve peak } `

  22. Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – with ILP of 2, each warp has 2 instructions that can be issued back-to-back – this is similar to having 2 warps with ILP of 1 – you can achieve peak using half the warps for(i=0; i< N; i++) { a = a * b + c; // 2 independent instructions; ILP is 2 d = d * e + f; } `

  23. Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – what is the minimum amount of ILP needed to achieve peak performance with 2 warps?

  24. Example • On the G80 architecture – arithmetic instruction latency is 24 cycles – there are 8 processors per SM – what is the minimum amount of ILP needed to achieve peak performance with 2 warps? – since we need 6 warps of ILP of 1, we can get away with 2 warps of ILP of 3 – equivalent to an occupancy of 2 / 24 = 0.083 for(i=0; i< N; i++) { a = a * b + c; // 3 independent instructions; ILP is 3 d = d * e + f; g = g * h + I; } `

  25. Impact of using fewer warps • Typically there is a limit on the amount of ILP that the GPU can take advantage of – you need to mix ILP and TLP to more easily achieve peak or target performance • Having lower occupancy means more registers per thread with which to increase performance – more work per thread may also reduce thread block scheduling overheads

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend