Adi Fuchs, Noam Shalev and Avi Mendelson – Technion , Israel Institute of Technology
This work was supported in part by the Metro450 consortium
Performance: Towards a New Optimization Tool Adi Fuchs, Noam Shalev - - PowerPoint PPT Presentation
Understanding of GPGPU Performance: Towards a New Optimization Tool Adi Fuchs, Noam Shalev and Avi Mendelson Technion , Israel Institute of Technology This work was supported in part by the Metro450 consortium Adi Fuchs, Noam Shalev and Avi
Adi Fuchs, Noam Shalev and Avi Mendelson – Technion , Israel Institute of Technology
This work was supported in part by the Metro450 consortium
Adi Fuchs, Noam Shalev and Avi Mendelson – Technion , Israel Institute of Technology
This work was supported in part by the Metro450 consortium
Bandwidth (in MB/s) for memory copy on two CPU, two GPU, and two 64-bit systems.
Existing tools and work – Industry + Academia:
Goals:
4 systems tested:
Micro-benchmark #1: Locality
Micro-benchmark #1: Locality
10 20 30 40 50 60 70 80 90 100 4 16 64 256 Kernel Latency(us) small jump size (bytes)
Shared Memory
C2070 Quadro2000 GTX680 K20
Micro-benchmark #1: Locality
100 200 300 400 500 600 4 16 64 256 Kernel Latency(us) small jump size (bytes)
Texture Memory
C2070 Quadro2000 GTX680 K20
Micro-benchmark #1: Locality
100 200 300 400 500 600 4 16 64 256 Kernel Latency(us) small jump size (bytes)
Constant Memory
C2070 Quadro2000 GTX680 K20
Micro-benchmark #1: Locality
100 200 300 400 500 600 4 16 64 256 Kernel Latency(us) small jump size (bytes)
Global Memory
C2070 Quadro2000 GTX680 K20
Micro-benchmark #2: Synchronization
Micro-benchmark #2: Synchronization
10 20 30 40 50 60 70 80 90 100 1 4 16 64 256 1024 Kernel Latency (us) #Sync instructions
Fermi Quadro 2000
1 thread 4 threads 32 threads 64 threads 128 threads 192 threads
Micro-benchmark #2: Synchronization
10 20 30 40 50 60 70 80 90 1 4 16 64 256 1024 Kernel Latency (us) #Sync instructions
K20
1 thread 4 threads 32 threads 64 threads 128 threads 192 threads
Micro-benchmark #3: Memory Coalescing
Micro-benchmark #3: Memory Coalescing
0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 32 64 128 256 Average read latency (us) #Threads
Fermi Quadro2000
4bytes 8bytes 16bytes 32bytes 64bytes 128bytes 256bytes 512bytes 1024bytes
Micro-benchmark #3: Memory Coalescing
0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 32 64 128 256 Average read latency (us) #Threads
Tesla K20
4bytes 8bytes 16bytes 32bytes 64bytes 128bytes 256bytes 512bytes 1024bytes
Other benchmarks...
with HW characterizations
with HW characterizations