Performance: Towards a New Optimization Tool Adi Fuchs, Noam Shalev - - PowerPoint PPT Presentation

performance towards a new optimization tool
SMART_READER_LITE
LIVE PREVIEW

Performance: Towards a New Optimization Tool Adi Fuchs, Noam Shalev - - PowerPoint PPT Presentation

Understanding of GPGPU Performance: Towards a New Optimization Tool Adi Fuchs, Noam Shalev and Avi Mendelson Technion , Israel Institute of Technology This work was supported in part by the Metro450 consortium Adi Fuchs, Noam Shalev and Avi


slide-1
SLIDE 1

Adi Fuchs, Noam Shalev and Avi Mendelson – Technion , Israel Institute of Technology

This work was supported in part by the Metro450 consortium

Understanding of GPGPU Performance: Towards a New Optimization Tool

slide-2
SLIDE 2

Adi Fuchs, Noam Shalev and Avi Mendelson – Technion , Israel Institute of Technology

This work was supported in part by the Metro450 consortium

slide-3
SLIDE 3

Bandwidth (in MB/s) for memory copy on two CPU, two GPU, and two 64-bit systems.

  • GPU provide significant performance or power efficiency for parallel workloads
  • However, even simple workloads are microarchitecture and platform sensitive
  • Why do applications behave the way they do?
slide-4
SLIDE 4
  • GPGPU Profiling tools:
  • complex and not conclusive
  • mainly based on companies’ work (don’t expose undocumented behavior)
  • Academic work
  • some works suggest the use of targeted benchmarks
  • some target specific structures or aspects
  • many are based on “common knowledge”

Existing tools and work – Industry + Academia:

slide-5
SLIDE 5

Goals:

  • Unveil GPU microarchitecture characterizations
  • …Including undocumented behavior!
  • Auto-match applications to HW spec + HW/SW optimizations
slide-6
SLIDE 6
slide-7
SLIDE 7

Current work

  • We have a series of CUDA benchmarks that explore different NVIDIA cards
  • Each micro-benchmark pinpoints a different phenomena
  • We focus on the memory system – has a huge impact on performance and power
  • Benchmarks executed on 4 different NVIDIA systems
slide-8
SLIDE 8

Long term vision…

  • We wish to construct an application + HW characteristics database
  • Based on this database we would like to construct a matching tool:
  • 1. Given a workload – what type of hardware should be used?
  • 2. Given workload + hardware – what optimizations to apply?
slide-9
SLIDE 9
slide-10
SLIDE 10
  • Common microbenchmarks often target hierarchy (e.g. cache levels)
  • Targeting hierarchy adds to the code’s complexity
  • Targeting hierarchy harms portability! (machine dependent code )
  • Our micro-benchmarks target behavior, not hierarchy
slide-11
SLIDE 11

4 systems tested:

slide-12
SLIDE 12

Micro-benchmark #1: Locality

  • Explore sizes of cacheline/prefetch using small jumps of varying size
slide-13
SLIDE 13

Micro-benchmark #1: Locality

10 20 30 40 50 60 70 80 90 100 4 16 64 256 Kernel Latency(us) small jump size (bytes)

Shared Memory

C2070 Quadro2000 GTX680 K20

  • In all systems tested shared memory is latency is fixed  no caching/prefetching
slide-14
SLIDE 14

Micro-benchmark #1: Locality

100 200 300 400 500 600 4 16 64 256 Kernel Latency(us) small jump size (bytes)

Texture Memory

C2070 Quadro2000 GTX680 K20

  • Texture memory caching is 32 bytes of size = 4 double precision coordinates
slide-15
SLIDE 15

Micro-benchmark #1: Locality

100 200 300 400 500 600 4 16 64 256 Kernel Latency(us) small jump size (bytes)

Constant Memory

C2070 Quadro2000 GTX680 K20

  • Constant memory has a 2-level hierarchy for 64 and 256 byte segments
slide-16
SLIDE 16

Micro-benchmark #1: Locality

100 200 300 400 500 600 4 16 64 256 Kernel Latency(us) small jump size (bytes)

Global Memory

C2070 Quadro2000 GTX680 K20

  • Global memory – CUDA 2.x systems support caching / prefetching
slide-17
SLIDE 17

Micro-benchmark #2: Synchronization

  • Examine the effects of varying synchronization granularity for memory writes
  • Number of thread changes as well - each thread executes the same kernel:
slide-18
SLIDE 18

Micro-benchmark #2: Synchronization

10 20 30 40 50 60 70 80 90 100 1 4 16 64 256 1024 Kernel Latency (us) #Sync instructions

Fermi Quadro 2000

1 thread 4 threads 32 threads 64 threads 128 threads 192 threads

  • Fine-grained sync increase latency by 163%. 192 threads increase latency by 13%
slide-19
SLIDE 19

Micro-benchmark #2: Synchronization

10 20 30 40 50 60 70 80 90 1 4 16 64 256 1024 Kernel Latency (us) #Sync instructions

K20

1 thread 4 threads 32 threads 64 threads 128 threads 192 threads

  • Fine-grained sync increase latency by 281%. 192 threads increase latency by 38%
slide-20
SLIDE 20

Micro-benchmark #3: Memory Coalescing

  • Target: the ability of grouping memory accesses from different threads
  • …And what happens when it’s impossible.
  • Each thread reads 1K lines starting from a different offset.
slide-21
SLIDE 21

Micro-benchmark #3: Memory Coalescing

0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 32 64 128 256 Average read latency (us) #Threads

Fermi Quadro2000

4bytes 8bytes 16bytes 32bytes 64bytes 128bytes 256bytes 512bytes 1024bytes

  • Large offset = loss of locality. 192 threads+ Large offset = scheduler competition!
slide-22
SLIDE 22

Micro-benchmark #3: Memory Coalescing

0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 32 64 128 256 Average read latency (us) #Threads

Tesla K20

4bytes 8bytes 16bytes 32bytes 64bytes 128bytes 256bytes 512bytes 1024bytes

  • No competition – however, overall latency is larger.
slide-23
SLIDE 23

Other benchmarks...

slide-24
SLIDE 24
slide-25
SLIDE 25
  • Understanding GPUs performance + power = understanding microarchitecture!
  • ... However microarchitecture is usually kept secret.
  • Memory access patterns must be taken under considerations
  • Loss of locality, resource competition , synchronizations significant side-effects
  • Side-effects differ between GPU platforms (newer is not always better!)
slide-26
SLIDE 26
  • Extend the focused benchmarks to other GPU’s aspects.
  • Extend the work to analyze programs’ behavior and correlate them

with HW characterizations

  • Extend the work to other platforms such as Xeon Phi
slide-27
SLIDE 27
  • Extend the focused benchmarks to other GPU’s aspects.
  • Extend the work to analyze programs’ behavior and correlate them

with HW characterizations

  • Extend the work to other platforms such as Xeon Phi