PMaC
Performance Modeling and Characterization
1/35 AsHES 2012
Modeling and Predicting Application Performance
- n Hardware Accelerators
Modeling and Predicting Application Performance on Hardware - - PowerPoint PPT Presentation
1/35 Modeling and Predicting Application Performance on Hardware Accelerators Presented by: Alexander Breslow Authors: Mitesh Meswani*, Laura Carrington*, Didem Unat, Allan Snavely, Scott Baden, and Steve Poole *San Diego Supercomputer
Performance Modeling and Characterization
1/35 AsHES 2012
Performance Modeling and Characterization
2/35 AsHES 2012
Performance Modeling and Characterization
3/35 AsHES 2012
Performance Modeling and Characterization
4/35 AsHES 2012
Performance Modeling and Characterization
5/35 AsHES 2012
Performance Modeling and Characterization
6/35 AsHES 2012
Performance Modeling and Characterization
7/35 AsHES 2012
Performance Modeling and Characterization
8/35 AsHES 2012
Performance Modeling and Characterization
9/35 AsHES 2012
Performance Modeling and Characterization
10/35 AsHES 2012
Performance Modeling and Characterization
11/35 AsHES 2012
Commodity Intel Server Convey FPGA-based Co-processor
Performance Modeling and Characterization
12/35 AsHES 2012
x86 Host Host Memory SM0 (32 Cores, 64KB L1 cache, Shared memory) SM1 (32 Cores, 64KB L1 cache, Shared memory) SM15 (32 Cores, 64KB L1 cache, Shared memory) L2 Cache Device Memory
Performance Modeling and Characterization
13/35 AsHES 2012
Performance Modeling and Characterization
14/35 AsHES 2012
10 20 30 40 50 60 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07
Memory Bandwidth (GB/s) Data Size (Bytes)
Stream: A[i]=B[i]
BW_CPU BW_FPGA BW_GPU 5 10 15 20 25 30 35 40 45 4.E+3 8.E+3 2.E+4 3.E+4 7.E+4 1.E+5 3.E+5 5.E+5 1.E+6 2.E+6 4.E+6 8.E+6 2.E+7 3.E+7 7.E+7
Memory Bandwidth (GB/s)
Data Size (Bytes)
Stencil: A[i]=B[i-1]+B[i+1]
BW_CPU BW_FPGA BW_GPU
Performance Modeling and Characterization
15/35 AsHES 2012
20 40 60 80 100 120 2.E+04 7.E+04 3.E+05 1.E+06 4.E+06 2.E+07 7.E+07 3.E+08 1.E+09 4.E+09
Memory Bandwidth (GB/s) Data Size (Bytes)
Transpose: A[i,j]=A[j,i]
BW_CPU BW_FPGA BW_GPU 10 20 30 40 50 60 70
Memory Bandwidth (GB/s) Data Size (Bytes)
Reduction: sum+=A[i]
BW_CPU BW_FPGA BW_GPU
Performance Modeling and Characterization
16/35 AsHES 2012
10 20 30 40 50 60 2.E+4 8.E+5 2.E+6 3.E+6 1.E+7 3.E+7 5.E+7 1.E+8
Memory Bandwidth (GB/s) Data Size (Bytes)
Gather: A[i] = B[C[j]]
BW_CPU BW_FPGA BW_GPU 10 20 30 40 50 60 2.E+4 8.E+5 2.E+6 3.E+6 6.E+6 1.E+7 3.E+7 5.E+7 1.E+8 2.E+8
Memory Bandwidth (GB/s) Data Size (Bytes)
Scatter: A[B[i]] = C[j]
BW_CPU BW_FPGA BW_GPU
Performance Modeling and Characterization
17/35 AsHES 2012
0.5 1 1.5 2 2.5 2.E+03 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07 1.E+08 3.E+08 5.E+08 1.E+09
Memory Bandwidth(GB/s) Data Size (Bytes)
Data Migration
BW_FPGA BW_GPU
Combining idiom plots and data migration costs illustrates the complexity of determining the best achievable performance from the GPU/FPGA for a given data size and it is interesting to note this space is complex – there is no clear winner among CPU, FPGA, GPU it depends on the idiom and the dataset size.
Performance Modeling and Characterization
18/35 AsHES 2012
File Line# Function Idiom Code foo.c 623 Func1 gather a[i] = b[d[j]] tmp.c 992 Func2 stream x[j]= c[i]
Performance Modeling and Characterization
19/35 AsHES 2012
Performance Modeling and Characterization
20/35 AsHES 2012
Performance Modeling and Characterization
21/35 AsHES 2012
Application Idiom Measured Predicted % Error1 Hmmer Stream (FPGA) 384.7 337.0 12.3% Hmmer Stream (GPU) 18.4 18.5 0.3% Hmmer Gather/Scatter (GPU) 0.074 0.087 17.3% Flash Gather /Scatter (FPGA) 69 68 1.4%
Performance Modeling and Characterization
22/35 AsHES 2012
G500 (CPU) actual G500 (Ported) actual Bfs speedup actual G500 (CPU) predicted G500 (Ported) predicted Bfs speedup predicted 5980 4686 (21.64%) 98X 5847 4757 (18.65%) 96x
Performance Modeling and Characterization
23/35 AsHES 2012
Performance Modeling and Characterization
24/35 AsHES 2012
Application Measured Predicted % Error1 Milc (8cpu) 278 277 0.4% Milc (256cpu) 1,345 1,350 0.4% HYCOM (8cpu) 262 246 6.1% HYCOM (256cpu) 809 663 18.1%
Performance Modeling and Characterization
25/35 AsHES 2012
HYCOM (8cpu) HYCOM (256cpu) Milc (8cpu) Milc (256cpu) Gather/scatter 14.2% 4.6% 1.2% 0.7% stream 21.1% 16.9% 5.6% 3.0% Idiom HYCOM Milc Gather/scat ter 1,797 156 stream 1,300 105
Idiom instances in source code Contribution of idioms to runtime
Performance Modeling and Characterization
26/35 AsHES 2012
HYCOM 256cpu CPU FPGA GPU Gather/Scatter 7,768 495 638 Stream 28,459 2,302 44,166 Total 36,556 2,798 44,803 MILC 256cpu CPU FPGA GPU Gather/Scatter 2,376 334 399 Stream 10,452 771 1,087 Total 12,827 1,104 1,487
Performance Modeling and Characterization
27/35 AsHES 2012
HYCOM 256cpu CPU
CPU
Optimal of CPU, GPU, FPGA CPU FPGA GPU Gather/Sca tter 495 638 448 7,768 495 638 Stream 2,297 6,096 2,149 28,459 2,302 44,166 Total 2,792 6,734 2,596 36,556 2,798 44,803 MILC 256cpu CPU vs. FPGA CPU
Optimal
GPU, FPGA CPU FPGA GPU Gather/Scat ter 334 399 334 2,376 334 399 Stream 770 1,087 765 10,452 771 1,087 Total 1,104 1,486 1,099 12,827 1,104 1,487
Performance Modeling and Characterization
28/35 AsHES 2012
Performance Modeling and Characterization
29/35 AsHES 2012
Performance Modeling and Characterization
30/35 AsHES 2012
Performance Modeling and Characterization
31/35 AsHES 2012
Performance Modeling and Characterization
32/35 AsHES 2012
Performance Modeling and Characterization
33/35 AsHES 2012
Performance Modeling and Characterization
34/35 AsHES 2012
Performance Modeling and Characterization
35/35 AsHES 2012
HYCOM Milc Gather/scatter 1797 156 reduction 110 22 stream 1300 105 stencil 132 transpose 3986 286 Mat-Mat Mult 2161 6 Mat-Vec Mult 115 2 Fraction of Loops Covered 67.45% 37.44%
Performance Modeling and Characterization
36/35 AsHES 2012
HYCOM (8cpu) HYCOM (256cpu) Milc (8cpu) Milc (256cpu) Gather/scatter 14.2% 4.6% 1.2% 0.7% reduction 0.0% 0.1% 15.7% 13.9% stream 21.1% 16.9% 5.6% 3.0% stencil 4.7% 11.1% 0.0% 0.0% transpose 0.9% 2.0% 0.0% 0.0% Mat-Mat Mult 23.7% 8.6% 61.2% 58.6% Mat-Vec Mult 0.0% 0.1% 10.5% 16.7% All Idioms 64.6% 43.4% 94.2% 93.2%
Performance Modeling and Characterization
37/35 AsHES 2012