Modeling and Predicting Application Performance on Hardware - - PowerPoint PPT Presentation

modeling and predicting application performance on
SMART_READER_LITE
LIVE PREVIEW

Modeling and Predicting Application Performance on Hardware - - PowerPoint PPT Presentation

1/35 Modeling and Predicting Application Performance on Hardware Accelerators Presented by: Alexander Breslow Authors: Mitesh Meswani*, Laura Carrington*, Didem Unat, Allan Snavely, Scott Baden, and Steve Poole *San Diego Supercomputer


slide-1
SLIDE 1

PMaC

Performance Modeling and Characterization

1/35 AsHES 2012

Modeling and Predicting Application Performance

  • n Hardware Accelerators

Presented by: Alexander Breslow Authors: Mitesh Meswani*, Laura Carrington*, Didem Unat, Allan Snavely, Scott Baden, and Steve Poole

*San Diego Supercomputer Center, *Performance Modeling and Characterization Lab (PMaC)

slide-2
SLIDE 2

PMaC

Performance Modeling and Characterization

2/35 AsHES 2012

PMaC Lab

 Goal: Understand factors that affect runtime and now recently energy performance of HPC apps on current and future HPC systems  PMaC framework provides fast and accurate predictions – Input: software characteristics, input data, hardware parameters – Output: Prediction model that predicts expected performance  Tools : PEBIL, PMaCInst, PSiNSTracer Etracer, IOTracer, ShmemTracer, PIR  Simulation: PSiNS, PSaPP

slide-3
SLIDE 3

PMaC

Performance Modeling and Characterization

3/35 AsHES 2012

Prediction framework

slide-4
SLIDE 4

PMaC

Performance Modeling and Characterization

4/35 AsHES 2012

Outline  Introduction  Methodology- developing models for FPGAs and GPUs  Results- workload predictions on accelerators  References

slide-5
SLIDE 5

PMaC

Performance Modeling and Characterization

5/35 AsHES 2012

Why Accelerators?  Traditional processing

– Solves the common case – Limited performance for specialized functions

 Solution : Use special purpose co- processors or Hardware Accelerators

– Examples: FPGA, GPU

slide-6
SLIDE 6

PMaC

Performance Modeling and Characterization

6/35 AsHES 2012

Application porting is time consuming  HPC apps can case exceed 100,000 lines of code  Choice of accelerator is not apparent  Prudent to evaluate benefit prior to porting  Solution: performance predictions models

– Allow fast evaluations without porting or running – Accuracy has to be high to be valuable

slide-7
SLIDE 7

PMaC

Performance Modeling and Characterization

7/35 AsHES 2012

Methodology  First identify code sections that may benefit from accelerators  HPC applications can be expressed by a small set of commonly occurring compute and data-access patterns also called as idioms, example transpose, reduction.  Predict performance of idiom instances on accelerators.  Port only instances that are predicted to run faster

slide-8
SLIDE 8

PMaC

Performance Modeling and Characterization

8/35 AsHES 2012

Our Study  Accelerators: Convey HC-1 FPGA system and NVIDIA FERMI GPU (TESLA 2070)  Characterize accelerators for 8 common HPC idioms  Develop and validate idiom models on two real-world benchmarks.  Present a case study of a hypothetical Supercomputer with FPGAs, GPUs for two popular HPC apps predict speedups up to 20%

slide-9
SLIDE 9

PMaC

Performance Modeling and Characterization

9/35 AsHES 2012

What are Idioms  Idiom is a pattern of computation and memory access.  Example: Stream copy

for (int i=0;i<n;i++) A[i] = B[i] ;

slide-10
SLIDE 10

PMaC

Performance Modeling and Characterization

10/35 AsHES 2012

Idioms Used  Stream : A[i] = B[i] + C[i]  Gather: A[i] = B[C[i]]  Scatter: A[C[i]] = B[i]  Transpose: A[i][j] = B[j][i]  Reduction: s = s + A[i]  Stencil: A[i] = A[i-1] + A[i+1]  Matrix Vector Multiply: C[i] = A[i][j]*B[i]  Matrix Matrix Multiply: C[i][j] = A[i][j]*B[k][j]

slide-11
SLIDE 11

PMaC

Performance Modeling and Characterization

11/35 AsHES 2012

Hardware Accelerator #1 – Convey HC-1 FPGA

Commodity Intel Server Convey FPGA-based Co-processor

slide-12
SLIDE 12

PMaC

Performance Modeling and Characterization

12/35 AsHES 2012

Hardware Accelerator #2 – NVIDIA TESLA 2070C GPU

x86 Host Host Memory SM0 (32 Cores, 64KB L1 cache, Shared memory) SM1 (32 Cores, 64KB L1 cache, Shared memory) SM15 (32 Cores, 64KB L1 cache, Shared memory) L2 Cache Device Memory

slide-13
SLIDE 13

PMaC

Performance Modeling and Characterization

13/35 AsHES 2012

Accelerator Characterizations  Simple benchmarks to profile capabilities of GPU, FPGA, and CPU to perform idiom

  • perations

 Each benchmark ranges over different memory sizes

slide-14
SLIDE 14

PMaC

Performance Modeling and Characterization

14/35 AsHES 2012

Stream, Stencil

10 20 30 40 50 60 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07

Memory Bandwidth (GB/s) Data Size (Bytes)

Stream: A[i]=B[i]

BW_CPU BW_FPGA BW_GPU 5 10 15 20 25 30 35 40 45 4.E+3 8.E+3 2.E+4 3.E+4 7.E+4 1.E+5 3.E+5 5.E+5 1.E+6 2.E+6 4.E+6 8.E+6 2.E+7 3.E+7 7.E+7

Memory Bandwidth (GB/s)

Data Size (Bytes)

Stencil: A[i]=B[i-1]+B[i+1]

BW_CPU BW_FPGA BW_GPU

slide-15
SLIDE 15

PMaC

Performance Modeling and Characterization

15/35 AsHES 2012

Transpose, Reduction

20 40 60 80 100 120 2.E+04 7.E+04 3.E+05 1.E+06 4.E+06 2.E+07 7.E+07 3.E+08 1.E+09 4.E+09

Memory Bandwidth (GB/s) Data Size (Bytes)

Transpose: A[i,j]=A[j,i]

BW_CPU BW_FPGA BW_GPU 10 20 30 40 50 60 70

Memory Bandwidth (GB/s) Data Size (Bytes)

Reduction: sum+=A[i]

BW_CPU BW_FPGA BW_GPU

slide-16
SLIDE 16

PMaC

Performance Modeling and Characterization

16/35 AsHES 2012

Gather, Scatter

10 20 30 40 50 60 2.E+4 8.E+5 2.E+6 3.E+6 1.E+7 3.E+7 5.E+7 1.E+8

Memory Bandwidth (GB/s) Data Size (Bytes)

Gather: A[i] = B[C[j]]

BW_CPU BW_FPGA BW_GPU 10 20 30 40 50 60 2.E+4 8.E+5 2.E+6 3.E+6 6.E+6 1.E+7 3.E+7 5.E+7 1.E+8 2.E+8

Memory Bandwidth (GB/s) Data Size (Bytes)

Scatter: A[B[i]] = C[j]

BW_CPU BW_FPGA BW_GPU

slide-17
SLIDE 17

PMaC

Performance Modeling and Characterization

17/35 AsHES 2012

Cost of Data Migration

0.5 1 1.5 2 2.5 2.E+03 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07 1.E+08 3.E+08 5.E+08 1.E+09

Memory Bandwidth(GB/s) Data Size (Bytes)

Data Migration

BW_FPGA BW_GPU

Combining idiom plots and data migration costs illustrates the complexity of determining the best achievable performance from the GPU/FPGA for a given data size and it is interesting to note this space is complex – there is no clear winner among CPU, FPGA, GPU it depends on the idiom and the dataset size.

slide-18
SLIDE 18

PMaC

Performance Modeling and Characterization

18/35 AsHES 2012

Application Characterizations – finding idioms  PMaC Idiom Recognizer (PIR):

– GCC plugin recognizes idioms during compilation using IR tree analysis – Users can specify different idioms using PIR’s idiom expression syntax

File Line# Function Idiom Code foo.c 623 Func1 gather a[i] = b[d[j]] tmp.c 992 Func2 stream x[j]= c[i]

slide-19
SLIDE 19

PMaC

Performance Modeling and Characterization

19/35 AsHES 2012

Application Characterizations – finding data size per idiom  PEBIL – binary instrumentation tool

– To find data size for an idiom:

 Determine basic blocks belonging for the idiom  Instrument those basic blocks to capture data range

– Run the instrumented binary and generate traces

slide-20
SLIDE 20

PMaC

Performance Modeling and Characterization

20/35 AsHES 2012

Prediction Models

slide-21
SLIDE 21

PMaC

Performance Modeling and Characterization

21/35 AsHES 2012

Model Validation – Fine-grained

 Hmmer: Protein sequence code, run with 8-tasks on GPU, FPGA systems  Flash: astrophysics code, sequential version run on FPGA system.

Application Idiom Measured Predicted % Error1 Hmmer Stream (FPGA) 384.7 337.0 12.3% Hmmer Stream (GPU) 18.4 18.5 0.3% Hmmer Gather/Scatter (GPU) 0.074 0.087 17.3% Flash Gather /Scatter (FPGA) 69 68 1.4%

slide-22
SLIDE 22

PMaC

Performance Modeling and Characterization

22/35 AsHES 2012

Model Validation – Graph500  FPGA validated:

– We ran scale 24 problem, 13 MTEPS – PIR analysis identifies scatter and stream idiom in make_bfs – make_bfs ported by convey to FPGA, rest on CPU – We use CPU and FPGA models to predict speedups

G500 (CPU) actual G500 (Ported) actual Bfs speedup actual G500 (CPU) predicted G500 (Ported) predicted Bfs speedup predicted 5980 4686 (21.64%) 98X 5847 4757 (18.65%) 96x

slide-23
SLIDE 23

PMaC

Performance Modeling and Characterization

23/35 AsHES 2012

Projection Study  Study production HPC system – Jaguar a Cray XT5

– 224,256 AMD cores, 300 TB memory

 Applications:

– Hycom – 8 and 256 cpu runs – Milc – 8 and 256 cpu runs

 Q: What would be projected speedup for an appliction running on machine like Jaguar but with FPGA and GPU on each node

slide-24
SLIDE 24

PMaC

Performance Modeling and Characterization

24/35 AsHES 2012

Results – CPU predictions

Application Measured Predicted % Error1 Milc (8cpu) 278 277 0.4% Milc (256cpu) 1,345 1,350 0.4% HYCOM (8cpu) 262 246 6.1% HYCOM (256cpu) 809 663 18.1%

slide-25
SLIDE 25

PMaC

Performance Modeling and Characterization

25/35 AsHES 2012

Idiom instances and runtime %

HYCOM (8cpu) HYCOM (256cpu) Milc (8cpu) Milc (256cpu) Gather/scatter 14.2% 4.6% 1.2% 0.7% stream 21.1% 16.9% 5.6% 3.0% Idiom HYCOM Milc Gather/scat ter 1,797 156 stream 1,300 105

Idiom instances in source code Contribution of idioms to runtime

slide-26
SLIDE 26

PMaC

Performance Modeling and Characterization

26/35 AsHES 2012

Run times of all idiom instances on one device

HYCOM 256cpu CPU FPGA GPU Gather/Scatter 7,768 495 638 Stream 28,459 2,302 44,166 Total 36,556 2,798 44,803 MILC 256cpu CPU FPGA GPU Gather/Scatter 2,376 334 399 Stream 10,452 771 1,087 Total 12,827 1,104 1,487

slide-27
SLIDE 27

PMaC

Performance Modeling and Characterization

27/35 AsHES 2012

Optimal mapping of device

HYCOM 256cpu CPU

  • vs. FPGA

CPU

  • vs. GPU

Optimal of CPU, GPU, FPGA CPU FPGA GPU Gather/Sca tter 495 638 448 7,768 495 638 Stream 2,297 6,096 2,149 28,459 2,302 44,166 Total 2,792 6,734 2,596 36,556 2,798 44,803 MILC 256cpu CPU vs. FPGA CPU

  • vs. GPU

Optimal

  • f CPU,

GPU, FPGA CPU FPGA GPU Gather/Scat ter 334 399 334 2,376 334 399 Stream 770 1,087 765 10,452 771 1,087 Total 1,104 1,486 1,099 12,827 1,104 1,487

slide-28
SLIDE 28

PMaC

Performance Modeling and Characterization

28/35 AsHES 2012

Summary of Results of porting Overall improvement of porting idioms results in improvements of 3.4% for Milc and 20% for HYCOM.

slide-29
SLIDE 29

PMaC

Performance Modeling and Characterization

29/35 AsHES 2012

Conclusions and Future Work  Device choice (CPU,GPU,FPGA) depends on idiom and data size  Continue extension to other idioms and model data transfer costs  Adapt the approach for modeling energy consumption

slide-30
SLIDE 30

PMaC

Performance Modeling and Characterization

30/35 AsHES 2012

References

  • M. R. Meswani, L. Carrington, et al., “Modeling and Predicting

Performance on Hardware Accelerators,” To appear as a Poster at the International Symposium on Workload Characterization (IISWC), 2011. 

  • C. Olschanowsky, M. R. Meswani, et al., “PIR: PMaC's Idiom

Recognizer,” in Proceedings of Parallel Software Tools and Tool Infrastructures (PSTI 2010) held in conjunction with ICPP 2010, San Diego, CA, Sep 2010. 

  • L. Carrington, M. Tikir, et al., “An Idiom-finding Tool for

Increasing Productivity of Accelerators,” in Proceedings of Int’l Conf of Supercomputing (ICS), 2011

slide-31
SLIDE 31

PMaC

Performance Modeling and Characterization

31/35 AsHES 2012

Questions ?  Thank you for your attention !  PMaC URL: www.sdsc.edu/pmac  Email: mitesh@sdsc.edu

slide-32
SLIDE 32

PMaC

Performance Modeling and Characterization

32/35 AsHES 2012

Extra Slides

slide-33
SLIDE 33

PMaC

Performance Modeling and Characterization

33/35 AsHES 2012

Model Validation – Graph500  GPU validation underway:

– Our CPU predictions are within 13% – CUDA version implemented and measured GPU version speeds up make_bfs 3X-5X – Optimization continues and is ongoing work

slide-34
SLIDE 34

PMaC

Performance Modeling and Characterization

34/35 AsHES 2012

Model Validation – Entire Application Run  Graph500: A data-intensive benchmark

– Two main kernels: make_bfs and verify_bfs – Briefly the algorithm proceeds in two steps: – First generates the graph based on the scale factor. – Second step it samples 64 random key and for each key:

 Executes make_bfs to generate parent array, and then  Executes validate routine on the parent array.

– Performance metric called TEPS (travelled edges per second) is calculated using time for make_bfs

slide-35
SLIDE 35

PMaC

Performance Modeling and Characterization

35/35 AsHES 2012

Idiom Code Coverage

HYCOM Milc Gather/scatter 1797 156 reduction 110 22 stream 1300 105 stencil 132 transpose 3986 286 Mat-Mat Mult 2161 6 Mat-Vec Mult 115 2 Fraction of Loops Covered 67.45% 37.44%

slide-36
SLIDE 36

PMaC

Performance Modeling and Characterization

36/35 AsHES 2012

Idiom runtime coverage

HYCOM (8cpu) HYCOM (256cpu) Milc (8cpu) Milc (256cpu) Gather/scatter 14.2% 4.6% 1.2% 0.7% reduction 0.0% 0.1% 15.7% 13.9% stream 21.1% 16.9% 5.6% 3.0% stencil 4.7% 11.1% 0.0% 0.0% transpose 0.9% 2.0% 0.0% 0.0% Mat-Mat Mult 23.7% 8.6% 61.2% 58.6% Mat-Vec Mult 0.0% 0.1% 10.5% 16.7% All Idioms 64.6% 43.4% 94.2% 93.2%

slide-37
SLIDE 37

PMaC

Performance Modeling and Characterization

37/35 AsHES 2012

Memory Time