Polyhedral-Based Data Reuse Optimization for Configurable Computing
Louis-Noël Pouchet1 Peng Zhang1 P . Sadayappan2 Jason Cong1
1 University of California, Los Angeles 2 The Ohio State University
Polyhedral-Based Data Reuse Optimization for Configurable Computing - - PowerPoint PPT Presentation
Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Nol Pouchet 1 Peng Zhang 1 P . Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February 12, 2013 ACM/SIGDA International
1 University of California, Los Angeles 2 The Ohio State University
Overview: FPGA’13
UCLA / OSU 2
Overview: FPGA’13
UCLA / OSU 2
Overview: FPGA’13
UCLA / OSU 2
Overview: FPGA’13
UCLA / OSU 2
Overview: FPGA’13
UCLA / OSU 2
Overview: FPGA’13
UCLA / OSU 2
The Polyhedral Model: FPGA’13
UCLA / OSU 3
The Polyhedral Model: FPGA’13
DS1 =
1 −1 −1 1 1 −1 −1 1 −1 −1 1 2 . i j n 1 ≥ UCLA / OSU 3
The Polyhedral Model: FPGA’13
UCLA / OSU 3
The Polyhedral Model: FPGA’13
DS1δS2 :
1 −1 1 −1 −1 3 1 −1 −1 3 1 −1 −1 3 . iS1 iS2 jS2 1 = 0 ≥
S1 iterations S2 iterations
UCLA / OSU 3
The Polyhedral Model: FPGA’13
UCLA / OSU 3
Data Reuse Optimization: FPGA’13
1
2
3
UCLA / OSU 4
Data Reuse Optimization: FPGA’13
UCLA / OSU 5
Data Reuse Optimization: FPGA’13
UCLA / OSU 6
Data Reuse Optimization: FPGA’13
UCLA / OSU 7
Data Reuse Optimization: FPGA’13
UCLA / OSU 7
Data Reuse Optimization: FPGA’13
UCLA / OSU 7
Data Reuse Optimization: FPGA’13
UCLA / OSU 7
Data Reuse Optimization: FPGA’13
UCLA / OSU 7
Data Reuse Optimization: FPGA’13
UCLA / OSU 7
Data Reuse Optimization: FPGA’13
1
2
3
◮ Example: A[i][j] = A[i][j+1] becomes for a buffer A_l[bs1][bs2]:
4
◮ Example of copy-in statement: A_l[i % bs1][j % bs2] = A[i][j]; UCLA / OSU 8
High-Level Synthesis: FPGA’13
UCLA / OSU 9
Experimental Results: FPGA’13
100 200 300 400 500 600 700 800 900 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08
Total BRAMs (in 16kB blocks) Total execution time (in cycles)
Denoise: Pareto-optimal
100 200 300 400 500 600 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 3.5e+09 4e+09 4.5e+09
Total BRAMs (in 16kB blocks) Total execution time (in cycles)
Segmentation: Pareto-optimal
20 40 60 80 100 120 140 1.8e+07 1.9e+07 2e+07 2.1e+07 2.2e+07 2.3e+07 2.4e+07 2.5e+07 2.6e+07 2.7e+07 2.8e+07
Total BRAMs (in 16kB blocks) Total execution time (in cycles)
DGEMM: Pareto-optimal
Benchmark Description basic off-chip PolyOpt hand-tuned [17] denoise 3D Jacobi+Seidel-like 7-point stencils 0.02 GF/s 4.58 GF/s 52.0 GF/s segmentation 3D Jacobi-like 7-point stencils 0.05 GF/s 24.91 GF/s 23.39 GF/s DGEMM matrix-multiplication 0.04 GF/s 22.72 GF/s N/A GEMVER sequence of matrix-vector 0.10 GF/s 1.07 GF/s N/A
UCLA / OSU 10
Software Infrastructure: FPGA’13
Parser
C-to-AST
Unparser
AST-to-C
PolyParser
AST-to-polyhedra
PolyUnparser
PAST-to-AST
Outliner
restructure code for HLS
Candl
dependence analysis
Pluto
tilability
vectorizer
inner-parallel
LMP
buffer and comm. generation
CLooG
Polyhedra-to- PAST
PIPLib ISL C code Sage AST (ROSE) SCoP (polyhedral rep.) PAST (Polyhedral AST) Input full C program HLS-friendly C program PoCC, the Polyhedral Compiler Collection PolyOpt, a Polyhedral Optimizer for the ROSE compiler ROSE compiler infrastructure (LLNL) More at http://www.cs.ucla.edu/~pouchet/software/polyopthls
UCLA / OSU 11
Conclusion: FPGA’13
UCLA / OSU 12