This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu
Jean-Franois Mhaut This project and the research leading to these - - PowerPoint PPT Presentation
http://www.montblanc-project.eu Jean-Franois Mhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n 288777.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu
The New Killer Processors Overview of the Mont-Blanc projects BOAST DSL for computing kernels
Corse: Compiler Optimization and Run-time SystEms ∗
Fabrice Rastello
∗Inria Joint Project Team (proposal)
June 9, 2015
Fabrice Rastello (Inria) Corse June 9, 2015 1 / 26
Project-team composition / Institutional context
Joint Project-Team (Inria, Grenoble INP , UJF) in the LIG laboratory @ Giant/Minatec
Fabrice Rastello, Florent Bouchez Tichadou, François Broquedis, Frédéric Desprez, Yliès Falcone, Jean-François M´ ehaut 8 PhD, 3 Post-doc, 1 Engineer
Fabrice Rastello (Inria) Corse June 9, 2015 3 / 26
Permanent member curriculum vitae
Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray, Nanosim) compiler optimization, compiler back-end François Broquedis MdC INP (PhD Bordeaux 2010, 1Y Mescal, 3Y Moais) runtime systems, OpenMP , memory management Frédéric Desprez (DR1 Inria: Graal, Avalon) parallel algorithmic, numerical libraries Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, Vasco) validation, enforcement, debugging, runtime Jean-François Mehaut Pr UJF ( Mescal, Nanosim) runtime, debugging, memory management, scientific applications Fabrice Rastello CR1 Inria (PhD Lyon 2000, 2Y STMicro, Compsys, GCG) compiler optimization, graph theory, compiler back-end, automatic parallelization
Fabrice Rastello (Inria) Corse June 9, 2015 4 / 26
Overall Objectives
Domain : Compiler optimization and runtime systems for performance and energy consumption (not reliability, nor WCET) Issues: Scalability and heterogeneity/complexity ≡ trade-off between specific optimizations and programmability/portability Target architectures: VLIW / SIMD / embedded / many-cores / heterogeneity Applications: dynamic-systems / loop-nests / graph-algorithmic / signal-processing Approach: combine static/dynamic & compiler/run-time
Fabrice Rastello (Inria) Corse June 9, 2015 5 / 26
1999, 3.1 TFLOPS
5
Cray-1, Cray-C90 NEC SX4, SX5 Alpha AV4, EV5 Intel Pentium IBM P2SC HP PA8200
1974 1979 1984 1989 1994 1999 10 100 1000 10.000
MFLOPS
Alpha Intel AMD NVIDIA Tegra Samsung Exynos 4-core ARMv8 1.5 GHz
1990 1995 2000 2005 2010 100 1.000 10.000 100.000
MFLOPS
2015 1.000.000
1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige
Tag Full name Properties
pthreads OpenMP OmpSs CUDA OpenCL
vecop Vector operation Common operation in numerical codes dmmm Dense matrix-matrix multiply Data reuse an compute performance 3dstc 3D volume stencil Strided memory accesses (7-point 3D stencil) 2dcon 2D convolution Spatial locality fft 1D FFT transform Peak floating-point, variable stride accesses red Reduction operation Varying levels of parallelism hist Histogram calculation Local privatization and reduction stage msort Generic merge sort Barrier synchronization nbody N-body calculation Irregular memory accesses amcd Markov chain Monte-Carlo method Embarassingly parallel spvm Sparse matrix-vector multiply Load imbalance
Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W 2 Racks 32 blade containers
256 nodes 512 cores
9x 48-port 1GbE switch 512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W
OmpSs runtime library (NANOS++) GPU CPU GPU CPU CPU GPU … Source files (C, C++, FORTRAN, …) gcc gfortran
OmpSs
… Compiler(s) Executable(s) CUDA OpenCL MPI GASNet Linux Linux Linux FFTW HDF5 … … ATLAS Scientific libraries
Scalasca … Paraver Developer tools Cluster management (Slurm)
Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results
Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results
Server chips Mobile chips
Per-node figure Intel SandyBridge (E5-2670) AppliedMicro X-Gene Calxeda EnergyCore (“Midway”) TI Keystone II Nvidia Tegra4 Samsung Exynos 5 Octa
#cores 8 16-32 4 4 4 4+4 CPU Sandy Bridge Custom ARMv8 Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A15 + Cortex-A7 Technology 32nm 40nm 28nm 28nm 28nm Clock speed 2.6GHz 3GHz 2GHz 1.9GHz 1.8GHz Memory size 750GB ? 4GB 4GB 4GB 4GB Memory bandwidth 51.2GB/s 80 GB/s 12.8 GB/s 12.8 GB/s 12.8 GB/s ECC in DRAM Yes Yes Yes Yes No No I/O bandwidth 80GB/s ? 4 x 10 Gb/s 10 Gb/s 6 Gb/s * 6 Gb/s * I/O interface PCIe Integrated Integrated Integrated USB 3.0 USB 3.0 Protocol offload (in the NIC) Yes Yes Yes No No
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
The Mont-Blanc European Projects
Mont-Blanc 1 (2011-2015) Mont-Blanc 2 (2013-2016) :
Develop prototypes of HPC clusters using low power commercially available embedded technology (ARM CPUs, low power GPUs...). Design the next generation in HPC systems based on embedded technologies and experiments on the prototypes. Develop a portfolio of existing applications to test these systems and optimize their efficiency, using BSC’s OmpSs programming model (11 existing applications were selected for this portfolio). Build Software Stack (OS, runtime, performance tools,...)
Prototype : based on Exynos 5250 : ARM dual core Cortex A15 with T604 Mali GPU (OpenCL)
7 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BigDFT a Tool for Nanotechnologies
Ab initio simulation :
Simulates the properties of crystals and molecules, Computes the electronic density, based on Daubechie wavelet.
This formalism was chosen because it is fit for HPC computations :
Each orbital can be treated independently most of the time, Operator on orbitals are simple and straightforward.
Mainly developed in Europe :
CEA-DSM/INAC (Grenoble) Basel, Louvain la Neuve,...
Electronic density around a methane molecule.
8 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BigDFT as an HPC application
Implementation details :
200,000 lines of Fortran 90 and C Supports MPI, OpenMP, CUDA and OpenCL Uses BLAS Scalability up to 16000 cores of Curie and 288GPUs
Operators can be expressed as 3D convolutions :
Wavelet Transform Potential Energy Kinetic Energy
These convolutions are separable and filter are short (16 elements). Can take up to 90% of the computation time on some systems.
9 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
SPECFEM3D a tool for wave propagation research
Wave propagation simulation :
Used for geophysics and material research, Accurately simulate earthquakes, Based on spectral finite element.
Developed all around the world :
France (CNRS Marseille), Switzerland (ETH Zurich) CUDA, United States (Princeton) Networking, Grenoble (LIG/CNRS) OpenCL.
Sichuan earthequake.
10 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
SPECFEM3D as an HPC application
Implementation details :
80,000 lines of Fortran 90 Supports MPI, CUDA, OpenCL and an OMPSs + MPI miniapp Scalability up to 693,600 cores on IBM BlueWaters
11 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Case Study 1 : BigDFT’s MagicFilter
The simplest convolution found in BigDFT, corresponds to the potential operator. Characteristics
Separable, Filter length 16, Transposition, Periodic, Only 32 operations per element.
Pseudo code
1 d o u b l e f i l t [ 1 6 ] = {F0 , F1 , . . . , F15 } ; 2 v o i d m a g i c f i l t e r ( i n t n , i n t ndat , 3 d o u b l e ∗ in , d o u b l e ∗ out ){ 4 d o u b l e temp ; 5 f o r ( j =0; j <ndat ; j ++) { 6 f o r ( i =0; i <n ; i ++) { 7 temp = 0 ; 8 f o r ( k=0; k <16; k++) { 9 temp+= i n [ ( ( i −7+k)%n ) + j ∗n ] 10 ∗ f i l t [ k ] ; 11 } 12
i ∗ ndat ] = temp ; 13 } } } 13 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Case study 2 : SPECFEM3D port to OpenCL
Existing CUDA code :
42 kernels and 15000 lines of code kernels with 80+ parameters ∼ 7500 lines of cuda code ∼ 7500 lines of wrapper code
Objectives :
Factorize the existing code, Single OpenCL and CUDA description for the kernels, Validate without unit tests, comparing native Cuda to generated Cuda executions Keep similar performances.
14 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
A Parametrized Generator
15 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
Source Code
Developer
Binary Performance data
Development Compilation Perfomance Analysis Optimization
Kernel optimization workflow Usually performed by a knowledgeable developer
16 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
Source Code Binary
Gcc Mercurium OpenCL
Performance data
Development Compilation Perfomance Analysis Optimization
Compilers perform optimizations Architecture specific or generic optimizations
16 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
Source Code Binary Performance data
MAQAO HW Counters Proprietary Tools
Development Compilation Perfomance Analysis Optimization
Performance data hint at source transformations Architecture specific or generic hints
16 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
Source Code
Developer
Binary Performance data
Development Compilation Perfomance Analysis Optimization
Multiplication of kernel versions or loss of versions Difficulty to benchmark versions against each-other
16 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST Development Loop
Source Code Binary Performance data
Development Compilation Perfomance Analysis Optimization
Generative Source Code
Developer
Transformation
Meta-programming of optimizations in BOAST High level object oriented language
17 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST Development Loop
Source Code
BOAST
Binary Performance data
Development Compilation Perfomance Analysis Optimization
Generative Source Code
Transformation
Generate combination of optimizations C, OpenCL, FORTRAN and CUDA are supported
17 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST Development Loop
Source Code Binary
MAQAO HW Counters Proprietary Tools
Performance data
Development Compilation Perfomance Analysis Optimization
Generative Source Code
Transformation
Gcc Mercurium OpenCL
Compilation and analysis are automated Selection of best version can also be automated
17 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST
C kernel Fortran kernel OpenCL kernel CUDA kernel C with vector intrinsics kernel
Select target language Select
Performance measurements
Select performance metrics
Binary kernel
Select compiler and options Select input data
Optimization space prunner: ASK, Collective Mind Binary analysis tool like MAQAO Kernel written in BOAST DSL Application kernel (SPECFEM3D, BigDFT, ...) code generation
BOAST
gcc,
runtime
BOAST
1 2 3 4 5 Best performing version 18 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Use Case Driven
Parameters arising in a convolution :
Filter : length, values, center. Direction : forward or inverse convolution. Boundary conditions : free or periodic. Unroll factor : arbitrary.
How are those parameters constraining our tool ?
19 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Features required
Unroll factor :
Create and manipulate an unknown number of variables, Create loops with variable steps.
Boundary conditions :
Manage arrays with parametrized size.
Filter and convolution direction :
Transform arrays.
And of course be able to describe convolutions and output them in different languages.
20 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Proposed Generator
Idea : use a high level language with support for operator
to transform a decorated tree. Define several abstractions :
Variables : type (array, float, integer), size... Operators : affect, multiply... Procedure and functions : parameters, variables... Constructs : for, while...
21 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Sample Code : Variables and Parameters
1 #simple Variable 2 i = Int "i" 3 #simple constant 4 lowfil = Int( "lowfil", :const => 1-center ) 5 #simple constant array 6 fil = Real("fil", :const => arr , :dim => [ Dim(lowfil ,upfil) ]) 7 #simple parameter 8 ndat = Int("ndat", :dir => :in) 9 # multidimensional array , an output parameter 10 y = Real("y", :dir => :out , :dim => [ Dim(ndat), Dim(dim_out_min , dim_out_max ) ] )
Variables and Parameters are objects with a name, a type, and a set of named properties.
22 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Sample Code : Procedure Declaration
The following declaration :
1 p = Procedure(" magic_filter ", [n,ndat ,x,y], [lowfil ,upfil ]) 2
Outputs Fortran :
1 subroutine magicfilter (n, ndat , x, y) 2 integer(kind =4), parameter :: lowfil = -8 3 integer(kind =4), parameter :: upfil = 7 4 integer(kind =4), intent(in) :: n 5 integer(kind =4), intent(in) :: ndat 6 real(kind =8), intent(in), dimension (0:n-1, ndat) :: x 7 real(kind =8), intent(out), dimension(ndat , 0:n -1) :: y
Or C :
1 void magicfilter (const int32_t n, const int32_t ndat , const double * x, double * y){ 2 const int32_t lowfil =
3 const int32_t upfil = 7; 23 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Sample Code : Constructs and Arrays
The following declaration :
1 unroll = 5 2 pr For(j,1,ndat -( unroll -1), unroll) { 3 #..... 4 pr tt2 === tt2 + x[k,j+1]* fil[l] 5 #..... 6 }
Outputs Fortran :
1 do j=1, ndat -4, 5 2 !...... 3 tt2=tt2+x(k,j+1)* fil(l) 4 !...... 5 enddo
Or C :
1 for(j=1; j<=ndat -4; j+=5){ 2 /* ........... */ 3 tt2=tt2+x[k -0+(j+1 -1)*(n -1 -0+1)]* fil[l-lowfil ]; 4 /* ........... */ 5 } 24 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Generator Evaluation
Back to the test cases :
The generator was used to unroll the Magicfilter an evaluate it’s performance on an ARM processor and an Intel processor. The generator was used to describe SPECFEM3D kernel.
25 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Performance Results
Tegra2 Intel T7500
26 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BigDFT Synthesis Kernel
27 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Improvement for BigDFT
Most of the convolutions have been ported to BOAST. Results are encouraging : on the hardware BigDFT was hand
MagicFilter OpenCL versions tailored for problem size by BOAST gain 10 to 20% of performance.
28 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
SPECFEM3D OpenCL port
Fully ported to OpenCL with comparable performances (using the global_s362ani_small test case) :
On a 2*6 cores (E5-2630) machine with 2 K40, using 12 MPI processes :
OpenCL : 4m15s CUDA : 3m10s
On an 2*4 cores (E5620) with a K20 using 6 MPI processes :
OpenCL : 12m47s CUDA : 11m23s
Difference comes from the capacity of cuda to specify the minimum number of blocks to launch on a multiprocessor. Less than 4000 lines of BOAST code (7500 lines of cuda originally).
29 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Conclusions and Future Work
30 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Conclusions
Generator has been used to test several loop unrolling strategies in BigDFT. Highlights :
Several output languages. All constraints have been met. Automatic benchmarking framework allows us to test several
Automatic non regression testing. Several algorithmically different versions can be generated (changing the filter, boundary conditions...).
31 / 32 BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Future Works and Considerations
Future work :
Produce an autotuning convolution library. Implement a parametric space explorer or use an existing one (ASK : Adaptative Sampling Kit, Collective Mind...). Vector code is supported, but needs improvements. Test the OpenCL version of SPECFEM3D on the Mont-Blanc prototype.
Question raised :
Is this approach extensible enough ? Can we improve the language used further ?
32 / 32 BOAST