jean fran ois m haut
play

Jean-Franois Mhaut This project and the research leading to these - PowerPoint PPT Presentation

http://www.montblanc-project.eu Jean-Franois Mhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n 288777.


  1. http://www.montblanc-project.eu Jean-François Méhaut This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

  2. http://www.montblanc-project.eu  The New Killer Processors  Overview of the Mont-Blanc projects  BOAST DSL for computing kernels This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.

  3. C orse : C ompiler O ptimization and R un - time S yst E ms ∗ Fabrice Rastello ∗ I nria Joint Project Team (proposal) June 9, 2015 Fabrice Rastello (I nria ) C orse June 9, 2015 1 / 26

  4. Project-team composition / Institutional context Joint Project-Team (I nria , Grenoble INP , UJF) in the LIG laboratory @ G iant /Minatec Fabrice R astello , Florent B ouchez T ichadou , François B roquedis , Frédéric D esprez , Yliès F alcone , Jean-François M´ ehaut 8 PhD, 3 Post-doc, 1 Engineer Fabrice Rastello (I nria ) C orse June 9, 2015 3 / 26

  5. Permanent member curriculum vitae Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray, N anosim ) compiler optimization, compiler back-end François Broquedis MdC INP (PhD Bordeaux 2010, 1Y M escal , 3Y M oais ) runtime systems, OpenMP , memory management Frédéric Desprez (DR1 I nria : Graal, Avalon) parallel algorithmic, numerical libraries Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, V asco ) validation, enforcement, debugging, runtime Jean-François Mehaut Pr UJF ( M escal , N anosim ) runtime, debugging, memory management, scientific applications Fabrice Rastello CR1 I nria (PhD Lyon 2000, 2Y STMicro, C ompsys , GCG) compiler optimization, graph theory, compiler back-end, automatic parallelization Fabrice Rastello (I nria ) C orse June 9, 2015 4 / 26

  6. Overall Objectives Domain : Compiler optimization and runtime systems for performance and energy consumption (not reliability, nor WCET) Issues: Scalability and heterogeneity/complexity ≡ trade-off between specific optimizations and programmability/portability Target architectures: VLIW / SIMD / embedded / many-cores / heterogeneity Applications: dynamic-systems / loop-nests / graph-algorithmic / signal-processing Approach: combine static/dynamic & compiler/run-time Fabrice Rastello (I nria ) C orse June 9, 2015 5 / 26

  7. First, vector processors dominated HPC • 1st Top500 list (June 1993) dominated by DLP architectures • Cray vector,41% • MasPar SIMD, 11% • Convex/HP vector, 5% • Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS

  8. Then, commodity took over special purpose • • ASCI Red, Sandia ASCI White, LLNL • 1997, 1 TFLOPS • 2001, 7.3 TFLOPS • 9,298 cores @ 200 Mhz • 8,192 proc. @ 375 Mhz, • Intel Pentium Pro • IBM Power 3 • Upgraded to Pentium II Xeon, 1999, 3.1 TFLOPS Transition from Vector parallelism to Message-Passing Programming Models

  9. Commodity components drive HPC • RISC processors replaced vectors • x86 processors replaced RISC • Vector processors survive as (widening) SIMD extensions

  10. The killer microprocessors 10.000 Cray-1, Cray-C90 NEC SX4, SX5 1000 MFLOPS Alpha AV4, EV5 Intel Pentium 100 IBM P2SC HP PA8200 10 1974 1979 1984 1989 1994 1999 • Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly cheaper and greener • Need 10 microprocessors to achieve the performance of 1 Vector CPU • SIMD vs. MIMD programming paradigms 5

  11. The killer mobile processors TM 1.000.000 Alpha 100.000 Intel MFLOPS AMD 10.000 NVIDIA Tegra Samsung Exynos 1.000 4-core ARMv8 1.5 GHz 100 2015 1990 1995 2000 2005 2010 • Microprocessors killed the • History may be about to Vector supercomputers repeat itself … • They were not faster ... • Mobile processor are not • ... but they were significantly faster … • … but they are significantly cheaper and greener cheaper

  12. Mobile SoC vs Server processor Performance Cost 5.2 21$ 1 GFLOPS x30 x70 153 1500$ 2 GFLOPS x10 x70 15.2 21$ (?) GFLOPS 1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige

  13. SoC under study: CPU and Memory NVIDIA Tegra 2 NVIDIA Tegra 3 2 x ARM Cortex-A9 @ 1GHz 4 x ARM Cortex-A9 @ 1.3GHz 1 x 32-bit DDR2-333 channel 2 x 32-bit DDR23-750 channels 32KB L1 + 1MB L2 32KB L1 + 1MB L2 Samsng Exynos 5 Dual Intel Core i7-2760QM 2 x ARM Cortex-A15 @ 1.7GHz 4 x Intel SandyBrdige @ 2.4GHz 2 x 32-bit DDR3-800 channels 2 x 64-bit DDR3-800 channels 32KB L1 + 1MB L2 32KB L1 + 1MB L2 + 6MB L3

  14. Evaluated kernels pthreads OpenMP OpenCL OmpSs CUDA Tag Full name Properties Common operation in numerical      vecop Vector operation codes Data reuse an compute      dmmm Dense matrix-matrix multiply performance Strided memory accesses (7-point      3dstc 3D volume stencil 3D stencil)      2dcon 2D convolution Spatial locality Peak floating-point, variable stride      fft 1D FFT transform accesses      red Reduction operation Varying levels of parallelism Local privatization and reduction      hist Histogram calculation stage      msort Generic merge sort Barrier synchronization      nbody N-body calculation Irregular memory accesses Markov chain Monte-Carlo      amcd Embarassingly parallel method      spvm Sparse matrix-vector multiply Load imbalance

  15. Single core performance and energy • Tegra3 is 1.4x faster than Tegra2 • Higher clock frequency • Exynos 5 is 1.7x faster than Tegra3 • Better frequency, memory bandwidth, and core microarchitecture • Intel Core i7 is ~3x better than ARM Cortex-A15 at maximum frequency • ARM platforms more energy-efficient than Intel platform

  16. Multicore performance and energy • Tegra3 is as fast as Exynos 5, a bit more energy efficient • 4-core vs. 2-core • ARM multicores as efficient as Intel at the same frequency • Intel still more energy efficient at highest performance • ARM CPU is not the major power sink in the platform

  17. Memory bandwidth (STREAM) • Exynos 5 improves dramatically over Tegra (4.5x) • Dual-channel DDR3 • ARM Cortex-A15 sustains more in-flight cache misses

  18. Tibidabo: The first ARM HPC multicore cluster Q7 Tegra 2 2 Racks 2 x Cortex-A9 @ 1GHz 32 blade containers 2 GFLOPS 256 nodes 512 cores 5 Watts (?) 9x 48-port 1GbE switch 0.4 GFLOPS / W 512 GFLOPS Q7 carrier board 3.4 Kwatt 2 x Cortex-A9 0.15 GFLOPS / W 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W • Proof of concept • It is possible to deploy a cluster of smartphone processors • Enable software stack development

  19. HPC System software stack on ARM • Open source system software Source files (C, C++, FORTRAN, …) stack Compiler(s) • Ubuntu Linux OS gcc gfortran … OmpSs • GNU compilers Executable(s) • gcc, g++, gfortran Scientific libraries • Scientific libraries ATLAS FFTW HDF5 … … • ATLAS, FFTW, HDF5,... Developer tools • Slurm cluster management Paraver Scalasca … • Runtime libraries • MPICH2, OpenMPI Cluster management (Slurm) • OmpSs toolchain OmpSs runtime library (NANOS++) • Performance analysis tools GASNet CUDA OpenCL • Paraver, Scalasca MPI • Allinea DDT 3.1 debugger Linux Linux Linux • Ported to ARM CPU GPU … CPU GPU CPU GPU

  20. Parallel scalability • HPC applications scale well on Tegra2 cluster • Capable of exploiting enough nodes to compensate for lower node performance

  21. SoC under study: Interconnection NVIDIA Tegra 2 NVIDIA Tegra 3 1 GbE (PCIe) 1 GbE (PCIe) 100 Mbit (USB 2.0) 100 Mbit (USB 2.0) Samsng Exynos 5 Dual Intel Core i7-2760QM 1 GbE (USB3.0) 1 GbE (PCIe) 100 Mbit (USB 2.0) QDR Infiniband (PCIe)

  22. Interconnection network: Latency • TCP/IP adds a lot of CPU overhead • OpenMX driver interfaces directly to the Ethernet NIC • USB stack adds extra latency on top of network stack Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

  23. Interconnection network: Bandwidth • TCP/IP overhead prevents Cortex-A9 CPU from achieving full bandwidth • USB stack overheads prevent Exynos 5 from achieving full bandwidth, even on OpenMX Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results

  24. Interconnect vs. Performance ratio Peak IN bytes / FLOPS 1 Gb/s 6 Gb/s 40 Gb/s Tegra2 0.06 0.38 2.50 Tegra3 0.02 0.14 0.96 Exynos 5250 0.02 0.11 0.74 Intel i7 0.00 0.01 0.07 • Mobile SoC have low-bandwidth interconnect … • 1 GbE or USB 3.0 (6Gb/s) • … but ratio to performance is similar to high-end • 40 Gb/s Inifiniband

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend