Benefits of the ARM architecture on the even Berkeley Dwarfs Patric - PowerPoint PPT Presentation

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric Mai, Pierre Schoonbrood RWTH Aachen University patric.mai@gmx.de, pierre.schoonbrood@rwth-aachen.de February 12, 2015 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 1 / 35

Overview The road to exaflopic computing 1 ARM architecture ARM HPC clusters Performance of the ARM architecture 2 Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models Developing for ARM 3 Conclusions 4 References 5 Who did what? 6 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 2 / 35

(1) - Current trends www.top500.org Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 3 / 35

(1) - Limitations A supercomputer should not exceed 20 MW budget Currently, 2 GFlops/Watt Required, 50 GFlops/Watt Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 4 / 35

(1) - ARM architecture www.arm.com Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 5 / 35

(1) - ARM HPC Clusters HPC cluster of low power SOCs (ARM) Tibidabo (Rajovic et Al. 2013) 256 nodes with: A9 dual-core @1GHz 1GB DDR2 SDRAM Mont-blanc Project (http://www.montblanc-project.eu) Pedraforca - 70 nodes with: A9 quad-core @1.4GHz 4GB DDR3 SDRAM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 6 / 35

overview The road to exaflopic computing 1 ARM architecture ARM HPC clusters Performance of the ARM architecture 2 Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models Developing for ARM 3 Conclusions 4 References 5 Who did what? 6 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 7 / 35

(2) - Benchmarking the ARM architecture How suitable is the ARM architecture for HPC applications? Performance per Watt for several dwarfs Comparison with other architectures Optimizing OpenCL framework for ARM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 8 / 35

(2) - (FEAST) Finite element analysis (1) Mesh of points of an object (For example: fluid) Points have properties (viscosity) Differential equations for behavior (laminar or turbulent?) Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 9 / 35

(2) - (FEAST) Finite element analysis (2) Tibidabo benchmarked against Xeon cluster (LiDOng) Tested with four configurations LiDOng as much cores as Tibidabo 1 As (1), but all cores of a node activated at LiDOng 2 As few nodes as possible with respect to memory (LiDOng) 3 Twice the amount of (3) 4 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 10 / 35

(2) - (FEAST) Finite element analysis (3) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 11 / 35

(2) - (FEAST) Finite element analysis (4) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 12 / 35

(2) - (HONEI LBM) Computational fluid dynamics (1) Mesh constructed for fluids Definition of physical bounds Definition of physical model Simulation behavior of liquids and gasses Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 13 / 35

(2) - (HONEI LBM) Computational fluid dynamics (2) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 14 / 35

(2) - (HONEI LBM) Computational fluid dynamics (3) G¨ oddicke et Al. 2012 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 15 / 35

(2) - Rijndael and Bitcount Bitcount - count set bits in an array Uneven workload Rijndael - derive round keys for example AES A lot of X-OR operations Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 16 / 35

(2) - Performance on Combinational logic Maghazeh et Al. 2013 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 17 / 35

(2) - Optimizing OpenCL runtime for ARM (1) ARM processor: both host processor and OpenCL device Every core is one compute unit Every core is a single processing element Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 18 / 35

(2) - Optimizing OpenCL runtime for ARM (2) Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 19 / 35

(2) - Optimizing OpenCL runtime for ARM (2) Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 20 / 35

(2) - Optimizing OpenCL compilation for ARM Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 21 / 35

(2) - Optimizing OpenCL compilation (NEON) Auto-vectorization by the GCC compiler Vector operations are converted into NEON intrinsic functions Binaries contain NEON functions Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 22 / 35

(2) - Optimizing OpenCL results Improvement over PGCL Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 23 / 35

(3) - Deep convolution neural networks Feed forward neural network Neuron collections responsible for part of an image Several layers (filters) Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 24 / 35

(3) - Performance on neural networks (1) Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 25 / 35

(3) - Performance on neural networks (2) Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 26 / 35

(3) - Performance on neural networks(3) SoC used has 2 cores with 512KB L2 cache Intermediate result: 256KB in size Lots of cache misses Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 27 / 35

(3) - Performance on neural networks(4) Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 28 / 35

(3) - Developing for ARM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 30 / 35

(4) - Conclusions ARM most of the time more energy efficient Mostly utilized for embedded applications Usable for HPC Applications Limiting factor: resources (memory, caches) Frameworks should be optimized for ARM Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 32 / 35

(5) - References (1) G¨ oddeke, Dominik, et al. Energy efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM-based cluster. Journal of Computational Physics , 2013, 237. Jg., S. 132-150. JO, Gangwon, et al. OpenCL framework for ARM processors with NEON support. In: Proceedings of the 2014 Workshop on Workshop on programming models for SIMD/Vector processing. ACM, 2014. S. 33-40. MAGHAZEH, Arian, et al. General purpose computing on low-power embedded GPUs: Has it come of age?. In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013 International Conference on . IEEE, 2013. S. 1-10. Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 33 / 35

(5) - References (2) JIN, Jonghoon, et al. An efficient implementation of deep convolutional neural networks on a mobile coprocessor. In: Circuits and Systems (MWSCAS), 2014 IEEE 57th International Midwest Symposium on . IEEE, 2014. S. 133-136. Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 34 / 35

(6) - Who did what? Shared Introduction - Patric Mai Road to exaflopic computing - Pierre Schoonbrood Sparse linear Algebra - Pierre Schoonbrood Unstructured grids - Pierre Schoonbrood Combinational logic - Patric Mai Optimizing OpenCL for ARM - Patric Mai Graphical models - Pierre Schoonbrood Developing for ARM - Patric Mai Conclusions - Both Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 35 / 35

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric - PowerPoint PPT Presentation

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric Mai, Pierre Schoonbrood RWTH Aachen University patric.mai@gmx.de, pierre.schoonbrood@rwth-aachen.de February 12, 2015 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Statements and open sentences Statements: 2 is an even integer. 3 is an even integer.

Development by Azeria @fox0x01 ARM Exploit Benefits of Learning ARM Assembly Reverse

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

1 What is it Really? ARM Chips ARM Chips ARM Chips ARM Chips Typically an Embedded

Secure Architecture and Secure Architecture and Implementation of Xen Xen on ARM on ARM

LEADING COLLABORAT ION IN THE ARM ECOSYSTEM Linaro workshop Open Source HPC Collaboration on

Fast Dynamic Simulation of VLSI circuits using Reduced Order Compact Macromodel of Standard Cells

Visual presentation skills Domain: Electronics and Telecommunication Subject: Digital Circuits

PPL, CPL and ATPL Pilots Coming Soon Greg Reeve Aviation Meteorologist New Meteorology Manuals

Contributions of Observations to Assessments of Anthropogenic Greenhouse Gas (GHG) Emissions

Hardware/Software Hardware/Software Codesign Environments Codesign Environments Gert Jervan

FPGArt Painting with an FPGA Niklas Rother, Rebecca Cramer, Tim Oberschulte 23.03.2017 1 / 10

Embedded Multi-Target Tracking System CN052 Wang Shuhui, Wang Qiaoyuan, Wei Longping Lu Xiaofeng

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9