ARM-based systems at BSC PRACE Spring School 2013 New and Emerging - PowerPoint PPT Presentation

www.bsc.es ARM-based systems at BSC PRACE Spring School 2013 New and Emerging Technologies - Programming for Accelerators Nikola Rajovic, Gabriele Carteni Barcelona Supercomputing Center

Outline A little bit of history – From vector CPUs to commodity components “Killer mobile” processors – Overview of current trends for mobile CPUs Our experiences – Tibidabo – ARM Multicore prototype – Pedraforca – ARM + GPU Prototype Looking ahead – Mont-Blanc project Disclaimer: All references to unavailable products are speculative, taken from web sources. There is no commitment from ARM, Samsung, Intel, or others implied. 2

In the beginning ... there were only supercomputers Built to order – Very few of them Special purpose hardware – Very expensive Control Data Cray-1 – 1975, 160 MFLOPS • 80 units, 5-8 M$ Cray X-MP – 1982, 800 MFLOPS Cray-2 – 1985, 1.9 GFLOPS Cray Y-MP – 1988, 2.6 GFLOPS ...Fortran+ Vectorizing Compilers 3 3

Then, commodity took over special purpose ASCI White, Lawrence Livermore Lab. ASCI Red, Sandia – – 2001, 7.3 TFLOPS, 8192 proc. 1997, 1 Tflops (Linpack), 9298 processors at 200 MHz, 1.2 RS6000 at 375 MHz, 6 Terabytes, Tbytes, 850 kWatts – (3 +3) MWatts – Intel Pentium Pro – Cooling + Everything else • Upgraded to Pentium II Xeon, – IBM Power 3 1999, 3.1 Tflops Message-Passing Programming Models 4 4

“Killer microprocessors” 10.000 Cray-1, Cray-C90 NEC SX4, SX5 1000 MFLOPS Alpha AV4, EV5 Intel Pentium 100 IBM P2SC HP PA8200 10 1974 1979 1984 1989 1994 1999 Microprocessors killed the Vector supercomputers – They were not faster ... – ... but they were significantly cheaper and greener 10 microprocessors approx. 1 Vector CPU – SIMD vs. MIMD programming paradigms 5 5

Finally, commodity hardware + commodity software MareNostrum – Nov 2004, #4 Top500 • 20 Tflops, Linpack – IBM PowerPC 970 FX • Blade enclosure – Myrinet + 1 GbE network – SuSe Linux 6 6

2008 – 1 PFLOPS – IBM RoadRunner Los Alamos National Laboratory (USA) Hybrid architecture – 1 x AMD dual-core Master blade – 2 x PowerXCell 8i Worker blade Hybrid MPI + Task off-load model 296 racks – 6.480 Opteron processors – 12.960 Cell processors • 128-bit SIMD Infiniband interconnect – 288-port switches 2.35 MWatt (425 MFLOPS / W) 7

2009 - Cray Jaguar (1.8 PFLOPS) Oak Ridge National Laboratory (USA) Multi-core architecture – Hybrid MPI + OpenMP programming 230 racks 224.256 AMD Opteron processors – 6 cores / chip Cray Seastar2+ interconnect – 3D-mesh using AMD Hypertransport 7 MWatt (257 MFLOPS / W) 8

2012 – Cray Titan (17.6 PFLOPS) DOE/SC/Oak Ridge National Laboratory – Jaguar GPU upgrade 200 racks 224.256 Cray XK7 nodes – 16-core AMD Opteron – Nvidia Testa K20X GPU 8.2 Mwatts (2.142 MFLOPS/W) 9

The next step in the commodity chain HPC Servers Desktop Total cores in Nov‘12 Top500 – 14.9M Cores Tablets sold 2012 – > 100M Tablets Mobile Smartphones sold 2012 – > 712M Phones 12 12

ARM Processor improvements in DP FLOPS IBM Intel 16 BG/Q AVX ARMv8 8 DP ops/cycle Intel IBM 4 SSE2 BG/P ARM Cortex TM -A15 2 ARM 1 Cortex TM -A9 1999 2001 2003 2005 2007 2009 2011 2013 2015 IBM BG/Q and Intel AVX implement DP in 256-bit SIMD – 8 DP ops / cycle ARM quickly moved from optional floating-point to state-of-the-art – ARMv8 ISA introduces DP in the NEON instruction set (128-bit SIMD) 13 13

Integrated ARM GPU performance Skrymir Performance Mali-T658 High-end solution + compute capability Scalable to 8 cores, ARMv8 compatible 272 GFLOPS* Mali-T604 First Midgard architecture product Scalable to 4 cores 68 GFLOPS* 2012 2014 2013 GPU compute performance increases faster than Moore’s Law * Data from web sources, not an ARM commitment

Are the “Killer Mobiles™" coming? Server ($1500) Cost (log 10 ) Desktop ($150) HPC-Mobile ($40) ? Nowadays Mobile ($20) Near future Performance (log 2 ) Where is the sweet spot? Maybe in the low-end ... – Today ~ 1:8 ratio in performance, 1:50 ratio in cost – Tomorrow ~ 1:2 ratio in performance, still 1:50 in cost ? The same reason why microprocessors killed supercomputers – Not so much performance ... but much lower cost, and power 15 15

The Killer Mobile processors TM 1.000.000 Alpha 100.000 Intel MFLOPS AMD 10.000 Nvidia Tegra Samsung Exynos 1.000 4-core ARMv8 1.5 GHz 100 1990 1995 2000 2005 2010 2015 History may be about to repeat itself … – Mobile processor are not faster … – … but they are significantly cheaper and greener 16

Then and now Then: Commodity vs Mobile Vector vs Commodity Now: Today’s situation looks very familiar – “Mobile vs. Server” similar to “Server vs. Vector” – Significantly lower cost of mobile CPUs (thousands vs hundreds of $) – Same programming model, larger scale • Will need more parallelism (probably less than one order of magnitude) Off course, this does not prove anything – Mobile CPUs will become a viable alternative, but there’s no guarantee that they will make it to mainstream HPC systems 17

BSC ARM-based prototype roadmap Pedraforca: GFLOPS / W ARM + GPU Tibidabo: Integrated ARM multicore ARM + GPU 2011 2012 2013 2014 Prototypes are critical to accelerate software development – System software stack + applications 18

ARM Cortex-A9 Smartphone CPU OoO superscalar processor – Issue width of 4 VFP for 64-bit Floating Point – DP: 1 FMA each 2 cycles The first ARM CPU worth for testing HPC workloads 20

NVIDIA Tegra2 Dual-core Cortex-A9 @ 1GHz – VFP for 64-bit Floating Point • 2 GFLOPS (1 FMA / 2 cycles) Low-power Nvidia GPU – OpenGL only, CUDA not supported Several (not useful for HPC) accelerators – Video encoder-decoder – Audio processor SECO Q7 board – Image processor 2 GFLOPS ~ 0.5 Watt 21

SECO Q7 Tegra2 + Carrier board Q7 Module – 1x Tegra2 SoC • 2x ARM Cortex-A9, 1 GHz – 1 GB DDR2 DRAM – 100 Mbit Ethernet (USB) – PCIe • 1 GbE • MXM connector for mobile GPU – 4" x 4" Q7 + MXM board – 2 Ethernet ports – 2 USB ports – 2 HDMI • 1 from Tegra • 1 from GPU – uSD slot – 8" x 5.6" 2 GFLOPS ~ 7 Watt

1U multi-board container Standard 19" rack dimensions – 1.75" (1U) x 19" x 32" deep 8x Q7-MXM Carrier boards – 8x Tegra2 SoC – 16x ARM Cortex-A9 – 8 GB DRAM 1 Power Supply Unit (PSU) – Daisy-chaining of boards – ~7 Watts PSU waste 16 GFLOPS ~ 65 Watts

Tibidabo: The first ARM multicore cluster Q7 Tegra 2 2 Racks 2 x Cortex-A9 @ 1GHz 32 blade containers 2 GFLOPS 256 nodes 512 cores 5 Watts (?) 9x 48-port 1GbE switch 0.4 GFLOPS / W 512 GFLOPS Q7 carrier board 3.4 Kwatt 2 x Cortex-A9 0.15 GFLOPS / W 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W 1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W Proof of concept – It is possible to deploy a cluster of smartphone processors Enable software stack development 24

Network, storage and management

Tibidabo: scalability and energy efficiency HPC applications scale out of the box on tibidabo – Strong scaling depends on the size of input set HPL – good weak scaling – 120 MFLOPS/Watt Specfem3D – Improvements over x86 cluster in energy efficiency (up to 3x) D. Goddeke et. al. “Energy -efficiency vs. performance of the numerical solution of PDEs: an application study on a low-power ARM- based cluster”, Journal of Computational Physics 26

Tibidabo: Power consumption breakdown Single node power consumption breakdown 0.26 W 0.26 W 0.10 W 0.70 W Core1 Core2 L2 cache 0.90 W Memory Eth1 0.50 W 5.68 W Eth2 Other » power consumption while running HP Linpack

Current status of operations Tibidabo is a prototype, that is: – *it is not a production system* – Limited user support (experienced users are expected) – Basic stack of production services – Frequent maintenances (often like time bombs ) Nodes inventory: – 1 Head Node, acting also as single I/O Node – 4 Login Nodes – 242 Compute Nodes (each providing 2x ARM Cortex-A9 CPU) – 2 Development Nodes (software development and testing)

ARM-based systems at BSC PRACE Spring School 2013 New and Emerging - PowerPoint PPT Presentation

www.bsc.es ARM-based systems at BSC PRACE Spring School 2013 New and Emerging Technologies - Programming for Accelerators Nikola Rajovic, Gabriele Carteni Barcelona Supercomputing Center Outline A little bit of history From vector CPUs to

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Studying : BSc Accounting & Finance, BSc Business & Economics, and BSc Business &

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Welcome to BSC Javier Bartolom BSC System Head 1st WISE Workshop, 20-22 October 2015 Agenda

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Katrin Hartmann, k.hartmann@gcu.ac.uk New BSc (Hon.) IT Management for Business for BSc (Hon.)

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer,

Fair CPU Time Accounting in CMP+SMT Processors Carlos Luque (UPC/BSC) Miquel Moreto

Effects of I/O Routing Through Column Interfaces in Embedded FPGA Fabrics Christophe Huriaux

S OFTWARE Defined Networking (SDN) has been proposed Field Programmable Gate Array (FPGA)

Presenting a live 90-minute webinar with interactive Q&A Employee Use of Dual-Purpose

BE PATIENT-PAPERWORK IS BEING HANDED OUT Hello, We are excited about getting the DE paperwork

Computational Seismology and Grid Computational Seismology and Grid Computational Seismology and

Safety Rapporteur Report George Donohue Summary Observations 7 papers: 1 European, 1 Joint,

CERN TE-ABT-EC LIU-ABT systems: PSB BI.DIS controls Session indico.389747 Wednesday, May 27 th

PAPERGALAXY Team KimChoSun Hoon Kim, Hyunsung Cho, Juho Sun Motivation Every research is

ARM-based systems at BSC PRACE Spring School 2013 New and Emerging - PowerPoint PPT Presentation

www.bsc.es ARM-based systems at BSC PRACE Spring School 2013 New and Emerging Technologies - Programming for Accelerators Nikola Rajovic, Gabriele Carteni Barcelona Supercomputing Center Outline A little bit of history From vector CPUs to

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Studying : BSc Accounting &amp; Finance, BSc Business &amp; Economics, and BSc Business &amp;

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Welcome to BSC Javier Bartolom BSC System Head 1st WISE Workshop, 20-22 October 2015 Agenda

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Katrin Hartmann, k.hartmann@gcu.ac.uk New BSc (Hon.) IT Management for Business for BSc (Hon.)

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer,

Fair CPU Time Accounting in CMP+SMT Processors Carlos Luque (UPC/BSC) Miquel Moreto

Effects of I/O Routing Through Column Interfaces in Embedded FPGA Fabrics Christophe Huriaux

S OFTWARE Defined Networking (SDN) has been proposed Field Programmable Gate Array (FPGA)

Presenting a live 90-minute webinar with interactive Q&amp;A Employee Use of Dual-Purpose

BE PATIENT-PAPERWORK IS BEING HANDED OUT Hello, We are excited about getting the DE paperwork

Computational Seismology and Grid Computational Seismology and Grid Computational Seismology and

Safety Rapporteur Report George Donohue Summary Observations 7 papers: 1 European, 1 Joint,

CERN TE-ABT-EC LIU-ABT systems: PSB BI.DIS controls Session indico.389747 Wednesday, May 27 th

PAPERGALAXY Team KimChoSun Hoon Kim, Hyunsung Cho, Juho Sun Motivation Every research is

Studying : BSc Accounting & Finance, BSc Business & Economics, and BSc Business &

Presenting a live 90-minute webinar with interactive Q&A Employee Use of Dual-Purpose