Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max - - PowerPoint PPT Presentation

are low power socs feasible for heterogenous hpc workloads
SMART_READER_LITE
LIVE PREVIEW

Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max - - PowerPoint PPT Presentation

Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Max Plauth and Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute, University of Potsdam, Germany Introduction Operating Systems and Middleware Group


slide-1
SLIDE 1

Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

Max Plauth and Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute, University of Potsdam, Germany

slide-2
SLIDE 2

■ Operating Systems and Middleware Group □ Prof. Dr. Andreas Polze □ 7 PhD students, 15 Master‘s thesis WiP □ „Extending the reach of Middleware“

Introduction

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Chart 2 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

slide-3
SLIDE 3

Introduction

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Chart 3

■ SSICLOPS □ Scalable and □ Secure □ Infrastructures for □ Cloud □ OPerationS

Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

slide-4
SLIDE 4

Motivation

■ Power efficiency has acquired an additional facet for the HPC sector □ Running electricity costs exceed initial acquisition costs ■ HPC community is getting interested in low-power (SoC) designs ■ This work focuses on the heterogenous aspects □ CPU: heterogenous multiprocessing / big.LITTLE paradigm □ CPU: improvements of ARMv8-A ISA □ GPU: SoC-grade GPUs has become OpenCL capable

Chart 4 Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

slide-5
SLIDE 5

Motivation (continued)

■ We provide the following contributions: □ We investigate the heterogenous capabilities of state-of-the-art SoCs, elaborating on both the heterogenous multiprocessing feature of big.LITTLE CPUs and the compute capabilities of SoC-grade GPUs. □ We compare the characteristics of ARMv8-A versus ARMV7-A SoCs. □ Based on the narrowing gap between ARM and x86_64 based SoCs, we anticipate the potential of forthcoming ARM desings in the HPC domain.

Chart 5 Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

slide-6
SLIDE 6

Related Work

■ GreenDestiny (2002) / MegaProto (2005) □ first attempts at using low power hardware in HPC scenarios □ GreenDestiny: 240x TM5600 667-MHz CPUs à 13.5 MFLOPS/Watt □ MegaProto: 512x TM8820 1-GHz CPUs à 100 MFLOPS/Watt ■ Rajovic et al. (2013) / Mont-Blanc project Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? □ NVIDIA Tegra 2 & 3 (Cortex-A9); Samsung Exynos 5250 (Cortex-A15) □ No GPU compute support at the time □ Outlook is promising, but current SoCs have many issues – No ECC, unstable PCIe implementation, etc.

Chart 6 Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

slide-7
SLIDE 7

■ Raspberry Pi 3 □ SoC: Broadcom BCM2837 □ CPU: 4x ARM Cortex-A53 CPU ARMv8-A – 1.2GHz, In-order execution – L1$ (I/D): 32KB/32KB – L2$: 512KB □ Memory: 1GB LPDDR2 (900 MHz) □ GPU: BCM VideoCore IV (no compute capabilities) □ OS: Ubuntu MATE 15.10 / Linux 4.1.18-v7+ (armv7l) □ Compiler: GCC v5.2.1

Hardware Targets

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 7

slide-8
SLIDE 8

■ Odroid-C2 □ SoC: Amlogic S905 □ CPU: 4x ARM Cortex-A53 CPU, ARMv8-A – 2.0 GHz, in-order execution – L1$ (I/D): 32KB/32KB – L2$: 512KB □ Memory: 2GB DDR3 (32 bit / 912Mhz) □ GPU: ARM Mali-450 (no compute capabilities) □ OS: Ubuntu MATE 16.04 / Linux 3.14.29-29 (aarch64) □ Compiler: GCC v5.3.1

Hardware Targets (continued)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 8

slide-9
SLIDE 9

■ Odroid-XU4 □ SoC: Samsung Exynos 5422 □ CPU: big.LITTLE octa core, ARMv7-A – 4x Cortex-A7, 1.5GHz, in-order-execution – 4x Cortex-A15, 2.0GHz, out-of-order-exec. – L1$ (I/D): 32KB/32KB – L2$: 512KB (A7) / 2MB (A15) □ Memory: 2GB LPDDR3 (32 bit / 933MHz, PoP) □ GPU: ARM Mali-T628 MP6 (OpenCL v1.1) □ OS: Ubuntu MATE 15.10 / Linux 3.10.96-78 (armv7l, HMP) □ Compiler: GCC v5.2.1

Hardware Targets (continued)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 9

slide-10
SLIDE 10

■ HPE ProLiant m710p Server Cartridge □ CPU: Intel Xeon E3-1284L v4, 4C/8T, x86_64 – 2.90GHz, out-of-order – L1$ (I/D): 32KB/32KB (per core) – L2$: 256KB (per core) – L3$: 6MB (shared) – L4$: 128MB eDRAM □ Memory: 32GB DDR3-1600 SODIMM □ GPU: Iris Pro P6300 BroadWell GT3 (OpenCL v1.2) □ OS Ubuntu 16.04 LTS / Linux 4.4.0-21 (x86\_64) □ Compiler: GCC v5.3.1

Hardware Targets (continued)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 10

slide-11
SLIDE 11

Benchmark procedure

■ Rodinia Suite: picked 4 tests to cover major Berkley Dwarfs – Structured Grid (Leukocyte Tracking) – Unstructured Grid (CFD Solver) – Dense Linear Algebra (k-Nearest Neighbours) – Graph Traversal (Breadth-First Search) □ Warm-Up run + 10 repeated measurements □ Energy consumption measured (Off/Idle/Load) ■ STREAM Benchmark (Memory Bandwith) ■ TinyMemBench (Memory Latency)

Chart 11 Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

slide-12
SLIDE 12

Power Consumption

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 12

RPI 3 C2 XU4 (A7) XU4 (A15) XU4 (GPU) m710p (CPU) m710p (GPU) Off 0.50 1.00 0.70 0.70 0.70 9.85 9.85 Idle 1.70 2.30 3.80 3.80 3.80 20.65 20.65 Load 2.70 4.10 5.10 11.70 6.60 79.45 67.93

■ SBCs: power consumption was measured using an power outlet meter ■ M710p: power consumption were retrieved through HPE iLO mgmt interface

slide-13
SLIDE 13

STREAM memory bandwidth

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 13

R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( B

  • t

h ) m 7 1 p 2000 4000 6000 12000 14000 16000

Memory Bandwidth [MB/s] Copy Scale Add Triad

slide-14
SLIDE 14

TinyMemBench memory latency

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 14

6 4 1 2 8 2 5 6 5 1 2 1 2 4 2 4 8 4 9 6 8 1 9 2 1 6 3 8 4 3 2 7 6 8 6 5 5 3 6 50 100 150 200 250

Block Size [KiB] Memory Latency [ns] m710p RPI 3 C2 XU4 (A7) XU4 (A15)

slide-15
SLIDE 15

Structured Grid (Leukocyte Tracking)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 15 R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( G P U ) m 7 1 p ( C P U ) m 7 1 p ( G P U ) 50 100 150

Time-to-Computation [s]

R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( G P U ) m 7 1 p ( C P U ) m 7 1 p ( G P U ) 100 200 300 400

Energy-to-Computation [J]

■ Heterogeneity: EtC(XU4 GPU) <<< EtC(A15) < EtC(A7); ■ ARMv8-A: +105% EtC (C2/64 compared to A7), +24% TtC (C2/64 vs. C2/32) ■ ARM vs. x86_64: XU4 GPU delivers competitive EtC and TtC performance

slide-16
SLIDE 16

Unstructured Grid (CFD Solver)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 16 R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( G P U ) m 7 1 p ( C P U ) m 7 1 p ( G P U ) 100 200 300 400 500

Time-to-Computation [s]

R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( G P U ) m 7 1 p ( C P U ) m 7 1 p ( G P U ) 500 1000 1500

Energy-to-Computation [J]

■ Heterogeneity: TtC(XU4 GPU) < TtC(A15) <<< TtC(A7); ■ ARMv8-A: +72% EtC (C2/64 compared to A7), no autovectorization à C2/64 > C2/32 ■ ARM vs. x86_64: XU4 GPU is competitive for EtC, but nothing else

slide-17
SLIDE 17

Dense Linear Algebra (k-Nearest Neighbours)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 17 R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( G P U ) m 7 1 p ( C P U ) m 7 1 p ( G P U ) 0.0 0.1 0.2 0.3

Time-to-Computation [s]

R P I 3 C 2 ( 3 2 ) C 2 ( 6 4 ) X U 4 ( A 7 ) X U 4 ( A 1 5 ) X U 4 ( G P U ) m 7 1 p ( C P U ) m 7 1 p ( G P U ) 0.0 0.2 0.4 0.6 0.8

Energy-to-Computation [J]

■ Heterogeneity: EtC(A7) < EtC(A15); TtC(XU4 GPU) <<< TtC(A15) < TtC(A7); ■ ARMv8-A: +186% EtC (C2/64 compared to A7), +73% TtC (C2/64 vs. C2/32) ■ ARM vs. x86_64: XU4 GPU delivers competitive EtC and TtC performance

slide-18
SLIDE 18

Graph Traversal (Breadth-First Search)

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 18 RPI 3 C2 (32) C2 (64) XU4 (A7) XU4 (A15) XU4 (GPU) m710p (CPU) m710p (GPU) 0.0 0.2 0.4 0.6 0.8

Time-to-Computation [s]

RPI 3 C2 (32) C2 (64) XU4 (A7) XU4 (A15) XU4 (GPU) m710p (CPU) m710p (GPU) 1 2 3 4 5

Energy-to-Computation [J]

■ Heterogeneity: EtC(XU4 GPU) < EtC(A7/A15); TtC(A15) < TtC(XU4 GPU) < TtC(A7); ■ ARMv8-A: no autovectorization à C2/64 > C2/32 ■ ARM vs. x86_64: superior EtC performance for ARM-based hardware

slide-19
SLIDE 19

■ Yes, but it depends: □ Heterogeneity – Sometimes, using the little cores provides better EtC results – SoC GPU compute performance is impressive – Benchmarks were not even optimized for specifics of Mali GPUs □ ARMv8-A – A53 provided much better EtC and TtC performance compared to A7 □ Competetiveness with x86_64 – In many scenarios, SoCs provide better EtC performance – TtC performance is getting better (A53 vs. A7) – Gap in GPU performance is small

Conclusion – So, are Low-Power SoCs Feasible for Heterogenous HPC Workloads?

Max Plauth, UCHPC@EuroPar'16, 23.08.2016 Are Low-Power SoCs Feasible for Heterogenous HPC Workloads? Chart 19

slide-20
SLIDE 20

Thank you for your attention!

Max Plauth and Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute, University of Potsdam, Germany