Benefits of the ARM architecture on the even Berkeley Dwarfs Patric - - PowerPoint PPT Presentation

benefits of the arm architecture on the even berkeley
SMART_READER_LITE
LIVE PREVIEW

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric - - PowerPoint PPT Presentation

Benefits of the ARM architecture on the even Berkeley Dwarfs Patric Mai, Pierre Schoonbrood RWTH Aachen University patric.mai@gmx.de, pierre.schoonbrood@rwth-aachen.de February 12, 2015 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley


slide-1
SLIDE 1

Benefits of the ARM architecture on the even Berkeley Dwarfs

Patric Mai, Pierre Schoonbrood

RWTH Aachen University patric.mai@gmx.de, pierre.schoonbrood@rwth-aachen.de

February 12, 2015

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 1 / 35

slide-2
SLIDE 2

Overview

1

The road to exaflopic computing ARM architecture ARM HPC clusters

2

Performance of the ARM architecture Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models

3

Developing for ARM

4

Conclusions

5

References

6

Who did what?

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 2 / 35

slide-3
SLIDE 3

(1) - Current trends

www.top500.org Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 3 / 35

slide-4
SLIDE 4

(1) - Limitations

A supercomputer should not exceed 20 MW budget Currently, 2 GFlops/Watt Required, 50 GFlops/Watt

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 4 / 35

slide-5
SLIDE 5

(1) - ARM architecture

www.arm.com Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 5 / 35

slide-6
SLIDE 6

(1) - ARM HPC Clusters

HPC cluster of low power SOCs (ARM) Tibidabo (Rajovic et Al. 2013)

256 nodes with:

A9 dual-core @1GHz 1GB DDR2 SDRAM

Mont-blanc Project (http://www.montblanc-project.eu)

Pedraforca - 70 nodes with:

A9 quad-core @1.4GHz 4GB DDR3 SDRAM

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 6 / 35

slide-7
SLIDE 7
  • verview

1

The road to exaflopic computing ARM architecture ARM HPC clusters

2

Performance of the ARM architecture Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models

3

Developing for ARM

4

Conclusions

5

References

6

Who did what?

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 7 / 35

slide-8
SLIDE 8

(2) - Benchmarking the ARM architecture

How suitable is the ARM architecture for HPC applications? Performance per Watt for several dwarfs Comparison with other architectures Optimizing OpenCL framework for ARM

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 8 / 35

slide-9
SLIDE 9

(2) - (FEAST) Finite element analysis (1)

Mesh of points of an object (For example: fluid) Points have properties (viscosity) Differential equations for behavior (laminar or turbulent?)

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 9 / 35

slide-10
SLIDE 10

(2) - (FEAST) Finite element analysis (2)

Tibidabo benchmarked against Xeon cluster (LiDOng) Tested with four configurations

1

LiDOng as much cores as Tibidabo

2

As (1), but all cores of a node activated at LiDOng

3

As few nodes as possible with respect to memory (LiDOng)

4

Twice the amount of (3)

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 10 / 35

slide-11
SLIDE 11

(2) - (FEAST) Finite element analysis (3)

  • ddicke et Al. 2012

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 11 / 35

slide-12
SLIDE 12

(2) - (FEAST) Finite element analysis (4)

  • ddicke et Al. 2012

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 12 / 35

slide-13
SLIDE 13

(2) - (HONEI LBM) Computational fluid dynamics (1)

Mesh constructed for fluids Definition of physical bounds Definition of physical model Simulation behavior of liquids and gasses

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 13 / 35

slide-14
SLIDE 14

(2) - (HONEI LBM) Computational fluid dynamics (2)

  • ddicke et Al. 2012

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 14 / 35

slide-15
SLIDE 15

(2) - (HONEI LBM) Computational fluid dynamics (3)

  • ddicke et Al. 2012

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 15 / 35

slide-16
SLIDE 16

(2) - Rijndael and Bitcount

Bitcount - count set bits in an array

Uneven workload

Rijndael - derive round keys for example AES

A lot of X-OR operations

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 16 / 35

slide-17
SLIDE 17

(2) - Performance on Combinational logic

Maghazeh et Al. 2013 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 17 / 35

slide-18
SLIDE 18

(2) - Optimizing OpenCL runtime for ARM (1)

ARM processor: both host processor and OpenCL device Every core is one compute unit Every core is a single processing element

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 18 / 35

slide-19
SLIDE 19

(2) - Optimizing OpenCL runtime for ARM (2)

Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 19 / 35

slide-20
SLIDE 20

(2) - Optimizing OpenCL runtime for ARM (2)

Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 20 / 35

slide-21
SLIDE 21

(2) - Optimizing OpenCL compilation for ARM

Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 21 / 35

slide-22
SLIDE 22

(2) - Optimizing OpenCL compilation (NEON)

Auto-vectorization by the GCC compiler Vector operations are converted into NEON intrinsic functions Binaries contain NEON functions

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 22 / 35

slide-23
SLIDE 23

(2) - Optimizing OpenCL results

Improvement over PGCL

Gangwon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 23 / 35

slide-24
SLIDE 24

(3) - Deep convolution neural networks

Feed forward neural network Neuron collections responsible for part of an image Several layers (filters)

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 24 / 35

slide-25
SLIDE 25

(3) - Performance on neural networks (1)

Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 25 / 35

slide-26
SLIDE 26

(3) - Performance on neural networks (2)

Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 26 / 35

slide-27
SLIDE 27

(3) - Performance on neural networks(3)

SoC used has 2 cores with 512KB L2 cache Intermediate result: 256KB in size Lots of cache misses

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 27 / 35

slide-28
SLIDE 28

(3) - Performance on neural networks(4)

Jonghoon et Al. 2014 Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 28 / 35

slide-29
SLIDE 29
  • verview

1

The road to exaflopic computing ARM architecture ARM HPC clusters

2

Performance of the ARM architecture Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models

3

Developing for ARM

4

Conclusions

5

References

6

Who did what?

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 29 / 35

slide-30
SLIDE 30

(3) - Developing for ARM

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 30 / 35

slide-31
SLIDE 31
  • verview

1

The road to exaflopic computing ARM architecture ARM HPC clusters

2

Performance of the ARM architecture Sparse linear algebra Unstructured grids Combinational logic Optimizing OpenCL for the ARM architecture Graphical models

3

Developing for ARM

4

Conclusions

5

References

6

Who did what?

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 31 / 35

slide-32
SLIDE 32

(4) - Conclusions

ARM most of the time more energy efficient Mostly utilized for embedded applications Usable for HPC Applications

Limiting factor: resources (memory, caches) Frameworks should be optimized for ARM

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 32 / 35

slide-33
SLIDE 33

(5) - References (1)

  • ddeke, Dominik, et al. Energy efficiency vs. performance of the

numerical solution of PDEs: an application study on a low-power ARM-based cluster. Journal of Computational Physics, 2013, 237. Jg., S. 132-150. JO, Gangwon, et al. OpenCL framework for ARM processors with NEON

  • support. In: Proceedings of the 2014 Workshop on Workshop on

programming models for SIMD/Vector processing. ACM, 2014. S. 33-40. MAGHAZEH, Arian, et al. General purpose computing on low-power embedded GPUs: Has it come of age?. In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013 International Conference on. IEEE, 2013. S. 1-10.

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 33 / 35

slide-34
SLIDE 34

(5) - References (2)

JIN, Jonghoon, et al. An efficient implementation of deep convolutional neural networks on a mobile coprocessor. In: Circuits and Systems (MWSCAS), 2014 IEEE 57th International Midwest Symposium on. IEEE,

  • 2014. S. 133-136.

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 34 / 35

slide-35
SLIDE 35

(6) - Who did what?

Shared Introduction - Patric Mai Road to exaflopic computing - Pierre Schoonbrood Sparse linear Algebra - Pierre Schoonbrood Unstructured grids - Pierre Schoonbrood Combinational logic - Patric Mai Optimizing OpenCL for ARM - Patric Mai Graphical models - Pierre Schoonbrood Developing for ARM - Patric Mai Conclusions - Both

Patric Mai, Pierre Schoonbrood (RWTH) ARM even Berkeley Dwarfs February 12, 2015 35 / 35