Architecture and software for when theres no longer plenty of room - PowerPoint PPT Presentation

Architecture and software for when there’s no longer plenty of room at the bottom Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing, Imperial College London Joint work with : David Ham (Imperial Computing/Maths/Grantham Inst for Climate Change) Gerard Gorman, Michael Lange (Imperial Earth Science Engineering – Applied Modelling and Computation Group) Mike Giles, Gihan Mudalige, Istvan Reguly (Mathematical Inst, Oxford) Doru Bercea, Fabio Luporini, Graham Markall, Lawrence Mitchell, Florian Rathgeber, Francis Russell, George Rokos, Paul Colea (Software Perf Opt Group, Imperial Computing) Spencer Sherwin (Aeronautics, Imperial), Chris Cantwell (Cardio-mathematics group, Mathematics, Imperial) Michelle Mills Strout, Chris Krieger, Cathie Olschanowsky (Colorado State University) Carlo Bertolli (IBM Research), Ram Ramanujam (Louisiana State University) Doru Thom Popovici, Franz Franchetti (CMU), Karl Wilkinson (Capetown), Chris – Kriton Skylaris (Southampton) 1 1

What we are Finite Vectorisation, PyOP2/OP2 Aeroengine doing…. difference parametric turbo- Unstructured- polyhedral tiling machinery mesh stencils Finite-volume Domain- Tiling for Firedrake unstructured- specific Weather Finite- mesh stencils and climate element Finite-element optimisation assembly Lazy, data- driven compute- Domestic SLAMBench Real-time 3D communicate robotics, scene Dense SLAM augmented – 3D vision understanding reality Runtime code Targetting generation PRAgMaTIc MPI, OpenMP, Adaptive- Tidal Dynamic OpenCL, mesh CFD turbines mesh Multicore graph Dataflow/ adaptation worklists FPGA, from Unsteady GiMMiK HPC to CFD - higher- Formula-1, Massive Small-matrix order flux- UAVs mobile, common sub- multiplication reconstruction expressions embedded Ab-initio and wearable TINTL Solar Optimisation of computational energy, drug Fourier composite chemistry design interpolation transforms (ONETEP) Technologies Contexts Projects Application 2 s

This talk Algorithmics at the limits of Moore’s Law Navigating the algorithmic design space Dataflow as a strategy for controlling data movement Domain-specific optimisations Getting the abstraction right Delivering 3 3

The advection- diffusion problem: Weak form: This is the entire t=state.scalar_fields["Tracer"] # Extract fields u=state.vector_fields["Velocity"] # from Fluidity specification for a solver for p=TrialFunction(t) # Setup test and an advection- q=TestFunction(t) # trial functions diffusion test problem M=p*q*dx # Mass matrix d=-dt*dfsvty*dot(grad(q),grad(p))*dx # Diffusion term Same model D=M-0.5*d # Diffusion matrix implemented in adv = (q*t+dt*dot(grad(q),u)*t)*dx # Advection RHS diff = action(M+0.5*d,t) # Diffusion RHS FEniCS/Dolfin, and also in solve(M == adv, t) # Solve advection Fluidity – solve(D == diff, t) # Solve diffusion hand-coded Fortran

Firedrake: a finite-element framework An alternative implementation of the FEniCS language Using PyOP2 as an intermediate representation of parallel loops All embedded in Python using runtime code generation The FEniCS project’s UFL – DSL Rathgeber, Ham, Mitchell et al, ACM TOMS 2016 for finite element discretisation Non-FE loops Unified Form over the mesh Language Compiler generates PyOP2 kernels and access descriptors UFL “Two - stage” Form Compiler Stencil DSL for unstructured-mesh Explicit access descriptors PyOP2 characterise access footprint of kernels Distributed MPI-parallel PyOP2 implementation Domain-specific loop optimizer COFFEE kernel optimiser/vectoriser For finite-element assembly and similar loop nests Future/ Manycore Multicore Vectorisation and flop-minimisation /GPU other

Firedrake – single-node performance Here we compare performance against two production codes solving the same problem on the same mesh: Markall, Rathgeber et al, ICS ’ 13 Fermi M2050 Fluidity: Fortran/C++ DOLFIN: the FEniCS project’s implementation of UFL These results are preliminary and are presented for discussion purposes – see Rathgeber, Ham, Mitchell et al, http://arxiv.org/abs/1501.01809 for more systematic and up to date evaluation Graph shows speedup over Fluidity on one core of a 12-core Westmere node

End-to-end accuracy drives algorithm selection Helmholtz problem using Optimum tetrahedral discretisation elements for 10% accuracy What is the best combination of h and p? Depends on the h solution accuracy required Which, in turn determines whether to Optimum choose local vs discretisation for 0.1% global assembly accuracy Blue dotted lines show runtime; Red solid lines show L 2 error (C.D.Cantwell, S.J.Sherwin, R.M.Kirby, P.H.J.Kelly, From h to p efficiently)

SLAM: “Simultaneous Location and Mapping” • Build coherent world representation and localise camera in real-time • “Dense SLAM”: use all the sensor data to build a full surface map • Applications in robotics, augmented reality, telepresence Shahram Izadi et al: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. (UIST '11)

• Paul Kelly - Imperial College London SLAMBench framework SLAM benchmarks KinectFusion ElasticFusion … … … LSD-SLAM ORB-SLAM Dense SLAM Semi-dense SLAM Sparse SLAM Implementation languages … PENCIL SYCL C++ OpenMP OpenCL CUDA Desktop to embedded platforms … ARM Intel NVIDIA Datasets … ICL-NUIM TUM RGB-D Performance evaluation Frame rate Energy Accuracy 14

• Luigi Nardi - Imperial College London SLAMBench's optional GUI RGB camera 3D model Depth camera 15 Performance Tracked points

• Paul Kelly - Imperial College London What is the optimisation space? Algorithmic : Space 1 Application-specific parameters Minimisation methods Early exit condition values Compilation : Co-design space opencl-params: -cl-mad-enable,-cl-fast-relaxed-math, etc. Space 2 LLVM flags: O1, O2, O3, vectorize-slp-aggressive, etc. Local work group size: 16/32/64/96/112/128/256 Vectorisation: width (1/2/4/8), direction (x/y) Thread coarsening: factor (1/2/4/8/16/32), stride (1/2/4/8/16/32), dimension (x/y) Architecture : Space 3 GPU frequency: 177/266/350/420/480/543/600/DVFS # of active big cores: 0/1/2/3/4 # of active LITTLE cores: 1/2/3/4 Warning: huge spaces, impossible to run exhaustively 16

• Luigi Nardi - Imperial College London Model-based, active-learning design-space exploration 18

• Paul Kelly - Imperial College London How is the model represented? Decision Tree 19 Random Forest

• Paul Kelly - Imperial College London DSE on algorithmic parameters error/runtime Machine CPU CPU name CPU GFLOPS CPU cores GPU GPU name GPU GFLOPS TDP Watts Hardkernel ARM Exynos 5422 80 4 + 4 ARM Mali-T628 60 + 30 10 A15 + A7 ODROID-XU3 20

Feynmann: plenty of room at the bottom ….. https://en.wikipedia.org/wiki/There's_Plenty_of_Room_at_the_Bottom (1959, talk at the American Physical Society) December 1959

Feynmann: plenty of room at the bottom ….. https://en.wikipedia.org/wiki/There's_Plenty_of_Room_at_the_Bottom 58 years of exponential progress since then We’re much closer to such limits Much debate about where they really lie What is clear is that we’re a lot closer We are confronted more and more with fundamental physical concerns Particularly wrt communication latency, bandwidth and energy. (1959, talk at the American Physical Society) December 1959

Algorithmic complexity and scheduling We teach that access to a hash table is O(1), ie independent of the size of the hash table And that it doesn’t matter how you want to access your hash table, it’s still O(1) Bilardi et al ######## 27

Algorithmic complexity and scheduling We teach that access to a hash table is O(1), ie independent of the size of the hash table But the hash table is implemented using a RAM distributed 3D space So wire length increases with RAM size And caching doesn’t help since access is randomised Column address decoder Row address decoder Bilardi et al ######## 28

Algorithmic complexity and scheduling approximating the ideal random access machine by physical machines. J. Gianfranco Bilardi, Kattamuri Ekanadham, and Pratap Pattnaik. 2009. On We teach that access to a hash table is O(1), ie independent of the size of the hash table But the hash table is implemented using a RAM distributed 3D space So wire length increases with RAM size And caching doesn’t help since access is randomised But this is a latency perspective ACM 56, 5, Article 27 (August 2009), If instead we’re interested in throughput, we might be able to pipeline the accesses We complete accesses at a rate of 1 per O(1) time In general, pipelining can hide memory access latency provided we have enough parallelism, and the program has “bounded address depth” 29

Architecture and software for when theres no longer plenty of room - PowerPoint PPT Presentation

Architecture and software for when theres no longer plenty of room at the bottom Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing,

There s no s no there there there! there! There W. Hyattsville Station

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis :

From Requirements to Architecture Ana Moreira Software Architecture - Basics 1 Goals

Introduction to Software Architecture The top level.... (and design revisited) 1 System Software

Can Software Architecture Be Used To Support Innovation? Pierre Pureur Can Software Architecture

What is Software Architecture and Why It is important March 2014 Ying Shen SSE, Tongji

Software Architecture Jay Urbain, Ph.D. urbain@msoe.edu 1 Software Architecture (SA) 1. SA: How

Presentation School of Computer Science Jose E. Labra Gayo Course 2018/2019 Software

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

SE2: Introduction to Software Architecture Mei Nagappan What is Architecture? Encyclopedia

Software architecture and Enterprise environment School of Computer Science Jose E. Labra Gayo

Documentation School of Computer Science Jose E. Labra Gayo Course 2018/2019 Software

What is Software Architecture and Why It is important Ying Shen SSE, Tongji University Software

Switched differential algebraic equations: Jumps and impulses Stephan Trenn Technomathematics

Tutorial on Gecode Constraint Programming Combinatorial Problem Solving (CPS) Enric Rodr

IMP 8 GME Energetic Particle Observations over Three Solar Cycles Ian G. Richardson 1 , Hilary V.

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

Element Project Workshop Welcome and Introduction Professor Mark Parsons, EPCC Director

DUNGEONS & DRAGONS As a Drupal project Hacking and slashing our way through real-world

Spectral/hp element methods as a digital twin for turbomachinery applications A. Cassinelli 1 , F.

A N EW A PPROACH OF Z ONAL H YBRID RANS-LES B ASED ON A T WO - EQUATION k M ODEL [2] L ARS D

Architecture and software for when theres no longer plenty of room - PowerPoint PPT Presentation

Architecture and software for when theres no longer plenty of room at the bottom Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing,

There s no s no there there there! there! There W. Hyattsville Station

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis :

From Requirements to Architecture Ana Moreira Software Architecture - Basics 1 Goals

Introduction to Software Architecture The top level.... (and design revisited) 1 System Software

Can Software Architecture Be Used To Support Innovation? Pierre Pureur Can Software Architecture

What is Software Architecture and Why It is important March 2014 Ying Shen SSE, Tongji

Software Architecture Jay Urbain, Ph.D. urbain@msoe.edu 1 Software Architecture (SA) 1. SA: How

Presentation School of Computer Science Jose E. Labra Gayo Course 2018/2019 Software

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

SE2: Introduction to Software Architecture Mei Nagappan What is Architecture? Encyclopedia

Software architecture and Enterprise environment School of Computer Science Jose E. Labra Gayo

Documentation School of Computer Science Jose E. Labra Gayo Course 2018/2019 Software

What is Software Architecture and Why It is important Ying Shen SSE, Tongji University Software

Switched differential algebraic equations: Jumps and impulses Stephan Trenn Technomathematics

Tutorial on Gecode Constraint Programming Combinatorial Problem Solving (CPS) Enric Rodr

IMP 8 GME Energetic Particle Observations over Three Solar Cycles Ian G. Richardson 1 , Hilary V.

Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main

Element Project Workshop Welcome and Introduction Professor Mark Parsons, EPCC Director

DUNGEONS &amp; DRAGONS As a Drupal project Hacking and slashing our way through real-world

Spectral/hp element methods as a digital twin for turbomachinery applications A. Cassinelli 1 , F.

A N EW A PPROACH OF Z ONAL H YBRID RANS-LES B ASED ON A T WO - EQUATION k M ODEL [2] L ARS D

DUNGEONS & DRAGONS As a Drupal project Hacking and slashing our way through real-world