architecture and software for when there s no
play

Architecture and software for when theres no longer plenty of room - PowerPoint PPT Presentation

Architecture and software for when theres no longer plenty of room at the bottom Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing,


  1. Architecture and software for when there’s no longer plenty of room at the bottom Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing, Imperial College London Joint work with : David Ham (Imperial Computing/Maths/Grantham Inst for Climate Change) Gerard Gorman, Michael Lange (Imperial Earth Science Engineering – Applied Modelling and Computation Group) Mike Giles, Gihan Mudalige, Istvan Reguly (Mathematical Inst, Oxford) Doru Bercea, Fabio Luporini, Graham Markall, Lawrence Mitchell, Florian Rathgeber, Francis Russell, George Rokos, Paul Colea (Software Perf Opt Group, Imperial Computing) Spencer Sherwin (Aeronautics, Imperial), Chris Cantwell (Cardio-mathematics group, Mathematics, Imperial) Michelle Mills Strout, Chris Krieger, Cathie Olschanowsky (Colorado State University) Carlo Bertolli (IBM Research), Ram Ramanujam (Louisiana State University) Doru Thom Popovici, Franz Franchetti (CMU), Karl Wilkinson (Capetown), Chris – Kriton Skylaris (Southampton) 1 1

  2. What we are Finite Vectorisation, PyOP2/OP2 Aeroengine doing…. difference parametric turbo- Unstructured- polyhedral tiling machinery mesh stencils Finite-volume Domain- Tiling for Firedrake unstructured- specific Weather Finite- mesh stencils and climate element Finite-element optimisation assembly Lazy, data- driven compute- Domestic SLAMBench Real-time 3D communicate robotics, scene Dense SLAM augmented – 3D vision understanding reality Runtime code Targetting generation PRAgMaTIc MPI, OpenMP, Adaptive- Tidal Dynamic OpenCL, mesh CFD turbines mesh Multicore graph Dataflow/ adaptation worklists FPGA, from Unsteady GiMMiK HPC to CFD - higher- Formula-1, Massive Small-matrix order flux- UAVs mobile, common sub- multiplication reconstruction expressions embedded Ab-initio and wearable TINTL Solar Optimisation of computational energy, drug Fourier composite chemistry design interpolation transforms (ONETEP) Technologies Contexts Projects Application 2 s

  3. This talk Algorithmics at the limits of Moore’s Law Navigating the algorithmic design space Dataflow as a strategy for controlling data movement Domain-specific optimisations Getting the abstraction right Delivering 3 3

  4. The advection- diffusion problem: Weak form: This is the entire t=state.scalar_fields["Tracer"] # Extract fields u=state.vector_fields["Velocity"] # from Fluidity specification for a solver for p=TrialFunction(t) # Setup test and an advection- q=TestFunction(t) # trial functions diffusion test problem M=p*q*dx # Mass matrix d=-dt*dfsvty*dot(grad(q),grad(p))*dx # Diffusion term Same model D=M-0.5*d # Diffusion matrix implemented in adv = (q*t+dt*dot(grad(q),u)*t)*dx # Advection RHS diff = action(M+0.5*d,t) # Diffusion RHS FEniCS/Dolfin, and also in solve(M == adv, t) # Solve advection Fluidity – solve(D == diff, t) # Solve diffusion hand-coded Fortran

  5. Firedrake: a finite-element framework An alternative implementation of the FEniCS language Using PyOP2 as an intermediate representation of parallel loops All embedded in Python using runtime code generation The FEniCS project’s UFL – DSL Rathgeber, Ham, Mitchell et al, ACM TOMS 2016 for finite element discretisation Non-FE loops Unified Form over the mesh Language Compiler generates PyOP2 kernels and access descriptors UFL “Two - stage” Form Compiler Stencil DSL for unstructured-mesh Explicit access descriptors PyOP2 characterise access footprint of kernels Distributed MPI-parallel PyOP2 implementation Domain-specific loop optimizer COFFEE kernel optimiser/vectoriser For finite-element assembly and similar loop nests Future/ Manycore Multicore Vectorisation and flop-minimisation /GPU other

  6. Firedrake – single-node performance Here we compare performance against two production codes solving the same problem on the same mesh: Markall, Rathgeber et al, ICS ’ 13 Fermi M2050 Fluidity: Fortran/C++ DOLFIN: the FEniCS project’s implementation of UFL These results are preliminary and are presented for discussion purposes – see Rathgeber, Ham, Mitchell et al, http://arxiv.org/abs/1501.01809 for more systematic and up to date evaluation Graph shows speedup over Fluidity on one core of a 12-core Westmere node

  7. End-to-end accuracy drives algorithm selection Helmholtz problem using Optimum tetrahedral discretisation elements for 10% accuracy What is the best combination of h and p? Depends on the h solution accuracy required Which, in turn determines whether to Optimum choose local vs discretisation for 0.1% global assembly accuracy Blue dotted lines show runtime; Red solid lines show L 2 error (C.D.Cantwell, S.J.Sherwin, R.M.Kirby, P.H.J.Kelly, From h to p efficiently)

  8. SLAM: “Simultaneous Location and Mapping” • Build coherent world representation and localise camera in real-time • “Dense SLAM”: use all the sensor data to build a full surface map • Applications in robotics, augmented reality, telepresence Shahram Izadi et al: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. (UIST '11)

  9. • Paul Kelly - Imperial College London SLAMBench framework SLAM benchmarks KinectFusion ElasticFusion … … … LSD-SLAM ORB-SLAM Dense SLAM Semi-dense SLAM Sparse SLAM Implementation languages … PENCIL SYCL C++ OpenMP OpenCL CUDA Desktop to embedded platforms … ARM Intel NVIDIA Datasets … ICL-NUIM TUM RGB-D Performance evaluation Frame rate Energy Accuracy 14

  10. • Luigi Nardi - Imperial College London SLAMBench's optional GUI RGB camera 3D model Depth camera 15 Performance Tracked points

  11. • Paul Kelly - Imperial College London What is the optimisation space? Algorithmic : Space 1 Application-specific parameters Minimisation methods Early exit condition values Compilation : Co-design space opencl-params: -cl-mad-enable,-cl-fast-relaxed-math, etc. Space 2 LLVM flags: O1, O2, O3, vectorize-slp-aggressive, etc. Local work group size: 16/32/64/96/112/128/256 Vectorisation: width (1/2/4/8), direction (x/y) Thread coarsening: factor (1/2/4/8/16/32), stride (1/2/4/8/16/32), dimension (x/y) Architecture : Space 3 GPU frequency: 177/266/350/420/480/543/600/DVFS # of active big cores: 0/1/2/3/4 # of active LITTLE cores: 1/2/3/4 Warning: huge spaces, impossible to run exhaustively 16

  12. • Luigi Nardi - Imperial College London Model-based, active-learning design-space exploration 18

  13. • Paul Kelly - Imperial College London How is the model represented? Decision Tree 19 Random Forest

  14. • Paul Kelly - Imperial College London DSE on algorithmic parameters error/runtime Machine CPU CPU name CPU GFLOPS CPU cores GPU GPU name GPU GFLOPS TDP Watts Hardkernel ARM Exynos 5422 80 4 + 4 ARM Mali-T628 60 + 30 10 A15 + A7 ODROID-XU3 20

  15. Feynmann: plenty of room at the bottom ….. https://en.wikipedia.org/wiki/There's_Plenty_of_Room_at_the_Bottom (1959, talk at the American Physical Society) December 1959

  16. Feynmann: plenty of room at the bottom ….. https://en.wikipedia.org/wiki/There's_Plenty_of_Room_at_the_Bottom 58 years of exponential progress since then We’re much closer to such limits Much debate about where they really lie What is clear is that we’re a lot closer We are confronted more and more with fundamental physical concerns Particularly wrt communication latency, bandwidth and energy. (1959, talk at the American Physical Society) December 1959

  17. Algorithmic complexity and scheduling We teach that access to a hash table is O(1), ie independent of the size of the hash table And that it doesn’t matter how you want to access your hash table, it’s still O(1) Bilardi et al ######## 27

  18. Algorithmic complexity and scheduling We teach that access to a hash table is O(1), ie independent of the size of the hash table But the hash table is implemented using a RAM distributed 3D space So wire length increases with RAM size And caching doesn’t help since access is randomised Column address decoder Row address decoder Bilardi et al ######## 28

  19. Algorithmic complexity and scheduling approximating the ideal random access machine by physical machines. J. Gianfranco Bilardi, Kattamuri Ekanadham, and Pratap Pattnaik. 2009. On We teach that access to a hash table is O(1), ie independent of the size of the hash table But the hash table is implemented using a RAM distributed 3D space So wire length increases with RAM size And caching doesn’t help since access is randomised But this is a latency perspective ACM 56, 5, Article 27 (August 2009), If instead we’re interested in throughput, we might be able to pipeline the accesses We complete accesses at a rate of 1 per O(1) time In general, pipelining can hide memory access latency provided we have enough parallelism, and the program has “bounded address depth” 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend