Architecture and software for when theres no longer plenty of room - - PowerPoint PPT Presentation

architecture and software for when there s no
SMART_READER_LITE
LIVE PREVIEW

Architecture and software for when theres no longer plenty of room - - PowerPoint PPT Presentation

Architecture and software for when theres no longer plenty of room at the bottom Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing,


slide-1
SLIDE 1

1

Architecture and software for when there’s no longer plenty of room at the bottom

Paul H J Kelly Group Leader, Software Performance Optimisation Co-Director, Centre for Computational Methods in Science and Engineering Department of Computing, Imperial College London

Joint work with : David Ham (Imperial Computing/Maths/Grantham Inst for Climate Change) Gerard Gorman, Michael Lange (Imperial Earth Science Engineering – Applied Modelling and Computation Group) Mike Giles, Gihan Mudalige, Istvan Reguly (Mathematical Inst, Oxford) Doru Bercea, Fabio Luporini, Graham Markall, Lawrence Mitchell, Florian Rathgeber, Francis Russell, George Rokos, Paul Colea (Software Perf Opt Group, Imperial Computing) Spencer Sherwin (Aeronautics, Imperial), Chris Cantwell (Cardio-mathematics group, Mathematics, Imperial) Michelle Mills Strout, Chris Krieger, Cathie Olschanowsky (Colorado State University) Carlo Bertolli (IBM Research), Ram Ramanujam (Louisiana State University) Doru Thom Popovici, Franz Franchetti (CMU), Karl Wilkinson (Capetown), Chris–Kriton Skylaris (Southampton) 1

slide-2
SLIDE 2

What we are doing…. Domain- specific

  • ptimisation

PyOP2/OP2 Unstructured- mesh stencils GiMMiK Small-matrix multiplication Firedrake Finite- element assembly SLAMBench Dense SLAM – 3D vision PRAgMaTIc Dynamic mesh adaptation TINTL Fourier interpolation Unsteady CFD - higher-

  • rder flux-

reconstruction Finite difference Real-time 3D scene

understanding

Adaptive- mesh CFD Ab-initio computational chemistry (ONETEP) Finite-element Formula-1, UAVs Aeroengine turbo- machinery Domestic robotics, augmented reality Tidal turbines Solar energy, drug design Weather and climate Projects Contexts Application s Massive common sub- expressions Vectorisation, parametric polyhedral tiling Lazy, data- driven compute- communicate Multicore graph worklists Optimisation of composite transforms Tiling for unstructured- mesh stencils Technologies

Targetting MPI, OpenMP, OpenCL, Dataflow/ FPGA, from HPC to mobile, embedded and wearable

Runtime code generation

2

Finite-volume

slide-3
SLIDE 3

3

3

This talk

Algorithmics at the limits of Moore’s Law Navigating the algorithmic design space Dataflow as a strategy for controlling data movement Domain-specific optimisations Getting the abstraction right Delivering

slide-4
SLIDE 4

t=state.scalar_fields["Tracer"] # Extract fields u=state.vector_fields["Velocity"] # from Fluidity p=TrialFunction(t) # Setup test and q=TestFunction(t) # trial functions M=p*q*dx # Mass matrix d=-dt*dfsvty*dot(grad(q),grad(p))*dx # Diffusion term D=M-0.5*d # Diffusion matrix adv = (q*t+dt*dot(grad(q),u)*t)*dx # Advection RHS diff = action(M+0.5*d,t) # Diffusion RHS solve(M == adv, t) # Solve advection solve(D == diff, t) # Solve diffusion

This is the entire specification for a solver for an advection- diffusion test problem Same model implemented in FEniCS/Dolfin, and also in Fluidity – hand-coded Fortran Weak form: The advection- diffusion problem:

slide-5
SLIDE 5

Distributed MPI-parallel PyOP2 implementation

Firedrake: a finite-element framework

An alternative implementation of the FEniCS language Using PyOP2 as an intermediate representation of parallel loops All embedded in Python using runtime code generation Stencil DSL for unstructured-mesh Explicit access descriptors characterise access footprint of kernels The FEniCS project’s UFL – DSL for finite element discretisation Compiler generates PyOP2 kernels and access descriptors

PyOP2 Non-FE loops

  • ver the mesh

UFL “Two- stage” Form Compiler Unified Form Language COFFEE kernel

  • ptimiser/vectoriser

Multicore Manycore /GPU Future/

  • ther

Rathgeber, Ham, Mitchell et al, ACM TOMS 2016

Domain-specific loop optimizer For finite-element assembly and similar loop nests Vectorisation and flop-minimisation

slide-6
SLIDE 6

Here we compare performance against two production codes solving the same problem on the same mesh: Fluidity: Fortran/C++ DOLFIN: the FEniCS project’s implementation

  • f UFL

Graph shows speedup over Fluidity on one core

  • f a 12-core Westmere node

Fermi M2050

Firedrake – single-node performance

Markall, Rathgeber et al, ICS’13 These results are preliminary and are presented for discussion purposes – see Rathgeber, Ham, Mitchell et al,

http://arxiv.org/abs/1501.01809

for more systematic and up to date evaluation

slide-7
SLIDE 7

End-to-end accuracy drives algorithm selection

h Helmholtz problem using tetrahedral elements What is the best combination of h and p? Depends on the solution accuracy required Which, in turn determines whether to choose local vs global assembly

Optimum discretisation for 10% accuracy Optimum discretisation for 0.1% accuracy Blue dotted lines show runtime; Red solid lines show L2 error

(C.D.Cantwell, S.J.Sherwin, R.M.Kirby, P.H.J.Kelly, From h to p efficiently)
slide-8
SLIDE 8

SLAM: “Simultaneous Location and Mapping”

  • Build coherent world representation and localise camera in real-time
  • “Dense SLAM”: use all the sensor data to build a full surface map
  • Applications in robotics, augmented reality, telepresence

Shahram Izadi et al: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. (UIST '11)

slide-9
SLIDE 9
  • Paul Kelly - Imperial College London

SLAMBench framework

14

Implementation languages

C++ OpenMP OpenCL CUDA

SLAM benchmarks

Semi-dense SLAM

LSD-SLAM Sparse SLAM

ORB-SLAM Dense SLAM ElasticFusion … KinectFusion

Desktop to embedded platforms

ARM Intel NVIDIA

… … Datasets

ICL-NUIM TUM RGB-D

… Performance evaluation

Frame rate Energy Accuracy SYCL PENCIL

slide-10
SLIDE 10
  • Luigi Nardi - Imperial College London

SLAMBench's optional GUI

RGB camera Depth camera Tracked points 3D model Performance

15

slide-11
SLIDE 11
  • Paul Kelly - Imperial College London

Space 3 Space 2 Space 1

Warning: huge spaces, impossible to run exhaustively

Co-design space

Algorithmic: Application-specific parameters Minimisation methods Early exit condition values Compilation:

  • pencl-params: -cl-mad-enable,-cl-fast-relaxed-math, etc.

LLVM flags: O1, O2, O3, vectorize-slp-aggressive, etc. Local work group size: 16/32/64/96/112/128/256 Vectorisation: width (1/2/4/8), direction (x/y) Thread coarsening: factor (1/2/4/8/16/32), stride (1/2/4/8/16/32), dimension (x/y) Architecture: GPU frequency: 177/266/350/420/480/543/600/DVFS # of active big cores: 0/1/2/3/4 # of active LITTLE cores: 1/2/3/4

What is the optimisation space?

16

slide-12
SLIDE 12
  • Luigi Nardi - Imperial College London

Model-based, active-learning design-space exploration

18

slide-13
SLIDE 13
  • Paul Kelly - Imperial College London

How is the model represented?

Decision Tree Random Forest

19

slide-14
SLIDE 14
  • Paul Kelly - Imperial College London

20

DSE on algorithmic parameters error/runtime

Machine

CPU CPU name CPU GFLOPS CPU cores GPU GPU name GPU GFLOPS TDP Watts

Hardkernel

ODROID-XU3

ARM A15 + A7 Exynos 5422 80 4 + 4 ARM Mali-T628 60 + 30 10

slide-15
SLIDE 15

Feynmann: plenty of room at the bottom

…..

December 1959

(1959, talk at the American Physical Society)

https://en.wikipedia.org/wiki/There's_Plenty_of_Room_at_the_Bottom

slide-16
SLIDE 16

Feynmann: plenty of room at the bottom

…..

December 1959

(1959, talk at the American Physical Society)

https://en.wikipedia.org/wiki/There's_Plenty_of_Room_at_the_Bottom

58 years of exponential progress since then We’re much closer to such limits Much debate about where they really lie What is clear is that we’re a lot closer We are confronted more and more with fundamental physical concerns Particularly wrt communication latency, bandwidth and energy.

slide-17
SLIDE 17

27

Algorithmic complexity and scheduling

We teach that access to a hash table is O(1), ie independent of the size of the hash table And that it doesn’t matter how you want to access your hash table, it’s still O(1)

Bilardi et al ########

slide-18
SLIDE 18

28

Algorithmic complexity and scheduling

We teach that access to a hash table is O(1), ie independent of the size of the hash table

But the hash table is implemented using a RAM distributed 3D space So wire length increases with RAM size And caching doesn’t help since access is randomised

Bilardi et al ########

Column address decoder Row address decoder

slide-19
SLIDE 19

29

Algorithmic complexity and scheduling

We teach that access to a hash table is O(1), ie independent of the size of the hash table

But the hash table is implemented using a RAM distributed 3D space So wire length increases with RAM size And caching doesn’t help since access is randomised

But this is a latency perspective

If instead we’re interested in throughput, we might be able to pipeline the accesses We complete accesses at a rate of 1 per O(1) time

In general, pipelining can hide memory access latency provided we have enough parallelism, and the program has “bounded address depth”

Gianfranco Bilardi, Kattamuri Ekanadham, and Pratap Pattnaik. 2009. On approximating the ideal random access machine by physical machines. J. ACM 56, 5, Article 27 (August 2009),

slide-20
SLIDE 20

30

Algorithmic complexity and scheduling

We know that matrix-matrix multiply is O(n3)

But in a deep memory hierarchy, access time depends

  • n reuse distance

So naïve “for i for j for k” loop nest suffers reuse access latency that grows with N Anecdotally, execution time ~O(n5)

+=

×

i j k k

C A B

for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j]+=A[i][k]*B[k][j]

i j

Each row of A is reused for a series of dot-products But if the cache is too small, it doesn’t fit

slide-21
SLIDE 21

31

Algorithmic complexity and scheduling

Tiling for cache bounds the reuse distance so that reused submatrix fits in cache With a deep hierarchy we have to do this at every level of the cache, recursively Doing this leads to a big-O performance improvement Finding schedules with good locality is really an algorithmic challenge

Alpern, B., Carter, L., Feig, E. et al. The uniform memory hierarchy model of computationAlgorithmica (1994) 12: 72.

for (kk = 0; kk < N; kk += S) for (jj = 0; jj < N; jj += S) for (i = 0; i < N; i++) for (k = kk; k < min(kk+S,N); k++){ for (j = jj; j < min(jj+S, N); j++) C[i][j] += A[i][k] * B[k][j]; }

slide-22
SLIDE 22

33

Turing tax

Alan Turing realised we could use digital technology to implement any computable function He then proposed the idea of a “universal” computing device – a single device which, with the right program, can implement any computable function without further configuration The “Turing Tax” is a term for the overhead (performance, cost, or energy) of universality in this sense That is, the performance difference between a special- purpose device and a general-purpose one One of the fundamental questions of computer architecture is to how to reduce the Turing Tax

slide-23
SLIDE 23

34

Turing tax

FPGAs are Turing Tax Fetch-execute is Turing Tax But since it doesn’t involve communication, it’s not the important thing Registers are Turing Tax Because if we know the program’s dataflow, we can use wires and latches to pass data from functional unit to functional unit Memory But if we can stream data from where it’s produced to where it’s used, maybe we don’t need so much RAM? Low-latency memory If we can manage reuse, we can place data in nearby memory Cache If we know exactly when the reuse will occur, we can program movement to and from local fast memory explicitly

slide-24
SLIDE 24 35

Maxeler Dataflow Novel dataflow programming model Implemented using FPGAs Max3 workstation for development Rack-based clusters for large-scale application

Multicore x86 PC

Max3 card PCI Express

24-48GB DRAM Configurable dataflow engine, implemented using large FPGA

Dataflow computing

slide-25
SLIDE 25 36

Dataflow computing

fl fl –

fi fi fl Solving PDEs for fluid dynamics On an unstructured mesh

slide-26
SLIDE 26 37

For this talk we concentrate on an unstructured 2D quad mesh

Dataflow computing

fl fl 1–6 fi fi fl fl 1–6 fl fl

– fl

slide-27
SLIDE 27

The flow of data

Basically what we have to do is sweep over the mesh Each time we visit a cell we compute values at the interfaces

38

DFE DRAM

  • Why is unstructured useful?

X86 CPU

slide-28
SLIDE 28

Interface data

The flux reconstruction datapath

Two streams from DRAM

Cell data Interface data

Both contiguous, stride-1 fully- streaming

50 50

DFE

  • Why is unstructured useful?

Gather interface data propagate values from interfaces propagate values to interfaces Combine interface data Compute updated cell data propagate values to interfaces

DRAM

cell data

slide-29
SLIDE 29

DRAM

Instructions: The machine is a “computer with just

  • ne instruction”

Instructions are streamed from the CPU, in step with cell data Four fields, one for each interface Specifies offset And whether this cell

  • wns this interface
53

Cell and interface computation kernel On-chip BRAM circular buffer On-chip BRAM circular buffer Operand set cell iface iface iface iface Result set cell iface iface iface iface Combine interfaces

Owner visit Non-owner visit Iface data cell data cell data Iface data

X86 CPU

  • ffset0

Owner?

  • ffset1

Owner?

  • ffset2

Owner?

  • ffset3

Owner?

Cell instruction stream

slide-30
SLIDE 30

DRAM

Partitioning: What happens if the circular buffer isn’t big enough to hold the interfaces?

54

Cell and interface computation kernel On-chip BRAM circular buffer On-chip BRAM circular buffer Operand set cell iface iface iface iface Result set cell iface iface iface iface Combine interfaces

Owner visit Non-owner visit Iface data cell data cell data Iface data

X86 CPU

  • ffset0

Owner?

  • ffset1

Owner?

  • ffset2

Owner?

  • ffset3

Owner?

Cell instruction stream

slide-31
SLIDE 31

Partitions and haloes

The mesh is partitioned so that offsets do not exceed the capacity of the BRAM buffers Halo interfaces (in red) have contributions from cells in two different partitions

55

1 2 3 4 5

5 3 2 4

slide-32
SLIDE 32

DRAM

Haloes: Halo interfaces are streamed from and to the x86 host CPU The CPU combines halo interfaces

56

Cell and interface computation kernel On-chip BRAM circular buffer On-chip BRAM circular buffer Operand set cell iface iface iface iface Result set cell iface iface iface iface Combine interfaces

Owner visit Non-owner visit Iface data cell data cell data Iface data

X86 CPU

  • ffset0

Owner?

  • ffset1

Owner?

  • ffset2

Owner?

  • ffset3

Owner?

Cell instruction stream

Halo ifaces

Halo iface

Halo interface input data stream

X86 CPU

Halo interface output data stream

slide-33
SLIDE 33

What does this mean?

What we have so far: Fully-streaming dataflow architecture for unstructured-mesh CFD Performance is limited by DRAM bandwidth for cell and interface data

Provided: We can implement the datapath to match the DRAM bandwidth The CPU/PCIe bandwidth is not the bottleneck

57
slide-34
SLIDE 34

68

Computer architecture – the book

Computer Architecture: A Quantitative Approach Five editions since 1990 Revolutionary landmark book brought experimental discipline to processor design Almost entirely devoid of theory

slide-35
SLIDE 35

69

Computer architecture – the future?

Computer Architecture An Asymptotic Approach

slide-36
SLIDE 36

70

Conclusions

At the application level the design space is enormous Once you have code, many of the interesting points are unreachable We need to get the applications people to navigate their design space And automate the pathway to efficient implementation We need to rewrite the textbooks to account for the physical realities of computing

slide-37
SLIDE 37

71

Acknowledgements

Partly funded by

NERC Doctoral Training Grant (NE/G523512/1) EPSRC “MAPDES” project (EP/I00677X/1) EPSRC “PSL” project (EP/I006761/1) Rolls Royce and the TSB through the SILOET programme EPSRC “PAMELA” Programme Grant (EP/K008730/1) EPSRC “PRISM” Platform Grant (EP/I006761/1) EPSRC “Custom Computing” Platform Grant (EP/I012036/1) AMD, Codeplay, Maxeler Technologies Code: http://www.firedrakeproject.org/ http://op2.github.io/PyOP2/