Is 2.44 trillion unknowns the largest finite element system that - PowerPoint PPT Presentation

Is 2.44 trillion unknowns the largest finite element system that can be solved today? U. Rüde (LSS Erlangen, ruede@cs.fau.de) Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de Advances in Numerical Algorithms and High Performance Computing University College London April 14-15, 2014 Ulrich Rüde - Lehrstuhl für Simulation 1

Overview Motivation How fast our computers are Some aspects of computer architecture Parallelism everywhere Where we stand: Scalable Parallel Multigrid Matrix-Free Multigrid FE solver Hierarchical Hybrid Grids (HHG) Multiphysics applications with multigrid and beyond Geodynamics: Terra-Neo What I will not talk about today: Electron beam melting Fully resolved 2 and 3-phase bubbly and particulate flows Electroosmotic flows LBM, granular systems, multibody dynamcs, GPUs/accelerators, Medical Applications, Image Processing, Real Time Applications Conclusions Ulrich Rüde - Lehrstuhl für Simulation 2

High Performance Computer Systems (on the way to Exa-Flops) Ulrich Rüde - Lehrstuhl für Simulation 3

Computer Architecture is Hierarchical Core Node Cluster vectorization (SSE, AVX), Several CPU chips (2) may thousands of nodes i.e. vectors of 2-8 floating be combined with local are connected by a point numbers must be memory to become a node fast network treated in blocks: Several cores (8) are on a different network may have their own cache CPU chip topologies memories Within a node we can use between nodes access to local (cache) „shared memory parallelism“ message passing mem fast must be used e.g. OpenMP access to remote mem e.g. MPI Several cores may share slow second/third level caches high latency pipelining, superscalar Memory access bottlenecks low bandwidth execution may occur each core may need Sometimes nodes are several threads to hide equipped with accelerators memory access latency (i.e. graphics cards) Ulrich Rüde - Lehrstuhl für Simulation 4

What will Computers Look Like in 2020? Super Computer (Heroic Computing) Cost: 200 Million € Parallel Threads: 10 8 - 10 9 10 18 FLOPS, Mem: 10 15 -10 17 Byte (1-100 PByte) Power Consumption: 20 MW Departmental Server (Mainstream Computing for R&D) Cost: 200 000 € Parallel Threads: 10 5 - 10 6 10 15 FLOPS, Mem: 10 12 -10 14 Byte (1-100 TByte) Power Consumption: 20 KW (mobile) Workstation (Computing for the Masses) ... scale down by another factor 100 but remember: Predictions are difficult ... especially those about the future Ulrich Rüde - Lehrstuhl für Simulation 5

What Supercomputers are Like Today JUQUEEN SuperMUC System IBM Blue Gene/Q IBM System x iDataPlex Processor IBM PowerPC A2 Intel Xeon E5-2680 8C SIMD QPX (256bit) AVX (256bit) Peak 5 872.0 TFlop/s 3 185.1 TFlop/s Clock 1.6 GHz 2.8 GHz Nodes 28 672 9 216 Node peak 204.8 GFlops/s 358.4 GFlops/s S/C/T per Node 1/16/64 2/16/32 GFlops per Watt 2.54 0.94 SuperMuc: 3 PFlops Ulrich Rüde - Lehrstuhl für Simulation 6

Let‘s try to quantify: What are the limiting resources? Floating Point Operations/sec (Flops) Memory capacity Communication/ memory access, ...? What is the capacity of a contemporary supercomputer? Flops? Memory? Memory and communication bandwidth? What are the resource requirements (e.g. to solve Laplace or Stokes) in Flops/ Dof ? Memory/ Dof ? ... isn‘t it surprising that there are hardly any publications that quantify efficient computing in this form? Ulrich Rüde - Lehrstuhl für Simulation 7

Estimating the cost complexity of FE solvers 10 6 unknowns memory requirement: solution vector: 8 M Bytes plus 3 auxiliary vectors: 32 MBytes stiffness & mass matrix, assume #nnz per row 15 (linear tet elements): 240 MBytes can save O(10) cost by matrix-free implementation assume asymptotically optimal solver (multigrid for scalar elliptic PDE) 100 Flops/unknown efficiency h = 0.1 machine with 1 GFlops, 100 MByte, should solve: 3 × 10 6 unknowns in 3 seconds 1 PFlops, 100 TByte, should solve: 3 × 10 12 unknowns in 3 seconds Ulrich Rüde - Lehrstuhl für Simulation 8

What good are 10 12 Finite Elements? Earth‘s oceans together have ca. 1.3 × 10 9 km 3 We can resolve the volume of the planetray ocean globally with ca. 100m resolution Earth‘s mantle has 0.91 × 10 12 km 3 We can resolve the volume of the mantle with ca. 1 km The human recirculatory system contains ca. 0.006 m 3 volume • discretize with 10 15 finite elements • mesh size of 2 m m • Exa-Scale: 10 3 operations per second and per volume. a red blood cell is ca. 7 m m large we have ca. 2.5 × 10 13 red blood cells • with an exa-scale system we can spend 4 × 10 4 flops per second and per blood cell Ulrich Rüde - Lehrstuhl für Simulation 9

Towards Scalable Algorithms and Data Structures Ulrich Rüde - Lehrstuhl für Simulation 10

What are the problems? Unprecedented levels of parallelism maybe billions of cores/threads needed Hybrid architectures standard CPU vector units (SSE, AVX) accelerators (GPU, Intel Xeon Phi) Memory wall memory response slow: latency memory transfer limited: bandwith Power considerations dictate limits to clock speed => multi core limits to memory size (byte/flop) limits to address references per operation limits to resilience Ulrich Rüde - Lehrstuhl für Simulation 11

Would you want to Designing Algorithms! propel a Superjumbo with four strong jet engines Large Scale Simulation Software or with 1,000,000 blow dryer fans? Moderately Parallel Computing Massively Parallel MultiCore Systems Ulrich Rüde - Lehrstuhl für Simulation 12

The Energy Problem Thought Experiment !"#%1"&'$(7)*+ !"#$%"&'$()*!+ ,&--."/012&"+ ,&--."/012&"+ 10 12 elements/nodes 277 GWh or assume that every entity 3"405/6++(+,)*+ must contribute to any other 0&--."/012&"+ 240 Kilotons TNT (as is typical for an elliptic problem) the picture shows the Badger-Explosion equivalent to either of 1953 with 23 Kilotons TNT multiplication with Source: Wikipedia 12 Text inverse matrix n × n Coulomb from: Exascale Programming Challenges, Report interaction of the 2011 Workshop on Exascale Programming Challenges, Marina del Rey, July 27-29, 2011 Results in 10 24 data movements each 1 NanoJoule (optimistic) Together: 10 15 Joule Ulrich Rüde - Lehrstuhl für Simulation 13

Multigrid for FE on Peta-Scale Computers Ulrich Rüde - Lehrstuhl für Simulation 14

Multigrid: V-Cycle Goal: solve A h u h = f h using a hierarchy of grids Relax on Correct Residual Restrict Interpolate Solve by recursion … … Ulrich Rüde - Lehrstuhl für Simulation 15

How fast can we make FE multigrid Parallelize „plain vanilla“ multigrid for tetrahedral finite elements Bey‘s Tetrahedral partition domain Refinement parallelize all operations on all grids use clever data structures matrix free implementation Do not worry (so much) about Coarse Grids idle processors? short messages? sequential dependency in grid hierarchy? Elliptic problems always require global communication. This cannot be accomplished by local relaxation or Krylov space acceleration or domain decomposition without coarse grid Ulrich Rüde - Lehrstuhl für Simulation 17

Hierarchical Hybrid Grids (HHG) Joint work with Frank Hülsemann (now EDF, Paris), Ben Bergen (now Los Alamos), T. Gradl (Erlangen), B. Gmeiner (Erlangen) HHG Goal: Ultimate Parallel FE Performance! unstructured coarse refinement grid with regular substructures for efficiency superconvergence effects matrix-free implementation using regular substructures constant stencil when coefficients are constant assembly-on-the-fly for variable coefficients Ulrich Rüde - Lehrstuhl für Simulation 18

HHG refinement example Input Grid Ulrich Rüde - Lehrstuhl für Simulation 19

HHG Refinement example Refinement Level one Ulrich Rüde - Lehrstuhl für Simulation 20

HHG Refinement example Refinement Level Two Ulrich Rüde - Lehrstuhl für Simulation 21

HHG Refinement example Structured Interior Ulrich Rüde - Lehrstuhl für Simulation 22

HHG Refinement example Structured Interior Ulrich Rüde - Lehrstuhl für Simulation 23

HHG Refinement example Edge Interior Ulrich Rüde - Lehrstuhl für Simulation 24

HHG Refinement example Edge Interior Ulrich Rüde - Lehrstuhl für Simulation 25

Regular tetrahedral refinement Structured refinement of tetrahedra Use regular HHG patches for partitioning the domain (only 2D for simplification) The HHG input mesh is quite large on many cores communication of ghost layers Coarse grid with 132k elements, as assigned to supercomputer Ulrich Rüde - Lehrstuhl für Simulation 26 - Each tetrahedral element ( ≈ 132 k ) was assigned to one Jugen

HHG Parallel Update Algorithm for each vertex do apply operation to vertex end for update vertex primary dependencies for each edge do copy from vertex interior apply operation to edge copy to vertex halo end for update edge primary dependencies for each element do copy from edge/vertex interiors apply operation to element copy to edge/vertex halos end for update secondary dependencies Ulrich Rüde - Lehrstuhl für Simulation 27

Is 2.44 trillion unknowns the largest finite element system that - PowerPoint PPT Presentation

Is 2.44 trillion unknowns the largest finite element system that can be solved today? U. Rde (LSS Erlangen, ruede@cs.fau.de) Lehrstuhl fr Informatik 10 (Systemsimulation) Universitt Erlangen-Nrnberg www10.informatik.uni-erlangen.de

- A Finite Element Software Teresa Beck, Simon Gawlok and HiFlow team HiFlow-Finite Element

Finite Element Method for netting Daniel.Priour@ifremer.fr IFREMER November 4, 2010

Project Cost Task Force Knowns and Unknowns Overview May 25, 2011 PCTF Known and Unknown

9/30/2013 Us- One species- Homo Sapiens The Microbiome: 10 Trillion cells Your Trillion

Welcome to India World's largest democracy with 1.1 billion people 10 th largest GDP in real

Finite Element tool box for Structural and Fluid Mechanics Cast3M Cast3M is a finite element tool

Slide 1 / 48 1 Elements Z and X are compared. Element Z is larger than Element X. Based on this

Natural hazards known knowns and known unknowns Richard Reinen-Hamill Business Leader

The Power of Unknowns Harnessing what you don't know to estimate project duration John Keklak

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

Speeding up a Finite Element Computation on GPU Nelson Inoue Summary Introduction Finite

Finite Element Multigrid Framework for Mimetic Finite Difference Discretizations Xiaozhe Hu

XQuery XML Example Mandatory statement Root element XML elements Element names Element

Recall from Last Lecture: XPath bib matches a bib element * matches any element CS/INFO 330 /

Sorting Divide and Conquer 1 Searching an n -element Array Linear Search Binary Search

Preparing for the Internet of Things 50 Trillion Gigabyte Challenge Pat McGarry Ryft Systems,

In fi nite Parallel Universes: State at the Edge Peter Bourgon Fastly In fi nite Parallel

Real-time Network Measurements Ran Ben Basat, Technion Joint work with Gil Einziger, Erez

Multi-Cloud Federated Kubernetes at CERN Clenimar Filemon @clenimar clenimar@lsd.ufcg.edu.br

Networking Named Content Van Jacobson, Diana K. Smetters, James D. Thornton, Michael Plass, Nick

Project PIZZARO - Image Restoration Module - Report I Michal Sorel, Filip Sroubek,

Results of Just-In-Time Browsing for Digital Microfilm Douglas J. Kennard and William A.

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim,

Restoring and Digitizing the Municipal Library of Gjirokaster Municipality of Gjirokaster Aurora

Sambuz

Useful Links

Newsletter

Mail Us