Introduction to High Performance Computing and Optimization Oliver - - PowerPoint PPT Presentation

▶

Dec 11, 2022 574 likes •718 views

Institut fr Numerische Mathematik und Optimierung Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13 Contents 1. Introduction 2. Processor

SLIDE 1

Institut für Numerische Mathematik und Optimierung

Introduction to High Performance Computing and Optimization

Oliver Ernst

Audience: 1./3. CMS, 5./7./9. Mm, doctoral students Wintersemester 2012/13

SLIDE 2

1. Introduction
2. Processor Architecture
3. Optimization of Serial Code
4. Parallel Computers
5. Parallelisation Fundamentals
6. OpenMP Programming
7. MPI Programming

Oliver Ernst (INMO) HPC Wintersemester 2012/13 1

SLIDE 3

1. Introduction
2. Processor Architecture
3. Optimization of Serial Code
4. Parallel Computers
5. Parallelisation Fundamentals
6. OpenMP Programming
7. MPI Programming

Oliver Ernst (INMO) HPC Wintersemester 2012/13 5

SLIDE 4

High Performance Computing

Computing

Three broad domains: Scientific Computing Engineering, earth sciences, medicine, finance, . . . Consumer Computing Audio/image/video processing, graph analysis, . . . Embedded Computing Control, communication, signal processing, . . . Limited number of critical kernels Dense and sparse linear algebra Convolution, stencils, filter-type operations Graph algorithms Codecs . . .

Cf. the 13 dwarfs/motifs of computing

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf source: IPAM, TU München source: Apple Inc. source: Drexel U Oliver Ernst (INMO) HPC Wintersemester 2012/13 6

SLIDE 5

High Performance Computing

Hardware then and now

ENIAC (1946) IBM 360 Series (1964) Cray 1 (1976) Connection Machine 2 (1987) SGI Origin 2000 (1996) IBM Blue Gene/Q (2012)

Oliver Ernst (INMO) HPC Wintersemester 2012/13 7

SLIDE 6

High Performance Computing

Developments

70 years of electronic computing initially unique pioneering machines Later (1970s-1990s) specialized designs and hardware industry (CDC, Cray, TMC) Up to here: leading edge in computing determined by HPC requirements. Last 20 years: commodity hardware designed for other purposes (business transactions, gaming) adapted/modified for HPC Dominant design: general-purpose microprocessor with hierarchical memory structure

Oliver Ernst (INMO) HPC Wintersemester 2012/13 8

SLIDE 7

High Performance Computing

Moore’s Law

Gordon Moore, cofounder of Intel, in 19651 “Integrated circuits will lead to such won- ders as home computers—or at least terminals connected to a central computer—automatic controls for automobiles, and personal porta- ble communications equipment. [. . . ] The complexity for minimum component costs has increased at a rate of roughly a factor of two per year (see graph). Certainly over the short term this rate can be expected to conti- nue, if not to increase.” Folklore: period of 18 months for performance doubling

f computer chips.

1Gordon E. Moore, “Cramming More Components onto Integrated Circuits,” Electronics,

pp. 114–117, April 19, 1965.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 9

SLIDE 8

High Performance Computing

Moore’s Law: some data

1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 1E+10 1970 1974 1978 1983 1987 1991 1995 1999 2004 2008 2012

CPU Transistor counts 1971-2012

Transistor Count Year of Introduction

For a long time, increased transistor count translated to reduced cycle time for CPUs . . .

Year of Introduction Transistor Count Name 1971 2.300 Intel 4004 1972 3.500 Intel 8008 1974 4.100 Motorola 6800 1974 4.500 Intel 8080 1976 8.500 Zilog Z80 1978 29.000 Intel 8086 1979 68.000 Motorola 68000 1982 134.000 Intel 80286 1985 275.000 Intel 80386 1989 1.180.000 Intel 80484 1993 3.100.000 Intel Pentium 1995 5.500.000 Intel Pentium Pro 1996 4.300.000 AMD K5 1997 7.500.000 Pentium II 1997 8.800.000 AMD K6 1999 9.500.000 Pentium III 2000 42.000.000 Pentium 4 2003 105.900.000 AMD K8 2003 220.000.000 Itanium 2 2006 291.000.000 Core 2 Duo 2007 904.000.000 Opteron 2400 2007 789.000.000 Power 6 2008 758.000.000 AMD K10 2010 2.300.000.000 Nehalem-EX 2010 1.170.000.000 Core i7 Gulftown 2011 2.600.000.000 Xeon Westmere-EX 2011 2.270.000.000 Core i7 Sandy Bridge 2012 1.200.000.000 AMD Bulldozer

Oliver Ernst (INMO) HPC Wintersemester 2012/13 10

SLIDE 9

High Performance Computing

Moore’s Law: heat wall

source: M. Püschel, ETH Zürich Oliver Ernst (INMO) HPC Wintersemester 2012/13 11

SLIDE 10

High Performance Computing

Top 500, June 2012

Rank Site Computer 1 DOE/NNSA/LLNL United States Sequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, Custom IBM 2 RIKEN Advanced Institute for Computational Science (AICS) Japan K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect Fujitsu 3 DOE/SC/Argonne National Laboratory United States Mira - BlueGene/Q, Power BQC 16C 1.60GHz, Custom IBM 4 Leibniz Rechenzentrum Germany SuperMUC - iDataPlex DX360M4, Xeon E5-2680 8C 2.70GHz, Infiniband FDR IBM 5 National Supercomputing Center in Tianjin China Tianhe-1A - NUDT YH MPP, Xeon X5670 6C 2.93 GHz, NVIDIA 2050 NUDT 6 DOE/SC/Oak Ridge National Laboratory United States Jaguar - Cray XK6, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA 2090 Cray Inc. 7 CINECA Italy Fermi - BlueGene/Q, Power BQC 16C 1.60GHz, Custom IBM 8 Forschungszentrum Juelich (FZJ) Germany JuQUEEN - BlueGene/Q, Power BQC 16C 1.60GHz, Custom IBM 9 CEA/TGCC-GENCI France Curie thin nodes - Bullx B510, Xeon E5-2680 8C 2.700GHz, Infiniband QDR Bull 10 National Supercomputing Centre in Shenzhen (NSCS) China Nebulae - Dawning TC3600 Blade System, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050 Dawning

Ranking based on perfor- mance running LINPACK- benchmark, the LU facto- rization of a matrix.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 12

SLIDE 11

High Performance Computing

Top 500, June 2012

Top ranked system: Sequoia - BlueGene/Q Location: Lawrence Livermore National Laboratory (CA/USA) Purpose: nuclear weapons simulations Manufacturer: IBM

source: LLNL

Operating system: Linux 1,572,864 cores Memory: 1,572,864 GB Power consumption: 7.89 MW Peak performance: 20,132.7 TFlops2 Sustained performance: 16,324.8 TFlops (81%)

Cf. 10,510 TFlops of top system November 2011 (K Computer, Japan)
Cf. top system of 1st Top500 (TNC CM5): 273,930 times faster.

21 TFlop = 1012 Flops

Oliver Ernst (INMO) HPC Wintersemester 2012/13 13

SLIDE 12

High Performance Computing

Top 500, progress

Oliver Ernst (INMO) HPC Wintersemester 2012/13 14

SLIDE 13

High Performance Computing

Efficiency

Most numerical code runs at ≈ 10% efficiency. Coping strategies: Do nothing and hope hardware gets faster. (Worked up to 2004) Rely on compiler to generate optimal code. (Not yet) Understand intricacies of modern computer architectures and learn to write

ptimized code.

Write code which is efficient for any architecture. Know the most efficient numerical libraries and use them.

Oliver Ernst (INMO) HPC Wintersemester 2012/13 15