High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. - - PowerPoint PPT Presentation

high performance computing
SMART_READER_LITE
LIVE PREVIEW

High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. - - PowerPoint PPT Presentation

High Performance Computing ADVANCED SCIENTIFIC COMPUTING Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing Centre, Germany


slide-1
SLIDE 1

ADVANCED SCIENTIFIC COMPUTING

  • Dr. – Ing. Morris Riedel

Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing Centre, Germany

Introduction to High Performance Computing

August 23, 2017 Room TG-227

High Performance Computing

Part Two

slide-2
SLIDE 2

Outline

Part Two – Introduction to High Performance Computing 2 / 70

slide-3
SLIDE 3

Outline

  • High Performance Computing (HPC) Basics
  • Four basic building blocks of HPC
  • TOP500 and Performance Benchmarks
  • Shared Memory and Distributed Memory Architectures
  • Hybrid and Emerging Architectures
  • HPC Ecosystem Technologies
  • Software Environments & Scheduling
  • System Architectures & Network Topologies
  • Data Access & Large-scale Infrastructures
  • Parallel Programming Basics
  • Message Passing Interface (MPI)
  • OpenMP
  • GPGPUs
  • Selected Programming Challenges

Part Two – Introduction to High Performance Computing 3 / 70

slide-4
SLIDE 4

High Performance Computing (HPC) Basics

Part Two – Introduction to High Performance Computing 4 / 70

slide-5
SLIDE 5

What is High Performance Computing?

  • Wikipedia: ‘redirects from HPC to Supercomputer’
  • Interesting – gives us already a hint what it is generally about:
  • HPC includes work on ‘four basic building blocks’ in this course:
  • Theory (numerical laws, physical models, speed-up performance, etc.)
  • Technology (multi-core, supercomputers, networks, storages, etc.)
  • Architecture (shared-memory, distributed-memory, interconnects, etc.)
  • Software (libraries, schedulers, monitoring, applications, etc.)

Part Two – Introduction to High Performance Computing

  • A supercomputer is a computer at the frontline of contemporary

processing capacity – particularly speed of calculation

[1] Wikipedia ‘Supercomputer’ Online [2] Introduction to High Performance Computing for Scientists and Engineers

5 / 70

slide-6
SLIDE 6

HTC

network interconnection less important!

Part Two – Introduction to High Performance Computing

HPC vs. High Throughput Computing (HTC) Systems

  • High Performance Computing (HPC) is based on computing resources that enable the efficient use
  • f parallel computing techniques through specific support with dedicated hardware such as high

performance cpu/core interconnections. These are compute-oriented systems.

  • High Throughput Computing (HTC) is based on commonly available computing resources such as

commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores. These are data-oriented systems

HPC

network interconnection important

6 / 70

slide-7
SLIDE 7

Parallel Computing

  • All modern supercomputers depend heavily on parallelism
  • Often known as ‘parallel processing’ of some problem space
  • Tackle problems in parallel to enable the ‘best performance’ possible
  • ‘The measure of speed’ in High Performance Computing matters
  • Common measure for parallel computers established by TOP500 list
  • Based on benchmark for ranking the best 500 computers worldwide

Part Two – Introduction to High Performance Computing

  • We speak of parallel computing whenever a number of ‘compute

elements’ (e.g. cores) solve a problem in a cooperative way

[2] Introduction to High Performance Computing for Scientists and Engineers [3] TOP 500 supercomputing sites

7 / 70

slide-8
SLIDE 8

TOP 500 List (June 2017)

Part Two – Introduction to High Performance Computing

power challenge EU #1

[3] TOP 500 supercomputing sites

8 / 70

slide-9
SLIDE 9

LINPACK benchmarks and Alternatives

  • TOP500 ranking is based on the LINPACK benchmark
  • LINPACK covers only a single architectural aspect (‘critics exist’)
  • Measures ‘peak performance’: All involved ‘supercomputer elements’
  • perate on maximum performance
  • Available through a wide variety of ‘open source implementations’
  • Success via ‘simplicity & ease of use’ thus used for over two decades
  • Realistic applications benchmark suites might be alternatives
  • HPC Challenge benchmarks (includes 7 tests)
  • JUBE benchmark suite (based on real applications)

Part Two – Introduction to High Performance Computing

  • LINPACK solves a dense system of linear equations of unspecified size.

[4] LINPACK Benchmark implementation [5] HPC Challenge Benchmark Suite [6] JUBE Benchmark Suite

  • The top 10 systems in the TOP500 list are dominated by companies, e.g. IBM, CRAY, Fujitsu, etc.

9 / 70

slide-10
SLIDE 10

Dominant Architectures of HPC Systems

  • Traditionally two dominant types of architectures
  • Shared-Memory Computers
  • Distributed Memory Computers
  • Often hierarchical (hybrid) systems of both in practice
  • Dominance in the last couple of years in the community on X86-based

commodity clusters running the Linux OS on Intel/AMD processors

  • More recently both above considered as ‘programming models’

Part Two – Introduction to High Performance Computing

  • Shared-memory parallelization with OpenMP
  • Distributed-memory parallel programming with MPI

10 / 70

slide-11
SLIDE 11

Shared-Memory Computers

  • Two varieties of shared-memory systems:

1. Unified Memory Access (UMA) 2. Cache-coherent Nonuniform Memory Access (ccNUMA)

  • The Problem of ‘Cache Coherence’ (in UMA/ccNUMA)
  • Different CPUs use Cache to ‘modify same cache values’
  • Consistency between cached data & data in memory must be guaranteed
  • ‘Cache coherence protocols’ ensure a consistent view of memory

Part Two – Introduction to High Performance Computing

  • A shared-memory parallel computer is a system in which a number
  • f CPUs work on a common, shared physical address space

[2] Introduction to High Performance Computing for Scientists and Engineers

11 / 70

slide-12
SLIDE 12

Shared-Memory with UMA

  • Socket is a physical package (with multiple

cores), typically a replacable component

  • Two dual core chips (2 core/socket)
  • P = Processor core
  • L1D = Level 1 Cache – Data (fastest)
  • L2 = Level 2 Cache (fast)
  • Memory = main memory (slow)
  • Chipset = enforces cache coherence and

mediates connections to memory

Part Two – Introduction to High Performance Computing

  • UMA systems use ‘flat memory model’: Latencies and bandwidth

are the same for all processors and all memory locations.

  • Also called Symmetric Multiprocessing (SMP)

[2] Introduction to High Performance Computing for Scientists and Engineers

12 / 70

slide-13
SLIDE 13

Shared-Memory with ccNUMA

  • Eight cores (4 cores/socket); L3 = Level 3 Cache
  • Memory interface = establishes a coherent link to enable one

‘logical’ single address space of ‘physically distributed memory’

Part Two – Introduction to High Performance Computing

  • ccNUMA systems share logically memory that is physically distributed

(similar like distributed-memory systems)

  • Network logic makes the aggregated memory appear as one single address space

[2] Introduction to High Performance Computing for Scientists and Engineers

13 / 70

slide-14
SLIDE 14

Programming with Shared Memory using OpenMP

  • OpenMP is a set of compiler directives to ‘mark parallel regions’
  • Bindings are defined for C, C++, and Fortran languages
  • Threads TX are ‘lightweight processes’ that mutually access data

Part Two – Introduction to High Performance Computing

Shared Memory

T1 T2 T3 T4 T5

  • Shared-memory programming enables

immediate access to all data from all processors without explicit communication

  • OpenMP is dominant shared-memory

programming standard today (v3)

[7] OpenMP API Specification

14 / 70

slide-15
SLIDE 15

Distributed-Memory Computers

  • Processors communicate via Network Interfaces (NI)
  • NI mediates the connection to a Communication network
  • This setup is rarely used  a programming model view today

Part Two – Introduction to High Performance Computing

  • A distributed-memory parallel computer establishes a ‘system view’

where no process can access another process’ memory directly

[2] Introduction to High Performance Computing for Scientists and Engineers

15 / 70

slide-16
SLIDE 16

Programming with Distributed Memory using MPI

  • No remote memory access on distributed-memory systems
  • Require to ‘send messages’ back and forth between processes PX
  • Many free Message Passing Interface (MPI) libraries available
  • Programming is tedious & complicated, but most flexible method

Part Two – Introduction to High Performance Computing

P1 P2 P3 P4 P5

  • Distributed-memory programming enables

explicit message passing as communication between processors

  • MPI is dominant distributed-memory

programming standard today (v2.2)

[8] MPI Standard

16 / 70

slide-17
SLIDE 17

Hierarchical Hybrid Computers

  • Shared-memory nodes (here ccNUMA) with local NIs
  • NI mediates connections to other remote ‘SMP nodes’

Part Two – Introduction to High Performance Computing

  • A hierarchical hybrid parallel computer is neither a purely shared-memory

nor a purely distributed-memory type system but a mixture of both

  • Large-scale ‘hybrid’ parallel computers have shared-memory building blocks

interconnected with a fast network today

[2] Introduction to High Performance Computing for Scientists and Engineers

17 / 70

slide-18
SLIDE 18

Programming Hybrid Systems

  • Experience from HPC Practice
  • Most parallel applications still take no notice of the hardware structure
  • Use of pure MPI for parallelization remains the dominant programming

(historical reason: old supercomputers all distributed-memory type)

  • Challenges with the ‘mapping problem’
  • The performance of hybrid (as well as pure MPI codes) depends crucially
  • n factors not directly connected to the programming model
  • It largely depends on the association of threads and processes to cores
  • Emerging ‘hybrid programming models‘ using GPGPUs and CPUs

Part Two – Introduction to High Performance Computing

  • Hybrid systems programming uses

MPI as explicit internode communication and OpenMP for parallelization within the node

18 / 70

slide-19
SLIDE 19

Emerging HPC System Architecture Developments

  • Increasing number of other ‘new’

emerging system architectures

  • General Purpose Computation on

Graphics Processing Unit (GPGPUs)

  • Use of GPUs instead for computer graphics for computing
  • Programming models are OpenCL and Nvidia CUDA
  • Getting more and more adopted in many application fields
  • Field Programmable Gate Array (FPGAs)
  • Integrated circuit designed to be configured by a user after shipping
  • Enables updates of functionality and reconfigurable ‘wired’

interconnects

  • Cell processors
  • Enables combination of general-purpose cores with co-processing

elements that accelerate dedicated forms of computations

Part Two – Introduction to High Performance Computing

  • Often in state of flux/vendor-specific
  • More details are quickly outdated
  • Not tackled in depth in this course

19 / 70

slide-20
SLIDE 20

HPC Ecosystem Technologies

Part Two – Introduction to High Performance Computing 20 / 70

slide-21
SLIDE 21

HPC Software Environment

  • Operating System
  • Former times often ‘proprietary OS’, nowadays often (reduced) ‘Linux’
  • Scheduling Systems
  • Manage concurrent access of users on Supercomputers
  • Different scheduling algorithms can be used with different ‘batch queues’
  • Example: SLURM @ JOTUNN Cluster, LoadLeveler @ JUQUEEN, etc.
  • Monitoring Systems
  • Monitor and test status of the system (‘system health checks/heartbeat’)
  • Enables view of usage of system per node/rack (‘system load’)
  • Examples: LLView, INCA, Ganglia @ JOTUNN Cluster, etc.
  • Performance Analysis Systems
  • Measure performance of an application

and recommend improvements

  • Examples: SCALASCA, VAMPIR, etc.

Part Two – Introduction to High Performance Computing

  • HPC systems typically provide

a software environment that support the processing of parallel applications

21 / 70

slide-22
SLIDE 22

Example: Ganglia @ Jötunn Cluster

Part Two – Introduction to High Performance Computing

[9] Jötunn Cluster Ganglia Monitoring Online

22 / 70

slide-23
SLIDE 23

Scheduling Principles

  • HPC Systems are typically not used in an interactive fashion
  • Program application starts ‘processes‘ on processors (‘do a job for a user‘)
  • Users of HPC systems send ‘job scripts‘ to schedulers to start programs
  • Scheduling enables the sharing of the HPC system with other users
  • Closely related to Operating Systems with a wide varity of algorithms
  • E.g. First Come First Serve (FCFS)
  • Queues processes in the order that they arrive in the ready queue.
  • E.g. Backfilling
  • Enables to maximize cluster utilization and throughput
  • Scheduler searches to find jobs that can fill gaps in the schedule
  • Smaller jobs farther back in the queue run ahead of a job waiting at the

front of the queue (but this job should not be delayed by backfilling!)

Part Two – Introduction to High Performance Computing

  • Scheduling is the method by which user processes are given access to processor time (shared)

23 / 70

slide-24
SLIDE 24

Example: Concurrent Usage of a Supercomputer

Part Two – Introduction to High Performance Computing

[10] LLView Tool

24 / 70

slide-25
SLIDE 25

System Architectures

  • HPC systems are very complex

‘machines‘ with many elements

  • CPUs & multi-cores
  • ‘multi-threading‘ capabilities
  • Data access levels
  • Different levels of Caches
  • Network topologies
  • Various interconnects
  • (Example: IBM Bluegene/Q)

Part Two – Introduction to High Performance Computing

  • HPC faced a significant change in practice with respect to performance increase after years
  • Getting more speed for free by waiting for new CPU generations does not work any more
  • Multicore processors emerge that require to use those multiple resources efficiently in parallel

25 / 70

slide-26
SLIDE 26

Example: Supercomputer BlueGene/Q

Part Two – Introduction to High Performance Computing

[10] LLView Tool

26 / 70

slide-27
SLIDE 27
  • Significant advances in CPU

(or microprocessor chips)

  • Multi-core architecture with dual,

quad, six, or n processing cores

  • Processing cores are all on one chip
  • Multi-core CPU chip architecture
  • Hierarchy of caches (on/off chip)
  • L1 cache is private to each core; on-chip
  • L2 cache is shared; on-chip
  • L3 cache or Dynamic random access memory (DRAM); off-chip

Multi-core CPU Processors

Part Two – Introduction to High Performance Computing

  • ne chip
  • Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years
  • Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat
  • Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies

[11] Distributed & Cloud Computing Book

27 / 70

slide-28
SLIDE 28

Example: BlueGene Architecture Evolution

Part Two – Introduction to High Performance Computing

  • BlueGene/P
  • BlueGene/Q

28 / 70

slide-29
SLIDE 29

Network Topologies

  • Large-scale HPC Systems have special network setups
  • Dedicated I/O nodes, fast interconnects, e.g. Infiniband (IB)
  • Different network topologies, e.g. tree, 5D Torus network, mesh, etc.

(raise challenges in task mappings and communication patterns)

Part Two – Introduction to High Performance Computing

[2] Introduction to High Performance Computing for Scientists and Engineers

Source: IBM 29 / 70

slide-30
SLIDE 30

Data Access

  • P = Processor core elements
  • Compute: floating points or integers
  • Arithmetic units (compute operations)
  • Registers (feed those units with operands)
  • ‘Data access‘ for application/levels
  • Registers: ‘accessed w/o any delay‘
  • L1D = Level 1 Cache – Data (fastest, normal)
  • L2 = Level 2 Cache (fast, often)
  • L3 = Level 3 Cache (still fast, less often)
  • Main memory (slow, but larger in size)
  • Storage media like harddisk, tapes, etc.

(too slow to be used in direct computing)

Part Two – Introduction to High Performance Computing

  • The DRAM gap is the large discrepancy between main memory and cache bandwidths

faster cheaper

[2] Introduction to High Performance Computing for Scientists and Engineers

too slow

30 / 70

slide-31
SLIDE 31

HPC Relationship to ‘Big Data‘

Part Two – Introduction to High Performance Computing

[12] F. Berman: ‘Maximising the Potential of Research Data’

31 / 70

slide-32
SLIDE 32

Large-scale Computing Infrastructures

  • Large computing systems are often embedded in infrastructures
  • Grid computing for distributed data storage and processing via middleware
  • The success of Grid computing was renowned when being mentioned by
  • Prof. Rolf-Dieter Heuer, CERN Director General, in the context of the Higgs

Boson Discovery:

  • Other large-scale distributed infrastructures exist
  • Partnership for Advanced Computing in Europe (PRACE)  EU HPC
  • Extreme Engineering and Discovery Environment (XSEDE)  US HPC

Part Two – Introduction to High Performance Computing

  • ‘Results today only possible due to

extraordinary performance of Accelerators – Experiments – Grid computing’.

[13] Grid Computing YouTube Video

32 / 70

slide-33
SLIDE 33

Towards new HPC Architectures – DEEP-EST EU Project

Part Two – Introduction to High Performance Computing

BN

Many Core CPU

CN

General Purpose CPU

MEM

GCE

FPGA

MEM

NAM

FPGA

General Purpose CPU

NVRAM NVRAM NVRAM NVRAM NVRAM NVRAM NVRAM NVRAM

DN

MEM

FPGA

General Purpose CPU

MEM NVRAM NVRAM

Possible Application Workload

33 / 70

[14] DEEP-EST EU Project

slide-34
SLIDE 34

Parallel Programming Basics

Part Two – Introduction to High Performance Computing 34 / 70

slide-35
SLIDE 35

Part Two – Introduction to High Performance Computing

Distributed-Memory Computers Reviewed

  • Processors communicate via Network Interfaces (NI)
  • NI mediates the connection to a Communication network
  • This setup is rarely used  a programming model view today
  • A distributed-memory parallel computer establishes a ‘system view’

where no process can access another process’ memory directly

Modified from [2] Introduction to High Performance Computing for Scientists and Engineers

Programming Model: Message Passing

35 / 70

slide-36
SLIDE 36

Part Two – Introduction to High Performance Computing

Programming with Distributed Memory using MPI

  • No remote memory access on distributed-memory systems
  • Require to ‘send messages’ back and forth between processes PX
  • Many free Message Passing Interface (MPI) libraries available
  • Programming is tedious & complicated, but most flexible method

P1 P2 P3 P4 P5

  • Distributed-memory programming enables

explicit message passing as communication between processors

  • MPI is dominant distributed-memory

programming standard today (v3.1)

[8] MPI Standard

36 / 70

slide-37
SLIDE 37

Part Two – Introduction to High Performance Computing

What is MPI?

  • ‘Communication library’ abstracting from low-level network view
  • Offers 500+ available functions to communicate between computing nodes
  • Practice reveals: parallel applications often require just ~12 (!) functions
  • Includes routines for efficient ‘parallel I/O’ (using underlying hardware)
  • Supports ‘different ways of communication’
  • ‘Point-to-point communication’ between two computing nodes (P P)
  • Collective functions involve ‘N computing nodes in useful communiction’
  • Deployment on Supercomputers
  • Installed on (almost) all parallel computers
  • Different languages: C, Fortran, Python, R, etc.
  • Careful: Different versions might be installed
  • Recall ‘computing

nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

37 / 70

slide-38
SLIDE 38

Part Two – Introduction to High Performance Computing

HPC Machine

Message Passing: Exchanging Data with Send/Receive

P1 P2 P3 P4 P5 P6

P

M

P

M

P

M

P

M

  • Each processor has its own data in its memory that

can not be seen/accessed by other processors

DATA: 17 DATA: 06 DATA: 19 DATA: 80 Point-To-Point Communications NEW: 17 NEW: 06

Compute Node

38 / 70

slide-39
SLIDE 39

Part Two – Introduction to High Performance Computing

Collective Functions : Broadcast (one-to-many)

P

M

P

M

P

M

P

M

DATA: 17 DATA: 06 DATA: 19 DATA: 80 NEW: 17 NEW: 17 NEW: 17

  • Broadcast distributes

the same data to many

  • r even all other

processors

39 / 70

slide-40
SLIDE 40

Part Two – Introduction to High Performance Computing

Collective Functions: Scatter (one-to-many)

  • Scatter distributes

different data to many

  • r even all other

processors

P

M

P

M

P

M

P

M

DATA: 30 DATA: 20 DATA: 10 NEW: 10 DATA: 80 DATA: 19 DATA: 06 NEW: 20 NEW: 30

40 / 70

slide-41
SLIDE 41

Part Two – Introduction to High Performance Computing

Collective Functions: Gather (many-to-one)

  • Gather collects data

from many or even all

  • ther processors

to one specific processor

P

M

P

M

P

M

P

M

DATA: 17 DATA: 80 DATA: 19 DATA: 06 NEW: 80 NEW: 19 NEW: 06

41 / 70

slide-42
SLIDE 42

Part Two – Introduction to High Performance Computing

Collective Functions: Reduce (many-to-one)

  • Reduce combines

collection with computation based on data from many or even all other processors

P

M

P

M

P

M

P

M

DATA: 17 DATA: 80 DATA: 19 DATA: 06 NEW: 122

  • Usage of reduce

includes finding a global minimum or maximum, sum, or product of the different data located at different processors

+ + + +

global sum as example

+

42 / 70

slide-43
SLIDE 43

Part Two – Introduction to High Performance Computing

Is MPI yet another network library?

  • TCP/IP and socket programming libraries are plentiful available
  • Do we need a dedicated communication & network protocols library?
  • Goal: simplify programming in parallel programming, focus on applications
  • Selected reasons
  • Designed for performance within large parallel computers (e.g. no security)
  • Supports various interconnects between ‘computing nodes’ (hardware)
  • Offers various benefits like ‘reliable messages’ or ‘in-order arrivals’
  • MPI is not designed to handle any communication in computer networks and is thus very special
  • Not good for clients that constantly establishing/closing connections again and again (e.g. would

have very slow performance in MPI)

  • Not good for internet chat clients or Web service servers in the Internet (e.g. no security beyond

firewalls, no message encryption directly available, etc.)

43 / 70

slide-44
SLIDE 44

Part Two – Introduction to High Performance Computing

(MPI) Basic Building Blocks: A main() function

  • The main()

function is automatically started when launching a C program

  • Normally the

‘return code’ denotes whether the program exit was ok (0) or problematic (-1)

  • Practice view: use of resiliency is not part of MPI (e.g. automatic restart

and error handlings), therefore this is rarely used in practice

‘standard C programming…’

44 / 70

slide-45
SLIDE 45

Part Two – Introduction to High Performance Computing

(MPI) Basic Building Blocks: Variables & Output

  • Libraries can be

used by including C header files, here library for screen outputs for example

  • Two integer

variables that are later useful for working with specific data

  • btained from

MPI library

  • Output with printf using stdio library:

‘Hello World’ and which process is printing out of the summary of all n processes

‘standard C programming…’

45 / 70

slide-46
SLIDE 46

Part Two – Introduction to High Performance Computing

MPI Basic Building Blocks: Header & Init/Finalize

‘standard C programming including MPI library use…’

  • Libraries can be

used by including C header files, here library for MPI included

  • The MPI_INIT()

function initializes the MPI environment and can take inputs via the main() function arguments

  • MPI_Finalize() shuts down the MPI environment

(after this statement no parallel execution of the code can take place)

46 / 70

slide-47
SLIDE 47

Part Two – Introduction to High Performance Computing

MPI Basic Building Blocks: Rank & Size Variables

‘standard C programming including MPI library use…’

  • MPI_COMM_WORLD communicator constant

denotes the ‘region of communication’, here all processes

  • The

MPI_Comm_size() function determines the

  • verall number of

n processes in the parallel program: stores it in variable size

  • The

MPI_Comm_rank() function determines the unique identifier for each processor: stores it in variable rank with valures (0 … n-1)

47 / 70

slide-48
SLIDE 48

Part Two – Introduction to High Performance Computing

Compiling & Executing an MPI program

  • Compilers and linkers need various information where include

files and libraries can be found

  • E.g. C header files like ‘mpi.h’, or Fortran modules via ‘use MPI’
  • Compiling is different for each programming language
  • Executing the MPI program on 4 processors
  • Normally batch system allocations

(cf. SLURM on JÖTUNN cluster)

  • Manual start-up example:
  • Output of the program
  • Order of outputs

can vary because I/O screen ‘serial resource’

P

M

P

M

P

M

P

M

hello hello hello hello $> mpirun –np 4 ./hello create 4 processes that produce

  • utput in parallel

48 / 70

slide-49
SLIDE 49

Part Two – Introduction to High Performance Computing

Practice: Our 4 CPU Program alongside many other Programs

[10] LLView Tool

Maybe our program!

49 / 70

slide-50
SLIDE 50

Part Two – Introduction to High Performance Computing

MPI Communicators

  • Each MPI activity specifies the

context in which a corresponding function is performed

  • MPI_COMM_WORLD

(region/context of all processes)

  • Create (sub-)groups of the

processes / virtual groups

  • f processes
  • Peform communications
  • nly within these sub-groups

easily with well-defined processes

  • Using communicators wisely in collective functions

can reduce the number of affected processors

[15] LLNL MPI Tutorial

50 / 70

slide-51
SLIDE 51

Part Two – Introduction to High Performance Computing

Shared-Memory Computers: Reviewed

  • Two varieties of shared-memory systems:

1. Unified Memory Access (UMA) 2. Cache-coherent Nonuniform Memory Access (ccNUMA)

  • A shared-memory parallel computer is a system in which a number
  • f CPUs work on a common, shared physical address space

[2] Introduction to High Performance Computing for Scientists and Engineers

Programming model: work on shared address space (‘local access to memory’)

51 / 70

slide-52
SLIDE 52

Part Two – Introduction to High Performance Computing

Programming with Shared Memory using OpenMP

  • OpenMP is a set of compiler directives to ‘mark parallel regions’
  • Bindings are defined for C, C++, and Fortran languages
  • Threads TX are ‘lightweight processes’ that mutually access data
  • (Shared-Memory concept itself is very old, like POSIX Threads)

Shared Memory

T1 T2 T3 T4 T5

  • Shared-memory programming enables

immediate access to all data from all processors without explicit communication

  • OpenMP is dominant shared-memory

programming standard today (v3)

[7] OpenMP API Specification

52 / 70

slide-53
SLIDE 53

Part Two – Introduction to High Performance Computing

What is OpenMP?

  • OpenMP is a library for specifying ‘parallel regions in serial code’
  • Defined by major computer hardware/software vendors  portability!
  • Enable scalability with parallelization constructs w/o fixed thread numbers
  • Offers a suitable data environment for easier parallel processing of data
  • Uses specific environment variables for clever decoupling of code/problem
  • Included in standard C compiler distributions (e.g. gcc)
  • Threads are the central entity in OpenMP
  • Threads enable ‘work-sharing’ and

share address space (where data resides)

  • Threads can be synchronized if needed
  • Lightweight process that share

common address space with other threads

  • Initiating (aka ‘spawning’) n threads is less

costly then n processes (e.g. variable space)

  • Recall ‘computing

nodes’ are independent computing processors (that may also have N cores each) and that are all part of one big parallel computer

  • Threads are lightweight

processes that work with data in memory

53 / 70

slide-54
SLIDE 54

Part Two – Introduction to High Performance Computing

Parallel and Serial Regions

modified from [2] Introduction to High Performance Computing for Scientists and Engineers

  • fork() initiated by master

thread (exists always) creates team of threads

  • Team of threads

concurrently work on shared-memory data actively in parallel regions

  • join() initiates the

‘shutdown’ of the parallel region and terminates team

  • f threads
  • Team of threads maybe

also put to sleep until next parallel region begins

  • Number of threads can be

different in each parallel region

OpenMP program

54 / 70

slide-55
SLIDE 55

Part Two – Introduction to High Performance Computing

Number of Threads & Scalability

  • The real number of threads normally not known at compile time
  • (There are methods for doing it in the program  do not use them!)
  • Number is set in scripts and/or environment variable before executing
  • Parallel programming is done without knowing number of threads
  • OpenMP programs should be always written in a way that it

does not assume a specific number of threads  Scalable program

int main() { #pragma omp parallel printf(„Hello World“); } int main() { #pragma omp parallel printf(„Hello World“); } ./helloworld.exe Hello World Hello World Hello World Hello World ./helloworld.exe Hello World Hello World Hello World Hello World

compile & execute

mT T0 T1 T2 T3

master thread becomes T0

55 / 70

slide-56
SLIDE 56

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Library & Sentinel

‘standard fortran programming…’ ‘standard C/C++ programming…’

  • The OpenMP

library contains OpenMP API definitions

  • The Sentinel is a

special string that starts an OpenMP compiler directive

  • Practice view: programming OpenMP in C/C++ and fortran is slightly different,

but providing the same basic concepts (e.g. no end of parallel region in C/C++, local variables, etc.)

56 / 70

slide-57
SLIDE 57

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Unique Thread IDs

‘standard fortran programming…’ ‘standard C/C++ programming…’

  • do_work_package() routine code is now executed in parallel by

each thread

  • BUT also sub-routines of that routine are now executed in parallel
  • mp_get_thread_

num() function provides unique Thread ID (0…n-1)

  • mp_get_num_

threads() function

  • btains number of

active threads in the current parallel region

57 / 70

slide-58
SLIDE 58

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Private Variables (Fortran)

  • PRIVATE defines

local variables for each thread

  • Each thread works

independently and thus needs space to ‘store’ local results

  • Practice view: the real parallelization idea is here in the loop: the simple sum of two arrays
  • For each value of i we can compute and store array values independently from each other
  • Same code

executed n times with n threads, BUT tid is unique and thus different for each thread

58 / 70

slide-59
SLIDE 59

Part Two – Introduction to High Performance Computing

Traditional HelloWorld Example (C/C++)

#include <omp.h> #include <stdio.h> int main(argc,argv) int argc; char *argv[]; { int nthreads, tid; #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf(„Hello World from thread = %d\n“, tid); if (tid == 0) { nthreads = omp_get_num_threads(); printf(„Number of threads in parallel region = %d\n“, nthreads); } } }

  • Simple Parallel Program
  • Only the master (tid=0) provides
  • utput of how many threads are

existing in the parallel region

  • Shared variable nthreads
  • Local variable tid

59 / 70

slide-60
SLIDE 60

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Loops (do, for in C/C++)

  • FIRSTPRIVATE()

copies initial value

  • f shared variable

to local variable (simple init here

  • therwise

problems in loop)

  • DO loop (in front
  • f usual do)

distributes (automatically) loop iterations among threads as specifically supported ‘work- sharing’ construct

  • Smart programming support by OpenMP: Loops are very often part of scientific applications!
  • Less burden for programmer: no manual definition of local variables (e.g. i automatically localized)

local sum exists, but where is the global sum?

60 / 70

slide-61
SLIDE 61

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Critical Regions

  • Local sum exists in

each of the different threads

  • We have n times

the local variable value for sum now

  • Race Condition in

shared-memory: shared variable pi will be set concurrently by the different threads

  • Value of pi

depends on the exact order the threads access pi and assign wrong values

  • Critical regions define a region within a parallel region where at most
  • ne thread at a time executes code (e.g. sum of new pi based on pi)

61 / 70

slide-62
SLIDE 62

Part Two – Introduction to High Performance Computing

OpenMP Basic Building Blocks: Reduction

  • Several operations are

common in scientific applications

  • +, *, -, &, |, ^, &&, ||,

max, min

  • REDUCTION() with
  • perator + on variable s

enables here …

  • Starting with a local copy
  • f s for each thread
  • During progress of parallel

region each local copy of s will be accumulated seperately by each thread

  • At the end of the parallel

region automatically synchronized and accumulated with resulting master thread variable

  • Reduction operations are a smart alternative to manual

critical regions definitions around operations of variables

  • Reduction operation automatically localizes variable

62 / 70

slide-63
SLIDE 63

Many-core GPUs

  • Use of very many simple cores
  • High throughput computing-oriented architecture
  • Use massive parallelism by executing a lot of concurrent threads slowly
  • Handle an ever increasing amount of multiple instruction threads
  • CPUs instead typically execute a single long thread as fast as possible
  • Many-core GPUs are used in large

clusters and within massively parallel supercomputers today

  • Named General-Purpose

Computing on GPUs (GPGPU)

Part Two – Introduction to High Performance Computing

  • Graphics Processing Unit (GPU) is great for data parallelism and task parallelism
  • Compared to multi-core CPUs, GPUs consist of a many-core architecture with

hundreds to even thousands of very simple cores executing threads rather slowly

[11] Distributed & Cloud Computing Book

63 / 70

slide-64
SLIDE 64

GPU Acceleration

  • GPU accelerator architecture example (e.g. NVIDEA card)
  • GPUs can have 128 cores on one single GPU chip
  • Each core can work with eight threads of instructions
  • GPU is able to concurrently execute 128 * 8 = 1024 threads
  • Interaction and thus major (bandwidth)

bottleneck between CPU and GPU is via memory interactions

  • E.g. applications

that use matrix – vector multiplication

Part Two – Introduction to High Performance Computing

  • CPU acceleration means that GPUs accelerate computing due to a massive parallelism with

thousands of threads compared to only a few threads used by conventional CPUs

  • GPUs are designed to compute large numbers of floating point operations in parallel

[11] Distributed & Cloud Computing Book

64 / 70

slide-65
SLIDE 65

NVIDEA Fermi GPU Example

Part Two – Introduction to High Performance Computing

[11] Distributed & Cloud Computing Book

65 / 70

slide-66
SLIDE 66

Challenges: Domain Decomposition & Load Imbalance

Part Two – Introduction to High Performance Computing 66 / 70

[16] Map Analysis - Understanding Spatial Patterns and Relationships, Book Modified from [2] Introduction to High Performance Computing for Scientists and Engineers

unused resources

  • Load imbalance hampers performance,

because some resources are underutilized

boundary halo

slide-67
SLIDE 67

Challenges: Ghost/Halo Regions & Stencil Methods

Part Two – Introduction to High Performance Computing 67 / 70

[2] Introduction to High Performance Computing for Scientists and Engineers 3 * 16 = 48 4 * 8 = 32

  • Stencil-based iterative

methods update array elements according to a fixed pattern called ‘stencil‘

  • The key of stencil methods

is its regular structure mostly implemented using arrays in codes

slide-68
SLIDE 68

Lecture Bibliography

Part Two – Introduction to High Performance Computing 68 / 70

slide-69
SLIDE 69

Lecture Bibliography

  • [1] Wikipedia ‘Supercomputer’, Online: http://en.wikipedia.org/wiki/Supercomputer
  • [2] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein,

Chapman & Hall/CRC Computational Science, ISBN 143981192X, English, ~330 pages, 2010, Online:

http://www.amazon.de/Introduction-Performance-Computing-Scientists-Computational/dp/143981192X

  • [3] TOP500 Supercomputing Sites, Online: http://www.top500.org/
  • [4] LINPACK Benchmark, Online: http://www.netlib.org/benchmark/hpl/
  • [5] HPC Challenge Benchmark Suite, Online: http://icl.cs.utk.edu/hpcc/
  • [6] JUBE Benchmark Suite, Online:

http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html

  • [7] The OpenMP API specification for parallel programming, Online:

http://openmp.org/wp/openmp-specifications/

  • [8] The MPI Standard, Online: http://www.mpi-forum.org/docs/
  • [9] Jötunn HPC Cluster, Online: http://ihpc.is/jotunn/
  • [10] LLView Tool, Online: http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/LLview/_node.html
  • [11] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book,

Online: http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049

  • [12] Fran Berman, ‘Maximising the Potential of Research Data’
  • [13] How EMI Contributed to the Higgs Boson Discovery, YouTube Video, Online:

http://www.youtube.com/watch?v=FgcoLUys3RY&list=UUz8n-tukF1S7fql19KOAAhw

  • [14] DEEP-EST EU Project, Online: http://www.deep-projects.eu/
  • [15] LLNL MPI Tutorial, Online: https://computing.llnl.gov/tutorials/mpi/
  • [16] Map Analysis, Understanding Spatial Patterns and Relationships, Joseph K. Berry, Online:

http://www.innovativegis.com/basis/Books/MapAnalysis/Default.htm

Part Two – Introduction to High Performance Computing 69 / 70

slide-70
SLIDE 70

Part Two – Introduction to High Performance Computing 70 / 70