Multi/Many Core Programming Strategies Greg Michaelson School of - - PowerPoint PPT Presentation

multi many core programming strategies
SMART_READER_LITE
LIVE PREVIEW

Multi/Many Core Programming Strategies Greg Michaelson School of - - PowerPoint PPT Presentation

Multicore Challenge Conference 2012 UWE, Bristol Multi/Many Core Programming Strategies Greg Michaelson School of Mathematical & Computer Sciences Heriot-Watt University Multicore Challenge Conference 2012 1 Overview RAM good old


slide-1
SLIDE 1

Multicore Challenge Conference 2012 UWE, Bristol

Multi/Many Core Programming Strategies

Greg Michaelson School of Mathematical & Computer Sciences Heriot-Watt University

1 Multicore Challenge Conference 2012

slide-2
SLIDE 2

Overview

  • good old fashioned

parallel computing based on lots of identical single CPUs is

  • ver

Multicore Challenge Conference 2012 2

PE PE PE Network RAM PE PE PE RAM RAM RAM Network Shared memory Distributed memory

slide-3
SLIDE 3

Overview

  • Moore’s Law implications

have changed

– speed of CPUs now stable at ~3.5 GHz – performance increases from multi- & many-core CPUs

Multicore Challenge Conference 2012 3

Intel 4004 – 1971

http://en.wikipedia.org/wiki/Intel_4004

Intel Core I7 – 2008

http://en.wikipedia.org/wiki/Intel_Core_i7

slide-4
SLIDE 4

Overview

  • multi-processor

architectures increasingly hierarchical & heterogeneous

  • message passing grids
  • f clusters of:

– now: shared memory multi-core

Multicore Challenge Conference 2012 4

Hector – Edinburgh Parallel Computer Centre

  • 464 compute blades with…
  • 4 compute nodes with…
  • 2 *12-core processors.
  • 44,544 cores

http://www.hector.ac.uk/abouthector/hectorbasics/

slide-5
SLIDE 5

Overview

  • multi-processor

architectures increasingly hierarchical & heterogeneous

  • message passing grids
  • f clusters of:

– soon: message passing many-core arrays

Multicore Challenge Conference 2012 5

SCC – Intel Research

http://techresearch.intel.com/ProjectDetails.aspx?Id=1

slide-6
SLIDE 6

Overview

  • cores also have SIMD processors (MMX/SSE)
  • non-uniform memory

– differing degrees/levels of private & shared cache

  • old programming strategies break down

– one size no longer fits all

  • need for hybrid strategies

Multicore Challenge Conference 2012 6

slide-7
SLIDE 7

Overview

  • developing multi-

processor software is still a black art

  • would like:

– low effort – flexibility – scalability – future proof – re-use

Multicore Challenge Conference 2012 7

slide-8
SLIDE 8

Overview

  • different approaches:

– require different effort – offer different degree of control over:

  • task division
  • communications
  • process placement

Multicore Challenge Conference 2012 8

slide-9
SLIDE 9

Methodological choices

Multicore Challenge Conference 2012 9

START

slide-10
SLIDE 10

Methodological choices

Multicore Challenge Conference 2012 10

automatic parallelisation START

slide-11
SLIDE 11

Automatic Parallelisation

  • vector/array parallelisation
  • implicit

– e.g. SIMD in C with gcc

  • language directives

– Fortrans: Fortran 90; F; High Performance Fortran

Multicore Challenge Conference 2012 11

slide-12
SLIDE 12

Automatic Parallelisation

Multicore Challenge Conference 2012 12

  • low effort

– no communications – no/minimal task division

  • poor flexibility/scalability

– good for regular problems – good on uniform architectures

slide-13
SLIDE 13

Methodological choices

Multicore Challenge Conference 2012 13

do it yourself automatic parallelisation START

slide-14
SLIDE 14

Methodological choices

Multicore Challenge Conference 2012 14

skeleton do it yourself automatic parallelisation START

slide-15
SLIDE 15

Algorithmic skeletons

  • capture common

patterns of data & control parallelism

– e.g. pipeline; farm; divide & conquer

  • skeleton libraries

for C/Java

Multicore Challenge Conference 2012 15

stage 1 stage 2 stage N worker farmer worker worker pipeline process farm

slide-16
SLIDE 16

Algorithmic skeletons

  • capture common

patterns of data & control parallelism

– e.g. pipeline; farm; divide & conquer

  • skeleton libraries

for C/Java

Multicore Challenge Conference 2012 16

parent parent/ child parent/ child parent/ child parent/ child parent/ child parent/ child divide & conquer

slide-17
SLIDE 17

Algorithmic skeletons

  • industrial frameworks
  • e.g. Google Map-

Reduce

  • Apache Hadoop

Multicore Challenge Conference 2012 17

Google Map-Reduce

http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto- 0008.html

slide-18
SLIDE 18

Algorithmic skeletons

  • industrial frameworks
  • e.g. Microsoft Dryad

Multicore Challenge Conference 2012 18

Microsoft Dryad

www.wikibench.eu/CloudCP2011/wp-content/.../Isaacs-keynote.ppsx

slide-19
SLIDE 19

Algorithmic skeletons

  • can choose appropriate skeleton for problem

class

  • medium effort to use skeleton

library/industrial framework

– must fit problem to skeleton

  • high effort to develop own skeletons

– must make communication & task division explicit

Multicore Challenge Conference 2012 19

slide-20
SLIDE 20

Algorithmic skeletons

  • can hand tune for:

– problem – irregularity – scalability – process placement

  • strong potential re-use of components

Multicore Challenge Conference 2012 20

slide-21
SLIDE 21

Methodological choices

Multicore Challenge Conference 2012 21

skeleton do it yourself automatic parallelisation programmed parallelisation START

slide-22
SLIDE 22

Methodological choices

Multicore Challenge Conference 2012 22

skeleton do it yourself automatic parallelisation programmed parallelisation

  • perating

system START

slide-23
SLIDE 23

Operating system

  • independent programs

– realised as threads

  • communication via pipes/sockets
  • bolted together with shell scripts

Multicore Challenge Conference 2012 23

slide-24
SLIDE 24

Operating system

  • low effort
  • highly dependent on underlying operating

system for:

– communication – scheduling – process placement

  • unpredictable performance

Multicore Challenge Conference 2012 24

slide-25
SLIDE 25

Methodological choices

Multicore Challenge Conference 2012 25

skeleton do it yourself automatic parallelisation programmed parallelisation

  • perating

system explicit processes START

slide-26
SLIDE 26

Methodological choices

Multicore Challenge Conference 2012 26

skeleton do it yourself automatic parallelisation programmed parallelisation

  • perating

system explicit processes library START

slide-27
SLIDE 27

Library

  • shared memory

– OpenMP

  • platform & architecture independent

– Posix Threads

  • Unix/Linux specific/architecture independent

– Intel Threading Building Blocks

  • platform/architecture independent

Multicore Challenge Conference 2012 27

slide-28
SLIDE 28

Library

  • distributed memory

– MPI & PVM

  • specialised hardware

– SIMD on MMX/SSE – CUDA & OpenCL for GPU arrays

Multicore Challenge Conference 2012 28

slide-29
SLIDE 29

Library

  • now common to use:

– MPI for inter-cluster – OpenMP for intra-cluster

  • medium to high effort

– explicit communication & task division

  • can shape algorithm to architecture
  • best for irregular problem/architecture

Multicore Challenge Conference 2012 29

slide-30
SLIDE 30

Library

  • often end up re-inventing some standard

algorithmic skeleton

  • good potential for reuse of:

– structure – components

Multicore Challenge Conference 2012 30

slide-31
SLIDE 31

Methodological choices

Multicore Challenge Conference 2012 31

skeleton do it yourself automatic parallelisation programmed parallelisation

  • perating

system explicit processes library hand crafted START

slide-32
SLIDE 32

Hand crafted

  • very low level
  • shared memory

– critical regions via semaphores

  • distributed memory

– communication over RS232; USB

Multicore Challenge Conference 2012 32

slide-33
SLIDE 33

Hand crafted

  • very high effort
  • highly problem/architecture specific
  • best for embedded systems

Multicore Challenge Conference 2012 33

slide-34
SLIDE 34

Questions...

  • is my problem suitable for parallelisation?
  • how do I know how my problem scales?
  • if I parallelise my problem, how do I tell how

much communication overhead will be incurred?

  • how do I assess the benefits of shared versus

distributed memory?

28th June, 2011 KTN ICT Scalable Applications & Services 34

slide-35
SLIDE 35

Questions...

  • can I do better with smarter solutions on my

existing technology?

  • where can I get help with deciding how to

proceed?

  • have other people already come up with

solutions that might work for me?

28th June, 2011 KTN ICT Scalable Applications & Services 35

slide-36
SLIDE 36

Future

  • UK has major research strengths in multi-

processor architectures, parallel languages/compilers, skeletons etc

  • groups don’t talk much to each other or to

practitioners e.g. in eScience

  • need to build inclusive UK community
  • opportunities through

– EPSRC multi-core priority for ITC – TSB ICT KTN for multi-core

Multicore Challenge Conference 2012 36