Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT - - PowerPoint PPT Presentation

theory of multicore algorithms
SMART_READER_LITE
LIVE PREVIEW

Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT - - PowerPoint PPT Presentation

Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and


slide-1
SLIDE 1

Slide-1 Multicore Theory

MIT Lincoln Laboratory

Theory of Multicore Algorithms

Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008

This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

slide-2
SLIDE 2

MIT Lincoln Laboratory

Slide-2 Multicore Theory

  • Programming Challenge
  • Example Issues
  • Theoretical Approach
  • Integrated Process

Outline

  • Parallel Design
  • Distributed Arrays
  • Kuck Diagrams
  • Hierarchical Arrays
  • Tasks and Conduits
  • Summary
slide-3
SLIDE 3

MIT Lincoln Laboratory

Slide-3 Multicore Theory

Multicore Programming Challenge

  • Great success of Moore’s Law era

– Simple model: load, op, store – Many transistors devoted to delivering this model

  • Moore’s Law is ending

– Need transistors for performance

Past Programming Model: Von Neumann Future Programming Model: ??? Can we describe how an algorithm is suppose to behave

  • n a hierarchical heterogeneous multicore processor?
  • Processor topology includes:

– Registers, cache, local memory, remote memory, disk

  • Cell has multiple programming

models

slide-4
SLIDE 4

MIT Lincoln Laboratory

Slide-4 Multicore Theory

Example Issues

Where is the data? How is distributed? Where is it running?

Y = X + 1 X,Y : NxN

  • A serial algorithm can run a serial processor with relatively

little specification

  • A hierarchical heterogeneous multicore algorithm requires

a lot more information

Which binary to run?

Initialization policy? How does the data flow? What are the allowed messages size? Overlap computations and communications?

slide-5
SLIDE 5

MIT Lincoln Laboratory

Slide-5 Multicore Theory

Theoretical Approach

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

A n⇒

Topic12

⇒m B

Topic12

Conduit

2

Replica 0 Replica 1

  • Provide notation and diagrams that allow hierarchical

heterogeneous multicore algorithms to be specified

slide-6
SLIDE 6

MIT Lincoln Laboratory

Slide-6 Multicore Theory

Integrated Development Process

Cluster

  • 2. Parallelize code

Embedded Computer

  • 3. Deploy code
  • 1. Develop serial code

Desktop

  • 4. Automatically parallelize code

Y = X + 1 X,Y : NxN Y = X + 1 X,Y : P(N)xN Y = X + 1 X,Y : P(P(N))xN

  • Should naturally support standard parallel embedded

software development practices

slide-7
SLIDE 7

MIT Lincoln Laboratory

Slide-7 Multicore Theory

  • Serial Program
  • Parallel Execution
  • Distributed Arrays
  • Redistribution

Outline

  • Parallel Design
  • Distributed Arrays
  • Kuck Diagrams
  • Hierarchical Arrays
  • Tasks and Conduits
  • Summary
slide-8
SLIDE 8

MIT Lincoln Laboratory

Slide-8 Multicore Theory

Serial Program

  • Math is the language of algorithms
  • Allows mathematical expressions to be written concisely
  • Multi-dimensional arrays are fundamental to mathematics

Y = X + 1 X,Y : NxN

slide-9
SLIDE 9

MIT Lincoln Laboratory

Slide-9 Multicore Theory

Parallel Execution

  • Run NP copies of same program

– Single Program Multiple Data (SPMD)

  • Each copy has a unique PID
  • Every array is replicated on each copy of the program

Y = X + 1 X,Y : NxN

PID=1 PID=0 PID=NP-1

slide-10
SLIDE 10

MIT Lincoln Laboratory

Slide-10 Multicore Theory

Distributed Array Program

  • Use P() notation to make a distributed array
  • Tells program which dimension to distribute data
  • Each program implicitly operates on only its own data

(owner computes rule)

Y = X + 1 X,Y : P(N)xN

PID=1 PID=0 PID=NP-1

slide-11
SLIDE 11

MIT Lincoln Laboratory

Slide-11 Multicore Theory

Explicitly Local Program

  • Use .loc notation to explicitly retrieve local part of a

distributed array

  • Operation is the same as serial program, but with different

data on each processor (recommended approach)

Y.loc = X.loc + 1 X,Y : P(N)xN

slide-12
SLIDE 12

MIT Lincoln Laboratory

Slide-12 Multicore Theory

Parallel Data Maps

  • A map is a mapping of array indices to processors
  • Can be block, cyclic, block-cyclic, or block w/overlap
  • Use P() notation to set which dimension to split among

processors

P(N)xN

Math

0 1 2 3

Computer PID Array

NxP(N) P(N)xP(N)

slide-13
SLIDE 13

MIT Lincoln Laboratory

Slide-13 Multicore Theory

Redistribution of Data

  • Different distributed arrays can have different maps
  • Assignment between arrays with the “=” operator causes

data to be redistributed

  • Underlying library determines all the message to send

Y = X + 1 Y

: NxP(N)

X

: P(N)xN

P0 P1 P2 P3

X =

P0 P1 P2 P3

Y = Data Sent

slide-14
SLIDE 14

MIT Lincoln Laboratory

Slide-14 Multicore Theory

  • Serial
  • Parallel
  • Hierarchical
  • Cell

Outline

  • Parallel Design
  • Distributed Arrays
  • Kuck Diagrams
  • Hierarchical Arrays
  • Tasks and Conduits
  • Summary
slide-15
SLIDE 15

MIT Lincoln Laboratory

Slide-15 Multicore Theory

M0 P0

A : RN×N

Single Processor Kuck Diagram

  • Processors denoted by boxes
  • Memory denoted by ovals
  • Lines connected associated processors and memories
  • Subscript denotes level in the memory hierarchy
slide-16
SLIDE 16

MIT Lincoln Laboratory

Slide-16 Multicore Theory

Net0.5

M0 P0 M0 P0 M0 P0 M0 P0

A : RN×P(N)

Parallel Kuck Diagram

  • Replicates serial processors
  • Net denotes network connecting memories at a level in the

hierarchy (incremented by 0.5)

  • Distributed array has a local piece on each memory
slide-17
SLIDE 17

MIT Lincoln Laboratory

Slide-17 Multicore Theory

Hierarchical Kuck Diagram

The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy

Net1.5 SM2 SMNet2 M0 P0 M0 P0 Net0.5 SM1 M0 P0 M0 P0 Net0.5 SM1 SMNet1 SMNet1

Legend:

  • P - processor
  • Net - inter-processor network
  • M - memory
  • SM - shared memory
  • SMNet - shared memory

network

2-LEVEL HIERARCHY Subscript indicates hierarchy level x.5 subscript for Net indicates indirect memory access

*High Performance Computing: Challenges for Future Systems, David Kuck, 1996

slide-18
SLIDE 18

MIT Lincoln Laboratory

Slide-18 Multicore Theory

Cell Example

M0 P0 M0 P0 M0 P0 Net0.5 M0 P0 M0 P0 M0 P0

1 2 3 7

MNet1 M1 PPE SPEs PPPE = PPE speed (GFLOPS) M0,PPE = size of PPE cache (bytes) PPPE -M0,PPE =PPE to cache bandwidth (GB/sec) PSPE = SPE speed (GFLOPS) M0,SPE = size of SPE local store (bytes) PSPE -M0,SPE = SPE to LS memory bandwidth (GB/sec) Net0.5 = SPE to SPE bandwidth (matrix encoding topology, GB/sec) MNet1 = PPE,SPE to main memory bandwidth (GB/sec) M1 = size of main memory (bytes)

Kuck diagram for the Sony/Toshiba/IBM processor Kuck diagram for the Sony/Toshiba/IBM processor

slide-19
SLIDE 19

MIT Lincoln Laboratory

Slide-19 Multicore Theory

  • Hierarchical Arrays
  • Hierarchical Maps
  • Kuck Diagram
  • Explicitly Local Program

Outline

  • Parallel Design
  • Distributed Arrays
  • Kuck Diagrams
  • Hierarchical Arrays
  • Tasks and Conduits
  • Summary
slide-20
SLIDE 20

MIT Lincoln Laboratory

Slide-20 Multicore Theory

Hierarchical Arrays

1 2 3 Local arrays Global array

  • Hierarchical arrays allow algorithms to conform to

hierarchical multicore processors

  • Each processor in P controls another set of processors P
  • Array local to P is sub-divided among local P processors

PID PID

1 1 1 1

slide-21
SLIDE 21

MIT Lincoln Laboratory

Slide-21 Multicore Theory

net1.5 M0 P0 M0 P0 net0.5 SM1 M0 P0 M0 P0 net0.5 SM1 SM net1 SM net1

A : RN×P(P(N))

Hierarchical Array and Kuck Diagram

  • Array is allocated across SM1 of P processors
  • Within each SM1 responsibility of processing is divided

among local P processors

  • P processors will move their portion to their local M0

A.loc A.loc.loc

slide-22
SLIDE 22

MIT Lincoln Laboratory

Slide-22 Multicore Theory

Explicitly Local Hierarchical Program

  • Extend .loc notation to explicitly retrieve local part of a local

distributed array .loc.loc (assumes SPMD on P)

  • Subscript p and p provide explicit access to (implicit
  • therwise)

Y.locp.locp = X.locp.locp + 1 X,Y : P(P(N))xN

slide-23
SLIDE 23

MIT Lincoln Laboratory

Slide-23 Multicore Theory

Block Hierarchical Arrays

1 2 3 Local arrays Global array

  • Memory constraints are common at the lowest level of the

hierarchy

  • Blocking at this level allows control of the size of data
  • perated on by each P

PID PID

1 1 1 1

b=4

in-core

  • ut-of-

core

Core blocks blk

1 2 3 1 2 3

slide-24
SLIDE 24

MIT Lincoln Laboratory

Slide-24 Multicore Theory

Block Hierarchical Program

  • Pb(4) indicates each sub-array should be broken up into

blocks of size 4.

  • .nblk provides the number of blocks for looping over each

block; allows controlling size of data on lowest level

for i=0, X.loc.loc.nblk-1 Y.loc.loc.blki = X.loc.loc.blki + 1 X,Y : P(Pb(4)(N))xN

slide-25
SLIDE 25

MIT Lincoln Laboratory

Slide-25 Multicore Theory

  • Basic Pipeline
  • Replicated Tasks
  • Replicated Pipelines

Outline

  • Parallel Design
  • Distributed Arrays
  • Kuck Diagrams
  • Hierarchical Arrays
  • Tasks and Conduits
  • Summary
slide-26
SLIDE 26

MIT Lincoln Laboratory

Slide-26 Multicore Theory

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

Tasks and Conduits

  • S1 superscript runs task on a set of processors; distributed

arrays allocated relative to this scope

  • Pub/sub conduits move data between tasks

A n⇒

Topic12

⇒m B

Topic12

Conduit

slide-27
SLIDE 27

MIT Lincoln Laboratory

Slide-27 Multicore Theory

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

A n⇒

Topic12

⇒m B

Topic12

Conduit

2

Replica 0 Replica 1

Replicated Tasks

  • 2 subscript creates tasks replicas; conduit will round-robin
slide-28
SLIDE 28

MIT Lincoln Laboratory

Slide-28 Multicore Theory

Task1S1()

A : RN×P(N)

M0 P0 M0 P0

1

M0 P0

2

M0 P0

3

Task2S2()

B : RN×P(N)

M0 P0

4

M0 P0

5

M0 P0

6

M0 P0

7

A n⇒

Topic12

⇒m B

Topic12

Conduit

2 2

Replica 0 Replica 1 Replica 0 Replica 1

Replicated Pipelines

  • 2 identical subscript on tasks creates replicated pipeline
slide-29
SLIDE 29

MIT Lincoln Laboratory

Slide-29 Multicore Theory

Summary

  • Hierarchical heterogeneous multicore processors are

difficult to program

  • Specifying how an algorithm is suppose to behave on such

a processor is critical

  • Proposed notation provides mathematical constructs for

concisely describing hierarchical heterogeneous multicore algorithms