Fresh Breeze A Radical Approach to Massively Parallel Architecture - - PowerPoint PPT Presentation

fresh breeze a radical approach to massively parallel
SMART_READER_LITE
LIVE PREVIEW

Fresh Breeze A Radical Approach to Massively Parallel Architecture - - PowerPoint PPT Presentation

Fresh Breeze A Radical Approach to Massively Parallel Architecture and Programming Jack Dennis MIT-CSAIL Computer Science and Ar:ficial Intelligence Laboratory The Multi Core Challenge Many processing cores provides for high potential


slide-1
SLIDE 1

Fresh Breeze A Radical Approach to Massively Parallel Architecture and Programming

Jack Dennis MIT-CSAIL Computer Science and Ar:ficial Intelligence Laboratory

slide-2
SLIDE 2

The Multi Core Challenge

  • Many processing cores provides for high

potential performance.

  • Goal: Achieve high core utilization
  • Goal: With highest Energy Efficiency.
  • Goal: Support Modular Construction of

Software for Parallel Computation.

  • Goal: Unify Memory with the File System.
slide-3
SLIDE 3

Typical Processor Chip

Network L2 Cache Off-Chip Memory System L1 Cache Core L1 Cache Core (DRAM and Disk)

slide-4
SLIDE 4

The Popular Approach

  • Overhead
  • No satisfactory notion of Program

Module

  • Difficult sharing of data objects

MPI: Message Passing Interface

Issues:

slide-5
SLIDE 5

Message Passing System

Core Interconnec:on Network

Core Core N - 1

Basic Commands: Send m to p Receive m from p

slide-6
SLIDE 6

The Fresh Breeze Project

  • Co-design of Programming Model and

System Architecture.

  • Goal: Support Fine-Grain Dynamic

Resource Management.

  • Goal: Support Modular Construction of

Software for Parallel Computation.

slide-7
SLIDE 7

What is a Program m Execu?on Model?

§ Application Code § Software Packages § Program Libraries § Compilers § Utility Applications

(API) PXM User Code

§ Hardware § Runtime Code § Operating System

System

slide-8
SLIDE 8

Features a User Program m Depends On

§ Procedures; call/return § Access to parameters and variables § Use of data structures (static and dynamic)

Features expressed within a Programming language

§ File creation, naming and access § Object directories § Communication: networks and peripherals § Concurrency: coordination; scheduling

Features expressed Outside a (typical) programming language But that’s not all !!

slide-9
SLIDE 9

To Today’s Conven?onal SoHware Stack

§ Application Code, Etc.

User Code

§ Runtime Code

System (API) PXM (API) PXM

§ Operating System § Hardware

(API) PXM Each system layer compensates for inadequacies of the layers below, leading to an inefficient whole.

slide-10
SLIDE 10

Fresh Breeze Characteristics

  • Use of fixed size

to represent all data objects, simplifying dynamic memory

  • management. Write once data eliminates cache

consistency problems.

  • Use of

executed according to principles yields a fine-grain tasking model.

  • Hardware task scheduler and load balancer

provide highly effec:ve dynamic management of processing load.

slide-11
SLIDE 11

Project Components

  • The funJava Programming Language for func:onal

programming to support parallel execu:on.

  • The Fresh Breeze architecture for parallel

compu:ng with fine-grain execu:on of many codelets.

  • The Kiva system simulator capable of cycle accurate

simula:on of systems with large numbers of components.

  • The Fresh Breeze compiler for genera:ng codelets

for highly parallel computa:on from funJava programs.

slide-12
SLIDE 12

A Functional Programming Language

  • A language in which all forms of parallelism are

readily expressed: Expression Parallel, Data Parallel, Producer-Consumer and Transac:on Processing.

  • A high level programming language in which

data streams are first class data objects

  • Retains the type secure features of the Java

language.

funJava

slide-13
SLIDE 13

Flexibility of resource management requires choice of a unit of exchange for memory and for processing

  • Unit of Memory – Fixed Size Memory Chunk
  • Unit of Processing – Execution of a Codelet
slide-14
SLIDE 14

A chunk holds sixteen data items that may be data values or pointers to other memory chunks

What is a Memory Chunk ?

104 128 57 12

slide-15
SLIDE 15

Data Structures as Trees of Chunks

§ Fan-out as large as 16 § Arrays: Three levels yields 4096 elements (longs or doubles) § Write-Once then Read Only

Data Chunks e.g. 128 Bytes Master Chunk

Cycle-Free Heap Arrays as Trees of Chunks

15

slide-16
SLIDE 16

Benefits of the Memory Model

  • Uniform representation scheme for all

data objects

  • Ease of selecting components of a data
  • bject.
  • Simplified memory management.
  • Write-once policy eliminates coherence

issues

slide-17
SLIDE 17

What is a Codelet ?

§ A block of Instructions scheduled for execution when needed data objects are available. § Results made available to successor codelets. § Data objects are trees of chunks. Codelet Object A Object B

slide-18
SLIDE 18

Work and Continuation Codelets (Data Parallel Computation)

18

Master Codelet Work Codelet Continuation Codelet

TaskSpawn (work, sync, 0) TaskSpawn (work, sync, n-1) SyncCreate (cont, n) -> sync SyncUpdate (sync, 0, data)

Work Codelet

SyncUpdate (sync, n-1, data) TaskQuit ()

slide-19
SLIDE 19

Example: The Dot Product

A B

*

Sum A B 5 levels: Vector length = 165 = 1,048,576

* +

scalar result

* *

Each of 65536 Leaf Tasks: Dot Product of two 16-element vectors: 16 multiplies; 15 adds

slide-20
SLIDE 20

ForAllSpawn

Codelets for the Dot Product

Compute Traverse Vectors Combine Sums

Update Update TaskSpawn ForAllSpawn Update

slide-21
SLIDE 21

Fresh Breeze Multicore Chip

Network L2 Cache

AB - AutoBuffer P - Processor Core Off-Chip Memory System S - Scheduler

Load Balancer AB P S AB P S AB P S AB P S

Innovations: AutoBuffer - AB Load Balancer

slide-22
SLIDE 22

Linear Algebra: Three Algorithms

  • Dot Product
  • Matrix Multiply
  • Fast Fourier Transform

22

Let’s consider the special characteris:cs of each.

slide-23
SLIDE 23

Dot Product

16

23 Segment A

15 Multiplies

Adds

31 Operations

  • No data reuse
  • No intermediate data
  • Large volume of input data

Leaf Task: Dot Product of 16-element segments A and B

Segment B

+ *

slide-24
SLIDE 24

Matrix Multiply

64

24

48 112 Multiplies Adds Operations

  • Each input chunk used many times
  • Result chunks written to memory
  • No intermediate data
  • Relatively small input data

Leaf Task: Product of two 4-by-4 matrices 16 dot products of four-element vectors

+ *

slide-25
SLIDE 25

Fast Fourier Transform

4

Eight Data Samples Four Twiddle Factors

6 Multiplies Adds 10 Operations

  • Log2 (n) stages
  • Intermediate data
  • Chunks written and read

Leaf Task: Group of Four Butterfly Computations

BFLY Eight Results BFLY BFLY BFLY

16 40 One Butterfly Four Butterflies 24

slide-26
SLIDE 26

Register File AutoBuffer

Chunk Buffers registers valid flag buffer index tags

Principle of the Auto Buffer

Auxiliary Fields Memory System

1 3 2 3

Codelets access chunks using chunk handles held in processor

  • registers. Once a chunk is assigned a buffer, its index is held by

the register containing the handle, providing direct access to the chunk.

slide-27
SLIDE 27

Dynamic Load Balancing

Load Balancer Local Task Queue LTQ LTQ LTQ Task Transfer Network Load Measure Send a Task To

The load Balancer monitors the number of tasks queued at each processor and instructs local schedulers to send tasks from processors with high load to processors with low load.

Receive a Task Send a Task

slide-28
SLIDE 28
  • Codelet – index of codelet within the codelet library.
  • Arguments – The handle of an argument chunk

Th The T e Task R Rec ecor

  • rd

Codelet Arguments

slide-29
SLIDE 29

Simulated Fresh Breeze System

Network Memory Units

Number of cores Execution Slots Size of AutoBuffer Latency of Read System Parameters

Load Balancer AB P S AB P S AB P S AB P S

slide-30
SLIDE 30

Speed Up Data – – Dot Product

5 2 4 7.9 15.4 30.4 59.4 114 204.5 4 1 2 3.9 7.8 15.2 29.4 54.8 96.1 151 3 1 2 3.8 7.3 12.8 19.9 26.3 30.3 27.9 26.5 26.4 2 1 1.8 2.7 3.3 3.1 3.1 3.1 2.7 2.9 2.9 2.9 1 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.7 0.6 0.6 1 2 4 8 16 32 64 128 256 512 1024

Processing Cores Depth

slide-31
SLIDE 31

System Configuration: 64 Processing Cores Job DP: 4096-element Dot Product Job MM: 16 x 16 Matrix Multiply

Ru Running T Two J

  • Job
  • bs T

Tog

  • gether

er

Job DP MM DP + MM Cycles 10,979 10,409 14,291

Ratio: Together / Separate : 0.67

slide-32
SLIDE 32
  • The AutoBuffer does not use a cache tag memory
  • Absence of TLB
  • No software cycles for task scheduling
  • No software cycles to handle page misses
  • No file system software

So Sour urces s of f Ene Energy gy Sa Savings vings

slide-33
SLIDE 33

Convert Class Files Transform Graphs Construct Code

DFGs of Methods DFGs for Codelets Fresh Breeze Codelets Bytecode Class Files

javac funJava

Fresh Breeze Compiler

Processor Simulator

slide-34
SLIDE 34

Structured Parallelism

Program modules are determinate unless nondeterminate behavior is desired and explicitly introduced by the programmer. A program execuNon model must permit parallel execuNon of two modules whenever there is no data dependence between them, that is, neither module requires any result produced by the

  • ther.
slide-35
SLIDE 35

InformaNon Hiding Principle

The user of a module must not need to know anything about the internal mechanism of the module to make effec:ve use of it.

slide-36
SLIDE 36

Invariant Behavior Principle

The func:onal behavior of a module must be independent of the site or context from which it is invoked.

slide-37
SLIDE 37

Data Generality Principle

The interface to a module must be capable of passing any data object an applicaNon may require.

slide-38
SLIDE 38

Secure Arguments Principle

The interface to a module must not allow side-effects on arguments supplied to the interface.

slide-39
SLIDE 39

Recursive ConstrucNon Principle

A program constructed from modules must be useable as a component in building larger programs or modules.

slide-40
SLIDE 40

System Resource Management Principle

Resource management for program modules must be performed by the computer system and not by individual program modules.

slide-41
SLIDE 41

The list processing language Lisp

Data Objects: Lists – binary trees Module: Func:on declara:on Garbage Collec:on: Yes Secure Arguments: Pure Lisp: Yes Complete Lisp: No Unified Memory and File System: No Parallel Execu:on: Pure Lisp: Yes, with func:onal behavior.

slide-42
SLIDE 42

The IBM AS/400 System

Designed to serve the corporate data processing market. Data Objects: Files and segments of memory iden:fied by handles. Module: Procedure Declara:on Secure Arguments: Not Known Unified Memory and File System: Yes Garbage Collec:on: Yes