Scalable Distributed Memory Multiprocessors 1 Outline Scalability - PowerPoint PPT Presentation

Scalable Distributed Memory Multiprocessors 1

Outline Scalability • physical, bandwidth, latency and cost • level of integration Realizing Programming Models • network transactions • protocols • safety – input buffer problem: N-1 – fetch deadlock Communication Architecture Design Space • how much hardware interpretation of the network transaction? 2

Limited Scaling of a Bus Characteristic Bus Physical Length ~ 1 ft Number of Connections fixed Maximum Bandwidth fixed Interface to Comm. medium memory inf Global Order arbitration Protection Virt -> physical Trust total OS single comm. abstraction HW Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components 3

Workstations in a LAN? Characteristic Bus LAN Physical Length ~ 1 ft KM Number of Connections fixed many Maximum Bandwidth fixed ??? Interface to Comm. medium memory inf peripheral Global Order arbitration ??? Protection Virt -> physical OS Trust total none OS single independent comm. abstraction HW SW No clear limit to physical scaling, little trust, no global order, consensus difficult to achieve. Independent failure and restart 4

Scalable Machines What are the design trade-offs for the spectrum of machines between? • specialize or commodity nodes? • capability of node-to-network interface • supporting programming models? What does scalability mean? • avoids inherent design limits on resources • bandwidth increases with P • latency does not • cost increases slowly with P 5

Bandwidth Scalability Typical switches Bus S S S S Crossbar Multiplexers P M M P M M P M M P M M What fundamentally limits bandwidth? • single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch? 6

Dancehall MP Organization M M ° ° ° M Scalable network Switch Switch Switch ° ° ° $ $ $ $ P P P P Network bandwidth? Bandwidth demand? • independent processes? • communicating processes? Latency? 7

Generic Distributed Memory Org. Scalable network Switch Switch Switch ° ° ° M CA $ P Network bandwidth? Bandwidth demand? • independent processes? • communicating processes? Latency? 8

Key Property Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved • effects propagated through additional transactions 9

Latency Scaling T(n) = Overhead + Channel Time + Routing Delay Overhead? Channel Time(n) = n/B --- BW at bottleneck RoutingDelay(h,n) 10

Typical example max distance: log n number of switches: α α n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Pipelined T 64 (128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward sf (128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us T 64 sf (1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us T 64 11

Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup 12

Cost Effective? 2000 1500 Speedup = P/(1+ logP) Costup = 1 + 0.1 P 1000 500 0 0 500 1000 1500 2000 Processors 2048 processors: 475 fold speedup at 206x cost 13

Physical Scaling Chip-level integration Board-level System level 14

nCUBE/2 Machine Organization Basic module 1024 Nodes DRAM interface channels Router MMU DMA I-Fetch Operand Hypercube network & $ configuration decode Single-chip node Execution unit 64-bit integer IEEE floating point Entire machine synchronous at 40 MHz 15

CM-5 Machine Organization Diagnostics network Control network Data network PM PM Processing Processing Control I/O partition partition partition processors SPARC FPU Data Control networks network $ $ NI ctrl SRAM MBUS Vector Vector unit unit DRAM DRAM DRAM DRAM ctrl ctrl ctrl ctrl DRAM DRAM DRAM DRAM 16

System Level Integration Power 2 IBM SP-2 node CPU L 2 $ Memory bus General inter connection 4-way network formed from Memory interleaved 8-port switches controller DRAM MicroChannel bus NIC I/O DMA DRAM i860 NI 17

Outline Scalability • physical, bandwidth, latency and cost • level of integration Realizing Programming Models • network transactions • protocols • safety – input buffer problem: N-1 – fetch deadlock Communication Architecture Design Space • how much hardware interpretation of the network transaction? 18

Programming Models Realized by Protocols CAD Database Scientific modeling Parallel applications Multiprogramming Shared Message Data Programming models address passing parallel Compilation Communication abstraction or library User/system boundary Operating systems support Hardware/software boundary Communication har dware Physical communication medium Network Transactions 19

Network Transaction Primitive Communication Network serialized msg ° ° ° input buf fer output buf fer Destination Node Source Node one-way transfer of information from a source output buffer to a dest. input buffer • causes some action at the destination • occurrence is not directly visible at source deposit data, state change, reply 20

Bus Transactions vs Net Transactions Issues: protection check V->P ?? format wires flexible output buffering reg, FIFO ?? global local media arbitration destination naming and routing input buffering limited many source action completion detection 21

Shared Address Space Abstraction Source Destination r ← [ Global address] (1) Initiate memory access Load (2) Address translation (3) Local/remote check Read request (4) Request transaction Read request (5) Remote memory access Memory access Wait Read response (6) Reply transaction Read response (7) Complete memory access Time Fundamentally a two-way request/response protocol • writes have an acknowledgement Issues • fixed or variable length (bulk) transfers • remote virtual or physical address, where is action performed? • deadlock avoidance and input buffer full coherent? consistent? 22

The Fetch Deadlock Problem Even if a node cannot issue a request, it must sink network transactions. Incoming transaction may be a request, which will generate a response. Closed system (finite buffering) 23

Consistency while (flag==0); A=1; flag=1; print A; P P P 1 2 3 Memory Memory Memory A:0 flag:0->1 Delay 3: load A 1: A=1 2: flag=1 Interconnection network (a) P P 3 2 P 1 Congested path (b) write-atomicity violated without caching 24

Key Properties of SAS Abstraction Source and destination data addresses are specified by the source of the request • a degree of logical coupling and trust no storage logically “outside the application address space(s)” – may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory • logically does not require intervention of the remote processor 25

Message passing Bulk transfers Complex synchronization semantics • more complex protocols • More complex action Synchronous • Send completes after matching recv and source data sent • Receive completes after data transfer complete from matching send Asynchronous • Send completes after send buffer may be reused 26

Synchronous Message Passing Source Destination Recv P src , local VA, len (1) Initiate send (2) Address translation on P src Send P dest , local VA, len (3) Local/remote check (4) Send-ready request Send-rdy req (5) Remote check for posted receive Wait Tag check Processor (assume success) Action? (6) Reply transaction Recv-rdy reply (7) Bulk data transfer ¡ Dest VA or ID Source VA Data-xfer req Time Constrained programming model. Deterministic! What happens when threads added? Destination contention very limited. User/System boundary? 27

Asynch. Message Passing: Optimistic Destination Source (1) Initiate send (2) Address translation Send (P dest , local VA, len) (3) Local/remote check (4) Send data (5) Remote check for posted receive; on fail, Tag match allocate data buffer Data-xfer req Allocate buffer Recv P src , local VA, len Time More powerful programming model Wildcard receive => non-deterministic Storage required within msg layer? 28

Asynch. Msg Passing: Conservative Destination Source (1) Initiate send (2) Address translation on P dest Send P dest , local VA, len (3) Local/remote check Send-rdy req (4) Send-ready request (5) Remote check for posted receive (assume fail); Return and compute record send-ready Tag check (6) Receive-ready request Recv P src , local VA, len (7) Bulk data reply ¡ Dest VA or ID Source VA Recv-rdy req Data-xfer reply Time Where is the buffering? Contention control? Receiver initiated protocol? Short message optimizations 29

Scalable Distributed Memory Multiprocessors 1 Outline Scalability - PowerPoint PPT Presentation

Scalable Distributed Memory Multiprocessors 1 Outline Scalability physical, bandwidth, latency and cost level of integration Realizing Programming Models network transactions protocols safety input buffer problem: N-1

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multiple- -Writer Distributed Memory Writer Distributed Memory Multiple The Sequential

Time and global state; Coordination and agreement; Distributed transactions Oleg Batrashev

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack EuroBSDCon

Locks & barriers INF4140 - Models of concurrency Locks & barriers, lecture 2 Hsten

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8b: Distributed File Systems Introduction NFS

The Distributed File System (DFS) The sharing of stored information is perhaps the most important

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Distributed-File Systems Background Naming and Transparency Remote File Access

Scalable Distributed Memory Multiprocessors 1 Outline Scalability - PowerPoint PPT Presentation

Scalable Distributed Memory Multiprocessors 1 Outline Scalability physical, bandwidth, latency and cost level of integration Realizing Programming Models network transactions protocols safety input buffer problem: N-1

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Scalable Distributed Lineage Authentication Ashish Gehani Scalable Distributed Lineage

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Multiple- -Writer Distributed Memory Writer Distributed Memory Multiple The Sequential

Time and global state; Coordination and agreement; Distributed transactions Oleg Batrashev

CSC2/458 Parallel and Distributed Systems PPMI: Synchronization Preliminaries Sreepathi Pai

Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack EuroBSDCon

Locks &amp; barriers INF4140 - Models of concurrency Locks &amp; barriers, lecture 2 Hsten

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8b: Distributed File Systems Introduction NFS

The Distributed File System (DFS) The sharing of stored information is perhaps the most important

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Distributed-File Systems Background Naming and Transparency Remote File Access

Locks & barriers INF4140 - Models of concurrency Locks & barriers, lecture 2 Hsten