Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters - PowerPoint PPT Presentation

History Our Design MPI Implementation Performance Conclusions and Future Work Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters Torsten Höfler Department of Computer Science TU Chemnitz June 24, 2006 university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Earth Simulator Global Barrier Counter (GBC) Flag registers within a processor node (Global Barrier Flag university-logo - GBF) Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Earth Simulator Barrier working principle: 1 Master node sets number of nodes into GBC 2 Control unit resets all GBFs of nodes 3 A completed node decrements GBC, and loops on GBF 4 When GBC=0 → control unit sets all GBFs 5 All nodes continue ⇒ constant barrier latency of 3 . 5 µ s between 2 and 512 nodes university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work BlueGene/L Independent Barrier Network university-logo Four independent Channels Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work BlueGene/L Barrier working principle: 1 Global OR 2 Global AND by inverted logic 3 Signal is propagated to top of a binomial Tree and down 4 OR is used for Interrupts (halt machine) 5 AND is used for Barrier 6 Can be partitioned at specific borders ⇒ constant barrier latency of 1 . 5 µ s between 2 and 65536 nodes university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Cray T3D Two Fetch&Increment Registers per Processor Global AND / OR barrier university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design MPI Implementation Parallel Machines with Barrier Support Performance Conclusions and Future Work Other Hardware Barriers ... many many more with same principles: Cray T3D Fujitsu VPP500 Thinking Machines CM-5 Purdue’s Adapter ... ⇒ our approach is to support commodity clusters without changes in the machine itself university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work FPGA Based Prototype Simple and cheap design university-logo Prototype supports 1 barrier per node Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Parallel Port Control Port (BASE + 2) outgoing 7 6 5 4 3 2 1 0 incoming 17 16 14 1 IRQ enable Status Port (BASE + 1) 7 6 5 4 3 2 1 0 11 10 12 13 15 1 14 Data Port (BASE + 0) 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 25 13 Three cables per node ( IN , OUT , GND ) university-logo Prototype supports 1 barrier per node Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Two-state Machine i1 and i2 and i3 and i4 = ’1’ o = ’0’ o = ’1’ i1 or i2 or i3 or i4 = ’0’ Two states (2 FFs + ⌈ log 2 P ⌉ 2-port ANDs/ORs) Very fast state transition university-logo OUT ↔ iP , IN ↔ o Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Working Principle Goal: minimize read/write Operations! 1 init only: read status ( IN ) 2 toggle status 3 write new status ( OUT ) 4 read status ( IN ) until toggled → no ”packets”, constant Voltage-Level based university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Hardware MPI Implementation State Machine Performance Conclusions and Future Work Scalability Goal: Connect more than thousand nodes! Similar principle as for BlueGene/L AND / OR tree Propagating state up and down Two-state principle university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Accessing the Parallel Port #define B 1 A S E P O R T 0x378 int main() { / ∗ Set the data signals (D0 − 7) of the port to ’0 ’ ∗ / outb(0, B A S E P O R T); / ∗ Read from the status port (BASE+1) ∗ / 6 printf ("status: % d\n" , inb(B A S E P O R T + 1)); } Protoype uses INB , OUTB Requires root-access and OS adds overhead university-logo Kernel module with mmapped registers easily possible Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Parallel Port Access MPI Implementation Open MPI Performance Conclusions and Future Work Collective Module in Open MPI Application MPI OB1 HWBARR ... PML COLL R2 BML IB TCP HWBARR BTL BTL Implemented as collective Module in Open MPI Prototype supports only MPI_COMM_WORLD university-logo Requires to run as root Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Microbenchmark MPI Implementation Application Bechmark Performance Conclusions and Future Work Outline History 1 Parallel Machines with Barrier Support Our Design 2 Hardware State Machine MPI Implementation 3 Parallel Port Access Open MPI Performance 4 Microbenchmark Application Bechmark university-logo Conclusions and Future Work 5 Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

History Our Design Microbenchmark MPI Implementation Application Bechmark Performance Conclusions and Future Work Performance Model Variables: 1 t b : Barrier latency 2 o w : CPU overhead to write to the parallel port 3 o r : CPU overhead to read from the parallel port 4 o p ( P ) : Processing overhead of a state change 5 P : Number of processors → toggle - write - read schema: t b = o w + o p ( P ) + o r university-logo Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters - PowerPoint PPT Presentation

History Our Design MPI Implementation Performance Conclusions and Future Work Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters Torsten Hfler Department of Computer Science TU Chemnitz June 24, 2006 university-logo

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

BAKER BRICK BAKER BRICK BARRIER BARRIER BAKER BRICK BARRIER The Easy Solution to Stained

Hardware Observability Framework Hardware Observability Framework Hardware Observability

developing a MPA network in the Great Barrier Reef Jon Day Great Barrier Reef Marine Park

M Low Lower barrier Potential implies lower barrier energy Applied potential Energy

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Air Barrier and Insulation Installation Component Guide COMPONENT AIR BARRIER CRITERIA

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

xtra tread adding life & cost savings to OTR tyres Another Tytec tyre & cost saving

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Adding a Programming Language Adding a Language Francois Ouellet , Director of Development

UMBC A B M A L T F O U M B C I M Y O R T 1 (April 1, 2002) I E S R C E O

The K Project Timer Conclusion LSE Team EPITA March 21, 2016 LSE Team (EPITA) The K Project

Parallel Ports, Power Supply, and the Clock Oscillator Clock Oscillator Chapter 3 Dr. Iyad

DC Motor Controller in RT-Linux The goal is to create a servo-controller (to control the speed of

Parallel Computing Daniel Merkle Course Introduction Communication media:

+ Projects: Developing an OS Kernel for x86 Low-Level x86 Programming: Exceptions, Interrupts,

A Compact and Accurate Timing Macro Model for Efficient Hierarchical Timing Analysis Pei-Yu Lee ,

Jason Williams Cody Boettcher CSCE 488 Homework 6 Wireless Wumpus World Wireless technology

Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters - PowerPoint PPT Presentation

History Our Design MPI Implementation Performance Conclusions and Future Work Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters Torsten Hfler Department of Computer Science TU Chemnitz June 24, 2006 university-logo

I - -75 Median Cable Barrier 75 Median Cable Barrier 75 Median Cable Barrier I 75 Median Cable

BAKER BRICK BAKER BRICK BARRIER BARRIER BAKER BRICK BARRIER The Easy Solution to Stained

Hardware Observability Framework Hardware Observability Framework Hardware Observability

developing a MPA network in the Great Barrier Reef Jon Day Great Barrier Reef Marine Park

M Low Lower barrier Potential implies lower barrier energy Applied potential Energy

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Air Barrier and Insulation Installation Component Guide COMPONENT AIR BARRIER CRITERIA

Noise Barrier Meeting March 12, 2019 WHY ARE WE HERE TONIGHT? Noise Barrier Final Design Noise

Overview What is an Asymmetric Barrier? Median barrier with unbalanced roadway elevations

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

xtra tread adding life &amp; cost savings to OTR tyres Another Tytec tyre &amp; cost saving

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Adding a Programming Language Adding a Language Francois Ouellet , Director of Development

UMBC A B M A L T F O U M B C I M Y O R T 1 (April 1, 2002) I E S R C E O

The K Project Timer Conclusion LSE Team EPITA March 21, 2016 LSE Team (EPITA) The K Project

Parallel Ports, Power Supply, and the Clock Oscillator Clock Oscillator Chapter 3 Dr. Iyad

DC Motor Controller in RT-Linux The goal is to create a servo-controller (to control the speed of

Parallel Computing Daniel Merkle Course Introduction Communication media:

+ Projects: Developing an OS Kernel for x86 Low-Level x86 Programming: Exceptions, Interrupts,

A Compact and Accurate Timing Macro Model for Efficient Hierarchical Timing Analysis Pei-Yu Lee ,

Jason Williams Cody Boettcher CSCE 488 Homework 6 Wireless Wumpus World Wireless technology

xtra tread adding life & cost savings to OTR tyres Another Tytec tyre & cost saving