Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters - - PowerPoint PPT Presentation

adding low cost hardware barrier support to small
SMART_READER_LITE
LIVE PREVIEW

Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters - - PowerPoint PPT Presentation

History Our Design MPI Implementation Performance Conclusions and Future Work Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters Torsten Hfler Department of Computer Science TU Chemnitz June 24, 2006 university-logo


slide-1
SLIDE 1

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work

Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters

Torsten Höfler Department of Computer Science TU Chemnitz June 24, 2006

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-2
SLIDE 2

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-3
SLIDE 3

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-4
SLIDE 4

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

Earth Simulator

Global Barrier Counter (GBC) Flag registers within a processor node (Global Barrier Flag

  • GBF)

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-5
SLIDE 5

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

Earth Simulator Barrier

working principle:

1 Master node sets number of nodes into GBC 2 Control unit resets all GBFs of nodes 3 A completed node decrements GBC, and loops on GBF 4 When GBC=0 → control unit sets all GBFs 5 All nodes continue

⇒ constant barrier latency of 3.5µs between 2 and 512 nodes

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-6
SLIDE 6

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

BlueGene/L

Independent Barrier Network Four independent Channels

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-7
SLIDE 7

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

BlueGene/L Barrier

working principle:

1 Global OR 2 Global AND by inverted logic 3 Signal is propagated to top of a binomial Tree and down 4 OR is used for Interrupts (halt machine) 5 AND is used for Barrier 6 Can be partitioned at specific borders

⇒ constant barrier latency of 1.5µs between 2 and 65536 nodes

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-8
SLIDE 8

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

Cray T3D

Two Fetch&Increment Registers per Processor Global AND/OR barrier

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-9
SLIDE 9

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Machines with Barrier Support

Other Hardware Barriers

... many many more with same principles: Cray T3D Fujitsu VPP500 Thinking Machines CM-5 Purdue’s Adapter ... ⇒ our approach is to support commodity clusters without changes in the machine itself

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-10
SLIDE 10

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-11
SLIDE 11

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

FPGA Based Prototype

Simple and cheap design Prototype supports 1 barrier per node

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-12
SLIDE 12

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

Parallel Port

1 2 7 6 5 4 3 1 2 7 6 5 4 3 1 2 7 6 5 4 3 Control Port (BASE + 2) Status Port (BASE + 1) IRQ enable 17 16 1

13 14

11 10 12 13 15 Data Port (BASE + 0) 6 5 3 4 7 8 9

1

  • utgoing

incoming

25

14 2

Three cables per node (IN, OUT, GND) Prototype supports 1 barrier per node

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-13
SLIDE 13

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-14
SLIDE 14

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

Two-state Machine

i1 or i2 or i3 or i4 = ’0’

  • = ’0’

i1 and i2 and i3 and i4 = ’1’

  • = ’1’

Two states (2 FFs + ⌈log2P⌉ 2-port ANDs/ORs) Very fast state transition OUT ↔ iP, IN ↔ o

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-15
SLIDE 15

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

Working Principle

Goal: minimize read/write Operations!

1 init only: read status (IN) 2 toggle status 3 write new status (OUT) 4 read status (IN) until toggled

→ no ”packets”, constant Voltage-Level based

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-16
SLIDE 16

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Hardware State Machine

Scalability

Goal: Connect more than thousand nodes! Similar principle as for BlueGene/L AND/OR tree Propagating state up and down Two-state principle

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-17
SLIDE 17

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Port Access Open MPI

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-18
SLIDE 18

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Port Access Open MPI

Accessing the Parallel Port

1 #define B A S E P O R T 0x378 int main() { /∗ Set the data signals (D0 −7) of the port to ’0 ’ ∗/

  • utb(0, B

A S E P O R T); /∗ Read from the status port (BASE+1) ∗/ 6 printf ("status: % d\n" , inb(B A S E P O R T + 1)); }

Protoype uses INB, OUTB Requires root-access and OS adds overhead Kernel module with mmapped registers easily possible

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-19
SLIDE 19

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Port Access Open MPI

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-20
SLIDE 20

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Parallel Port Access Open MPI

Collective Module in Open MPI

MPI ...

IB TCP

BTL BTL

R2

BML PML

Application

OB1

HWBARR

COLL HWBARR

Implemented as collective Module in Open MPI Prototype supports only MPI_COMM_WORLD Requires to run as root

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-21
SLIDE 21

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-22
SLIDE 22

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

Performance Model

Variables:

1 tb: Barrier latency 2 ow: CPU overhead to write to the parallel port 3 or: CPU overhead to read from the parallel port 4 op(P): Processing overhead of a state change 5 P: Number of processors

→ toggle - write - read schema: tb = ow + op(P) + or

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-23
SLIDE 23

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

Parameter Benchmark

Benchmarked Parameters (4 2.4 GHz Xeon nodes):

  • w = 1.2µs
  • r = 1.2µs
  • p(P) = P · 0.01µs

→ tb = 1.2µs + 4 · 0.01µs + 1.2µs = 2.44µs

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-24
SLIDE 24

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

MPI Microbenchmark

PMB-1 (4 2.4 GHz Xeon nodes): 1000 repetitions of MPI_BARRIER Average of 2.57µs Open MPI framework adds only 0.13µs

  • cp. GigE, 4 nodes: ≈ 80µs
  • cp. IB, 4 nodes: ≈ 14µs

→ compareable to commercial Hardware Barriers

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-25
SLIDE 25

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

Outline

1

History Parallel Machines with Barrier Support

2

Our Design Hardware State Machine

3

MPI Implementation Parallel Port Access Open MPI

4

Performance Microbenchmark Application Bechmark

5

Conclusions and Future Work

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-26
SLIDE 26

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

Benchmarking Abinit

Calculates electronic structures of solids Uses MPI_BARRIER for MPI_COMM_WORLD 8% MPI overhead 65% of MPI overhead is due to MPI_BARRIER

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-27
SLIDE 27

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work Microbenchmark Application Bechmark

Abinit Results

Comparison between GigE and HWBARR: GigE: 4:34 min HWBARR: 4:27 min MPI overhead decreased by nearly 32% MPI_BARRIER overhead is halved

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier

slide-28
SLIDE 28

university-logo History Our Design MPI Implementation Performance Conclusions and Future Work

Conclusions

Comparable to commercial hardware barriers Extensible design

  • r/ow can be reduced with memory mapping

More wires per node could be used (5 in, 12 out) → up to 211 barriers → incoming interrupt wire general OS support (e.g. /dev/barrier0) ...

Torsten Höfler Department of Computer Science TU Chemnitz Hardware Barrier