A Power-Aware, Application-Based, Performance Study Of Modern - - PowerPoint PPT Presentation

a power aware application based performance study of
SMART_READER_LITE
LIVE PREVIEW

A Power-Aware, Application-Based, Performance Study Of Modern - - PowerPoint PPT Presentation

A Power-Aware, Application-Based, Performance Study Of Modern Commodity Cluster Interconnection Networks Torsten Hoefler , Timo Schneider, and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, USA CAC09 - IPDPS09 Rome,


slide-1
SLIDE 1

A Power-Aware, Application-Based, Performance Study Of Modern Commodity Cluster Interconnection Networks

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine

Open Systems Lab Indiana University Bloomington, USA

CAC’09 - IPDPS’09

Rome, Italy

May, 25th 2009

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-2
SLIDE 2

Motivation I (economic)

2 3 4 5 6 7 8 9 10 11 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 Price [cent/kWh] Year Commercial Energy Price

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-3
SLIDE 3

Motivation II (personal)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-4
SLIDE 4

Motivation III (scientific)

Interconnection network is the heart of parallel computing

How do we compare different network technologies? Microbenchmarks! Often Latency and Bandwidth only Is this enough to predict application performance?

Power consumption is becoming a problem for system designers

Green500 list as an addition to Top500 Power input (cooling!) major design goal for large systems What about power efficiency of the network?

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-5
SLIDE 5

Experiment Setup

We compare three different network technologies Fiber-based Myrinet 10G Copper-based Myrinet 10G Copper-based ConnectX InfiniBand We compare latency and bandwidth results (NetPIPE) and application performance on absolutely identical systems. OpenMPI 1.2.8, OFED 1.3, MX 1.4.3 SLES 10 SP 2 (Linux 2.6.16) 14 nodes, 2 × 4 Xeons L5420 2.5 GHz 4 GiB RAM per core

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-6
SLIDE 6

Microbenchmark Results - Latency

1 2 3 4 5 6 7 1 10 100 Latency [usec] Message size [byte] IB-C, OMPI MX-C, OMPI MX-F, OMPI

Latency: IB 1.4µs, MX-F 2.5 µs, MX-C 2.8 µs

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-7
SLIDE 7

Microbenchmark Results - Throughput

2 4 6 8 10 12 14 16 1.0k 4.1k 16.4k 65.5k 262.1k 1.0M 4.2M 16.8M Throughput [Gb/s] Message size [byte] IB-C, OMPI MX-C, OMPI MX-F, OMPI

Bandwidth: IB 13.9 Gib/s (86.9%), MX 9.1 Gib/s (91%)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-8
SLIDE 8

Microbenchmark Summary

Results:

IB performs significantly better in nearly all configurations! MX-F is slightly faster than MX-C OMPI’s MX eager-rendezvous switching point seems suboptimal

Projection:

IB should deliver higher application performance no data about power consumption yet

⇒ proceeding to real application runs!

three runs with each application/network lowest running time counts all results were very stable (< 3% variance)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-9
SLIDE 9

Application Performance - MILC

Quantum chromodynamics code (nuclear physics) Multiple programs We used NERSC ”medium” benchmark for su3rmd Runtime:

IB: 444s (123s MPI) MX-C: 435s (115s MPI) MX-F: 426s (107s MPI)

IB−C MX−C MX−F Time [s] 50 100 150 MPI_Allreduce MPI_Comm_rank MPI_Init MPI_Irecv MPI_Isend MPI_Wait

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-10
SLIDE 10

Application Performance - POP

Ocean circulation simulations We used the x1 POP benchmark (32 cores on 14 nodes) Runtime:

IB: 66s (10s MPI) MX-C: 63s (7s MPI) MX-F: 61s (5s MPI)

IB−C MX−C MX−F Time [s] 2 4 6 8 10 12 14 MPI_Allreduce MPI_Bcast MPI_Init MPI_Irecv MPI_Isend MPI_Waitall

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-11
SLIDE 11

Application Performance - RAxML

Models evolution by building phylogenetic trees from DNA We calculated 112 trees (1 per core) from 50 genome sequences with 5000 base pairs each Runtime:

IB: 746s (35s MPI) MX-C: 743s (32s MPI) MX-F: 738s (32s MPI)!

IB−C MX−C MX−F Time [s] 10 20 30 40 MPI_Finalize MPI_Init MPI_Probe

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-12
SLIDE 12

Application Performance - WPP

Simulates time-dependent elastic and viscoelastic propagation of waves which

  • ccur during earth quakes

and explosions 3D seismic modelling with finite difference methods 30k × 30k × 17k grid, single wave source (LOH1 example)

  • n 112 cores

Runtime:

IB: 702s (51s MPI) MX-C: 706s (57s MPI) MX-F: 701s (53s MPI)!

IB−C MX−C MX−F Time [s] 20 40 60 MPI_Allreduce MPI_Barrier MPI_Cart_create MPI_Finalize MPI_Init MPI_Sendrecv

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-13
SLIDE 13

Power Measurements

Methodology:

two APC 7800 PDUs, resolution 0.1 A (120 V) data sampled every second via SNMP compute total power consumption as discrete integral

Base Data:

idle system: IB 17.7 A, MX-C 17.3 A, MX-F 16.9 A IB switch: Cisco TopSpin SFS 7000D 0.48 A MX switch: 0.75 A (0.45 A w/o fan)

4 nodes idle vs. 8 MiB message-stream:

IB: 3.9 A / 5.0 A MX-C: 3.77 A / 4.95 A (PML OB1) MX-C: 3.77 A / 4.75 A (MTL MX)

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-14
SLIDE 14

Power Consumption - MILC

23 24 25 26 27 28 29 50 100 150 200 250 300 350 400 450 Power consumption [A] Application run time [s] IB-C MX-C MX-F

Energy: IB 3.879 kWh, MX-C 0.1% less, MX-F 1.5% less

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-15
SLIDE 15

Power Consumption - POP

18 19 20 21 22 23 10 20 30 40 50 60 70 Power consumption [A] Application run time [s] IB-C MX-C MX-F

Energy: IB 0.458 KWh, MX-C 4.6% less, MX-F 11.3% less

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-16
SLIDE 16

Power Consumption - RAxML

29 30 31 32 33 34 35 36 100 200 300 400 500 600 700 Power consumption [A] Application run time [s] IB-C MX-C MX-F

Energy: IB 8.315 kWh, MX-C 1.8% less, MX-F 3.6% less

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-17
SLIDE 17

Power Consumption - WPP

27 28 29 30 31 100 200 300 400 500 600 700 Power consumption [A] Application run time [s] IB-C MX-C MX-F

Energy: IB 6.807 KWh, MX-C 0.4% less, MX-F 1.4% less

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-18
SLIDE 18

Conclusions

Microbenchmarks and simple metrics such as latency and bandwidth are not accurate performance predictors. Other factors influence performance of parallel applications, for example tag matching in hardware, memory registration and cache pollution. The network fabric can have an important impact on power consumption, up to 11% in our experiments. Future Work more power aware network fabric comparisons should performed (not by us) study influence of the driver stack on application performance

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder

slide-19
SLIDE 19

Thanks Thanks for your attention! Questions?

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine A Power-Aware, Application-Based, Performance Study Of Moder