Operating two InfiniBand grid clusters over 28 km distance Sabine - - PowerPoint PPT Presentation

operating two infiniband grid clusters over 28 km distance
SMART_READER_LITE
LIVE PREVIEW

Operating two InfiniBand grid clusters over 28 km distance Sabine - - PowerPoint PPT Presentation

Operating two InfiniBand grid clusters over 28 km distance Sabine Richling, Steffen Hau, Heinz Kredel, Hans-G unther Kruse IT-Center University of Heidelberg, Germany IT-Center University of Mannheim, Germany 3PGCIC-2010, Fukuoka, 4.


slide-1
SLIDE 1

Operating two InfiniBand grid clusters over 28 km distance

Sabine Richling, Steffen Hau, Heinz Kredel, Hans-G¨ unther Kruse

IT-Center University of Heidelberg, Germany IT-Center University of Mannheim, Germany

3PGCIC-2010, Fukuoka, 4. November 2010

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 1 / 41

slide-2
SLIDE 2

Introduction

Motivation

Circumstances in Baden-W¨ urttemberg (BW)

Increasing demand for high-performance computing capacities from scientific communities Demands are not high enough to qualify for the top German HPC centers in J¨ ulich, Munich and Stuttgart ⇒ Grid infrastructure concept for the Universities in Baden-W¨ urttemberg

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 2 / 41

slide-3
SLIDE 3

Introduction

Motivation

Special Circumstances in Heidelberg/Mannheim

Both IT-centers have a long record of cooperations Both IT-centers are connected by a 10 Gbit dark fibre connection of 28 km (two color lines already used for backup and other services) ⇒ Connection of the clusters in Heidelberg and Mannheim to ease

  • peration and to enhance utilization

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 3 / 41

slide-4
SLIDE 4

Introduction

Outline

1

Introduction

2

bwGRiD cooperation

3

Interconnection of two bwGRiD clusters

4

Cluster operation

5

Performance modeling

6

Summary and Conclusions

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 4 / 41

slide-5
SLIDE 5

bwGRiD cooperation

bwGRiD cooperation

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 5 / 41

slide-6
SLIDE 6

bwGRiD cooperation

D-Grid

German Grid Initiative (www.d-grid.de) Start: September 2005 Aim: Development and establishment of a reliable and sustainable Grid infrastructure for e-science in Germany Funded by the Federal Ministry

  • f Education and Research

(BMBF) with ∼ 50 Million Euro

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 6 / 41

slide-7
SLIDE 7

bwGRiD cooperation

bwGRiD

Community project of the Universities of BW (www.bw-grid.de) Compute clusters at 8 locations: Stuttgart, Ulm (Konstanz), Karlsruhe, T¨ ubingen, Freiburg, Mannheim/Heidelberg, Esslingen Central storage unit in Karlsruhe Distributed system with local administration Access for all D-Grid virtual organizations via at least one middleware supported by D-Grid

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 7 / 41

slide-8
SLIDE 8

bwGRiD cooperation

bwGRiD – Objectives

Verifying the functionality and the benefit of Grid concepts for the HPC community in BW Managing organisational and security problems Development of new cluster and Grid applications Solving license difficulties Enabling the computing centers to specialize

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 8 / 41

slide-9
SLIDE 9

bwGRiD cooperation

bwGRiD – Access Possibilities

Access with local university accounts (via ssh):

→ Access to a local bwGRiD cluster only

Access with Grid Certificate and VO membership using a Grid middleware (e.g. Globus Toolkit: gsissh, GridFTP or Webservices):

→ Access to all bwGRiD resources

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 9 / 41

slide-10
SLIDE 10

bwGRiD cooperation

bwGRiD – Resources

Compute cluster:

Mannheim/Heidelberg: 280 nodes Direct Interconnection Karlsruhe: 140 nodes Stuttgart: 420 nodes T¨ ubingen: 140 nodes Ulm (Konstanz): 280 nodes Hardware in Ulm Freiburg: 140 nodes Esslingen: 180 nodes more recent Hardware

Central storage:

Karlsruhe: 128 TB (with Backup) 256 TB (without Backup)

Heidelberg Mannheim Frankfurt München Ulm (joint cluster with Konstanz) Freiburg Stuttgart Tübingen Karlsruhe (interconnected to a single cluster) Esslingen

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 10 / 41

slide-11
SLIDE 11

bwGRiD cooperation

bwGRiD – Software

Common Software:

Scientific Linux, Torque/Moab batch system, GNU and Intel compiler suite Central repository for software modules (MPI versions, mathematical libraries, various free software, application software from each bwGRiD site)

Application areas of bwGRiD sites:

Freiburg: System Technology, Fluid Mechanics Karlsruhe: Engineering, Compiler & Tools Heidelberg: Mathematics, Neuroscience Mannheim: Business Administration, Economics, Computer Algebra Stuttgart: Automotive simulations, Particle simulations T¨ ubingen: Astrophysics, Bioinformatics Ulm: Chemistry, Molecular Dynamics Konstanz: Biochemistry, Theoretical Physics

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 11 / 41

slide-12
SLIDE 12

Interconnection of two bwGRiD clusters

Interconnection of two bwGRiD clusters

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 12 / 41

slide-13
SLIDE 13

Interconnection of two bwGRiD clusters

Hardware before Interconnection

10 Blade-Center in Heidelberg and 10 Blade-Center in Mannheim Each Blade-Center contains 14 IBM HS21 XM Blades Each Blade contains

2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) 16 GB Memory 140 GB Hard Drive (since January 2009) Gigabit-Ethernet (1 Gbit) Infiniband Network (20 Gbit)

⇒ 1120 Cores in Heidelberg and 1120 Cores in Mannheim

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 13 / 41

slide-14
SLIDE 14

Interconnection of two bwGRiD clusters

Hardware – Bladecenter

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 14 / 41

slide-15
SLIDE 15

Interconnection of two bwGRiD clusters

Hardware – Infiniband

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 15 / 41

slide-16
SLIDE 16

Interconnection of two bwGRiD clusters

Interconnection of the bwGRiD clusters

Proposal in 2008 Acquisition and Assembly until May 2009 Running since July 2009 Infiniband over Ethernet over fibre optics: Longbow adaptor from Obsidian

InfiniBand connector (black cable) fibre optic connector (yellow cable)

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 16 / 41

slide-17
SLIDE 17

Interconnection of two bwGRiD clusters

Interconnection of the bwGRiD clusters

ADVA component: Transformation of white light from Longbow to

  • ne color light for the dark fibre connection between IT centers

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 17 / 41

slide-18
SLIDE 18

Interconnection of two bwGRiD clusters

MPI Performance – Prospects

Measurements for different distances (HLRS, Stuttgart, Germany) Bandwidth 900-1000 MB/sec for up to 50-60 km Latency is not published

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 18 / 41

slide-19
SLIDE 19

Interconnection of two bwGRiD clusters

MPI Performance – Latency

Local: ∼ 2 µsec Interconnection: 145 µsec

1 10 100 1000 10000 100000 1e+06 1e9 1e8 1e7 1e6 1e5 1e4 1e3 100 10 1 time [µsec] buffer size [bytes] IMB 3.2 PingPong local MA-HD

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 19 / 41

slide-20
SLIDE 20

Interconnection of two bwGRiD clusters

MPI Performance – Bandwidth

Local: 1400 MB/sec Interconnection: 930 MB/sec

200 400 600 800 1000 1200 1400 1600 1800 1e9 1e8 1e7 1e6 1e5 1e4 1e3 100 10 1 bandwidth [Mbytes/sec] buffer size [bytes] IMB 3.2 PingPong local MA-HD

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 20 / 41

slide-21
SLIDE 21

Interconnection of two bwGRiD clusters

Experiences with Interconnection Network

Cable distance MA-HD is 28 km (18 km linear distance in air) ⇒ Light needs 143 µsec for this distance Latency is high: 145 µsec = Light transit time + 2 µsec local latency Bandwidth is as expected: about 930 MB/sec Local bandwidth 1200-1400 MB/sec Obsidian needs a license for 40 km

Obsidian has buffers for larger distances Activation of buffers with license License for 10 km is not sufficient

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 21 / 41

slide-22
SLIDE 22

Interconnection of two bwGRiD clusters

MPI Bandwidth – Influence of the Obsidian License

200 400 600 800 1000 16 Sep 00:00 23 Sep 00:00 30 Sep 00:00 07 Oct 00:00 bandwidth [Mbytes/sec] start time [date hour] IMB 3.2 - PingPong - buffer size 1 GB

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 22 / 41

slide-23
SLIDE 23

Cluster operation

Cluster operation

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 23 / 41

slide-24
SLIDE 24

Cluster operation

bwGRiD Cluster Mannheim/Heidelberg

Benutzer Cluster Mannheim Cluster Heidelberg Admin Benutzer passwd LDAP AD PBS Lustre bwFS MA Lustre bwFS HD InfiniBand InfiniBand Obsidian + ADVA Belwue Benutzer Benutzer VORM VORM

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 24 / 41

slide-25
SLIDE 25

Cluster operation

bwGRiD Cluster Mannheim/Heidelberg – Overview

Two clusters (blue boxes) are connected by InfiniBand (orange lines) “Obsidian and ADVA” (orange box) represents the 28 km fibre connection bwGRiD storage systems (grey boxes) are also connected by Infiniband Access nodes (“Benutzer”) are connected with 10 GBit (light orange lines) to the outside Internet “Belwue” (BW science net)

Access with local accounts from Mannheim (“LDAP”) Access with local accounts from Heidelberg (“AD”) Access with Grid certificates (“VORM”)

Ethernet connection between all components is not shown

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 25 / 41

slide-26
SLIDE 26

Cluster operation

Node Management

Compute nodes are booted via PXE and use NFS read-only export as root file system Administration server provides

DHCP service for the nodes (MAC-to-IP address configuration file) NFS export for root file system NFS directory for software packages accessible via module utilities queuing and scheduling system

Node administration (power on/off, execute commands, BIOS update, etc.) with

adjusted shell scripts originally developed by HLRS IBM management module (command line interface and Web-GUI)

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 26 / 41

slide-27
SLIDE 27

Cluster operation

User Management

Users should have exclusive access to compute nodes

user names and user-ids must be unique replacing passwd with reduced passwd proofed unreliable better is a direct connection to PBS for user authorization via PAM module

Authentication at the access nodes

directly against directory services: LDAP (MA) and AD (HD)

  • r with D-Grid certificate

Combining information from directory services from both universities

Prefix “ma”, “hd” or “mh” for group names Adding offsets to group-ids Adding offsets to user-ids Activated user names from MA and HD must be different

Activation process

Adding a special attribute for the user in the directory service (for authentication) Updating the user database of the cluster (for authorization)

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 27 / 41

slide-28
SLIDE 28

Cluster operation

Job Management

Interconnection (high latency, limited bandwidth) provides

enough bandwidth for I/O operations not sufficient for all kinds of MPI jobs

Jobs only run on nodes located either in HD or in MA (realized with attributes provides by the queuing system) Before interconnection

In Mannheim: mostly single node jobs → free nodes In Heidelberg: many MPI jobs → long waiting times

With interconnection better resource utilization (see Ganglia report)

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 28 / 41

slide-29
SLIDE 29

Cluster operation

Monitoring Report during activation of the interconnection

Number of processes Percent CPU Usage

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 29 / 41

slide-30
SLIDE 30

Performance modeling

Performance modeling

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 30 / 41

slide-31
SLIDE 31

Performance modeling

MPI Jobs running across the interconnection

How does the interconnection influence the performance? How much bandwidth would be necessary to the improve the performance? How much would such an upgrade cost?

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 31 / 41

slide-32
SLIDE 32

Performance modeling

Performance modeling

Numerical model

High-Performance Linpack (HPL) benchmark OpenMPI Intel MKL

Model variants

Calculations on a single cluster with up to 1025 CPU cores Calculations on the coupled cluster with up to 2048 CPU cores symmetrically distributed

Analytical model for the speed-up to analyze the characteristics of the interconnection

high latency of 145 µsec limited bandwidth of 930 MB/sec

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 32 / 41

slide-33
SLIDE 33

Performance modeling

Results for a single cluster

50 100 150 200 250 300 500 1000 1500 2000 speed-up number of processors p HPL 1.0a local p/ln(p) np=40000 np=30000 np=20000 np=10000

np load parameter (matrix size for HPL) ideal speed-up for perfect parallel programs Sideal(p) = p speed-up for a simple model “all CPU configurations have equal probability” Ssimple(p) = p/ ln p

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 33 / 41

slide-34
SLIDE 34

Performance modeling

Results for coupled cluster

20 40 60 80 100 500 1000 1500 2000 speed-up number of processors p HPL 1.0a MA-HD np=40000 np=30000 np=20000 np=10000

np load parameter (matrix size for HPL) for p > 256 reduced speed-up by a factor of ∼ 4 compared to single cluster for p > 500 constant (decreasing) speed-up

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 34 / 41

slide-35
SLIDE 35

Performance modeling

Direct comparison of the two cases

1 10 100 1 10 100 1000 speed-up number of processors p HPL 1.0a local np=40000 np=10000 MA-HD np=40000 np=10000

np load parameter (matrix size for HPL) for p < 50 speed-up for coupled cluster is acceptable, applications could run across interconnection effectively (in the case

  • f exclusive usage)

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 35 / 41

slide-36
SLIDE 36

Performance modeling

Performance modeling

Following a performance model developed by Kruse (2009): tc(p): communication time tB(1): processing time for p = 1 Speed-up Sc(p) ≤ p ln p + tc(p)

tB(1)

For tc(p) = 0, we receive the result of the simple model: Ssimple(p) = p/ ln p

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 36 / 41

slide-37
SLIDE 37

Performance modeling

Performance model for the high latency

Modeling tc(p) as a function of the typical communication time between 2 processes t(2)

c

an the communication topology c(p): tc(p) = t(2)

c c(p)

Defining a rate r = t(2)

c /tA between t(2) c

and the computation time for a typical instruction tA = tB(1)/n: Speed-up Sc(p) ≤ p ln p + r

nc(p)

Analysis for HPL (n = 2

3n3 p):

for np = 1000: ∼ p/ ln p for small p, decrease for p ≥ 30 for np = 10 000: ∼ p/ ln p for p ≤ 10 000, decrease for c(p) > 106 Analysis does not explain the numerical results. Decrease of speed-up already for smaller p.

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 37 / 41

slide-38
SLIDE 38

Performance modeling

Performance model including a limited bandwidth

Modeling the interconnection as a shared medium for the communication

  • f p processes with a given bandwidth B and average message length

< m >: t(2)

c

= tL +

m (B/p)

r(p) = tL

tA + m tAB p

With the measured bandwidth B = 1.5 · 106 and m = 106: Speed-up Sc(p) ≤ p ln p + 3

4

  • 100

np

3 (1 + 4p)c(p) With assumption c(p) = 1

2p2:

for np = 10 000: ∼ p/ ln p, decrease for p ≥ 50 for np = 40 000: ∼ p/ ln p, decrease for p ≥ 250 Speed-up reproduces the measurements.

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 38 / 41

slide-39
SLIDE 39

Performance modeling

Speed-up of the model including limited bandwidth

np load parameter (matrix size for HPL) ⇒ limited bandwidth is the performance bottleneck for shared connection between the clusters ⇒ Doubling the bandwidth: 25 % improvement for np = 40 000 ⇒ 100 % improvement with a ten-fold bandwidth (in the case of exclusive usage)

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 39 / 41

slide-40
SLIDE 40

Summary and Conclusions

Summary and Conclusions

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 40 / 41

slide-41
SLIDE 41

Summary and Conclusions

InfiniBand connection of two compute clusters

Network (Obsidian, ADVA and Infiniband switches) is stable and reliable Latency of 145 µsec is very high Bandwidth of 930 MB/sec is as expected Jobs are limited to one site, because MPI jobs would be slow (Interconnection is a “shared medium”) Performance model predicts the cost for an improvement of the interconnection Bandwidth sufficient for cluster administration and file I/O on Lustre file systems Interconnection is useful and stable for a “Single System Cluster” administration Better load balance at both sites due to common PBS Solving organizational issues between two universities is a great challenge

Richling, Hau, Kredel, Kruse (URZ/RUM) Operating two grid clusters over 28 km Fukuoka, November 2010 41 / 41