Introduction to PC- Cluster Hardware (II) Russian-German School on - - PowerPoint PPT Presentation

introduction to pc cluster hardware ii
SMART_READER_LITE
LIVE PREVIEW

Introduction to PC- Cluster Hardware (II) Russian-German School on - - PowerPoint PPT Presentation

Introduction to PC- Cluster Hardware (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart High Performance Computing Center


slide-1
SLIDE 1

High Performance Computing Center Stuttgart

Introduction to PC- Cluster Hardware (II)

Russian-German School on High Performance Computer Systems, June, 27th until July, 6th 2005, Novosibirsk

  • 1. Day, 27th of June, 2005

HLRS, University of Stuttgart

slide-2
SLIDE 2

High Performance Computing Center Stuttgart

Outline

  • I/O – Bus

– PCI – PCI-X – PCI-express

  • Network Interconnects

– Ethernet – Myrinet – Quadrics Elan4 – Infiniband

  • Mass Storage

– Hard disks and RAIDs

  • Cluster File Systems
slide-3
SLIDE 3

High Performance Computing Center Stuttgart

I/O Bus Layout Example

slide-4
SLIDE 4

High Performance Computing Center Stuttgart

PCI

  • Stands for Periphal Component Interconnect
  • Standard for I/O interface in PC’s since 1992
  • 32 Bit wide
  • 33.33 MHz clock
  • Max. 133 MB/s throughput
  • Extented to 64 Bit, 266 MB/s throughput
  • Several adapters can share the bus (and the bandwidth)
slide-5
SLIDE 5

High Performance Computing Center Stuttgart

PCI and PCI-X overview Also additional features within PCI-X and PCI-X 2.0

3.3 and 1.5V 4266 MB/s 533 MHz 64 Bit (PCI-X 533) 3.3 and 5 V 133 MB/s 33.33 MHz 32 Bit PCI 3.3 and 1.5V 2133 MB/s 266 MHz 64 Bit (PCI-X 266) 3.3 V 1066 MB/s 133 MHz 64 Bit PCI-X 133 3.3 V 800 MB/s 100 MHz 64 Bit PCI-X 100 3.3 V 533 MB/s 66.66 MHz 64 Bit PCI(-X) 66 3.3 and 5 V 266 MB/s 33.33 MHz 64 Bit PCI Voltage Throughput Clock Width

slide-6
SLIDE 6

High Performance Computing Center Stuttgart

PCI-Express (PCIe)

  • Formerly known as 3GIO
  • Bases on PCI programming concepts
  • Uses serial communication system

– Much faster

  • 2.5 GBit per lane (5 and 10 GBit in future)
  • 8B/10B encoding max. 250 MB/s per lane
  • Standard allows for 1, 2, 4, 8, 12, 16 and 32 lanes
  • Point-to-point connection Adapter - Chipset
  • Allows for 95 % of peak rate for large transfers
slide-7
SLIDE 7

High Performance Computing Center Stuttgart

PCI-Express (II)

  • Performance

16 GB/s 8 GB/s 2.5 GHz 32 lanes 8 GB/s 4 GB/s 2.5 GHz 16 lanes 4 GB/s 2 GB/s 2.5 GHz 8 lanes 2 GB/s 1 GB/s 2.5 GHz 4 lanes 500 MB/s 250 MB/s 2.5 GHz 1 lane Throughput Bidir. Throughput Unidir. Clock

slide-8
SLIDE 8

High Performance Computing Center Stuttgart

Outline

  • I/O – Bus

– PCI – PCI-X – PCI-express

  • Network Interconnects

– Ethernet – Myrinet – Quadrics – Infiniband

  • Mass Storage

– Hard disks and RAIDs

  • Cluster File Systems
slide-9
SLIDE 9

High Performance Computing Center Stuttgart

Ethernet

  • Gigabit Ethernet

– Standard – Available within neraly every PC – Mostly copper – Cheap – But costs CPU performance

  • 10 GBit Ethernet

– First Adapters available – Copper/fibre – Currently expensive – Eats up to 100% CPU – TCP offloading to decrease CPU load

slide-10
SLIDE 10

High Performance Computing Center Stuttgart

Myrinet

  • Prefferd used cluster interconnect for a quite long time
  • Bandwidth higher than with Gigabit Ethernet
  • Lower Latency than Ethernet
  • Has a processor on each adapter overlap of computation and

communication possible

  • RDMA capability
  • Link aggrgation possible
  • Myrinet 10G planned
  • Only one supplier, Myricom
slide-11
SLIDE 11

High Performance Computing Center Stuttgart

Quadrics Elan 4

  • Interconnect used for high performance clusters
  • Higher bandwidth than Myrinet
  • Lower Latency
  • Has a processor on each adapter
  • verlap of computation

and communication possible

  • RDMA capability
  • Link aggrgation possible
  • Quite expensive
  • Only one supplier, Quadrics
slide-12
SLIDE 12

High Performance Computing Center Stuttgart

Infiniband

  • Specified standard protocol
  • Interconnect often used today for high performance clusters
  • Bandwith like Quadrics
  • Latency like Myrinet
  • RDMA capability
  • Link aggrgation possible
  • Same costs like Myrinet, is planned to be as cheap as GigE
  • Many vendors
slide-13
SLIDE 13

High Performance Computing Center Stuttgart

Bandwidth

0,00 200,00 400,00 600,00 800,00 1000,00 1200,00 1400,00 1600,00 1800,00 2000,00 1 1 6 2 5 6 4 k 6 4 k 1 M Message Size Throughput MB/s Myrinet Myrinet dual Q Elan 4 Elan 4 dual rail IB PCI-X IB PCI-Express

slide-14
SLIDE 14

High Performance Computing Center Stuttgart

Bandwidth Infiniband

1800 MB/s 930 MB/s PCI Express 900 MB/s 830 MB/s PCI-X bidirectional unidirectional System Interface

slide-15
SLIDE 15

High Performance Computing Center Stuttgart

Network latency

3.5 Infiniband PCIe 4.5 Infiniband PCI-X 2.5 Quadrics Elan 4 3.5 to 6 Myrinet ? 10 G Ethernet

  • min. 11 us up to 40

Gigabit Ethernet

slide-16
SLIDE 16

High Performance Computing Center Stuttgart

Outline

  • I/O – Bus

– PCI – PCI-X – PCI-express

  • Network Interconnects

– Ethernet – Myrinet – Quadrics Elan4 – Infiniband

  • Mass Storage

– Hard disks and RAIDs

  • Cluster File Systems
slide-17
SLIDE 17

High Performance Computing Center Stuttgart

Technologies to connect HDD (I)

  • IDE/PATA

– Bus – max. 2 devices – max. 133 MB/s (ATA/133) – Typically system internal

  • SATA

– Point-to-point – 150 MB/s (300 MB/s SATA 2.0) – Typically system internal

slide-18
SLIDE 18

High Performance Computing Center Stuttgart

Technologies to connect HDD (II)

  • SCSI

– Bus – max. 7/15 devices – Up to 320 MB/s Throughput per bus – System internal and external

  • FC (Fibre Channel)

– Network (fabric) and Loop – Max 127 devices per loop – Used for storage area networks – 2 Gbit, near future 4 GBit – 8 and 10 GBit planned

slide-19
SLIDE 19

High Performance Computing Center Stuttgart

Storage Area Network (SAN)

  • Fabric

– HBA’s – Switches – Today typically Fibre Channel, but also IP (iSCSI)

SAN

slide-20
SLIDE 20

High Performance Computing Center Stuttgart

Storage Media

  • Single Harddisks
  • RAID Systems (Disk Arrays)

– Fault tolerance

  • RAID 1
  • RAID 3
  • RAID 5

– Higher Throughput

  • RAID 0
  • RAID 3
  • RAID 5

– FC, SCSI and SATA (performance and reliability <-> costs)

slide-21
SLIDE 21

High Performance Computing Center Stuttgart

File Systems for Cluster

slide-22
SLIDE 22

High Performance Computing Center Stuttgart

Topologies

  • Roughly two classes
  • Shared storage class
  • Network Centric class

– Shared nothing class

slide-23
SLIDE 23

High Performance Computing Center Stuttgart

Shared Storage Class

  • Sharing physical devices (disks)

– Mainly by using a Fibre Channel Network (SAN) – IP SAN with iSCSI is also possible – SRP within Infiniband – (Using a metadata server to organize disk access)

SAN

slide-24
SLIDE 24

High Performance Computing Center Stuttgart

Shared Storage Class - Implementation

  • Topologie (CXFS, OpenGFS, SNFS and NEC GFS)

SAN

CXFS Server CXFS Clients

(privat) IP Network

slide-25
SLIDE 25

High Performance Computing Center Stuttgart

Network Centric Class

  • The storage is in the network

– On storage nodes – May be on all nodes

Network

slide-26
SLIDE 26

High Performance Computing Center Stuttgart

PVFS - topology

Network

mgr

slide-27
SLIDE 27

File Systems for Clusters

Symmetric Clustered File System e.g. GPFS Lock management Is bottleneck

C C C C C C

Component Server

Parallel File System like PFS Asymmetric MD Server is bottleneck

Component Server

C C C Server

Distributed File System e.g. NFS/CIFS Server is bottleneck

SAN based File Systems like SANergy

Server is bottleneck Scale limited

san

Server C C C

Metadata Server

slide-28
SLIDE 28

Lustre Solution

C C C OST

Asymmetric Cluster File System Scalable MDS handles object allocation, OSTs handle block allocation

OST MDS Cluster

slide-29
SLIDE 29

High Performance Computing Center Stuttgart

Necessary features of a Cluster File System

  • Accessability/Global Namespace
  • Access method
  • Authorization (and Accounting), Security
  • Maturity
  • Safety, Reliability, Availability
  • Parallelism
  • Scalability
  • Performance
  • Interfaces for Backup, Archiving and HSM Systems
  • (Costs)
slide-30
SLIDE 30

High Performance Computing Center Stuttgart

HLRS File System Benchmark

  • The disk-I/O Benchmark

– Allows throughput measurements for reads and writes

  • Arbitrary file size
  • Arbitrary I/O chunk size
  • Arbitrary number of performing processes

– Allows metadata performance measurements

  • file creation, file status (list), file deletion rate
  • with an arbitrary number of processes (p-threads or MPI)
slide-31
SLIDE 31

High Performance Computing Center Stuttgart

Measurement Method

  • Measurement with disk I/O

– HLRS file system benchmark

  • Measuring of Throughput

– depending on the I/O chunk size (1, 2, 4, 8, and 16 MB chunks) – for clients and server

  • Measuring of metadata performance

– Essential for cluster file systems – Measuring of clients and servers – file creation, file status and file deletion rate – with 1, 5, 10 and 20 processes on a client – 50 files per process

slide-32
SLIDE 32

High Performance Computing Center Stuttgart

Measurement Environment

  • CXFS

– Server: Origin 3000, 8 procs, 6 GByte buffer cache, 2x2Gbit FC – Client: Origin 3000, 20 procs, 20 GByte buffer cache, 2x2Gbit FC – RAID: Data Direct Networks

  • PVFS

– 4 systems IA-64, 2procs, 8 GB Memory, 36 GB local disk each symmetric setup

  • Lustre

– 7 systems IA-32, Pentium III 1 GHz, 18 GB local disk, 1 MDS, 2 OST, 4 clients, Lustre 0.7 (Old version !!!)

  • Measurements have been performed 2003
slide-33
SLIDE 33

High Performance Computing Center Stuttgart

Throughput Client/Server, large file, 1MB/s

50 100 150 200 250 CXFS NEC GFS (Az) NEC GFS (SX) MB/s write client read client write server/local read server/local

slide-34
SLIDE 34

High Performance Computing Center Stuttgart

Throughput remote / local disk , large file, 1 MB I/O chunks

10 20 30 40 50 60 70 PVFS Lustre, 1 OST Lustre, 2 OST MB/s write client read client write local disk read local disk

slide-35
SLIDE 35

High Performance Computing Center Stuttgart

Metadata Performance Client, Create (I)

500 1000 1500 2000 2500 1 5 10 20 GFS(SX) GFS(Az) PVFS Lustre CXFS

slide-36
SLIDE 36

High Performance Computing Center Stuttgart

Metadata Performance Client, Create (II)

20 40 60 80 100 120 140 160 180 200 1 5 10 20 Processes Create Operations/s GFS(SX) GFS(Az) PVFS Lustre

slide-37
SLIDE 37

High Performance Computing Center Stuttgart

Cluster of Workstations (COW)

High Performance Computing and Visualization Network Gigabit Ethernet Myrinet/Quadrics/Infiniband Fiber Channel System Network Fast Disk I/O Network