PAD Cluster: An Open, Modular and Low Cost High Performance - - PowerPoint PPT Presentation

pad cluster
SMART_READER_LITE
LIVE PREVIEW

PAD Cluster: An Open, Modular and Low Cost High Performance - - PowerPoint PPT Presentation

PAD Cluster: An Open, Modular and Low Cost High Performance Computing System Volnys Borges Bernal Sergio Takeo Kofuji Guilherme Matos Sipahi Marcio Lobo Netto Laboratrio de Sistemas Integrveis, EPUSP Alan G. Anderson Elebra Defesa e


slide-1
SLIDE 1

PAD Cluster:

An Open, Modular and Low Cost High Performance Computing System

Volnys Borges Bernal Sergio Takeo Kofuji Guilherme Matos Sipahi Marcio Lobo Netto Laboratório de Sistemas Integráveis, EPUSP Alan G. Anderson Elebra Defesa e Controles Ltda

slide-2
SLIDE 2

Agenda

  • Main Objectives
  • PAD Cluster E

nvironment

  • PAD Cluster Architecture
  • Communication Libraries
  • System Administrator Tools
  • Operator Tools
  • User Tools
  • Development E

nvironment

slide-3
SLIDE 3

PAD Cluster

  • Main goals

– Parallel Cluster Based Computing E nvironment

  • Based on Commodity Components
  • High Performance Communication Medium
  • Development E

nvironment for Fortran77, fortran90 & HPF

  • MPI Interface
  • IE

E E POSIX UNIX Interface

  • X-Windows Interface

– Initial Application:

  • RAMS ( Regional Atmospheric Modeling System )
  • Development: LSI-E

PUSP + E lebra, FINE P support

slide-4
SLIDE 4

PAD Cluster

  • Characteristics

– Use of High Performance Commodities Components – Linux Operating System

  • Important:

– Integration

  • Hardware components
  • Software subsystems
slide-5
SLIDE 5

PAD Cluster E nvironment

Configuration & Operation User Interface and Utilities Clustermagic Configuration & Replication Multiconsole Cluster Partitioning CDE Windows Interface LSF Job Scheduling PAD-ptools Parallel UNIX utilities Monitoring System POSIX Unix Interface Development Tools Compilers GNU C, C++ F77 Portland F77, F90 HPF Tools Libraries Portland Profiler Portland

  • F77. F90,

Debugger BLAS, BLACS MPI MPICH FULL MPICH-FULL Myrinet API/BPI LaPack ScalaPack

slide-6
SLIDE 6

PAD Cluster Architecture

  • System Architecture

– Processing nodes – Access Workstation – Administration Workstation – Fast-ethernet switch – Myrinet Switch – Synchronization Hardware

Synchronization Hardware Myrinet switch Processing Node Processing Node Processing Node Processing Node Processing Node Processing Node Processing Node Processing Node Administration Workstation Access Workstation Multi-serial

to external network

Fast-Ethernet Switch

slide-7
SLIDE 7

PAD Cluster Architecture

  • Node Architecture

Intel Pentium II 333 MHz Intel Pentium II 333 MHz Intel Pentium II 333 MHz Intel Pentium II 333 MHz RAM RAM PCI Bridge PCI Bridge Myrinet Controller Myrinet Controller Fast Ethernet Controller Fast Ethernet Controller SCSI Controller SCSI Controller Lm 78 Lm 78

slide-8
SLIDE 8

Communication Infrastructure

  • Primary Network

– Fast-E thernet – General purpose network

  • For traditional network services (NFS, DNS, SNMP, XNTP, …

)

– Operating System TCP/ IP Stack

slide-9
SLIDE 9

Communication Infrastructure

  • High Performance Network

– Myrinet – For application data – Communication Libraries:

  • MPICH over Operating System TCP/ IP Stack
  • FULL user level interface library
  • MPICH-FULL user level interface library
slide-10
SLIDE 10

Communication Libraries

  • MPICH Library

– MPI over TCP/ IP stack

  • FULL Library

– User level communication library – Developed in LSI-E PUSP in 1998 – Implementation Based on Cornell’s UNE T

  • MPICH-FULL Library

– User level communication library – Internode communication: MPICH + FULL – Intranode communication: MPICH + Shared Memory

slide-11
SLIDE 11

Communication Libraries

  • MPI-FULL performance

Performance of Myrinet with MPICH-FULL 2 processes (1 process per node) Two 333 MHz dual nodes

10 20 30 40 50 60 200000 400000 600000 800000 1000000 1200000 Size of Package in bytes Mbytes/s

Performance of Myrinet with MPICH-FULL 4 processes (2 processes per node) Two 333 MHz dual nodes

10 20 30 40 50 60 200000 400000 600000 800000 1000000 1200000 Size of Package in bytes MBytes/s

Performance of Myrinet with MPICH-FULL Shared Memory ( 2 processes in one node) One 333 MHz dual node

10 20 30 40 50 60 200000 400000 600000 Size of package in bytes Mbytes/s

slide-12
SLIDE 12

Communication Infrastructure

  • Synchronization Hardware

– Support for collective MPI operations – Implemented in FPGA – Interfaces for 8 nodes – Based on PAPE RS – Operations

  • barrier
  • broadcast
  • allgather
  • allreduce

– Global Wall Clock

slide-13
SLIDE 13

Communication Infrastructure

  • Serial Lines

– Connects each node to the administration workstation – Allows remote console on the administration workstation

slide-14
SLIDE 14

System Administrator Tools

  • ClusterMagic

– Two main funcions:

  • Cluster Configuration
  • Node Replication

– Advantages

  • E

asy configuration / reconfiguration

  • Assure uniformity
  • Fast node replication
slide-15
SLIDE 15

System Administrator Tools

  • Cluster Magic: Cluster Configuration

cluster.conf cluster magic

  • perator

hosts hosts.equiv rhosts fstab nsswitch.conf resolv.conf ifcfg-lo profile inittab issue issue.net motd HOSTNAME lilo.conf exports network ifcfg-eth0 node commun files node specific files bootptab DNS server files adm files generated files

slide-16
SLIDE 16

System Administrator Tools

  • Cluster Magic: Node Replication

– Node installation based on the replication of a “Womb Node” – ClusterMagic replication diskette:

  • boots a small Linux System
  • disk partitioning
  • womb image copying
  • configuration files instalation
  • Boot sector initialization

– Automatic process – Takes about 12 minutes

slide-17
SLIDE 17

Operator Tools

  • Xadmin

– Cluster Partitioning – Remote Commands

  • Multiconsole

– Node console access

  • Job Scheduling

– Job submission – LSF integrated with Cluster Partitioning

  • Cluster Monitoring
slide-18
SLIDE 18

Operator Tools

  • Xadmin

– Node partitioning

N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N0 Cluster partitioning tool N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N0 P1 P2 P3

slide-19
SLIDE 19

Operator Tools

  • Xadmin

– Remote Commands

slide-20
SLIDE 20

Operator Tools

  • Multiconsole
slide-21
SLIDE 21

Operator Tools

  • Cluster Monitoring

– Java + SNMP agents

slide-22
SLIDE 22

User Tools

  • PAD-ptools

– Parallel versions of UNIX utilities – pcp, pls, pcat, … – Integratded with cluster partitioning

  • LSF

– Job submission and control

  • mpirun

– MPICH, MPI-FULL

slide-23
SLIDE 23

Development E nvironment

  • Portland

– Fortran77 – Fortran90 – HPF – Profiler – Debugger

  • Libraries

– BLAS, BLACS, LaPack, ScaLaPack

  • TotalView debbuger
  • VAMPIR profiler
slide-24
SLIDE 24

Conclusions

  • Complete product system:

– E lebra Vortix Cluster ( PAD Cluster )

  • www.elebra.com.br/ aero
  • Several Developments:

– Hardware

  • Collective operations,

Synchronization and Global Clock

– Software

  • Communication Libraries
  • Cluster Tools
  • Communication Drivers
slide-25
SLIDE 25

Future Works

  • University of São Paulo + Purdue University +

University of Pittsburg

– Hardware for collective operations and synchronization with PCI 64 bits Interface

  • University of São Paulo + ICS-FOTH ( Greece )

– ATM Like Switch on 2.4 Gbps/ s

  • University of São Paulo

– New cluster administration, management and secure tools – High Availability Data Base applications

slide-26
SLIDE 26

Acknowledgments

  • FINE

P

  • LSI-E

PUSP Development Team

  • E

lebra Development Team