Simplest Scalable Architecture NOW Network Of Workstations Many - - PowerPoint PPT Presentation

simplest scalable architecture
SMART_READER_LITE
LIVE PREVIEW

Simplest Scalable Architecture NOW Network Of Workstations Many - - PowerPoint PPT Presentation

Cluster Computing Simplest Scalable Architecture NOW Network Of Workstations Many types of Clusters (form HPs Dr. Bruce J. Walker) Cluster Computing High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI


slide-1
SLIDE 1

Cluster Computing

Simplest Scalable Architecture

NOW – Network Of Workstations

slide-2
SLIDE 2

Cluster Computing

Many types of Clusters

(form HP’s Dr. Bruce J. Walker)

  • High Performance Clusters

– Beowulf; 1000 nodes; parallel programs; MPI

  • Load-leveling Clusters

– Move processes around to borrow cycles (eg. Mosix)

  • Web-Service Clusters

– LVS; load-level tcp connections; Web pages and applications

  • Storage Clusters

– parallel filesystems; same view of data from each node

  • Database Clusters

– Oracle Parallel Server;

  • High Availability Clusters

– ServiceGuard, Lifekeeper, Failsafe, heartbeat, failover clusters

slide-3
SLIDE 3

Cluster Computing

Many types of Clusters

(form HP’s Dr. Bruce J. Walker)

  • High Performance Clusters

– Beowulf; 1000 nodes; parallel programs; MPI

  • Load-leveling Clusters

– Move processes around to borrow cycles (eg. Mosix)

  • Web-Service Clusters

– LVS; load-level tcp connections; Web pages and applications

  • Storage Clusters

– parallel filesystems; same view of data from each node

  • Database Clusters

– Oracle Parallel Server;

  • High Availability Clusters

– ServiceGuard, Lifekeeper, Failsafe, heartbeat, failover clusters

NOW type architectures

slide-4
SLIDE 4

Cluster Computing

NOW Approaches

  • Single System View
  • Shared Resources
  • Virtual Machine
  • Single Address Space
slide-5
SLIDE 5

Cluster Computing

Shared System View

  • Loadbalancing clusters
  • High availability clusters
  • High Performance

– High throughput – High capability

slide-6
SLIDE 6

Cluster Computing

Berkeley NOW

slide-7
SLIDE 7

Cluster Computing

NOW Philosophies

  • Commodity is cheaper
  • In 1994 1 MB RAM was

– $40/MB for a PC – $600/MB for a Cray M90

slide-8
SLIDE 8

Cluster Computing

NOW Philosophies

  • Commodity is faster

CPU MPP year WS year 150 MHz Alpha 93-94 92-93 50MHz i860 92-93 ~91 32 MHz SS-1 91-92 89-90

slide-9
SLIDE 9

Cluster Computing

Network RAM

  • Swapping to disk is extremely expensive

– 16-24 ms for a page swap on disk

  • Network performance is much higher

– 700 us for page swap over the net

slide-10
SLIDE 10

Cluster Computing

Network RAM

slide-11
SLIDE 11

Cluster Computing

NOW or SuperComputer?

Machine Time Cost C-90 (16) 27 $30M RS6000 (256) 27374 $4M ”+ATM 2211 $5M ”+Parallel FS 205 $5M ”+NOW protocol 21 $5M

slide-12
SLIDE 12

Cluster Computing

The Condor System

  • Unix and NT
  • Operational since 1986
  • More than 1300 CPUs at UW-Madison
  • Available on the web
  • More than 150 clusters worldwide in

academia and industry

slide-13
SLIDE 13

Cluster Computing

What is Condor?

  • Condor converts collections of

distributively owned workstations and dedicated clusters into a high- throughput computing facility.

  • Condor uses matchmaking to make

sure that everyone is happy.

slide-14
SLIDE 14

Cluster Computing

What is High-Throughput Computing?

  • High-performance: CPU cycles/second under

ideal circumstances.

– “How fast can I run simulation X on this machine?”

  • High-throughput: CPU cycles/day (week, month,

year?) under non-ideal circumstances.

– “How many times can I run simulation X in the next month using all available machines?”

slide-15
SLIDE 15

Cluster Computing

What is High-Throughput Computing?

  • Condor does whatever it takes to run your

jobs, even if some machines…

– Crash! (or are disconnected) – Run out of disk space – Don’t have your software installed – Are frequently needed by others – Are far away & admin’ed by someone else

slide-16
SLIDE 16

Cluster Computing

A Submit Description File

# Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue

slide-17
SLIDE 17

Cluster Computing

What is Matchmaking?

  • Condor uses Matchmaking to make sure that

work gets done within the constraints of both users and owners.

  • Users (jobs) have constraints:

– “I need an Alpha with 256 MB RAM”

  • Owners (machines) have constraints:

– “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

slide-18
SLIDE 18

Cluster Computing

Process Checkpointing

  • Condor’s Process Checkpointing

mechanism saves all the state of a process into a checkpoint file

– Memory, CPU, I/O, etc.

  • The process can then be restarted from

right where it left off

  • Typically no changes to your job’s source

code needed – however, your job must be relinked with Condor’s Standard Universe support library

slide-19
SLIDE 19

Cluster Computing

Remote System Calls

  • I/O System calls trapped and sent back to

submit machine

  • Allows Transparent Migration Across

Administrative Domains

– Checkpoint on machine A, restart on B

  • No Source Code changes required
  • Language Independent
  • Opportunities for Application Steering

– Example: Condor tells customer process “how” to

  • pen files
slide-20
SLIDE 20

Cluster Computing

MOSIX and its characteristics

  • Software that can transform a Linux cluster of

x86 based workstations and servers to run almost like an SMP

  • Has the ability to distribute and redistribute the

processes among the nodes

slide-21
SLIDE 21

Cluster Computing

MOSIX

  • Dynamic migration added to the BSD

kernel

– Now Linux

  • Uses TCP/IP for communication between

workstations

  • Requires Homogeneous networks
slide-22
SLIDE 22

Cluster Computing

MOSIX

  • All processes start their life at the users

workstation

  • Migration is transparent and preemptive
  • Migrated processes use local resources as

much as possible and the resources on the home workstation otherwise

slide-23
SLIDE 23

Cluster Computing

Process Migration in MOSIX

User-level Kernel

Link Layer

User-level Kernel

Link Layer Deputy Remote

A local process and a migrated process

slide-24
SLIDE 24

Cluster Computing

MOSIX

slide-25
SLIDE 25

Cluster Computing

Mosix Make

slide-26
SLIDE 26

Cluster Computing

PVM

  • Task based
  • Tasks can be created at runtime
  • Tasks can be notified on the death of a parent
  • r child
  • Tasks can be grouped
slide-27
SLIDE 27

Cluster Computing

PVM Architecture

  • Demon based communication
  • User defined host list
  • Hosts can be added and removed during

execution

  • The virtual machine may be used

interactively or in the background

slide-28
SLIDE 28

Cluster Computing Heterogeneous Computing

  • Runs processes on different architectures
  • Handles conversion between little endian

and big endian architectures

slide-29
SLIDE 29

Cluster Computing PVM communication model

  • Explicit message passing
  • Has mechanisms for packing into buffers

and unpacking from buffers

  • Supports Asynchronous Communication
  • Supports one to many communication
  • Broadcast
  • Multicast
slide-30
SLIDE 30

Cluster Computing

The virtual machine codes

  • All calls to PVM return an integer, if less

than zero this indicates an error

  • pvm_perror();
slide-31
SLIDE 31

Cluster Computing

PVM

slide-32
SLIDE 32

Cluster Computing

Managing the virtual machine

  • Add a host to the virtual machine
  • int info = pvm_addhosts( char **hosts, int

nhost, int *infos );

  • Deleting a host in the virtual machine
  • int info = pvm_delhosts( char **hosts, int

nhost, int *infos )

  • Shutting down the virtual machine
  • int info = pvm_halt( void );
slide-33
SLIDE 33

Cluster Computing

Managing the virtual machine

  • Reading the virtual machine configuration
  • int info = pvm config( int *nhost, int *narch, struct

pvmhostinfo **hostp )

  • struct pvmhostinfo {

int hi_tid; char *hi_name; char *hi_arch; int hi_speed; } hostp;

slide-34
SLIDE 34

Cluster Computing

Managing the virtual machine

  • Check the status of a node
  • int mstat = pvm_mstat(char *host);
  • PvmOk host is OK
  • PvmNoHost host is not in virtual machine
  • PvmHostFail host is unreachable (and thus

possibly failed)

slide-35
SLIDE 35

Cluster Computing

Tasks

  • PVM tasks can be created and killed during

execution

  • id = pvm_mytid();
  • cnt = pvm_spawn(image, argv, flag, node, num, tids);
  • pid = pvm_parrent();
  • pvm_kill(tids[0]);
  • pvm exit();
  • int status = pvm_pstat( tid )
slide-36
SLIDE 36

Cluster Computing

Tasks

  • int info = pvm_tasks( int where, int *ntask,struct

pvmtaskinfo **taskp ) struct pvmtaskinfo{ int ti_tid; int ti_ptid; int ti_host; int ti_flag; char *ti_a_out; int ti_pid; } taskp;

slide-37
SLIDE 37

Cluster Computing

Managing IO

  • In the newest version of PVM output may

be redirected to the parent

  • int bufid = pvm_catchout( FILE *ff );
slide-38
SLIDE 38

Cluster Computing

Asynchronous events

  • Notifications on special events
  • info = pvm_notify(event, tag, cnt, tids);
  • info = pvm_sendsig(tid, signal);
slide-39
SLIDE 39

Cluster Computing

Groups

  • Groups allows for easy fragmentation of the

execution in an application

  • num=pvm_joingroup("worker");
  • size = pvm_gsize("worker");
  • info = pvm_lvgroup("worker");
  • int inum = pvm_getinst( char *group, int tid )
  • int tid = pvm_gettid( char *group, int inum )
slide-40
SLIDE 40

Cluster Computing

Buffers

  • PVM applications have a default send

and a default receive buffer

  • buf=pvm_initsend(Default|Raw|In place);
  • info = pvm_pk(type)(data,10,1);
  • info = pvm_upk(type)(data,10,1);
slide-41
SLIDE 41

Cluster Computing

Managing Buffers

  • info = pvm_mkbuffer(Default|Raw|In place);
  • oldbuf = pvm_setrbuf(bufid);
  • oldbuf = pvm_setsbuf(bufid);
  • int info = pvm_freebuf( int bufid )
  • int bufid = pvm_getrbuf( void );
  • int bufid = pvm_getsbuf( void );
slide-42
SLIDE 42

Cluster Computing

Receiving messages

  • Messages may be received blocking or

nonblocking

  • bufid = pvm_probe(tid, tag);
  • bufid = pvm_recv(tid, tag);
  • bufid = pvm_trecv(tid, tag, tmout);
  • bufid = pvm_nrecv(tid, tag);
  • info = pvm_precv(tid, tag, array, cnt, type,

&atid, &atag, &acnt);

slide-43
SLIDE 43

Cluster Computing

Sending messages

  • Messages can also be sent in various ways
  • info = pvm_send(tid, tag);
  • info = pvm_psend(tid, tag, data, cnt, type);
slide-44
SLIDE 44

Cluster Computing

Managing Buffers

  • info = pvm_mkbuffer(Default|Raw|In

place);

  • oldbuf = pvm_setrbuf(bufid);
  • oldbuf = pvm_setsbuf(bufid);
  • int info = pvm_bufinfo( int bufid, int *bytes,

int *msgtag, int *tid );

slide-45
SLIDE 45

Cluster Computing

Global reductions

  • Global reductions are useful for a wide

array of parallel applications

  • info = pvm_reduce(PvmMax, &data, cnt, type,

tag, "workers", roottid);

slide-46
SLIDE 46

Cluster Computing

PVM Reductions

  • Global
  • Sum
  • Produkt
  • Min
  • Max
slide-47
SLIDE 47

Cluster Computing

PVM Synchronizarions

  • Barrier
  • inum=pvm_joingroup("worker");
  • pvm_barrier("worker",5);
slide-48
SLIDE 48

Cluster Computing

Broadcast

  • Sends the active buffer to all members of a

group

  • info=pvm_bcast(“worker”, 42);
  • NOTE: the task that issues a broadcast

need not be a member of the group!

slide-49
SLIDE 49

Cluster Computing

Multicasting

  • A message can be sent to a number of

tasks without the existence of a shared group

  • info = pvm_mcast(list, number, 42);
slide-50
SLIDE 50

Cluster Computing

An example

  • Finite differences
  • Well know technique for solving differential

equations

  • The one-dimensional version is trivial if we

don’t need information on the evolution in time

slide-51
SLIDE 51

Cluster Computing

The model

slide-52
SLIDE 52

Cluster Computing

The example

slide-53
SLIDE 53

Cluster Computing

First Solution

If left neighbor exist then

read data from left send data to the left

Update points 0..n-1 If right neighbor exist then

read data from right send data to the right update point n

slide-54
SLIDE 54

Cluster Computing

Problems with Solution 1?

  • Results in serialization!
  • We must eliminate this serialization
slide-55
SLIDE 55

Cluster Computing

Second Solution

If left neighbor exist then

read data from left send data to the left

If right neighbor exist then

send data to the right read data from right

Update points 0..n

slide-56
SLIDE 56

Cluster Computing

Problems with Solution 2

  • Enforced strict synchronous execution
  • Slowest Task dictates progress
  • All communication takes place at the same

time

  • Stresses the communication network
slide-57
SLIDE 57

Cluster Computing

Solution 3

If left neighbor exist then

send data to the left

If right neighbor exist then

send data to the right

Update points 1..n-1 If left neighbor exist then

read data from left Update point 0

If right neighbor exist then

read data from right Update points n

slide-58
SLIDE 58

Cluster Computing

Problems with solution 3

  • Practically none!
  • Only potential improvement is to overlap

communication and calculation (latency hiding)

slide-59
SLIDE 59

Cluster Computing

Solution 4

If left neighbor exist then

issue_read data from left issue_send data to the left

If right neighbor exist then

issue_read data from right issue_send data to the left

Update points 1..n-1 Finish_any_read; Update corresponding point Finish_any_read; Update corresponding point

slide-60
SLIDE 60

Cluster Computing

Matrix Multiplication

Used extremely frequently in scientific applications

slide-61
SLIDE 61

Cluster Computing

Naïve version

mxmul(REAL **c, REAL **a, REAL**b, int n) { for(i=0;i<n;i++) for(j=0;j<n;j++) for(k=0;k<n;k++) c[i][j]+=a[i][k]*b[k][j] } The performance of the naïve version may be improved by maintaining B in its transposed form!!

slide-62
SLIDE 62

Cluster Computing

Blocked Sequential Version

bmul(REAL **c, REAL **a, REAL**b, int is, int js, int bs, int n){ int i,j,k; for(i=is*bs;i<is*bs+bs;i++) for(j=js*bs;j<js*bs+bs;j++) for(k=0;k<n;k++) C(i,j)+=A(i,k)*B(k,j); } mxmul(REAL **c, REAL **a, REAL**b, int n){ int i,j,k; for(i=0; i<n; i+=bs) for(j=0; j<n; j+=bs) bmul(i,i+bs,j,j+bs); }

slide-63
SLIDE 63

Cluster Computing

Performace of the Basic versions

slide-64
SLIDE 64

Cluster Computing

Recursive Version

Matrix C mxmul(Matrix A, Matrix B, int s){ if(s==1) C=A*B; else { s=s/2; p0=mxmul(UL(A),UL(B),s); p1=mxmul(UR(A),LL(B),s); p2=mxmul(UL(A),UR(B),s); p3=mxmul(UR(A),LR(B),s); p4=mxmul(LL(A),UL(B),s); p5=mxmul(LR(A),LL(B),s); p6=mxmul(LL(A),UR(B),s); p7=mxmul(LR(A),LR(B),s); UL(C)=p0+p1; UR(C)=p2+p3; LL(C)=p4+p5; LR(C)=p6+p7; } return C; }

slide-65
SLIDE 65

Cluster Computing

Blocked Parallel Version

  • If we have a broadcast media then we can

efficiently broadcast blocks to all workers

slide-66
SLIDE 66

Cluster Computing

Blocked Parallel version

  • Done in W broadcasts using W workers!
slide-67
SLIDE 67

Cluster Computing

Blocked Version in PVM

  • All workers holds one row-block and the

corresponding coloum block

  • Worker zero first broadcasts its coloum,

the one and so forth

  • Result is that excatly the size of B is

broadcast in W blocks

slide-68
SLIDE 68

Cluster Computing

Main

main(int argc, char **argv){ int bs; char msg[1024]; N=atoi(argv[1]); bs=atoi(argv[2]); size=atoi(argv[3]); pvm_joingroup("workers"); rank=pvm_getinst( "workers", pvm_mytid()); basicBsize=N/size; lastBsize=basicBsize+N%size; if(rank==size-1)myBsize=lastBsize; else myBsize=basicBsize; a=(REAL *)malloc(N*lastBsize*sizeof(REAL)); //same for b,tb and c mmul(bs); pvm_exit(); }

slide-69
SLIDE 69

Cluster Computing

Main loop

mmul(int bs){ int w,i,j,k; int src, atag, acnt; REAL *t=tb; for(w=0;w<size;w++){ pvm_initsend(PVM_COM_MODEL); if(rank==w){ tb=b; pvm_pkreal(b, N*(w==size-1 ? lastBsize : basicBsize), 1); pvm_bcast("workers", 100+w); } else { pvm_recv(-1,100+w); pvm_upkreal(tb,N*(w==size-1 ? lastBsize : basicBsize),1); } for(i=0; i<myBsize; i+=bs) for(j=0; j<myBsize; j+=bs) bmul(i,i+bs,j,j+bs); tb=t; } }

slide-70
SLIDE 70

Cluster Computing

How may this version be improved?

  • Overlapping communication and

calculation

slide-71
SLIDE 71

Cluster Computing

Summary

  • PVM is similar to programming with

threads - except you need message- passing

  • At first parallel programs may be very

inefficient

  • More efficient programs are more complex
slide-72
SLIDE 72

Cluster Computing

Programming NOW

  • Dynamic load balancing
  • Dynamic orchestration
slide-73
SLIDE 73

Cluster Computing

Dynamic Load Balancing

  • Base your applications on redundant

parallelism

  • Rely on the OS to balance the application
  • ver the CPUs
  • Rather few applications can be
  • rchestrated in this way
slide-74
SLIDE 74

Cluster Computing

Barnes Hut

  • Galaxy simulations

are still quite interresting

  • Basic formula is:
  • Naïve algorithm is

O(n2)

slide-75
SLIDE 75

Cluster Computing

Barnes Hut

slide-76
SLIDE 76

Cluster Computing

Barnes Hut

O(n log n)

slide-77
SLIDE 77

Cluster Computing

Balancing Barnes Hut

slide-78
SLIDE 78

Cluster Computing

Dynamic Orchestration

  • Divide your application into a job-queue
  • Spawn workers
  • Let the workers take and execute jobs

from the queue

  • Not all applications can be orchestrated in

this way

  • Does not scale well – job-queue process

may become a bottleneck

slide-79
SLIDE 79

Cluster Computing

Parallel integration

slide-80
SLIDE 80

Cluster Computing

Parallel integration

  • Split the outer integral
  • Jobs = range(x1, x2, interval)
  • Tasks = integral with x1 = Jobsi, x2=Jobsi

+1; for i in len(Jobs -1)

  • Result = Sum(Execute(Tasks))
slide-81
SLIDE 81

Cluster Computing

Genetic Algorithms

  • Genetic algorithms are very well suited for

NOW type architectures

– Requires much processing time – Little communication – Many independent blocks

slide-82
SLIDE 82

Cluster Computing

Example

  • Based on Conway’s-game-of-life
  • We have an area with weed

– Bacteria – Or another simple organism

  • Life in this scenario is governed by very

simple rules

  • We desire an initial setup that returns the

most life after exactly 100 iterations

slide-83
SLIDE 83

Cluster Computing

Rules

  • A cell with less than 2 neighbors die, from

loneliness

  • A cell with more than 3 neighbors die from

crowding

  • A living cell with 2 or 3 neighbors survive

to next generation

  • A dead cell with exactly 3 neighbors

springs to life by reproduction

slide-84
SLIDE 84

Cluster Computing

Approach

  • Let the computer test

– Various sizes of initial population size – Vary mutation rate

  • Run a paralle solution finder using the

island model

– Where each node in a NOW runs independently from the others – But nodes exchange champions every once i a while