Cluster Computing
Simplest Scalable Architecture NOW Network Of Workstations Many - - PowerPoint PPT Presentation
Simplest Scalable Architecture NOW Network Of Workstations Many - - PowerPoint PPT Presentation
Cluster Computing Simplest Scalable Architecture NOW Network Of Workstations Many types of Clusters (form HPs Dr. Bruce J. Walker) Cluster Computing High Performance Clusters Beowulf; 1000 nodes; parallel programs; MPI
Cluster Computing
Many types of Clusters
(form HP’s Dr. Bruce J. Walker)
- High Performance Clusters
– Beowulf; 1000 nodes; parallel programs; MPI
- Load-leveling Clusters
– Move processes around to borrow cycles (eg. Mosix)
- Web-Service Clusters
– LVS; load-level tcp connections; Web pages and applications
- Storage Clusters
– parallel filesystems; same view of data from each node
- Database Clusters
– Oracle Parallel Server;
- High Availability Clusters
– ServiceGuard, Lifekeeper, Failsafe, heartbeat, failover clusters
Cluster Computing
Many types of Clusters
(form HP’s Dr. Bruce J. Walker)
- High Performance Clusters
– Beowulf; 1000 nodes; parallel programs; MPI
- Load-leveling Clusters
– Move processes around to borrow cycles (eg. Mosix)
- Web-Service Clusters
– LVS; load-level tcp connections; Web pages and applications
- Storage Clusters
– parallel filesystems; same view of data from each node
- Database Clusters
– Oracle Parallel Server;
- High Availability Clusters
– ServiceGuard, Lifekeeper, Failsafe, heartbeat, failover clusters
NOW type architectures
Cluster Computing
NOW Approaches
- Single System View
- Shared Resources
- Virtual Machine
- Single Address Space
Cluster Computing
Shared System View
- Loadbalancing clusters
- High availability clusters
- High Performance
– High throughput – High capability
Cluster Computing
Berkeley NOW
Cluster Computing
NOW Philosophies
- Commodity is cheaper
- In 1994 1 MB RAM was
– $40/MB for a PC – $600/MB for a Cray M90
Cluster Computing
NOW Philosophies
- Commodity is faster
CPU MPP year WS year 150 MHz Alpha 93-94 92-93 50MHz i860 92-93 ~91 32 MHz SS-1 91-92 89-90
Cluster Computing
Network RAM
- Swapping to disk is extremely expensive
– 16-24 ms for a page swap on disk
- Network performance is much higher
– 700 us for page swap over the net
Cluster Computing
Network RAM
Cluster Computing
NOW or SuperComputer?
Machine Time Cost C-90 (16) 27 $30M RS6000 (256) 27374 $4M ”+ATM 2211 $5M ”+Parallel FS 205 $5M ”+NOW protocol 21 $5M
Cluster Computing
The Condor System
- Unix and NT
- Operational since 1986
- More than 1300 CPUs at UW-Madison
- Available on the web
- More than 150 clusters worldwide in
academia and industry
Cluster Computing
What is Condor?
- Condor converts collections of
distributively owned workstations and dedicated clusters into a high- throughput computing facility.
- Condor uses matchmaking to make
sure that everyone is happy.
Cluster Computing
What is High-Throughput Computing?
- High-performance: CPU cycles/second under
ideal circumstances.
– “How fast can I run simulation X on this machine?”
- High-throughput: CPU cycles/day (week, month,
year?) under non-ideal circumstances.
– “How many times can I run simulation X in the next month using all available machines?”
Cluster Computing
What is High-Throughput Computing?
- Condor does whatever it takes to run your
jobs, even if some machines…
– Crash! (or are disconnected) – Run out of disk space – Don’t have your software installed – Are frequently needed by others – Are far away & admin’ed by someone else
Cluster Computing
A Submit Description File
# Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue
Cluster Computing
What is Matchmaking?
- Condor uses Matchmaking to make sure that
work gets done within the constraints of both users and owners.
- Users (jobs) have constraints:
– “I need an Alpha with 256 MB RAM”
- Owners (machines) have constraints:
– “Only run jobs when I am away from my desk and never run jobs owned by Bob.”
Cluster Computing
Process Checkpointing
- Condor’s Process Checkpointing
mechanism saves all the state of a process into a checkpoint file
– Memory, CPU, I/O, etc.
- The process can then be restarted from
right where it left off
- Typically no changes to your job’s source
code needed – however, your job must be relinked with Condor’s Standard Universe support library
Cluster Computing
Remote System Calls
- I/O System calls trapped and sent back to
submit machine
- Allows Transparent Migration Across
Administrative Domains
– Checkpoint on machine A, restart on B
- No Source Code changes required
- Language Independent
- Opportunities for Application Steering
– Example: Condor tells customer process “how” to
- pen files
Cluster Computing
MOSIX and its characteristics
- Software that can transform a Linux cluster of
x86 based workstations and servers to run almost like an SMP
- Has the ability to distribute and redistribute the
processes among the nodes
Cluster Computing
MOSIX
- Dynamic migration added to the BSD
kernel
– Now Linux
- Uses TCP/IP for communication between
workstations
- Requires Homogeneous networks
Cluster Computing
MOSIX
- All processes start their life at the users
workstation
- Migration is transparent and preemptive
- Migrated processes use local resources as
much as possible and the resources on the home workstation otherwise
Cluster Computing
Process Migration in MOSIX
User-level Kernel
Link Layer
User-level Kernel
Link Layer Deputy Remote
A local process and a migrated process
Cluster Computing
MOSIX
Cluster Computing
Mosix Make
Cluster Computing
PVM
- Task based
- Tasks can be created at runtime
- Tasks can be notified on the death of a parent
- r child
- Tasks can be grouped
Cluster Computing
PVM Architecture
- Demon based communication
- User defined host list
- Hosts can be added and removed during
execution
- The virtual machine may be used
interactively or in the background
Cluster Computing Heterogeneous Computing
- Runs processes on different architectures
- Handles conversion between little endian
and big endian architectures
Cluster Computing PVM communication model
- Explicit message passing
- Has mechanisms for packing into buffers
and unpacking from buffers
- Supports Asynchronous Communication
- Supports one to many communication
- Broadcast
- Multicast
Cluster Computing
The virtual machine codes
- All calls to PVM return an integer, if less
than zero this indicates an error
- pvm_perror();
Cluster Computing
PVM
Cluster Computing
Managing the virtual machine
- Add a host to the virtual machine
- int info = pvm_addhosts( char **hosts, int
nhost, int *infos );
- Deleting a host in the virtual machine
- int info = pvm_delhosts( char **hosts, int
nhost, int *infos )
- Shutting down the virtual machine
- int info = pvm_halt( void );
Cluster Computing
Managing the virtual machine
- Reading the virtual machine configuration
- int info = pvm config( int *nhost, int *narch, struct
pvmhostinfo **hostp )
- struct pvmhostinfo {
int hi_tid; char *hi_name; char *hi_arch; int hi_speed; } hostp;
Cluster Computing
Managing the virtual machine
- Check the status of a node
- int mstat = pvm_mstat(char *host);
- PvmOk host is OK
- PvmNoHost host is not in virtual machine
- PvmHostFail host is unreachable (and thus
possibly failed)
Cluster Computing
Tasks
- PVM tasks can be created and killed during
execution
- id = pvm_mytid();
- cnt = pvm_spawn(image, argv, flag, node, num, tids);
- pid = pvm_parrent();
- pvm_kill(tids[0]);
- pvm exit();
- int status = pvm_pstat( tid )
Cluster Computing
Tasks
- int info = pvm_tasks( int where, int *ntask,struct
pvmtaskinfo **taskp ) struct pvmtaskinfo{ int ti_tid; int ti_ptid; int ti_host; int ti_flag; char *ti_a_out; int ti_pid; } taskp;
Cluster Computing
Managing IO
- In the newest version of PVM output may
be redirected to the parent
- int bufid = pvm_catchout( FILE *ff );
Cluster Computing
Asynchronous events
- Notifications on special events
- info = pvm_notify(event, tag, cnt, tids);
- info = pvm_sendsig(tid, signal);
Cluster Computing
Groups
- Groups allows for easy fragmentation of the
execution in an application
- num=pvm_joingroup("worker");
- size = pvm_gsize("worker");
- info = pvm_lvgroup("worker");
- int inum = pvm_getinst( char *group, int tid )
- int tid = pvm_gettid( char *group, int inum )
Cluster Computing
Buffers
- PVM applications have a default send
and a default receive buffer
- buf=pvm_initsend(Default|Raw|In place);
- info = pvm_pk(type)(data,10,1);
- info = pvm_upk(type)(data,10,1);
Cluster Computing
Managing Buffers
- info = pvm_mkbuffer(Default|Raw|In place);
- oldbuf = pvm_setrbuf(bufid);
- oldbuf = pvm_setsbuf(bufid);
- int info = pvm_freebuf( int bufid )
- int bufid = pvm_getrbuf( void );
- int bufid = pvm_getsbuf( void );
Cluster Computing
Receiving messages
- Messages may be received blocking or
nonblocking
- bufid = pvm_probe(tid, tag);
- bufid = pvm_recv(tid, tag);
- bufid = pvm_trecv(tid, tag, tmout);
- bufid = pvm_nrecv(tid, tag);
- info = pvm_precv(tid, tag, array, cnt, type,
&atid, &atag, &acnt);
Cluster Computing
Sending messages
- Messages can also be sent in various ways
- info = pvm_send(tid, tag);
- info = pvm_psend(tid, tag, data, cnt, type);
Cluster Computing
Managing Buffers
- info = pvm_mkbuffer(Default|Raw|In
place);
- oldbuf = pvm_setrbuf(bufid);
- oldbuf = pvm_setsbuf(bufid);
- int info = pvm_bufinfo( int bufid, int *bytes,
int *msgtag, int *tid );
Cluster Computing
Global reductions
- Global reductions are useful for a wide
array of parallel applications
- info = pvm_reduce(PvmMax, &data, cnt, type,
tag, "workers", roottid);
Cluster Computing
PVM Reductions
- Global
- Sum
- Produkt
- Min
- Max
Cluster Computing
PVM Synchronizarions
- Barrier
- inum=pvm_joingroup("worker");
- pvm_barrier("worker",5);
Cluster Computing
Broadcast
- Sends the active buffer to all members of a
group
- info=pvm_bcast(“worker”, 42);
- NOTE: the task that issues a broadcast
need not be a member of the group!
Cluster Computing
Multicasting
- A message can be sent to a number of
tasks without the existence of a shared group
- info = pvm_mcast(list, number, 42);
Cluster Computing
An example
- Finite differences
- Well know technique for solving differential
equations
- The one-dimensional version is trivial if we
don’t need information on the evolution in time
Cluster Computing
The model
Cluster Computing
The example
Cluster Computing
First Solution
If left neighbor exist then
read data from left send data to the left
Update points 0..n-1 If right neighbor exist then
read data from right send data to the right update point n
Cluster Computing
Problems with Solution 1?
- Results in serialization!
- We must eliminate this serialization
Cluster Computing
Second Solution
If left neighbor exist then
read data from left send data to the left
If right neighbor exist then
send data to the right read data from right
Update points 0..n
Cluster Computing
Problems with Solution 2
- Enforced strict synchronous execution
- Slowest Task dictates progress
- All communication takes place at the same
time
- Stresses the communication network
Cluster Computing
Solution 3
If left neighbor exist then
send data to the left
If right neighbor exist then
send data to the right
Update points 1..n-1 If left neighbor exist then
read data from left Update point 0
If right neighbor exist then
read data from right Update points n
Cluster Computing
Problems with solution 3
- Practically none!
- Only potential improvement is to overlap
communication and calculation (latency hiding)
Cluster Computing
Solution 4
If left neighbor exist then
issue_read data from left issue_send data to the left
If right neighbor exist then
issue_read data from right issue_send data to the left
Update points 1..n-1 Finish_any_read; Update corresponding point Finish_any_read; Update corresponding point
Cluster Computing
Matrix Multiplication
Used extremely frequently in scientific applications
Cluster Computing
Naïve version
mxmul(REAL **c, REAL **a, REAL**b, int n) { for(i=0;i<n;i++) for(j=0;j<n;j++) for(k=0;k<n;k++) c[i][j]+=a[i][k]*b[k][j] } The performance of the naïve version may be improved by maintaining B in its transposed form!!
Cluster Computing
Blocked Sequential Version
bmul(REAL **c, REAL **a, REAL**b, int is, int js, int bs, int n){ int i,j,k; for(i=is*bs;i<is*bs+bs;i++) for(j=js*bs;j<js*bs+bs;j++) for(k=0;k<n;k++) C(i,j)+=A(i,k)*B(k,j); } mxmul(REAL **c, REAL **a, REAL**b, int n){ int i,j,k; for(i=0; i<n; i+=bs) for(j=0; j<n; j+=bs) bmul(i,i+bs,j,j+bs); }
Cluster Computing
Performace of the Basic versions
Cluster Computing
Recursive Version
Matrix C mxmul(Matrix A, Matrix B, int s){ if(s==1) C=A*B; else { s=s/2; p0=mxmul(UL(A),UL(B),s); p1=mxmul(UR(A),LL(B),s); p2=mxmul(UL(A),UR(B),s); p3=mxmul(UR(A),LR(B),s); p4=mxmul(LL(A),UL(B),s); p5=mxmul(LR(A),LL(B),s); p6=mxmul(LL(A),UR(B),s); p7=mxmul(LR(A),LR(B),s); UL(C)=p0+p1; UR(C)=p2+p3; LL(C)=p4+p5; LR(C)=p6+p7; } return C; }
Cluster Computing
Blocked Parallel Version
- If we have a broadcast media then we can
efficiently broadcast blocks to all workers
Cluster Computing
Blocked Parallel version
- Done in W broadcasts using W workers!
Cluster Computing
Blocked Version in PVM
- All workers holds one row-block and the
corresponding coloum block
- Worker zero first broadcasts its coloum,
the one and so forth
- Result is that excatly the size of B is
broadcast in W blocks
Cluster Computing
Main
main(int argc, char **argv){ int bs; char msg[1024]; N=atoi(argv[1]); bs=atoi(argv[2]); size=atoi(argv[3]); pvm_joingroup("workers"); rank=pvm_getinst( "workers", pvm_mytid()); basicBsize=N/size; lastBsize=basicBsize+N%size; if(rank==size-1)myBsize=lastBsize; else myBsize=basicBsize; a=(REAL *)malloc(N*lastBsize*sizeof(REAL)); //same for b,tb and c mmul(bs); pvm_exit(); }
Cluster Computing
Main loop
mmul(int bs){ int w,i,j,k; int src, atag, acnt; REAL *t=tb; for(w=0;w<size;w++){ pvm_initsend(PVM_COM_MODEL); if(rank==w){ tb=b; pvm_pkreal(b, N*(w==size-1 ? lastBsize : basicBsize), 1); pvm_bcast("workers", 100+w); } else { pvm_recv(-1,100+w); pvm_upkreal(tb,N*(w==size-1 ? lastBsize : basicBsize),1); } for(i=0; i<myBsize; i+=bs) for(j=0; j<myBsize; j+=bs) bmul(i,i+bs,j,j+bs); tb=t; } }
Cluster Computing
How may this version be improved?
- Overlapping communication and
calculation
Cluster Computing
Summary
- PVM is similar to programming with
threads - except you need message- passing
- At first parallel programs may be very
inefficient
- More efficient programs are more complex
Cluster Computing
Programming NOW
- Dynamic load balancing
- Dynamic orchestration
Cluster Computing
Dynamic Load Balancing
- Base your applications on redundant
parallelism
- Rely on the OS to balance the application
- ver the CPUs
- Rather few applications can be
- rchestrated in this way
Cluster Computing
Barnes Hut
- Galaxy simulations
are still quite interresting
- Basic formula is:
- Naïve algorithm is
O(n2)
Cluster Computing
Barnes Hut
Cluster Computing
Barnes Hut
O(n log n)
Cluster Computing
Balancing Barnes Hut
Cluster Computing
Dynamic Orchestration
- Divide your application into a job-queue
- Spawn workers
- Let the workers take and execute jobs
from the queue
- Not all applications can be orchestrated in
this way
- Does not scale well – job-queue process
may become a bottleneck
Cluster Computing
Parallel integration
Cluster Computing
Parallel integration
- Split the outer integral
- Jobs = range(x1, x2, interval)
- Tasks = integral with x1 = Jobsi, x2=Jobsi
+1; for i in len(Jobs -1)
- Result = Sum(Execute(Tasks))
Cluster Computing
Genetic Algorithms
- Genetic algorithms are very well suited for
NOW type architectures
– Requires much processing time – Little communication – Many independent blocks
Cluster Computing
Example
- Based on Conway’s-game-of-life
- We have an area with weed
– Bacteria – Or another simple organism
- Life in this scenario is governed by very
simple rules
- We desire an initial setup that returns the
most life after exactly 100 iterations
Cluster Computing
Rules
- A cell with less than 2 neighbors die, from
loneliness
- A cell with more than 3 neighbors die from
crowding
- A living cell with 2 or 3 neighbors survive
to next generation
- A dead cell with exactly 3 neighbors
springs to life by reproduction
Cluster Computing
Approach
- Let the computer test
– Various sizes of initial population size – Vary mutation rate
- Run a paralle solution finder using the