XCPU: A Process Management System
L a t c h e s a r I o n k o v L O S A L A M O S N A T I O N A L L A B O R A T O R Y
XCPU: A Process Management System L a t c h e s a r I o n k o v - - PowerPoint PPT Presentation
XCPU: A Process Management System L a t c h e s a r I o n k o v L O S A L A M O S N A T I O N A L L A B O R A T O R Y HPC Cluster Desktop Desktop Desktop Desktop Head FS Node ... CN1 CN2 CN k IO1 FS ... CN k+1 CN k+2 CN l IO2
L a t c h e s a r I o n k o v L O S A L A M O S N A T I O N A L L A B O R A T O R Y
Head Node CN1 CN2 CNk
...
CNk+1 CNk+2 CNl
...
CNl+1 CNl+2 CNm
...
CNm+1 CNm+2 CNn
...
IO1 IO2 IO3 IOp FS FS FS FS FS Desktop Desktop Desktop Desktop
Nodes
Network
Linux Cluster management: Perceus, Warewulf, XCAT Job scheduling: Torque, Moab, SLURM Compute Nodes:
Make sure all libraries are included in the cluster software stack Collect all binaries, configuration and data files Write job script that
mounted filesystem already)
Schedule a job, wait until it is finished
Most devices accessible as files, but ioctl
Not everything is a file
Files -- NFS, CIFS, AFS, FTP Printers -- CUPS, LPD Sound -- Pulseaudio, aRts, NAS Display -- X11, VNC, NX Ad-hoc protocols for each device
Access local files on a remote server Remote program to use the local sound card Program running remotely to print on the local printer Program running remotely to use the locally established VPN
If device files don’t use ioctl operations sharing
Devices:
Single file namespace
mounted The root decides what printers the users can print to The root decides what networks are available
Linux allows processes to have private namespaces Security issues -- legacy applications and libraries expect single namespace Solution -- only root can create private namespaces Result -- nobody uses private namespaces
Virtualization User-level workarounds Files -- GNOME GIO/GVFS, KDE KIO Printers -- none Network -- none
Fix legacy code and loosen private namespace restrictions, loosen mount restrictions Represent more resources as files -- FUSE and 9P make it easy Get rid of ioctls for the kernel devices Effect:
server without involving the sysadmin
When a user logs in on a remote server, a new private namespace is created Print on local printer -- mount at /dev/printer Sound on local speakers -- mount at /dev/sound Use local VPN -- mount it at /dev/net The resources are invisible to other users and don’t affect their work
Distribute job related files (binary, data, configuration) to all nodes Setup job environment, arguments Start, monitor and control job execution Clean-up when the job is done Can survive head node crash
Interface implemented as file tree Global files
Job session files
XCPU file interface mounted on /mnt/xcpu
$ cd /mnt/xcpu $ ls arch clone ctl env $ tail -f clone & 2 $ cd 2 $ ls argv ctl fs/ stdin stdout stderr wait $ echo foo > argv $ cp /bin/cat fs/cat $ echo hello world > fs/foo $ echo exec cat > ctl $ cat stdout hello world
Copy files to many nodes Linear (or even parallel) distribution doesn’t scale Solution: setup few sessions from the head node and instruct the compute nodes to clone them further Runs recursively, as many levels as necessary
n3 n4 n5 n6 head node n1 n2 n7 n8 n9 n10
1.Head node creates sessions
2.Head node sets up few sessions (argv, env, executable and input files) 3.Head node instructs sessions to clone themselves to other sessions
echo clone n3,n4,n7,n8 > ctl
4.Head node starts execution
n3 n4 n5 n6 head node n1 n2 n7 n8 n9 n10
9P2000 resource sharing protocol The server (xcpufs) runs on every compute node The synthetic file interface exported by xcpufs is mountable on Linux, or accessible
Total size, server, tools and libraries, is 20K lines of code
job execution
xrx n[1-128],n250 /bin/date xrx -s n[1-128] /bin/date xrx -n 2 n[1-128] /bin/date xrx -a /bin/date xrx -J foo
list processes -- xps kill process -- xk libxcpu
Ownership and permissions on files define who can do what The program runs as the user that mounted the file interface Authentication
in advance
XCPU users different than Unix users
The desktop exports its filesystem All nodes for a job mount it and see the same files as the user’s desktop If an application works on user’s desktop it will (most likely) work on the cluster No library mismatches, no missing files, no wrong pathnames Similar to Plan9’s cpu command
New node type -- job control node Responsible for controlling the nodes assigned for a job Job nodes “see” the filesystem on the job control node Jobs on the same node can use different distributions
Control Node
Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Job Control Node Job Control Node Job Control Node
XCPU file interface mounted on /mnt/xcpu
$ pwd /home/lucho $ ls foo bar $ xrx remote pwd /home/lucho $ xrx remote ls foo bar $
Common case -- import the root filesystem from job control node ns file allows custom namespaces Example:
unshare
import $XCPUTSADDR /mnt/term bind /dev /mnt/term/dev bind /proc /mnt/term/proc bind /sys /mnt/term/sys chroot /mnt/term
Operations unshare mount bind import cd chroot cache
All nodes for a job are likely to use the same system files Cooperative caching between the nodes in a job would achieve high hit rate Currently read-only, non-cooperative caching
n3 n4 n5 n6 head node n1 n2
/bin/cat /etc/hosts /bin/cat /bin/cat /bin/cat /bin/cat
XCPU2 transparently imports user’s desktop environment to all cluster nodes Makes it very easy to use different distributions and configurations If more devices and services operated as normal files, the integration would be even better (Plan9’s cpu command) Experiment with user- and kernel-level services that look like regular files Don’t be afraid of private namespaces, use them and ask your distributions for support!
Plan9 http://plan9.bell-labs.com Glendix http://www.glendix.org 9P libraries http://9p.cat-v.org/implementations XCPU http://xcpu.org