Getting a first grip on doing large computations at CWI Nicolas H - - PowerPoint PPT Presentation

getting a first grip on doing large computations at cwi
SMART_READER_LITE
LIVE PREVIEW

Getting a first grip on doing large computations at CWI Nicolas H - - PowerPoint PPT Presentation

Getting a first grip on doing large computations at CWI Nicolas H oning Centrum Wiskunde & Informatica CWI (Centre for Mathematics and Computer Science) Intelligent Systems Group Scope: embarrassingly parallel computation problems


slide-1
SLIDE 1

Getting a first grip on doing large computations at CWI

Nicolas H¨

  • ning

Centrum Wiskunde & Informatica – CWI

(Centre for Mathematics and Computer Science)

Intelligent Systems Group

slide-2
SLIDE 2

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 01/28

Scope: embarrassingly parallel computation problems

◮ No effort is required to separate the problem into a

number of parallel tasks

◮ Results are independent, e.g. in

◮ searching through large data sets ◮ 3D graphics rendering ◮ simulating independent scenarios ◮ repeating computations with differing randomisation seeds

(Monte Carlo Sampling)

◮ etc.

slide-3
SLIDE 3

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 02/28

Scope: state of the art

So this is happening right now:

“4,829$-per-hour supercomputer (50,000 cores) built on Amazon cloud to fuel cancer research”

Source: ArsTechnica

slide-4
SLIDE 4

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 03/28

Scope: state of the art Linux is where it’s at:

Source: ZDNet

slide-5
SLIDE 5

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 04/28

Scope: Everything increases

◮ Embarrassment:

◮ Lots of CPUs in your computer ◮ Lots of CPUs in clouds

slide-6
SLIDE 6

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 05/28

Scope: Everything increases

◮ Embarrassment:

◮ Lots of CPUs in your computer ◮ Lots of CPUs in clouds

◮ Expectations:

◮ Big data ◮ Complex problems ◮ etc.

slide-7
SLIDE 7

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 06/28

Scope: Everything increases

◮ Embarrassment:

◮ Lots of CPUs in your computer ◮ Lots of CPUs in clouds

◮ Expectations:

◮ Big data ◮ Complex problems ◮ etc.

◮ Possibilities:

◮ Many people write many tools ◮ You will invest time

slide-8
SLIDE 8

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 07/28

Outline

  • 1. Computing resources: How to use many CPUs

for many more subtasks?

  • 2. Workflow management: How to generate and

keep track of all those subtasks?

slide-9
SLIDE 9

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 08/28

Make use of computing resources: Choices

◮ Dynamic allocation of tasks. Requires one process to

assign tasks (to workers).

◮ A parallelisation tool can be agnostic to the programming

language you are using, or embed in a language. I am interested in the former.

slide-10
SLIDE 10

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 09/28

Make use of computing resources: The dilemma

slide-11
SLIDE 11

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 10/28

Make use of computing resources: your local network, via Gnu parallel

◮ https://www.gnu.org/software/parallel/ ◮ repeat any Unix command on separate CPUs ◮ communicates per SSH, runs jobs in threads ◮ very mature and rich ◮ target audience: system admins ◮ Short demo

parallel

  • -gnu touch {}. tmp ::: 1 2 3 4 5
slide-12
SLIDE 12

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 11/28

Make use of computing resources: your local network, via FJD

◮ https://github.com/nhoening/fjd ◮ communicates per SSH, runs workers in Unix screens who

pick up jobs

◮ Assumes shared home directory ◮ Advantages:

  • 1. light-weight
  • 2. config files for parameterisation
  • 3. inspecting workers in progress possible
  • 4. you can re-sort job queue on the fly

◮ suited for long-running jobs (fix costs) ◮ Short demo

fjd --exe ’touch $1.tmp ’ --parameters 1,2,3,4,5 fjd --exe "mktemp XXX.tmp" --repeat 5

slide-13
SLIDE 13

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 12/28

Make use of computing resources: Surfsara LISA cluster

◮ https://surfsara.nl/systems/lisa ◮ Computers with 12+ cores (all in all: 8960) ◮ Uses PBS (Torque) scheduling (born in 1980s), describe

what you want in a job file

◮ Can use message passing (MPI) between nodes ◮ CWI/NWO has an agreement with SurfSara ◮ Short look around LISA

slide-14
SLIDE 14

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 13/28

Make use of computing resources: The “cloud”

◮ Commercial cloud servers (e.g. Amazon EC2) ◮ Faster response time than PBS ◮ Almost no constraints on number of computers, more on

your budget

◮ Skills you need here are also very useful for industry jobs ◮ Use some protocol to distribute tasks between cores, via

MPI or AMQP, e.g. RabbitMQ

◮ Much control possible, e.g. with Docker

slide-15
SLIDE 15

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 14/28

Make use of computing resources: The dilemma

slide-16
SLIDE 16

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 15/28

Make use of computing resources: Getting better

Hard challenge: switch effortlessly back and forth

slide-17
SLIDE 17

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 16/28

Workflow management

Let’s support iterative development + portability

slide-18
SLIDE 18

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 17/28

Workflow management: Sumatra

◮ https://pythonhosted.org/Sumatra/ ◮ “Scientific Notebook” ◮ “Automated tracking of scientific computations” ◮

python main.py default.param

becomes

smt run --executable=python

  • -main=main.py default.param

◮ Links code, parameters and result files by watching folders

(using version control systems, e.g. git)

◮ automatic work history, viewable in a browser ◮ Can use parallel computation with an MPI layer (can also run

  • n PBS)
slide-19
SLIDE 19

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 18/28

Workflow management

You should use version control for your code, by the way: https://scm.cwi.nl/

slide-20
SLIDE 20

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 19/28

Workflow management: StoSim

◮ https://homepages.cwi.nl/~nicolas/stosim ◮ Only tracks log files ◮ Very easy to get started

  • 1. few dependencies
  • 2. built-in support for FJD and PBS(+FJD) → switch effortlessly
  • 3. Can make plots and T-tests for you

4. stosim

  • -run --plots
  • -ttests

◮ You can make (incremental) snapshots of code and results ◮ Short demo

slide-21
SLIDE 21

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 20/28

Misc: SSH config

◮ configure SSH in ~/.ssh/config [1] ◮ host shortcuts ◮ SSH keys ◮ connection sharing

[1] http://blogs.perl.org/users/smylers/2011/08/ssh-productivity-tips.html

slide-22
SLIDE 22

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 21/28

Misc: my ~/.ssh/config file

ControlMaster auto ControlPath /tmp/ssh_mux_%h_%p_%r ControlPersist 4h Host cwi HostName ssh.cwi.nl User nicolas IdentityFile ~/. ssh/id_cwi

slide-23
SLIDE 23

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 22/28

Misc: run Unix commands in background

◮ Appending an ampersand ◮ CTRL-Z and bg/fg ◮ Unix screens ◮ nohup

slide-24
SLIDE 24

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 23/28

Misc: get LISA account

◮ https://surfsara.nl/systems/shared/fom-ncf ◮ Basically, fill in forms and email copies over to them ◮ normally, projects begin March 1 ◮ People are helpful there. Call 020 800 1400 or write to

hic@surfsara.nl

slide-25
SLIDE 25

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 24/28

Extra: ask the PBS system about its queue

#!/bin/bash job=$1 idle=‘showq | grep "IDLE JOBS" -n | cut -d: -f1 ‘ jobline=‘showq | grep -n $job | cut -d: -f1 ‘ place=‘expr $jobline - $idle - 2‘ echo "Idle Jobs section starts at line $idle" echo "Job $job at line $jobline" echo "Place in queue: $place"

slide-26
SLIDE 26

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 25/28

Extra: use all CPUs on PBS nodes with Gnu Parallel

  • 1. write PBS files to request nodes
  • 2. wait until nodes are started
  • 3. cat

argfile | parallel

  • -slf

$PBS_NODEFILE your_command

slide-27
SLIDE 27

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 26/28

Extra: use all CPUs on PBS nodes with FJD

A brute force benchmark for a problem, evaluating > 600K problem configurations on a PBS computation cluster: https://github.com/nhoening/fjd/blob/master/fjd/ example/runbrute.py

slide-28
SLIDE 28

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 27/28

Extra: use all CPUs on PBS nodes with StoSim (+ FJD)

Simply put ”scheduler:pbs” in the stosim.conf file (see also docs for additional information you can add about your requirements on LISA)

slide-29
SLIDE 29

Nicolas H¨

  • ning, Intelligent Systems Group

June 25, 2014 28/28

The End Thanks for coming