RHIC Real time data reconstruction using Magellan cloud computing - - PowerPoint PPT Presentation

rhic real time data reconstruction using magellan cloud
SMART_READER_LITE
LIVE PREVIEW

RHIC Real time data reconstruction using Magellan cloud computing - - PowerPoint PPT Presentation

RHIC Real time data reconstruction using Magellan cloud computing STAR event Magellan Cluster at NERSC 2011 OSG All Hands Jan Balewski March 7-11, 2011 for STAR Collaboration Harvard Medical School, Boston Outline STAR experiment


slide-1
SLIDE 1

RHIC Real time data reconstruction using Magellan cloud computing

2011 OSG All Hands

March 7-11, 2011 Harvard Medical School, Boston

Jan ¡Balewski for ¡STAR ¡Collaboration

Magellan Cluster at NERSC

STAR event

slide-2
SLIDE 2

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Outline

  • STAR experiment at RHIC
  • Computing requirements for real data analysis
  • STAR encounters with Cloud-like computing
  • Deployment of real time data processing
  • Benefits of “instantaneous” data analysis
  • Summary + ...

2

slide-3
SLIDE 3

1.2 km

RHIC

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

STAR experiment at RHIC

3

Brookhaven National Laboratory, Upton NY, USA

~600 collaborators from ~50 institutions and ~12 countries

STAR

L

  • n

g I s l a n d , N Y BNL

slide-4
SLIDE 4

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Explore properties of proton spin using W boson

4

slide-5
SLIDE 5

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Explore properties of proton spin using W boson

4

slide-6
SLIDE 6

Magnetic Resonance Imaging Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Explore properties of proton spin using W boson

4

slide-7
SLIDE 7

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

5 “Exploring the mystery of proton spin has been

  • ne of the key scientific research goals at

RHIC,” said Steven Vigdor, Brookhaven’s Associate Laboratory Director for Nuclear and Particle Physics. .... The W boson measurements [will help us] ... in quantitative understanding of proton spin structure and dynamics.”

http://www.bnl.gov/bnlweb/pubaf/pr/PR_display.asp?prID=1232

Explore properties of proton spin using W boson

slide-8
SLIDE 8

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

5 “Exploring the mystery of proton spin has been

  • ne of the key scientific research goals at

RHIC,” said Steven Vigdor, Brookhaven’s Associate Laboratory Director for Nuclear and Particle Physics. .... The W boson measurements [will help us] ... in quantitative understanding of proton spin structure and dynamics.”

http://www.bnl.gov/bnlweb/pubaf/pr/PR_display.asp?prID=1232

Explore properties of proton spin using W boson

slide-9
SLIDE 9

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

5

  • Phys. Rev. Lett. 106, 062002 (2011).

“Exploring the mystery of proton spin has been

  • ne of the key scientific research goals at

RHIC,” said Steven Vigdor, Brookhaven’s Associate Laboratory Director for Nuclear and Particle Physics. .... The W boson measurements [will help us] ... in quantitative understanding of proton spin structure and dynamics.”

http://www.bnl.gov/bnlweb/pubaf/pr/PR_display.asp?prID=1232

Explore properties of proton spin using W boson

slide-10
SLIDE 10

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Registered collision of 2 protons with lot of energy

6

Reconstruction of particles emerging from collision

  • f two protons is a computational challenge

proton proton

slide-11
SLIDE 11

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Computational challenges at STAR for W physics

Data Acquisition

  • STAR records ‘events’ at 1kHz
  • data rate ~1 GiB/sec

event file: 5 GB with 15,000 events

7

Data Reconstruction

  • reconstruction of 1 event : 10 seconds
  • time to process 5GB event file: 40 hours
  • 10,000 CPUs needed for a true real

time event processing

slide-12
SLIDE 12

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Computational challenges at STAR for W physics

Data Acquisition

  • STAR records ‘events’ at 1kHz
  • data rate ~1 GiB/sec

event file: 5 GB with 15,000 events

7

Data Reconstruction

  • reconstruction of 1 event : 10 seconds
  • time to process 5GB event file: 40 hours
  • 10,000 CPUs needed for a true real

time event processing Analysis requires Calibration of Detector response

  • Quality : crude , available within an hour

‘fastOffline’ reconstruction of 15% of events, used to monitor performance of detector

  • Quality : preliminary, available within a month

start first data pass

  • Quality : final , available within 6 months

full data pass over all qualified events, used for publication

  • f results

Cloud computing → application

slide-13
SLIDE 13

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Traditional in-house data analysis model

8

raw data 1 kHz HPSS

for 3 months

Experiment

slide-14
SLIDE 14

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Traditional in-house data analysis model

8

raw data 1 kHz HPSS

for 3 months

results Experiment

for 1 year

raw data 300 Hz In-house computing

farm of 2000 dual core machines running highly customize analysis package

slide-15
SLIDE 15

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Virtualization enables outsourcing of computation

9

at first on a laptop .... STAR Virtual Machine (VM) is born ...

slide-16
SLIDE 16

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Virtualization enables outsourcing of computation

9

at first on a laptop .... STAR Virtual Machine (VM) is born ... 2) pack it ‘from inside’ and ship to Amazon EC2, Magellan@NERSC, Magellan@ANL, etc.. 1) recently STAR VM is prepared at a PC at NERSC

slide-17
SLIDE 17

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Virtualization enables outsourcing of computation

9

at first on a laptop .... STAR Virtual Machine (VM) is born ... 2) pack it ‘from inside’ and ship to Amazon EC2, Magellan@NERSC, Magellan@ANL, etc.. 1) recently STAR VM is prepared at a PC at NERSC

  • Validate once, re-use multiple times.
  • The same results obtained ANYWHERE

→ virtualization allows normalization of resources

  • Reproducibility of old code results rests in archived old

VM, no need to retain hardware

slide-18
SLIDE 18

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

STAR encounters with VMs

10 date Facility tools type of task # of VMs # jobs/ VM total CPU days calendar days total input (TB) total

  • utput

(TB) remarks 2009, March Amazon EC2 Nimbus Globus PBS batch simu 100 1 500 5 0.3

works like normal globus GK grid site

2009, November Amazon EC2 EC2 simu 10 1 or 2 1 1 0.01 use commercial interface 2010, February GLOW Madison Uni Wisconsin CondorVM simu 430 1 130 0.6 0.1 call home model 2010, July Clemson Uni, SC Kestrel, QEMU-KVM simu 1000 1 17,000 20 7 VM lifetime 24 h, no ssh to VM 2011, February Magellan NERSC Eucalyptus data reco 20 6 or 7 600+ 20+ 2 1 almost real-time processing

Clemson STAR Amazon EC2 GLOW NERSC

slide-19
SLIDE 19

Date Jul17 Jul24 Jul31 N Machines 200 400 600 800 1000 1200 1400

Available Machines Working Machines Idle Machines

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Largest STAR simulations (ever) at Clemson

11

July 2010

STAR MC simulations with partonic pT > 2 GeV, PYTHIA event generator

Virtual Machine prepared with STAR software stack and deployed to over 1000 machines

Using cloud computing at Clemson University in South Carolina (Ranked #85 best supercomputer)

Over 12 billion events generated

Took over 400,000 CPU hours and generated 7 TB of data transferred to BNL

Largest physics simulation on cloud, largest STAR simulation in CPU hours

Benefit: shorten by a year PhD study of MIT student

slide-20
SLIDE 20

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Today: Magellan @ NERSC

12 employing VM technology to separate experiment specific requirements from facility infrastructure

slide-21
SLIDE 21

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Today: Magellan @ NERSC

12

20 nodes allocated STAR

employing VM technology to separate experiment specific requirements from facility infrastructure

slide-22
SLIDE 22

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Real-time distributed processing of 2011 Data

13

STAR experiment @BNL HPSS

BNL NERSC

slide-23
SLIDE 23

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Real-time distributed processing of 2011 Data

13

STAR experiment @BNL HPSS

BNL NERSC

raw data

clone STAR VM x 20

Magellan @ NERSC

STAR VMs

slide-24
SLIDE 24

DB

w/ time dependent detector calibration

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Real-time distributed processing of 2011 Data

13

reco events STAR experiment @BNL HPSS

BNL NERSC

raw data

clone STAR VM x 20

Magellan @ NERSC

STAR VMs

slide-25
SLIDE 25

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Topology of connectivity RCF ↔ VMs

14

HPSS RCF @ BNL

Magellan/Eucalyptus: 20 VM *7 jobs=140 jobs

1 job : input 5GB, duration 1-3 days NERSC carver.nersc.gov

cache 20 TB gpfs

globus-url-copy globus-job-run push raw data pool results RCF stargrid01.bnl.gov g e t r a w d a t a p u t r e s u l t s asynchronous local DBs updated periodically

DB fresh snapshot

available every 2 hours

STAR VM #2 STAR VM #1

80 GB local scratch disk 8 cores, 20 GB RAM STAR software local DB

STAR VM #3 STAR VM #4 ...

cache 3 TB

slide-26
SLIDE 26

pagoda nest-ants nest

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Model of coordination of VMs

15

Principles of VM operation:

1.Acts w/o supervision 2.Protects own integrity 3.Initiates connection to host

  • acquire input
  • perform task
  • retruns results to host
  • rest for ‘5 minutes’

Model citizen

  • acts autonomously
  • highly specialized
  • aggregated output from

many individuals serves a higher purpose

slide-27
SLIDE 27

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Model of coordination of VMs

15

Principles of VM operation:

1.Acts w/o supervision 2.Protects own integrity 3.Initiates connection to host

  • acquire input
  • perform task
  • retruns results to host
  • rest for ‘5 minutes’

Model citizen

  • acts autonomously
  • highly specialized
  • aggregated output from

many individuals serves a higher purpose VM 8 cores, 20 GB RAM, 80 GB disk local mySql DB self-check (state machine) case: load >10 → hotCPU disk<35 GB → diskFull # jobs>6 → busy DB too old → oldDB jobs enabled → open default: → closed New job will start only if VM state=open

slide-28
SLIDE 28

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Model of coordination of VMs

15

Principles of VM operation:

1.Acts w/o supervision 2.Protects own integrity 3.Initiates connection to host

  • acquire input
  • perform task
  • retruns results to host
  • rest for ‘5 minutes’

Model citizen

  • acts autonomously
  • highly specialized
  • aggregated output from

many individuals serves a higher purpose

  • No micro-management of instance,

local rules result with coherent aggregated

  • utput
  • No active reporting by

VMs

  • No inter-machines synchronization
  • VMs compete for data

use of ‘atomic’ rename avoids collisions

  • Instances are disposable and failures don’t

disrupt workload of other N-1 VM 8 cores, 20 GB RAM, 80 GB disk local mySql DB self-check (state machine) case: load >10 → hotCPU disk<35 GB → diskFull # jobs>6 → busy DB too old → oldDB jobs enabled → open default: → closed New job will start only if VM state=open

slide-29
SLIDE 29

This ‘hot’ VM will not launch new jobs despite only 5 jobs are running. ← jobs ← CPU load ← open

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

VM cluster load - example

16

goal # of jobs # of cores per VM ‘hot VM’ threshold

slide-30
SLIDE 30

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Time lag of ‘real-time’ processing

17

few hours latency between data acquisition (blue) and reconstruction (red)

slide-31
SLIDE 31

Electron cluster ET (GeV) 10 20 30 40 50 60 Events 5 10 15 20 25

Real time ’on the Cloud’ reco.

  • f W-boson candidates with

|<1 at STAR from 2011 p+p

  • |

=500 GeV s collisions at

  • Feb. 21, 2011

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Deliverables after 10 days of data taking

18 STAR first 100 Ws reconstructed in 2011 using Cloud resources: Magellan @ NERSC Uniformity of reconstructed tracks in 2011 data

slide-32
SLIDE 32

Electron cluster ET (GeV) 10 20 30 40 50 60 Events 5 10 15 20 25

Real time ’on the Cloud’ reco.

  • f W-boson candidates with

|<1 at STAR from 2011 p+p

  • |

=500 GeV s collisions at

  • Feb. 21, 2011

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Deliverables after 10 days of data taking

18 STAR first 100 Ws reconstructed in 2011 using Cloud resources: Magellan @ NERSC

  • Phys. Rev. Lett. 106, 062002 (2011).

Uniformity of reconstructed tracks in 2011 data

slide-33
SLIDE 33

2009 → 2009 → 2010→ 2010→ Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May data taking calibration reco pass 1 presentation on esentation on esentation on analysis ☆ a confer ☆ a conference ☆ a conference

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

How much of ACCESS is SUCCESS ?

19

Achieved timeline of W measurement in 2009

scientific deliverables in 10 months

slide-34
SLIDE 34

Electron cluster ET (GeV) 10 20 30 40 50 60 Events 5 10 15 20 25

Real time ’on the Cloud’ reco.

  • f W-boson candidates with

|<1 at STAR from 2011 p+p

  • |

=500 GeV s collisions at

  • Feb. 21, 2011

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

How much of ACCESS is SUCCESS ?

20 2009 → 2009 → 2010→ 2010→ Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May data taking calibration reconstruction econstruction presentation on esentation on esentation on analysis ☆ a confer ☆ a conference ☆ a conference

Achieved timeline of W measurement in 2009

scientific deliverables in 10 months deliverables in 3 months ?

2 weeks Intended timeline of W measurement in 2011

2011 → 2011 → Feb Mar Apr May Jun Jul Aug Sep Oct data taking cloud reco calibration 1st analysis cloud reco 2 presentation on presentation on esentation on 2nd analysis ☆ a confer ☆ a conference ☆ a conference

slide-35
SLIDE 35

2009 → 2009 → 2010→ 2010→ Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May data taking calibration reconstruction econstruction presentation on esentation on esentation on analysis ☆ a confer ☆ a conference ☆ a conference 2011 → 2011 → Feb Mar Apr May Jun Jul Aug Sep Oct data taking cloud reco calibration 1st analysis cloud reco 2 presentation on presentation on esentation on 2nd analysis ☆ a confer ☆ a conference ☆ a conference

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

How much of ACCESS is SUCCESS ?

21

Achieved timeline of W measurement in 2009

scientific deliverables in 10 months

Intended timeline of W measurement in 2011 ‘Cloud boost’ of 6 months

deliverables in 3 months ?

slide-36
SLIDE 36

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Summary

22

  • Virtualization has allowed STAR to run complex workflow and address intense

processing demands

  • Today STAR is doing real data reconstruction in near real time, providing a valuable

QA and preview of the results (preliminaries for the W to be used for making the Physics case far ahead of final publishable results)

  • Processing on a distributed facility are real and beyond proof of principles

(STAR is doing this TODAY at a 7% level - scalability ramp up is next)

  • Availability of such capabilities in OSG would allow full exploitation of resources

available on a distributed National Facility

  • Thanks to virtualization capabilities of Cloud and the resources

provided by the Magellan project and the Magellan support team at NERSC STAR is in a world-wide unique position to process acquired data in real time. Experimentalist can see what they measure as they measure.

  • Faster data analysis will shorten publication cycle
  • Unified VMs allow easy integration over multiple geographical location
slide-37
SLIDE 37

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

BACKUP

23

http://portal.nersc.gov/project/star/balewski/w2011/C/

The real ‘thing’ can be seen here:

slide-38
SLIDE 38

Jan Balewski, MIT RHIC & Cloud, 2011 OSG All Hands

Multi-site real time STAR data reco , March 2011

24

raw data BNL NERSC 20 VMs ANL 4 VMs r e c

  • e

v e n t s STAR HPSS DB