PROOF as a Service on the Cloud a Virtual Analysis Facility based on - - PowerPoint PPT Presentation

proof as a service on the cloud
SMART_READER_LITE
LIVE PREVIEW

PROOF as a Service on the Cloud a Virtual Analysis Facility based on - - PowerPoint PPT Presentation

PROOF as a Service on the Cloud a Virtual Analysis Facility based on the CernVM ecosystem Dario Berzano R.Meusel, G.Lestaris, I.Charalampidis, G.Ganis, P .Buncic, J.Blomer CERN PH-SFT CHEP2013 - Amsterdam, 15.10.2013 A cloud-aware analysis


slide-1
SLIDE 1

CHEP2013 - Amsterdam, 15.10.2013

Dario Berzano

R.Meusel, G.Lestaris, I.Charalampidis, G.Ganis, P .Buncic, J.Blomer

CERN PH-SFT

PROOF as a Service on the Cloud

a Virtual Analysis Facility based on the CernVM ecosystem

slide-2
SLIDE 2

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

A cloud-aware analysis facility

2

geographically distributed independent cloud providers

IaaS SaaS

user’s workflow does not change admins provide virtual clusters

Virtual Analysis Facility → analysis cluster on the cloud in one click

slide-3
SLIDE 3

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Clouds can be a troubled environment

  • Resources are diverse

→ Like the Grid but at virtual machine level

  • Virtual machines are volatile

→ Might appear and disappear without notice Building a cloud aware application for HEP

  • Scale promptly when resources vary

→ No prior pinning of data to process to the workers

  • Deal smoothly with crashes

→ Automatic failover and clear recovery procedures

A cloud-aware analysis facility

3

Usual Grid workflow → static job pre-splitting ≠ cloud-aware

slide-4
SLIDE 4

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

PROOF is cloud-aware

PROOF: the Parallel ROOT Facility

  • Based on unique advanced features of ROOT
  • Event-based parallelism
  • Automatic merging and display of results
  • Runs on batch systems and Grid with PROOF on Demand

PROOF is interactive

  • Constant control and feedback of attached resources
  • Data is not preassigned to the workers → pull scheduler
  • New workers dynamically attached to a running process

4

Interactivity is what makes PROOF cloud-aware

NEW

slide-5
SLIDE 5

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

PROOF is cloud-aware

  • Zero configuration

No system-wide installation

  • Sandboxing

User crashes don’t propagate to others

  • Self-servicing

User can restart her PROOF server

  • Advanced scheduling

Leverage policies of underlying WMS

5

PROOF on Demand: runs PROOF on top of batch systems

pod.gsi.de

slide-6
SLIDE 6

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Adaptive workload: very granular (up to per event) pull architecture

PROOF is cloud-aware

6

get next

master worker packet generator ready process

packet

process

packet get next

ready process

packet get next

ready

time

Worker

0.83 0.77 0.5 0.23 0.17 0.33 0.47 0.27 0.37 0.67 0.0 0.73 0.57 0.53 0.13 0.43 0.63

Packets 500 1000 1500 2000 2500 3000 3500 4000

Packets per worker

Mean 2287 RMS 16.61

Query Processing Time (s) 2260 2270 2280 2290 2300 2310 2320 2330 2 4 6 8 10 12 14 16

Mean 2287 RMS 16.61

Worker activity stop (seconds)

nonuniform workload distribution uniform completion time all workers are done in ~20 s

slide-7
SLIDE 7

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

NEW IN ROOT v5.34.10

Dynamic addition of workers new workers can join and offload a running process

PROOF is cloud-aware

7

init

master process

time

worker worker worker worker init

process

init

process

worker worker

initially available bulk init new workers autoregister deferred init init register init register init

slide-8
SLIDE 8

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

PROOF dynamic workers

8

User requests N workers

Minimal latency and optimal resources usage See ATLAS use case here: http://chep2013.org/contrib/256

A bunch of workers is started Other workers will gradually become available Old Workflow Wait until “some” workers are ready Run the full analysis on such workers only They will be available at next run only Wait until at least one worker becomes available Run the analysis Additional workers join the processing New Workflow

slide-9
SLIDE 9

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Time [s] 500 1000 1500 2000 2500 3000 3500

  • Num. available workers

20 40 60 80 100

CERN CNAF ROMA1 NAPOLI MILANO

PROOF dynamic workers

9

See ATLAS talk: http://chep2013.org/contrib/256 Various ATLAS Grid sites considered Measured time taken for 100 Grid jobs requested at the same time to start

slide-10
SLIDE 10

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Total required computing time [s] 5000 10000 15000 20000 25000 30000 35000 Actual time to results [s] 200 400 600 800 1000 1200

Grid batch jobs (ideal num. of workers) PROOF pull and dynamic workers

PROOF dynamic workers

10

Analytically derived from actual startup latency measurements PROOF is up 30% more efficient

  • n the same computing resources by

design (analytical upper limit) PROOF with Dynamic Workers: all job time spent in computing (never idle, no latencies) Batch jobs: results collected only when late workers are finished (latencies and dead times)

slide-11
SLIDE 11

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

The virtual analysis facility

  • What: a cluster of µνCernVMs with HTCondor

→ One head node plus a scalable number of workers

  • How: contextualization configured on the Web

→ Simple web interface: http://cernvm-online.cern.ch

  • Who: so easy that can even be created by end users

→ You can have your personal analysis facility

  • When: scales up/down automatically

→ Optimal usage of resources: fundamental when you pay for them!

11

PROOF Elastiq PoD µνCernVM authn/authz HTCondor CVM online CernVM-FS

slide-12
SLIDE 12

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

The virtual analysis facility

12

VAF leverages the CernVM ecosystem and HTCondor

  • µCernVM: SLC6 compatible OS on demand

→ See previous talk: http://chep2013.org/contrib/213

  • CernVM-FS: HTTP-based cached FUSE filesystem

→ Both OS and experiments software downloaded on demand

  • CernVM Online: safe context GUI and repository

→ See previous talk: http://chep2013.org/contrib/185

  • HTCondor: light and stable workload management system

→ Workers auto-register to head node: no static resources configuration

The full stack of components is cloud-aware

slide-13
SLIDE 13

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Elastiq queue monitor

13

HTCondor queue

waiting running running running waiting

Running VMs

idle working idle idle working

start new VMs shutdown idle VMs

cloud controller

  • r CernVM Cloud

Python app to monitor HTCondor and scale up or down

Experimental meta cloud controller

  • Accepts scale requests
  • Translates them to multiple clouds

Jobs waiting too long will trigger a scale up EC2 interface (credentials given securely in the context) Code available at http://bit.ly/elastiq Can be used on any HTCondor cluster and has a trivial configuration

slide-14
SLIDE 14

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Elastic cloud computing in action

14

Context creation with CernVM Online: http://cernvm-online.cern.ch

Create new special context Customize a few options Get generated user-data

slide-15
SLIDE 15

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Elastic cloud computing in action

15

Screencast: http://youtu.be/fRq9CNXMcdI

slide-16
SLIDE 16

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

µνCernVM+PROOF startup latency

16

Measured the delay before requested resources become available

Target clouds:

  • Small: OpenNebula @ INFN Torino
  • Large: OpenStack @ CERN (Agile)

Test conditions:

  • µνCernVM use a HTTP caching proxy

→ Precaching via a dummy boot

  • µνCernVM image is 12 MB big

→ Image transfer time negligible

  • VMs deployed when resources are available

→ Rule out delay and errors due to lack of resources

Measuring latency due to:

  • µνCernVM boot time
  • HTCondor automatic

registration of new nodes

  • PoD and PROOF reaction time

Note: not comparing cloud

  • infrastructures. Only measuring

µCernVM+PROOF latencies.

slide-17
SLIDE 17

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 CERN OpenStack Torino OpenNebula Time to wait for workers [m:ss]

µνCernVM+PROOF startup latency

17

Measured time elapsed between PoD workers’ request and availability: pod-info -l 10 VMs started in the test Compatible results: latency is ~6 minutes from scratch

slide-18
SLIDE 18

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

Conclusions

Every VAF layer is cloud-aware

  • PROOF+HTCondor deal with “elastic” addition/removal of workers
  • µCernVM is very small and fast to deploy
  • CernVM-FS downloads only what is needed

Consistent configuration of solid and independent components

  • No login to configure: all done via CernVM Online context
  • PROOF+PoD also work dynamically on the Grid
  • Elastiq can scale any HTCondor cluster, not PROOF-specific
  • Reused existing components wherever possible

18

slide-19
SLIDE 19

Thank you for your attention!

slide-20
SLIDE 20

Dario.Berzano@cern.ch - PROOF as a Service on the Cloud - http://chep2013.org/contrib/308

References

20

  • PROOF (the Parallel ROOT Framework)

http://root.cern.ch/drupal/content/proof

  • Virtual Analysis Facility client and Elastiq

https://github.com/dberzano/virtual-analysis-facility

  • The CernVM Ecosystem

http://cernvm.cern.ch/portal/publications

  • Cloud @ INFN Torino

http://chep2013.org/contrib/474

  • CERN Agile Infrastructure

http://chep2013.org/contrib/86