Compute and data management strategies for grid deployment of high - - PowerPoint PPT Presentation

compute and data management strategies for grid
SMART_READER_LITE
LIVE PREVIEW

Compute and data management strategies for grid deployment of high - - PowerPoint PPT Presentation

Compute and data management strategies for grid deployment of high throughput protein structure studies Ian Stokes-Rees, Piotr Sliz Harvard Medical School Many Task Computing on Grids and Supercomputers 2010 Overview Context: Structural


slide-1
SLIDE 1

Compute and data management strategies for grid deployment of high throughput protein structure studies

Ian Stokes-Rees, Piotr Sliz Harvard Medical School Many Task Computing on Grids and Supercomputers 2010

slide-2
SLIDE 2

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Overview

Context: Structural biology computing (think proteins) Infrastructure: Open Science Grid Computational model

Application Data Workflow

Identity management and security Perspectives & Conclusions

slide-3
SLIDE 3

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

SBGrid Consortium

Rice University

  • E. Nikonowicz
  • Y. Shamoo

Y.J. Tao

CalTech

  • P. Bjorkman
  • W. Clemons
  • G. Jensen
  • D. Rees

Stanford

  • A. Brunger
  • K. Garcia
  • T. Jardetzky

UCSF

JJ Miranda

  • Y. Cheng

UC Davis

  • H. Stahlberg

UCSD

  • T. Nakagawa
  • H. Viadiu

WesternU

  • M. Swairjo
  • U. Washington
  • T. Gonen

Washington U. School of Med.

  • T. Ellenberger
  • D. Fremont

Vanderbilt

Center for Structural Biology

Rosalind Franklin

  • D. Harrison
  • A. Leschziner
  • K. Miller
  • A. Rao
  • T. Rapoport
  • M. Samso
  • P. Sliz
  • T. Springer
  • G. Verdine
  • G. Wagner
  • L. Walensky

S.Walker T.Walz

  • J. Wang
  • S. Wong
  • N. Beglova
  • S. Blacklow
  • B. Chen
  • J. Chou
  • J. Clardy
  • M. Eck
  • B. Furie
  • R. Gaudet
  • M. Grant

S.C. Harrison

  • J. Hogle
  • D. Jeruzalmi
  • D. Kahne
  • T. Kirchhausen

Harvard and Affiliates

NE-CAT

  • R. Oswald
  • C. Parrish
  • H. Sondermann
  • R. Cerione
  • B. Crane
  • S. Ealick
  • M. Jin
  • A. Ke

Cornell U. Brandeis U.

  • N. Grigorieff

Tufts U.

  • K. Heldwein

UMass Medical

  • W. Royer

NIH

  • M. Mayer
  • U. Maryland
  • E. Toth
  • K. Reinisch
  • J. Schlessinger
  • F. Sigworth
  • F. Zhou
  • T. Boggon
  • D. Braddock
  • Y. Ha
  • E. Lolis

Yale U.

  • C. Sanders
  • B. Spiller
  • M. Stone
  • M. Waterman
  • W. Chazin
  • B. Eichman
  • M. Egli
  • B. Lacy

Columbia U.

  • Q. Fan

Rockefeller U.

  • R. MacKinnon

Thomas Jefferson

  • J. Williams

Not Pictured: University of Toronto: L. Howell, E. Pai, F. Sicheri; NHRI (Taiwan): G. Liou; Trinity College, Dublin: Amir Khan

slide-4
SLIDE 4

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Northeast BioGrid Virtual Organization

Biomedical researchers Life sciences Universities Hospitals Government agencies Currently Boston-focused

Tufts Universit y School
  • f
Medicin e
slide-5
SLIDE 5

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Protein Structure Studies

sample imaging data structure X-ray crystallography cryo electron microscopy ... fragments O(1e5) processed using grid infrastructure

slide-6
SLIDE 6

Single Structure Study

slide-7
SLIDE 7

Broad Structure Study

slide-8
SLIDE 8

550 structures x 4000 iterations = 1 million iterations in broad study

slide-9
SLIDE 9

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Single Structure Wide Search

100,000 iterations 20,000 core-hours 12 hours wall-clock (typical)

slide-10
SLIDE 10

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Open Science Grid

US National Cyberinfrastructure Primarily used for high energy physics computing 80 sites O(1e5) job slots O(1e6) core-hours per day PB scale aggregate storage

LIGO Engage SBGrid

4,654,878

slide-11
SLIDE 11

Ian Stokes-Rees, NEBioGrid, Harvard Medical School February 16th, 2010

slide-12
SLIDE 12

Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009

Typical Layered Environment

Command line application (e.g. Fortran) Friendly application API wrapper Batch execution wrapper for N-iterations Results extraction and aggregation Grid job management wrapper Web interface forms, views, static HTML results GOAL eliminate shell scripts

  • ften found as “glue” language between layers

Python API Fortran bin Multi-exec wrapper Result aggregator Grid management Web interface

MAP- REDUCE

slide-13
SLIDE 13

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

✓ Rich set of easy-to-use file system

  • perations

✓ Quick to translate “experimental”

  • perations from command line into

reusable script

✓ Portable

  • Limited error handling
  • Configuration and parameter

processing

  • Limited data structures
  • Difficult to build larger systems
  • Poor web integration

✓ Good modularization ✓ Good Web/RPC integration ✓ Good error handling ✓ Rich data structures ✓ GUI interfaces possible

  • File system interaction difficult
  • Portability
  • Translating CLI operations laborious

Shell Scripting vs. Structured Language

slide-14
SLIDE 14

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Developing MTC Workflows

Single CLI execution Job submission Configuration API for invocation Results suitable for aggregation Multi-exec format (important for short invocations) Meta-data suitable for MTC management and metrics

slide-15
SLIDE 15

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Application Model

Application binary API wrapper (single invocation)

shex Python module for shell-like operations xconfig Python module for environment and module configuration

Grid wrapper (single invocation)

grid job description for single invocation

Workflow generator

Create DAG and job descriptors

Standard results format Standard meta-data format

slide-16
SLIDE 16

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Modules

http://portal.nebiogrid.org/devel/projects/shex/shex http://portal.nebiogrid.org/devel/projects/xconfig/xconfig

Results

jobset job status start runtime exitcode score ba9 1scza_ OK 1287230825 635 0 614

Job meta-data: JOB_MARKER entry

JOB_MARKER WQCG-Harvard-OSG tuscany.med.harvard.edu 1287198043 ba9-1c5pa_ sbgrid@tuscany01.med.harvard.edu:/scratch/condor/execute/dir_16947/glide_e16995/ execute/dir_27129 Sat Oct 16 03:00:43 UTC 2010

Application deployment

Locally host “gold standard” Replicate to predictable location at all sites: $OSG_APP/sbgrid

System configuration

Sanity check basic pre-requisites (memory, disk space, applications, common data sets, directory existence and permissions, network) Environment: PATH, LD_LIBRARY_PATH, PYTHONPATH, etc.

slide-17
SLIDE 17

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Data Model (I)

Per-job data

Need to minimize this to smallest unique set Even then, may need to pre-stage data to remote file server Staged using job manager or pulled by rsync, curl (HTTP), scp Removed on job completion

Per-job set (workflow instance) data

Pre-staged to each site at job set creation time: $OSG_DATA/users/$USERNAME/workflows/$WORKFLOWNAME Fetched by each job to worker node local disk (or read from NFS) Removed on job set cleanup or by tmpwatch weekly sweep NEW: Large data sets for workflow instance: pre-stage to UCSD, pull

  • n per-job basis (insufficient quota, but big pipes)
slide-18
SLIDE 18

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Data Model (II)

User project data

Pre-staged and manually managed at each site by user: $OSG_DATA/users/$USERNAME/projects/$PROJECTNAME Fetched by each job to worker node local disk (or read from NFS) Removed by user or manually by administrators on quota basis

Static data

Maintain “gold standard” and rsync or bulk update as required 20 GB of protein models pre-staged to $OSG_DATA/sbgrid/biodb

slide-19
SLIDE 19

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Workflow Model

Continuous aggregation

“Inprogress” view of data - accept possibility of corruption

Track errors

sort by execution site - key predictor (network, disk, library, config problems)

Retain only key output

STDOUT, STDERR, and single per-job “results” file enough to easily retry arbitrary sub-sets of overall jobset (timeout, error, etc.)

On-demand updates

User-driven “expensive” status updates on queued, running, complete, failed jobs, plus aggregated results and report generation

Finalized results

Cleaned results Augmented results (inclusion of static per-job information)

slide-20
SLIDE 20

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Application Exit States

OK - job executed application correctly and usable results are returned

done

NO_SOLUTION - job executed, but no usable results

failed, don’t rerun

ERROR - job failed to execute properly

failed, rerun (up to retry limit)

SHORT - job executed and produced output, but runtime is suspicious

complete, but don’t trust - rerun

TIMEOUT - job was aborted before completing, no results available

cancelled, don’t rerun

slide-21
SLIDE 21

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Integrated View & Debug

Web portal access to files

X.509 access control (full file management) .htpasswd read-only sharing

CLI or web interaction with running jobs Web view of data (files, tables, reports, AJAX) Web “file browsing” of all results

With augmented hyperlinking to details or static information

ssh/CLI access to files

Users need to be able to drill down to the 1 million files and 5 GB of data generated by the execution of their workflow

slide-22
SLIDE 22

Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009

Access, IdM and Security

Relying heavily on OSG facilities for federated environment

X.509 DNs Proxy certs MyProxy

LDAP for local accounts Access control: mod_gridsite and GACL policies Data access: apache and mod_gridsite Service access: web portal and gsi-enabled ssh Challenge: Making facilities available to user community

alternatives to web portal and gsi-ssh would be nice: local to user

slide-23
SLIDE 23

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Security Challenges

Identity Management

Mixture of .htpasswd, PAM, X.509, and application-specific IDs Complexity of X.509 (and associated paraphernalia) confuses users account creation, use, and management

Virtual Organization hierarchies and user-driven collaborations

Inheritance of rights/policies How to allow users to easily create and manage groups

Merging security policies

Site/resource, VO, and user policies need to be merged

Encryption and Privacy Preservation

Generic mechanisms for encryption and key management Preserving privacy of actions and data in federated grid environment

slide-24
SLIDE 24

Ian Stokes-Rees, http://sbgrid.org

slide-25
SLIDE 25

Ian Stokes-Rees, http://sbgrid.org

slide-26
SLIDE 26

Ian Stokes-Rees, http://sbgrid.org

slide-27
SLIDE 27

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Security Work

Meta data system

Provide more generic pointers to ACLs and encryption keys

Extension of GACL system

Include non-X.509 ID tokens as policy principals Allow GACL policies to apply to web framework objects (pyGACL)

Simple replicated key system for file encryption

Use of meta-data framework to point to encryption key (and replicas) Use GACL to control key access (regular file) Libraries to automatically read/write encrypted files

Future

VO hierarchies Tools for user driven ACL management Tools for policy management (merging site, VO and user policies)

slide-28
SLIDE 28

Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010

Perspectives and Conclusions

Hierarchical execution model

application, API, cluster, grid, multi-exec tension between experimentation, development, debugging Python provides the right mix for this

Necessity of clear model for data and binary deployment

lifetime, configuration, ownership

Workflow management

application specific part infrastructure specific part

Web interfaces, command line, and APIs all required Identity management, access control and security model is tough!

slide-29
SLIDE 29

Ian Stokes-Rees, http://sbgrid.org

Piotr Sliz

PI and SBGrid team leader

Peter Doherty

Grid Research IT Admin

Ian Levesque

Systems Architect

Ben Eisenbraun

Software Curator

Acknowledgements

Please be in touch if you have questions: http://portal.nebiogrid.org/ ijstokes@hkl.hms.harvard.edu

slide-30
SLIDE 30

Extras

slide-31
SLIDE 31

Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009

How to get a structural biologist using CI

Ease of use

No command line X.509 (initial request, VOs, proxies, Roles, etc.) are really complicated Support infrastructure (mailing lists, tickets, phone, training)

Killer apps

They will use it if they see peers using it to advance scientific goals They will use it if some novel workflows or workflow patterns are established Data management is a big problem for everyone (see bonus, time permitting) -- we believe grid infrastructure could provide a solution

Security

Data needs to be secure ... ... but users still want to control sharing/access

Roadblocks

Reliability of underlying infrastructure and difficulty in debugging Applications tied to GUIs, rudimentary interfaces

slide-32
SLIDE 32

Ian Stokes-Rees, http://sbgrid.org

slide-33
SLIDE 33

Ian Stokes-Rees, http://sbgrid.org