Compute and data management strategies for grid deployment of high throughput protein structure studies
Ian Stokes-Rees, Piotr Sliz Harvard Medical School Many Task Computing on Grids and Supercomputers 2010
Compute and data management strategies for grid deployment of high - - PowerPoint PPT Presentation
Compute and data management strategies for grid deployment of high throughput protein structure studies Ian Stokes-Rees, Piotr Sliz Harvard Medical School Many Task Computing on Grids and Supercomputers 2010 Overview Context: Structural
Ian Stokes-Rees, Piotr Sliz Harvard Medical School Many Task Computing on Grids and Supercomputers 2010
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Context: Structural biology computing (think proteins) Infrastructure: Open Science Grid Computational model
Application Data Workflow
Identity management and security Perspectives & Conclusions
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Rice University
Y.J. Tao
CalTech
Stanford
UCSF
JJ Miranda
UC Davis
UCSD
WesternU
Washington U. School of Med.
Vanderbilt
Center for Structural Biology
Rosalind Franklin
S.Walker T.Walz
S.C. Harrison
Harvard and Affiliates
NE-CAT
Cornell U. Brandeis U.
Tufts U.
UMass Medical
NIH
Yale U.
Columbia U.
Rockefeller U.
Thomas Jefferson
Not Pictured: University of Toronto: L. Howell, E. Pai, F. Sicheri; NHRI (Taiwan): G. Liou; Trinity College, Dublin: Amir Khan
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Biomedical researchers Life sciences Universities Hospitals Government agencies Currently Boston-focused
Tufts Universit y SchoolIan Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
sample imaging data structure X-ray crystallography cryo electron microscopy ... fragments O(1e5) processed using grid infrastructure
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
100,000 iterations 20,000 core-hours 12 hours wall-clock (typical)
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
US National Cyberinfrastructure Primarily used for high energy physics computing 80 sites O(1e5) job slots O(1e6) core-hours per day PB scale aggregate storage
LIGO Engage SBGrid
4,654,878
Ian Stokes-Rees, NEBioGrid, Harvard Medical School February 16th, 2010
Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009
Command line application (e.g. Fortran) Friendly application API wrapper Batch execution wrapper for N-iterations Results extraction and aggregation Grid job management wrapper Web interface forms, views, static HTML results GOAL eliminate shell scripts
Python API Fortran bin Multi-exec wrapper Result aggregator Grid management Web interface
MAP- REDUCE
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
✓ Rich set of easy-to-use file system
✓ Quick to translate “experimental”
reusable script
✓ Portable
processing
✓ Good modularization ✓ Good Web/RPC integration ✓ Good error handling ✓ Rich data structures ✓ GUI interfaces possible
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Single CLI execution Job submission Configuration API for invocation Results suitable for aggregation Multi-exec format (important for short invocations) Meta-data suitable for MTC management and metrics
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Application binary API wrapper (single invocation)
shex Python module for shell-like operations xconfig Python module for environment and module configuration
Grid wrapper (single invocation)
grid job description for single invocation
Workflow generator
Create DAG and job descriptors
Standard results format Standard meta-data format
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Modules
http://portal.nebiogrid.org/devel/projects/shex/shex http://portal.nebiogrid.org/devel/projects/xconfig/xconfig
Results
jobset job status start runtime exitcode score ba9 1scza_ OK 1287230825 635 0 614
Job meta-data: JOB_MARKER entry
JOB_MARKER WQCG-Harvard-OSG tuscany.med.harvard.edu 1287198043 ba9-1c5pa_ sbgrid@tuscany01.med.harvard.edu:/scratch/condor/execute/dir_16947/glide_e16995/ execute/dir_27129 Sat Oct 16 03:00:43 UTC 2010
Application deployment
Locally host “gold standard” Replicate to predictable location at all sites: $OSG_APP/sbgrid
System configuration
Sanity check basic pre-requisites (memory, disk space, applications, common data sets, directory existence and permissions, network) Environment: PATH, LD_LIBRARY_PATH, PYTHONPATH, etc.
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Per-job data
Need to minimize this to smallest unique set Even then, may need to pre-stage data to remote file server Staged using job manager or pulled by rsync, curl (HTTP), scp Removed on job completion
Per-job set (workflow instance) data
Pre-staged to each site at job set creation time: $OSG_DATA/users/$USERNAME/workflows/$WORKFLOWNAME Fetched by each job to worker node local disk (or read from NFS) Removed on job set cleanup or by tmpwatch weekly sweep NEW: Large data sets for workflow instance: pre-stage to UCSD, pull
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
User project data
Pre-staged and manually managed at each site by user: $OSG_DATA/users/$USERNAME/projects/$PROJECTNAME Fetched by each job to worker node local disk (or read from NFS) Removed by user or manually by administrators on quota basis
Static data
Maintain “gold standard” and rsync or bulk update as required 20 GB of protein models pre-staged to $OSG_DATA/sbgrid/biodb
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Continuous aggregation
“Inprogress” view of data - accept possibility of corruption
Track errors
sort by execution site - key predictor (network, disk, library, config problems)
Retain only key output
STDOUT, STDERR, and single per-job “results” file enough to easily retry arbitrary sub-sets of overall jobset (timeout, error, etc.)
On-demand updates
User-driven “expensive” status updates on queued, running, complete, failed jobs, plus aggregated results and report generation
Finalized results
Cleaned results Augmented results (inclusion of static per-job information)
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
OK - job executed application correctly and usable results are returned
done
NO_SOLUTION - job executed, but no usable results
failed, don’t rerun
ERROR - job failed to execute properly
failed, rerun (up to retry limit)
SHORT - job executed and produced output, but runtime is suspicious
complete, but don’t trust - rerun
TIMEOUT - job was aborted before completing, no results available
cancelled, don’t rerun
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Web portal access to files
X.509 access control (full file management) .htpasswd read-only sharing
CLI or web interaction with running jobs Web view of data (files, tables, reports, AJAX) Web “file browsing” of all results
With augmented hyperlinking to details or static information
ssh/CLI access to files
Users need to be able to drill down to the 1 million files and 5 GB of data generated by the execution of their workflow
Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009
Relying heavily on OSG facilities for federated environment
X.509 DNs Proxy certs MyProxy
LDAP for local accounts Access control: mod_gridsite and GACL policies Data access: apache and mod_gridsite Service access: web portal and gsi-enabled ssh Challenge: Making facilities available to user community
alternatives to web portal and gsi-ssh would be nice: local to user
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Identity Management
Mixture of .htpasswd, PAM, X.509, and application-specific IDs Complexity of X.509 (and associated paraphernalia) confuses users account creation, use, and management
Virtual Organization hierarchies and user-driven collaborations
Inheritance of rights/policies How to allow users to easily create and manage groups
Merging security policies
Site/resource, VO, and user policies need to be merged
Encryption and Privacy Preservation
Generic mechanisms for encryption and key management Preserving privacy of actions and data in federated grid environment
Ian Stokes-Rees, http://sbgrid.org
Ian Stokes-Rees, http://sbgrid.org
Ian Stokes-Rees, http://sbgrid.org
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Meta data system
Provide more generic pointers to ACLs and encryption keys
Extension of GACL system
Include non-X.509 ID tokens as policy principals Allow GACL policies to apply to web framework objects (pyGACL)
Simple replicated key system for file encryption
Use of meta-data framework to point to encryption key (and replicas) Use GACL to control key access (regular file) Libraries to automatically read/write encrypted files
Future
VO hierarchies Tools for user driven ACL management Tools for policy management (merging site, VO and user policies)
Ian Stokes-Rees - portal.nebiogrid.org - Harvard Medical School MTAGS10, November 2010
Hierarchical execution model
application, API, cluster, grid, multi-exec tension between experimentation, development, debugging Python provides the right mix for this
Necessity of clear model for data and binary deployment
lifetime, configuration, ownership
Workflow management
application specific part infrastructure specific part
Web interfaces, command line, and APIs all required Identity management, access control and security model is tough!
Ian Stokes-Rees, http://sbgrid.org
Piotr Sliz
PI and SBGrid team leader
Peter Doherty
Grid Research IT Admin
Ian Levesque
Systems Architect
Ben Eisenbraun
Software Curator
Please be in touch if you have questions: http://portal.nebiogrid.org/ ijstokes@hkl.hms.harvard.edu
Ian Stokes-Rees, SBGrid, Harvard Medical School October 13th, 2009
Ease of use
No command line X.509 (initial request, VOs, proxies, Roles, etc.) are really complicated Support infrastructure (mailing lists, tickets, phone, training)
Killer apps
They will use it if they see peers using it to advance scientific goals They will use it if some novel workflows or workflow patterns are established Data management is a big problem for everyone (see bonus, time permitting) -- we believe grid infrastructure could provide a solution
Security
Data needs to be secure ... ... but users still want to control sharing/access
Roadblocks
Reliability of underlying infrastructure and difficulty in debugging Applications tied to GUIs, rudimentary interfaces
Ian Stokes-Rees, http://sbgrid.org
Ian Stokes-Rees, http://sbgrid.org