Pegasus : Introducing Integrity to Scientific Workflows
Karan Vahi
vahi@isi.edu https://pegasus.isi.edu
Pegasus : Introducing Integrity to Scientific Workflows Karan Vahi - - PowerPoint PPT Presentation
Pegasus : Introducing Integrity to Scientific Workflows Karan Vahi vahi@isi.edu https://pegasus.isi.edu HTCondor DAGMan Compute Compute Pipelines Pipelines DAGMan DAGMan is is a a reliable reliable and and a a scalable scalable
Pegasus : Introducing Integrity to Scientific Workflows
Karan Vahi
vahi@isi.edu https://pegasus.isi.edu
Pegasus
http://pegasus.isi.edu
2
Compute Compute Pipelines Pipelines – Building Building Blocks Blocks
HTCondor DAGMan
DAGMan is is a a reliable reliable and and a a scalable scalable workflow workflow executor executor Sits Sits on
top of
HTCondor Schedd Schedd Can Can handle handle very very large large workflows workflows
Has useful useful reliability reliability features features in in-built built Automatic Automatic job job retries retries and and rescue rescue DAG’s DAG’s ( ( recover recover from from where where you you left left off
in case case of
failures)
Throttling for for jobs jobs in in a a workflow workflow However, However, it it is is still still up up-to to user user to to figure figure out
Data Management Management How How do do you you ship ship in in the the small/large small/large amounts amounts data data required required by by your your pipeline pipeline and and protocols protocols to to use? use?
How best best to to leverage leverage different different infrastructure infrastructure setups setups OSG OSG has has no no shared shared filesystem filesystem while while XSEDE XSEDE and and your your local local campus campus cluster cluster has has one!
Debug and and Monitor Monitor Computations. Computations. Correlate Correlate data data across across lots lots of
log files. files. Need Need to to know know what what host host a a job job ran ran on
and how how it it was was invoked invoked
Restructure Workflows Workflows for for Improved Improved Performance Performance Short hort running running tasks? tasks? Data Data placement placement
Pegasus
http://pegasus.isi.edu
3
Automate Recover Debug
Why Pegasus?
Automates complex, multi-stage processing pipelines Enables parallel, distributed computations Automatically executes data transfers Reusable, aids reproducibility Records how data was produced (provenance) Provides to tools to handle and debug failures Keeps track of data and files NSF NSF funded funded project project since since 2001, 2001, with with close close Collaboration Collaboration with with HTCondor HTCondor team. team. Portable: Describe once; execute multiple times
Pegasus
https://pegasus.isi.edu
4
cleanup cleanup job job
Removes unused data
stage stage-in in job job stage stage-out
job registration registration job job
Transfers the workflow input data Transfers the workflow output data Registers the workflow output data
clustered clustered job job
Groups small jobs together to improve performance
directed directed-acyclic acyclic graphs graphs DAG DAG in in XML XML
Portable Portable Description Description Users Users don’t don’t worry worry about about low low level level execution execution details details
Condor I/O (HTCondor pools, OSG, …)
via HTCondor file transfers
Non-shared File System (clouds, OSG, …)
possibly not co-located with the computation
Shared File System (HPC sites, XSEDE, Campus clusters, …)
Submit Host Compute Site Shared FS WN WN HPC Cluster Compute Site Submit Host Staging Site WN WN Amazon EC2 with S3 Submit Host Local FS Compute Site WN WN Jobs Data
Pegasus Guarantee - Wherever and whenever a job runs it’s inputs will be in the directory where it is launched.
3rd party transfers)
HTTP SCP GridFTP Globus Online iRods Amazon S3 Google Storage SRM FDT stashcp cp ln -s
cacr.iu.edu/projects/swip/
NSF CICI Awards 1642070, 1642053, and 1642090
GOALS Provide additional assurances that a scientific workflow is not accidentally or maliciously tampered with during its execution Allow for detection of modification to its data or executables at later dates to facilitate reproducibility. Integrate cryptographic support for data integrity into the Pegasus Workflow Management System.
PIs: Von Welch, Ilya Baldin, Ewa Deelman, Steve Myers Team: Omkar Bhide, Rafael Ferrieira da Silva, Randy Heiland, Anirban Mandal, Rajiv Mayani, Mats Rynge, Karan Vahi
cacr.iu.edu/projects/swip/
Modern IT systems are not perfect - errors creep in. At modern “Big Data” sizes we are starting to see checksums breaking down. Plus there is the threat of intentional changes: malicious attackers, insider threats, etc.
cacr.iu.edu/projects/swip/
Motivation: CERN Study of Disk Errors
Examined Disk, Memory, RAID 5 errors. “The error rates are at the 10-7 level, but with complicated patterns.” E.g. 80% of disk errors were 64k regions of corruption. Explored many fixes and their often significant performance trade-offs.
https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf
cacr.iu.edu/projects/swip/
Network router software inadvertently corrupts TCP data and checksum! XSEDE and Internet2 example from 2013. Second similar case in 2017 example with FreeSurfer/Fsurf project.
https://www.xsede.org/news/-/news/item/6390 Brocade TSB 2013-162-A
cacr.iu.edu/projects/swip/
Bug in StashCache data transfer software would occasionally cause silent failure (failed but returned zero). Internal to the workflow this was detected when input to a stage of the workflow was detected as corrupted and retry invoked. (60k retries and an extra 2 years of cpu hours!) However, failures in the final staging
because their was no workflow next stage to catch the errors. The workflow management system, believing workflow was complete, cleaned up, so final data incomplete and all intermediary data lost. Ten CPU*years of computing came to naught.
cacr.iu.edu/projects/swip/
Application-level checksums address these and other issues (e.g. malicious changes). In use by many data transfer applications: scp, Globus/GridFTP, some parts of HTCondor, etc. To include all aspects of the application workflow, requires either manual application by a researcher or integration into the application(s).
Automatic Integrity Checking - Goals
checks on data
different types of files
transferred to output site
places in the workflow.
cacr.iu.edu/projects/swip/
Pegasus will perform integrity checksums on input files before a job starts on the remote node.
in the input replica catalog along with file locations. Can compute checksums while transferring if not specified.
checksums are generated and tracked within the system.
Failure is triggered if checksums fail Introduced in Pegasus 4.9
cacr.iu.edu/projects/swip/
Initial Initial Results Results with with Integrity Integrity Checking Checking on
OSG). The problematic jobs were automatically retried and the workflow finished successfully.
and 3 at UNL hosts. Error Analysis
seconds in between the jobs.
suspect that the node level cache got corrupted.
cacr.iu.edu/projects/swip/
Checksum Checksum Overheads Overheads
hours and 42 minutes doing checksum verification, with an overhead of 0.068%.
seconds of checksum verification. The overhead was 0.062%.
1000 Node OSG Kinc Workflow Overhead of 0.054 % incurred
Automate, recover, and debug scientific computations.
Pegasus Website https://pegasus.isi.edu Users Mailing List pegasus-users@isi.edu Support pegasus-support@isi.edu Pegasus Online Office Hours
https://pegasus.isi.edu/blog/online-pegasus-office-hours/
Bi-monthly basis on second Friday of the month, where we address user questions and also apprise the community of new developments
Automate, recover, and debug scientific computations.
Karan Vahi vahi@isi.edu Karan Vahi Rafael Ferreira da Silva Rajiv Mayani Mats Rynge Ewa Deelman