Keeping Pace with Science
The CyVerse Data Store in 2020 and the Future
Tony Edgin and Edwin Skidmore
iRODS UGM 2020
Keeping Pace with Science The CyVerse Data Store in 2020 and the - - PowerPoint PPT Presentation
Keeping Pace with Science The CyVerse Data Store in 2020 and the Future Tony Edgin and Edwin Skidmore iRODS UGM 2020 190 million data objects (9 PiB) 80 million files (5 PiB) transferred in 2019 Data Store Statistics 200 thousand
The CyVerse Data Store in 2020 and the Future
Tony Edgin and Edwin Skidmore
iRODS UGM 2020
Data Store Statistics
★ 190 million data objects (9 PiB) ★ 80 million files (5 PiB) transferred in 2019 ★ 200 thousand files (14 TiB) transferred daily ★ 80 thousand users ★ 50 concurrent user connections on average
10 GiB File Transfer
Computation Platform Throughput (MiB/s) Texas Advanced Computing Center (TACC) 170 Jetstream 330 Amazon Web Services (AWS) 240 Google Cloud Platform (GCP) 260
File Transfer Performance Between CyVerse and Various Compute Platforms
What is the CyVerse Data Store?
○ Project-specific storage ○ Service integration (see Appendix) ○ Custom application integration (see Appendix)
Optimizing Access to Large Sets of Small Files
Use Case
Datasets for genome browser, e.g., JBrowse or UCSC Genome Browser
○ loads files as needed ○ must be responsive, i.e., cannot take 20 seconds for each user request
CyVerse Solution
Set up a WebDAV server with a file cache
○ davrods for iRODS access ○ modfilecache to cache files
authenticated access
Project-Specific Storage
Use Case
A project wants to store its data in the Data Store.
CyVerse Solution
○ catalog consumers hosting storage resources ○ project uses replication resource ○ policy to ensure data localities ○ separate iRODS service account
Steps toward utopia
○ See Your app makes me fat
Data Store of Tomorrow
Upcoming Features
Distributed Data Services (THREDDS) (see Appendix)
○ BYO storage ○ BYO compute (later)
User-Provided, S3-Compliant Storage
Use Case
User wants to analyze their cloud data using CyVerse cyberinfrastructure.
system, e.g. Google Cloud Storage
CyVerse Solution
Use iRODS S3 Resource Plugin and Filesystem Scanner.
scalability
avoid egress costs
responsible for accrued costs More details in Appendix
Why “Continuous Analysis”?
Greene et al, Nature Biotechnology, June 2016 (http://dx.doi.org/10.1038/nbt.3780) ○ Used github and drone to demonstrate “continuous analysis”, ci/cd for science ○ Code and data changes -> re-execute analysis and version everything ○ Authors admit limitations in dealing with data sets, though not impossible
mortals who don’t know yaml (or json)
Why Continuous Analysis (cont.)
○ Jetstream/Atmosphere (multi-cloud, ad hoc interactive environments, allocations) ○ Containerized workflows ○ Data management
Kubernetes-native analyses in a managed fashion
Example User Stories
ML training data is available, or every hour
actually use
project’s servers
scale as I define it
What is Continuous Analysis
Event-driven backend-as-a-service (BaaS) platform that will allow users to create, manage, deploy containerized analyses to any (kubernetes) cloud. High level Capabilities:
○ Data events, workflow events, periodic, external events
○ Custom Resource Definition (CRD) ○ Supports k8s CRD workflows: standard k8s, Argo workflows
○ REST API is the initial focus (not so easy) ○ Command line interface (somewhat easier) ○ Easy to use UI
Service Integration
Use Case
A Powered by CyVerse service needs to access its users’ data.
to data
CyVerse Solution
account
○
○ policy gives service write on contents ○ user has write permission, discourage delete, breaking service access
Example Application Integration
Use Case
Sparc’d is a desktop application supporting wildlife conservation created by Susan Malusa.
○ sizeable sets of small files ○ each set is tagged with metadata ○ supports sharing ○ images cannot be public, protect endangered species from poaching
○ volunteers, low frustration tolerance ○ require efficient uploads
CyVerse Solution
who gets own permission on contents, enforced by policy
○ Sparc’d packs images in one or more tar files ○ asynchronous rule unpacks, registers images
○ uploaded as CSV in each tar file ○ applied by image registration rule
THREDDS Support for NetCDF Data Sets
Use Case
A project uses NetCDF files to store its public data sets.
CyVerse Solution
THREDDS Data Server (TDS) provides a collection
datasets including NetCDF.
vault
THREDDS Integration Process
NetCDF data.
policy in iRODS and adds project to main TDS catalog. Project prepares data using iRODS client.
for THREDDS integration.
TDS
iRODS
User-Specific S3 Resource Creation Process
User Data
core.re
User S3 Resource Discovery Environment Powered by CyVerse Services Continuous Analysis Platform
CyVerse the S3 connection information
S3
creates S3 resource and data residency policies.
scanner on cloud platform to register data.
from CyVerse platforms.
iRODS Filesystem Scanner
⟳