Keeping Pace with Science The CyVerse Data Store in 2020 and the - PowerPoint PPT Presentation

Keeping Pace with Science The CyVerse Data Store in 2020 and the Future Tony Edgin and Edwin Skidmore iRODS UGM 2020

190 million data objects (9 PiB) ★ 80 million files (5 PiB) transferred in 2019 ★ Data Store Statistics 200 thousand files (14 TiB) transferred daily ★ 80 thousand users ★ 50 concurrent user connections on average ★ File Transfer Performance Between CyVerse and Various Compute Platforms 10 GiB File Transfer Computation Platform Throughput (MiB/s) Texas Advanced Computing 170 Center (TACC) Jetstream 330 Amazon Web Services (AWS) 240 Google Cloud Platform (GCP) 260

What is the CyVerse Data Store? Offsite replication ● Optimization for accessing large sets of small files ● Event publishing ● Customer-driven extensions ● Project-specific storage ○ Service integration (see Appendix) ○ Custom application integration (see Appendix) ○

Optimizing Access to Large Sets of Small Files Use Case CyVerse Solution Datasets for genome browser, e.g., JBrowse or Set up a WebDAV server with a file cache UCSC Genome Browser ● apache web server with ● thousands of kilobyte-sized files ○ davrods for iRODS access ● browsers are interactive, ○ modfilecache to cache files ○ loads files as needed ● separate virtual hosts for anonymous and ○ must be responsive, i.e., cannot take 20 authenticated access seconds for each user request ● warm cache for byte-range access ● 100x faster than iget

Project-Specific Storage Use Case CyVerse Solution A project wants to store its data in the Data Store. ● project provides institutional storage servers ● CyVerse configures storage servers ● 100 TB of data ○ catalog consumers hosting storage ● replicas stored locally at two institutions resources ○ project uses replication resource ○ policy to ensure data localities ○ separate iRODS service account

Data Store of Tomorrow Steps toward utopia Increase interoperability ● Reduce accidental complexity ● See Your app makes me fat ○ Shorten scientific analysis feedback loop ●

Upcoming Features Thematic Real-time Environmental ● Distributed Data Services (THREDDS) (see Appendix) Bring your own (BYO) infrastructure ● BYO storage ○ BYO compute ( later ) ○ Continuous analysis ●

User-Provided, S3-Compliant Storage Use Case CyVerse Solution User wants to analyze their cloud data using Use iRODS S3 Resource Plugin and Filesystem CyVerse cyberinfrastructure. Scanner. ● data hosted in an S3-compliant storage ● cacheless , detached S3 resource for system, e.g. Google Cloud Storage scalability ● moving them to Data Store is not feasible ● Filesystem Scanner registers data in place ● Filesystem Scanner runs on cloud platform to avoid egress costs ● project owns cloud access credentials and responsible for accrued costs More details in Appendix

CyVerse Continuous Analysis

Why “Continuous Analysis”? ● “Reproducibility of computational workflows is automated using continuous analysis”, CS Greene et al, Nature Biotechnology, June 2016 (http://dx.doi.org/10.1038/nbt.3780) ○ Used github and drone to demonstrate “continuous analysis”, ci/cd for science ○ Code and data changes -> re-execute analysis and version everything ○ Authors admit limitations in dealing with data sets, though not impossible ● Scientists and researchers want event-driven analysis (data growth, sensors data, etc) ● Containers are becoming the de facto standard as units of reproducible compute ● Kubernetes is becoming the de facto standard for orchestrating containers ● Container orchestration and CI/CD technologies are difficult to use, esp for a scientist and mortals who don’t know yaml (or json)

Why Continuous Analysis (cont.) ● Lessons learned ○ Jetstream/Atmosphere (multi-cloud, ad hoc interactive environments, allocations) ○ Containerized workflows ○ Data management ● Scientists need infrastructure to create, manage, and share in this emerging Kubernetes-native analyses in a managed fashion ● Complements the CyVerse’s ecosystem, including Discovery Environment, Bisque, etc

Example User Stories ● I want my analyses to launch every time my workflow changes, my data changes, new ML training data is available, or every hour ● I want my analyses to always be “available” and only be "charged" for the resources I actually use ● I want to launch or transfer my analyses onto Jetstream/AWS/GCP/IoT/my own project’s servers ● I want to use Argo, Airflow, Snakemake, or Makeflow workflows with Kubernetes and scale as I define it

What is Continuous Analysis Event-driven backend-as-a-service (BaaS) platform that will allow users to create, manage, deploy containerized analyses to any (kubernetes) cloud. High level Capabilities: ● Multi-cloud (and iRODS integrated) ● Auto-scaling and Scale to zero ● Event-driven aka Continuous analysis (CI/CD for science) ○ Data events, workflow events, periodic, external events ● Kubernetes/Cloud Native ○ Custom Resource Definition (CRD) ○ Supports k8s CRD workflows: standard k8s, Argo workflows ● Git for workflow persistence ● Support for federated identity (via keycloak) ● CyVerse-features: api, sharing/permissions, interop, etc

Current Status ● Currently, in development ○ REST API is the initial focus (not so easy) Command line interface (somewhat easier) ○ Easy to use UI ○ ● Limited release in Q4 2020

Questions?

Appendix

Service Integration Use Case CyVerse Solution service assigned rodsuser type iRODS A Powered by CyVerse service needs to access its ● users’ data. account ● user opts into service through User Portal ● not controlled by CyVerse, no admin access ● shared collection for user and service to data ○ owned by user in home collection ● read-write access to its user data policy gives service write on contents ○ user has write permission, discourage ○ delete, breaking service access

Example Application Integration Use Case CyVerse Solution Sparc’d is a desktop application supporting wildlife ● project collection managed by Sparc’d creator who gets own permission on contents, conservation created by Susan Malusa. enforced by policy ● manages sets of camera trap images ● “tar pipe” style upload ○ sizeable sets of small files ○ Sparc’d packs images in one or more ○ each set is tagged with metadata tar files ○ supports sharing ○ asynchronous rule unpacks, registers ○ images cannot be public, protect images endangered species from poaching ● metadata attached in bulk ● intended users are citizen scientists ○ uploaded as CSV in each tar file ○ volunteers, low frustration tolerance ○ applied by image registration rule ○ require efficient uploads

THREDDS Support for NetCDF Data Sets Use Case CyVerse Solution A project uses NetCDF files to store its public data THREDDS Data Server ( TDS ) provides a collection sets. of web services for accessing various types of datasets including NetCDF. ● files are multi-gigabyte sized ● only portions of some files needed at a time ● iRODS resource server and TDS share host ● TDS has direct, read-only access to iRODS vault ● THREDDS data description files in vault ● project manages served data through iRODS ● analyst accesses data through TDS

THREDDS Integration Process 1 . Project asks for THREDDS integration. 3 . CyVerse sets up data residency policy in iRODS and adds project to main TDS catalog. Project prepares data using iRODS client. TDS 4 . Analyst accesses NetCDF data.

User-Specific S3 Resource Creation Process 2 . CyVerse creates S3 resource and 3 . User runs iRODS filesystem 1 . User gives data residency Continuous scanner on cloud platform to CyVerse the S3 policies. Analysis Platform register data. connection information iRODS Discovery iRODS core.re Environment Filesystem Scanner ⟳ Powered by User S3 CyVerse Services Resource S3 User Data 4 . User accesses data from CyVerse platforms.

Keeping Pace with Science The CyVerse Data Store in 2020 and the - PowerPoint PPT Presentation

Keeping Pace with Science The CyVerse Data Store in 2020 and the Future Tony Edgin and Edwin Skidmore iRODS UGM 2020 190 million data objects (9 PiB) 80 million files (5 PiB) transferred in 2019 Data Store Statistics 200 thousand

PACE i CE in Iowa Liz P Parr rry Nati tional P PACE Associati tion PACE 101 101 What i

Bringi ging P g PACE t CE to New Ha Hampshire Liz P Parr rry Nati tional P PACE Associati

High Impact Educational Practices Presenters: Dr. Anna Shostya, (ashostya@pace.edu), Dr. Joseph

C-PACE in Hampton Roads February 28, 2019 Hampton Roads Chamber of Commerce Mid-Atlantic PACE

PACE Financing Presented by, Lean & Green Michigan Ag Agenda Intro: What is PACE? What is

PACE Program of All-Inclusive Care for the Elderly Sandra J. Yoro, APD PACE Policy Analyst May

1 CALIFORNIA STATEWIDE COMMUNITY DEVELOPMENT ASSOCIATION (CSCDA) OPEN PACE CSCDA Open PACE

MD-PACE and Kent County Property Assessed Clean Energy (PACE) Financing Property Assessed

Local Government Webinar Series PACE Financing for Local Governments January 13, 2015 PACE

and CAEATFAs PACE Loss Reserve Program California Alternative Energy and Advanced

Colorado PACE Financing Overview January 2016 Agenda ! Brief Company Background ! Evolution of

New Faculty Orientation Pace University Library library.pace.edu Hello! My name is Sarah

Identity Management Identity Management Alberto Pace Alberto Pace CERN, Information Technology

Ameren Keeping Current and Keeping Cooling Evaluation Presentation 2016 Evaluation Activities

The Jo Job Keeping Pla lan A TRAINING FOR RESIDENTIAL PROVIDERS Job Keeping Plan Training

PACE Financing Lean & Green Michigan Presented by: Bali Kumar, CEO Lean & Green

Bisque: Bio-Image Semantic Query User Environment Workshop on Imaging for High Throughput

The Doctrine of Creation The Doctrine of Creation & Cosmology & Cosmology Robert C.

October 9 - 11 | Seattle Satya Nadella CEO, Microsoft Session Sponsor: Moderated by GeekWire

Course #1 Compliance with CoARC Standards: What are the key elements from NBRC reports?

CS3505/5020 Software Practice II A bit of team player review Network protocol design Quiz

Byron Bay Combine some of the best views on the NSW coast with a bustling beach town and luxe

Review of the Relational Model 5DV120 Database System Principles Ume a University

Econometric modelling in finance and insurance with the R language Arthur Charpentier

Keeping Pace with Science The CyVerse Data Store in 2020 and the - PowerPoint PPT Presentation

Keeping Pace with Science The CyVerse Data Store in 2020 and the Future Tony Edgin and Edwin Skidmore iRODS UGM 2020 190 million data objects (9 PiB) 80 million files (5 PiB) transferred in 2019 Data Store Statistics 200 thousand

PACE i CE in Iowa Liz P Parr rry Nati tional P PACE Associati tion PACE 101 101 What i

Bringi ging P g PACE t CE to New Ha Hampshire Liz P Parr rry Nati tional P PACE Associati

High Impact Educational Practices Presenters: Dr. Anna Shostya, (ashostya@pace.edu), Dr. Joseph

C-PACE in Hampton Roads February 28, 2019 Hampton Roads Chamber of Commerce Mid-Atlantic PACE

PACE Financing Presented by, Lean &amp; Green Michigan Ag Agenda Intro: What is PACE? What is

PACE Program of All-Inclusive Care for the Elderly Sandra J. Yoro, APD PACE Policy Analyst May

1 CALIFORNIA STATEWIDE COMMUNITY DEVELOPMENT ASSOCIATION (CSCDA) OPEN PACE CSCDA Open PACE

MD-PACE and Kent County Property Assessed Clean Energy (PACE) Financing Property Assessed

Local Government Webinar Series PACE Financing for Local Governments January 13, 2015 PACE

and CAEATFAs PACE Loss Reserve Program California Alternative Energy and Advanced

Colorado PACE Financing Overview January 2016 Agenda ! Brief Company Background ! Evolution of

New Faculty Orientation Pace University Library library.pace.edu Hello! My name is Sarah

Identity Management Identity Management Alberto Pace Alberto Pace CERN, Information Technology

Ameren Keeping Current and Keeping Cooling Evaluation Presentation 2016 Evaluation Activities

The Jo Job Keeping Pla lan A TRAINING FOR RESIDENTIAL PROVIDERS Job Keeping Plan Training

PACE Financing Lean &amp; Green Michigan Presented by: Bali Kumar, CEO Lean &amp; Green

Bisque: Bio-Image Semantic Query User Environment Workshop on Imaging for High Throughput

The Doctrine of Creation The Doctrine of Creation &amp; Cosmology &amp; Cosmology Robert C.

October 9 - 11 | Seattle Satya Nadella CEO, Microsoft Session Sponsor: Moderated by GeekWire

Course #1 Compliance with CoARC Standards: What are the key elements from NBRC reports?

CS3505/5020 Software Practice II A bit of team player review Network protocol design Quiz

Byron Bay Combine some of the best views on the NSW coast with a bustling beach town and luxe

Review of the Relational Model 5DV120 Database System Principles Ume a University

Econometric modelling in finance and insurance with the R language Arthur Charpentier

PACE Financing Presented by, Lean & Green Michigan Ag Agenda Intro: What is PACE? What is

PACE Financing Lean & Green Michigan Presented by: Bali Kumar, CEO Lean & Green

The Doctrine of Creation The Doctrine of Creation & Cosmology & Cosmology Robert C.