iRODS in the Cloud: SciDAS and NIH Helium Commons Commons Claris - - PowerPoint PPT Presentation

irods in the cloud scidas and nih helium commons
SMART_READER_LITE
LIVE PREVIEW

iRODS in the Cloud: SciDAS and NIH Helium Commons Commons Claris - - PowerPoint PPT Presentation

iRODS in the Cloud: SciDAS and NIH Helium Commons Commons Claris Castillo RENCI, UNC Chapel Hill Not Scaling up Data Analysis is Not an Option 20 th Century 21st Century Normal veteran (giga-/terascale) and newbie


slide-1
SLIDE 1

iRODS in the Cloud: SciDAS and NIH Helium Commons

Claris Castillo RENCI, UNC Chapel Hill

  • Commons
slide-2
SLIDE 2
slide-3
SLIDE 3

Not Scaling up Data Analysis is Not an Option

www.smartpractice.com Wisegeek.org

DatAPocaLypse Prediction (Genomics): In 20 years, every CVS, subway, hospital, research lab, public health facility, police station, etc will have a DNA sequencer generating Exabytes of data in aggregate each week. 20th Century 21st Century Normal veteran (giga-/terascale) and newbie (megascale) users MUST ADVANCE to the peta/exa-scale in this generation. Issues:

  • Limited computational skills (What is a C library?)
  • Poor use of advanced networks (We need more HDs to mail!)
  • Limited access to computational resources (awareness, $$$)
  • Unpredictable time to compute result (queue times, queue times,

queue times, broken nodes, segfaults, OOM, data geography)

  • Missing skillsets (I only know Perl)
  • Data must be organized and good stuff deleted (Data policies)
  • How many bioinformaticists are on the CVS payroll?
  • How many faculty recruitments failed because campus X research computing

resources are stuck in 2015?

  • How many adverse drug reactions were not predicted because of limited/broken

cyberinfrastructure?

Alex Feltus

slide-4
SLIDE 4

Heterogeneous and Complex CI Ecosystems

+100 sites +1500 users

Community data sharing platforms

Compute infrastructure Advance networks Storage infrastructure

slide-5
SLIDE 5

Commoditization of Cloud computing and the convergence of compute, storage, data and network technologies enables the ‘illusion’ of a single large computer consisting of widely distributed systems.

slide-6
SLIDE 6

Breakdown: One Layer at A Time -- Data

+100 sites +1500 users

… iRODS team connected iRODS to a MariaDB Galera Cluster to provide a multi-master, distributed iRODS catalog over the WAN.

“Distributing the iRODS Catalog: a way forward”, M. Stealey, et. al. iRODS User Group Meeting (UGM), Netherlands, 2017.

SciDAS Zone MariaDB Gallera cluster

slide-7
SLIDE 7

+100 sites +1500 users

… Apache Mesos: A layer of abstraction, to utilize an entire data center as a single large server

Breakdown: One Layer at A Time -- Compute

slide-8
SLIDE 8

+100 sites +1500 users

Scientific applications will be available in the form of SciApps “virtual appliances” (NSF CC-ADAMANT, [works15])

[works15] Enabling Workflow Repeatability with Virtualization Support, Fan Jiang et.al. Workshop on Workflows of Large-Scale Science, Supercomputing Conference (SC15), Austin, Texas,2015.

Breakdown: One Layer at A Time – Scientific Tools

slide-9
SLIDE 9

SciDAS: Bringing it All Together Into One System

+100 sites +1500 users

[works15] Enabling Workflow Repeatability with Virtualization Support, Fan Jiang et.al. Workshop on Workflows of Large-Scale Science, Supercomputing Conference (SC15), Austin, Texas,2015.

SciDAS Middleware

Cost-Aware Optimize iRODS Shim (aaS) API PerfSONAR Shim (aaS) API PerfS ONA R map ping Requester Orchestrator

  • Network aware placement
  • Optimize for data locality
  • Capability aware resource aware

placement

  • GPU able nodes
  • Authentication and authorization

infrastructure

  • CiLogon
slide-10
SLIDE 10

Improving scientific productivity by the numbers

slide-11
SLIDE 11
  • Commons

Cloud-Agnostic Platform

Global Unique ID Search & Indexing Workspace

Security & Compliance

FAIR

Data Commons APIs Scientific Communities

Interoperability Component APIs

Data / Tools Enrollment Data / Tools Discovery Scalable, Secure and Collaborative Workflow Execution

Data Commons

slide-12
SLIDE 12

`

TopMED MOD GTEX …

Marathon

Chronos

Jupyter apps

Appliances JSON Descriptors {:}

JSON

{:}

JSON

{:}

JSON

{:}

JSON

data: /aws/TopMed cloud-preference: GC Encryption: true Docker-imge:foo Ram:16G CPU:Stge: 5TB

PIVOT API/Core Service

High-level descriptor of applications Intelligent decision (cloud aware) Provision & deploy Access/write data anywhere

Input

Make results discoverable

CWL apps

CommonsShare (KC5:portal)

Bring-Your-Own-Data `

TopMED MOD GTEX …

Metadata to encode rich information Rule engine programmed with rules to enact policies Data Federation Virtualization system

  • Commons

Bring-Your-Own- Data-Service

slide-13
SLIDE 13

`

TopMED MOD GTEX …

Marathon

Chronos

Jupyter apps

Appliances JSON Descriptors {:}

JSON

{:}

JSON

{:}

JSON

{:}

JSON

data: /aws/TopMed cloud-preference: GC Encryption: true Docker-imge:foo Ram:16G CPU:Stge: 5TB

PIVOT API/Core Service

High-level descriptor of applications Intelligent decision (cloud aware) Provision & deploy Access/write data anywhere

Input

Make results discoverable

CWL apps

CommonsShare (KC5:portal)

Bring-Your-Own-Data ` Bring-Your-Own- Data-Service

TopMED MOD GTEX …

  • Commons

BYOD: Cloud storage can be added as storage resources Extended data collaboration (BYODS): Seamless integration with data hosted on external data services Data Federation (default): continuous virtual system while retaining control of each endpoint

iRODS enables powerful data sharing models in the Commons

slide-14
SLIDE 14

Thank you!

claris@renci.org