Superfacility and Gateways for Experimental and Observational Data - PowerPoint PPT Presentation

Superfacility and Gateways for Experimental and Observational Data Debbie Bard Lead, Superfacility Project Lead, Data Science Engagement Group Cory Snavely Deputy, Superfacility Project NUG 2020 Lead, Infrastructure Services Group August 17, 2020 1

Superfacility: an ecosystem of connected facilities, software and expertise to enable new modes of discovery Superfacility@ LBNL: NERSC, ESnet and CRD working together ● A model to integrate experimental, computational and networking facilities for reproducible science ● Enabling new discoveries by coupling experimental science with large scale data analysis and simulations 2

The Superfacility concept is a key part of LBNL strategy to support computing for experimental science User Engagement Data Lifecycle Automated Resource Allocation Computing at the Edge 3

NERSC supports many users and projects from DOE SC’s experimental and observational facilities Experiments Future operating now experiments 4

NERSC supports many users and projects from DOE SC’s experimental and observational facilities ~35% of NERSC projects in 2018 said the primary role of the Experiments Future project is to work with operating now experiments experimental data 5

Compute needs from experimental and observational facilities continues to increase Needs go beyond compute hours : • High data volumes (today use ~19% of Preliminary estimate! computing hours, but store 78% of data.) • Real-time (or near) turnaround and interactive access for running experiments • Resilient workflows to run across multiple compute sites • Ecosystem of persistent edge services, including workflow managers, Taken from Exascale Requirements Reviews visualization, databases, web services… 6

Compute needs from experimental and observational facilities continues to increase Needs go beyond compute hours : • High data volumes (today use ~19% of Preliminary estimate! computing hours, but store 78% of data.) You will hear much more about this in the next • Real-time (or near) turnaround and breakout for the NUGX SIG for Experimental interactive access for running experiments • Science Users! Resilient workflows to run across multiple compute sites • Ecosystem of persistent edge services, including workflow managers, Taken from Exascale Requirements Reviews visualization, databases, web services… 7

Timing is critical • Workflow may run • Experiments may need HPC continuously and feedback: real-time scheduling automatically: API access, dedicated workflow nodes First experiment of LCLS-II: studying protease for SARS-Cov-2 and inhibitors 8

Data management is critical • Experiments move & manage • Scientists need to search, data across sites and collate and reuse data collaborators across sites and experiments 9

Access is critical • Scientists need access • Experiments have their own beyond the command line: user communities and Jupyter, API… policies: Federated ID 10

The CS Area Superfacility ‘project’ coordinates and tracks this work Project Goal: By the end of CY 2021, 3 (or more) of our 7 science application engagements will demonstrate automated pipelines that analyze data from remote facilities at large scale, without routine human intervention, using these capabilities: • Real-time computing support • Dynamic, high-performance networking • Data management and movement tools • API-driven automation • Authentication using Federated Identity 11

We’ve developed and deployed many new tools and capabilities this year... Enabled time-sensitive workloads Automation to reduce human effort in complex workflows ● Added appropriate scheduling policies, ● Released programmable API to query NERSC including real-time queues status, reserve compute, move data etc ● Slurm NRE for job pre-emption, advance ● Upgraded Spin: Container-based platform to reservations and dynamic partitions support workflow & edge services ● Workload introspection to identify spaces for ● Designed federated ID management across opportunistic scheduling facilities Deployed data management tools for large geographically-distributed collaborations Supported HPC-scale Jupyter usage by experiments ● Introduced Globus sharing for collaboration accounts ● Scaled out Jupyter notebooks to run on 1000s ● Deployed prototype GHI (GPFS-HPSS of nodes interface) for easier archiving ● Developed real-time visualization and ● PI dashboard for collaboration interactive widgets management ● Curated notebooks, forking & reproducible 12 workflows

Superfacility Annual Meeting Demo series In May/June we held a series of virtual demonstrations of tools and utilities that have been developed to support the needs of experimental scientists at ESnet and NERSC. ▪ Recordings available here: https://www.nersc.gov/research-and-development/superfacility/ – SENSE: Intelligent Network Services for Science Workflows ( Xi Yang and the SENSE team ) – New Data Management Tools and Capabilities ( Lisa Gerhardt and Annette Greiner ) – Superfacility API: Automation for Complex Workflows at Scale ( Gabor Torok, Cory Snavely, Bjoern Enders ) – Docker Containers and Dark Matter: An Overview Of the Spin Container Platform with Highlights from the LZ Experiment ( Cory Snavely, Quentin Riffard, Tyler Anderson ) – Jupyter, Matthew Henderson ( w. Shreyas Cholia and Rollin Thomas ) ▪ Planning a second demo series in the Fall as we roll out next round of capabilities 13

Priorities for 2020 1. Continue to deploy and integrate new tools, with a focus on the top “asks” from our partner facilities API, Data management tools, Federated ID o 2. Resiliency in the PSPS era Working with NERSC facilities team to motivate center resilience o Working with experiments to help build more robust workflows o • eg cross-site data analysis for LZ, DESI, ZTF, LCLS: using ALCC award and LDRD funding 3. Perlmutter prep Key target: at least 4 superfacility science teams can use o Perlmutter successfully in the Early Science period 14

Perlmutter was designed to include features that are good for Superfacility • ○ 15

Slingshot Network • Slingshot is Ethernet compatible – Blurs the line between the inside/outside machine – Allow for seamless external communication – Direct interface to storage 4D-STEM microscope at NCEM will directly benefit from this • Currently has to use SDN and direct connection to NERSC network to stream data to Cori compute nodes – uses a buffer into the data flow to send data to Cori via TCP, avoiding packet loss Cori bridge Cori compute 4D-STEM Switch node node NCEM Buffer 16

All-Flash scratch Filesystem • Fast across many dimensions – 4 TB/s sustained bandwidth – 7,000,000 IOPS – 3,200,000 file creates/sec • Optimized for NERSC data workloads – NEW small-file I/O improvements – NEW features for high IOPS, non-sequential I/O Astronomy (and many other) data analysis workloads will directly benefit from this • IO-limited pipelines need random reads from large files and databases 17

Demo: a Science Gateway in 5 Minutes

Motivation for Spin “ How can I run services alongside HPC that can… … access file systems … outlive jobs (persistence) … access HPC networks … schedule jobs / workflows … scale up or out … stay up when HPC is down … use custom software … be available on the web and are managed by my project team? ” 19

Many Projects Need More Than HPC Spin answers this need. Users can deploy their own science gateways, workflow managers, databases, and other network services with Docker containers. • Use public or custom software images • Access HPC file systems and networks • Orchestrate complex workflows • ...on a secure, scalable, managed platform 20

Spin Embraces the Docker Methodology Build Ship Run images on your them to a registry your workloads laptop with your for version control custom software, and safekeeping and when they run ● DockerHub: share reliably, … with the public ● NERSC: keep private to your project 21

Use a UI, Dockerfile, YAML Declarations… my-project.yml baseType: workload containers: name: app Dockerfile image: flask-app:v2 imagePullPolicy: always environment: FROM ubuntu:18.04 TZ: US/Pacific RUN apt-get update --quiet -y && \ volumeMounts: apt-get install --quiet -y \ - mountPath: python-flask name: WORKDIR /app type: COPY app.py /app readOnly: false ENTRYPOINT ["python"] CMD ["app.py"] ... 22

…to create running services. A typical example: 1 1 web frontend 1 web frontend 2 1. multiple nginx frontends 2 2. custom Flask backend app backend 3. database or key-value store 4 3 3 (dedicated, not shared) database key-value automatically plumbed into a Rancher orchestration 4. private overlay network. . . . Rancher starts all the containers and node 2 node n node 1 ensures they stay running. CFS NFS 23

High-Level Spin Architecture ingress web frontend 1 web frontend 2 User- security policy enforcement management UI / CLI managed app backend NERSC database key-value handles the rest! . . . node 1 node 2 node n docker image CFS CVMFS NFS registry 24

Demo: Creating a Service in Spin 25

Superfacility and Gateways for Experimental and Observational Data - PowerPoint PPT Presentation

Superfacility and Gateways for Experimental and Observational Data Debbie Bard Lead, Superfacility Project Lead, Data Science Engagement Group Cory Snavely Deputy, Superfacility Project NUG 2020 Lead, Infrastructure Services Group August

Tunneling and Gateways Tunneling and Gateways Examples Gateways Motivation

Tunneling and Gateways Tunneling and Gateways Srinidhi Varadarajan Topics Topics Tunneling

Superfacility: How new workflows in the DOE Office of Science are influencing storage system

Leftovers: Leftovers: MPLS, Multicast, MPLS, Multicast, Gateways and Firewalls, Gateways and

Kurma: Secure Geo-distributed Multi-cloud Storage Gateways Ming Chen and Erez Zadok Stony Brook

Gateways: Access to Jewish Education Mission: Gateways provides high quality special education

Dealing with Public Ethernet Jacks: Switches, Gateways, and Authentication Bob Beck

Open community software: Building science gateways and workflows Marlon Pierce, Suresh Marru

+ Device Management for OSGi IoT Gateways Luca uca Dazi @ Eurotech Julien ien Vermi rmilla

Automatic Generation of Network protocol gateways David Bromberg, Laurent Rveillre, Julia

Title Slide Team Intro to Presentation Purpose Characteristics to be examined

Image credit: astronaut.com, history.com, theinfosphere.org Clients Gateways Cluster F L F

Building Blocks for Science Gateways Carol Song, Ph.D. Purdue University International Workshop

New Results for the PTB-PTS Attack on Tunneling Gateways Vincent Roca Ludovic Jacquin Saikou

Restricted Community Accounts Securing Science Gateways at the Account Level Kevin J. Price

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

FOR LIVE PROGRAM ONLY Navigating Divergent State Sales Tax Treatment of Product and Service

Software Architecture in the Presales Process Humberto Cervantes Universidad Autnoma

Mitigation and Performance Recovery Using Earned Value Patrick M. Kelly, PSP Degree: B.S.

Seismic Multi Axial Behavior of Concrete Filled Steel Tube Beam Columns Mark Denavit

1 28 th March 2019 Davide Delaiti Eurostars Programme Management Officer 2019 EUREKA

FY2018 1Q Results April 25, 2018 Yoshiyuki Matsusaka Director, Senior Vice President * The

Ballast Nedam Breaking new ground together Introduction Serkan Sen BIM Specialist Istanbul

How to design and develop high performance antenna solutions for M2M and IoT applications Antti