Toward a National Research Platform Invited Presentation Open - - PowerPoint PPT Presentation

toward a national research platform
SMART_READER_LITE
LIVE PREVIEW

Toward a National Research Platform Invited Presentation Open - - PowerPoint PPT Presentation

Toward a National Research Platform Invited Presentation Open Science Grid All Hands Meeting Salt Lake City, UT March 20, 2018 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E.


slide-1
SLIDE 1

“Toward a National Research Platform”

Invited Presentation Open Science Grid All Hands Meeting Salt Lake City, UT March 20, 2018

  • Dr. Larry Smarr

Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor,

  • Dept. of Computer Science and Engineering

Jacobs School of Engineering, UCSD http://lsmarr.calit2.net

1

slide-2
SLIDE 2

30 Years Ago NSF Brought to University Researchers a DOE HPC Center Model NCSA Was Modeled on LLNL SDSC Was Modeled on MFEnet 1985/6

slide-3
SLIDE 3

I-WAY: Information Wide Area Year

Supercomputing ‘95

  • The First National 155 Mbps Research Network

– 65 Science Projects – Into the San Diego Convention Center

  • I-Way Featured:

– Networked Visualization Applications – Large-Scale Immersive Displays – I-Soft Programming Environment

– Led to the Globus Project

UIC

http://archive.ncsa.uiuc.edu/General/Training/SC95/GII.HPCC.html

See talk by: Brian Bockelman

slide-4
SLIDE 4

NSF’s PACI Program was Built on the vBNS to Prototype America’s 21st Century Information Infrastructure

The PACI Grid Testbed

National Computational Science

1997

vBNS led to

Key Role

  • f Miron Livny

& Condor

slide-5
SLIDE 5

UCSD Has Been Working Toward PRP for Over 15 Years: NSF OptIPuter, Quartzite, Prism Awards

PI Papadopoulos, 2013-2015 PI Smarr, 2002-2009 PI Papadopoulos, 2004-2007

Precursors to DOE Defining DMZ in 2010

slide-6
SLIDE 6

Based on Community Input and on ESnet’s Science DMZ Concept, NSF Has Funded Over 100 Campuses to Build DMZs

Red 2012 CC-NIE Awardees Yellow 2013 CC-NIE Awardees Green 2014 CC*IIE Awardees Blue 2015 CC*DNI Awardees Purple Multiple Time Awardees

Source: NSF

NSF Program Officer: Kevin Thompson

slide-7
SLIDE 7

(GDC)

Logical Next Step: The Pacific Research Platform Networks Campus DMZs to Create a Regional End-to-End Science-Driven “Big Data Superhighway” System

NSF CC*DNI Grant $5M 10/2015-10/2020

PI: Larry Smarr, UC San Diego Calit2 Co-PIs:

  • Camille Crittenden, UC Berkeley CITRIS,
  • Tom DeFanti, UC San Diego Calit2/QI,
  • Philip Papadopoulos, UCSD SDSC,
  • Frank Wuerthwein, UCSD Physics and SDSC

Letters of Commitment from:

  • 50 Researchers from 15 Campuses
  • 32 IT/Network Organization Leaders

NSF Program Officer: Amy Walton

Source: John Hess, CENIC

slide-8
SLIDE 8

Note That the OSG Cluster Map Has Major Overlap with the NSF-Funded DMZ Map

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

NSF CC* Grants

slide-9
SLIDE 9

Bringing OSG Software and Services to a Regional-Scale DMZ

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

slide-10
SLIDE 10
  • FIONAs PCs [a.k.a ESnet DTNs]:

– ~$8,000 Big Data PC with:

– 1 CPUs – 10/40 Gbps Network Interface Cards – 3 TB SSDs or 100+ TB Disk Drive

– Extensible for Higher Performance to:

– +NVMe SSDs for 100Gbps Disk-to-Disk – +Up to 8 GPUs [4M GPU Core Hours/Week] – +Up to 160 TB Disks for Data Posting – +Up to 38 Intel CPUs

– $700 10Gpbs FIONAs Being Tested

  • FIONettes are $270 FIONAs

– 1Gbps NIC With USB-3 for Flash Storage or SSD

Big Data Science Data Transfer Nodes (DTNs)- Flash I/O Network Appliances (FIONAs)

FIONette—1G, $250

Phil Papadopoulos, SDSC & Tom DeFanti, Joe Keefe & John Graham, Calit2

Key PRP Innovation: UCSD Designed FIONAs To Solve the Disk-to-Disk Data Transfer Problem at Full Speed on 10/40/100G Networks

FIONAS—10/40G, $8,000

slide-11
SLIDE 11

We Measure Disk-to-Disk Throughput with 10GB File Transfer Using Globus GridFTP 4 Times Per Day in Both Directions for All PRP Sites

January 29, 2016

From Start of Monitoring 12 DTNs to 24 DTNs Connected at 10-40G in 1 ½ Years

July 21, 2017 Source: John Graham, Calit2/QI

slide-12
SLIDE 12

PRP’s First 2 Years: Connecting Multi-Campus Application Teams and Devices

Earth Sciences

slide-13
SLIDE 13

PRP Over CENIC Couples UC Santa Cruz Astrophysics Cluster to LBNL NERSC Supercomputer CENIC 2018 Innovations in Networking Award for Research Applications

slide-14
SLIDE 14

100 Gbps FIONA at UCSC Allows for Downloads to the UCSC Hyades Cluster from the LBNL NERSC Supercomputer for DESI Science Analysis

300 images per night. 100MB per raw image 120GB per night 250 images per night. 530MB per raw image 800GB per night

Source: Peter Nugent, LBNL Professor of Astronomy, UC Berkeley

Precursors to LSST and NCSA

NSF-Funded Cyberengineer Shaw Dong @UCSC Receiving FIONA Feb 7, 2017

slide-15
SLIDE 15

Jupyter Has Become the Digital Fabric for Data Sciences PRP Creates UC-JupyterHub Backbone

Source: John Graham, Calit2

Goal: Jupyter Everywhere

slide-16
SLIDE 16

LHCOne Traffic Growth is Large Now But Will Explode in 2026

31 Petabytes in January 2018 +38% Change Within Last Year

LHC Accounts for 47% of Total ESNet traffic Today

Dramatic Data Volume Growth Expected for HL-LHC in 2026

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

slide-17
SLIDE 17

Data Transfer Rates From 40 Gbps DTN in UCSD Physics Building, Across Campus on PRISM DMZ, Then to Chicago’s Fermilab Over CENIC/ESnet

Based on This Success, Würthwein Will Upgrade 40G DTN to 100G For Bandwidth Tests & Kubernetes Integration With OSG, Caltech, and UCSC

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

slide-18
SLIDE 18

LHC Data Analysis Running on PRP

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

Two Projects:

  • OSG Cluster-in-a-Box for “T3”
  • Distributed Xrootd Cache for “T2”
slide-19
SLIDE 19

First Steps Toward Integrating OSG and PRP – Tier 3 “Cluster-in-a Box”

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

slide-20
SLIDE 20

PRP Distributed Tier-2 Cache Across Caltech & UCSD

Cache Server Cache Server

Redirect

  • r

Cache Server Cache Server

Redirect

  • r

UCSD Caltech

Redirector Top Level Cache Global Data Federation of CMS

Applications Can Connect at Local

  • r Top Level Cache Redirector

⇒ Test the System as Individual or Joint Cache Provisioned pilot systems:

PRP UCSD: 9 x 12 SATA Disk of 2TB @ 10Gbps for Each System PRP Caltech: 2 x 30 SATA Disk of 6TB @ 40Gbps for Each System

Production Use (UCSD only) I/O in Production Limited by # of Apps Hitting the Cache, and Their I/O Patterns

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

slide-21
SLIDE 21

Game Changer: Using Kubernetes to Manage Containers Across the PRP

“Kubernetes is a way of stitching together a collection of machines into, basically, a big computer,”

  • -Craig Mcluckie, Google

and now CEO and Founder of Heptio "Everything at Google runs in a container."

  • -Joe Beda,Google

“Kubernetes has emerged as the container orchestration engine of choice for many cloud providers including Google, AWS, Rackspace, and Microsoft, and is now being used in HPC and Science DMZs.

  • -John Graham, Calit2/QI UC San Diego

See talk by: Rob Gardner

slide-22
SLIDE 22

Distributed Computation on PRP Nautilus HyperCluster Coupling SDSU Cluster and SDSC Comet Using Kubernetes Containers

25 years

Developed and executed MPI-based PRP Kubernetes Cluster execution [CO2,aq] 100 Year Simulation

4 days

75 years 100 years

  • 0.5 km x 0.5 km x 17.5 m
  • Three sandstone layers

separated by two shale layers

Simulating the Injection of CO2 in Brine-Saturated Reservoirs: Poroelastic & Pressure-Velocity Fields Solved In Parallel With MPI Using Domain Decomposition Across Containers

Source: Chris Paolini and Jose Castillo, SDSU

slide-23
SLIDE 23

Rook is Ceph Cloud-Native Object Storage ‘Inside’ Kubernetes https://rook.io/

Source: John Graham, Calit2/QI

See talk by: Shawn McKee

slide-24
SLIDE 24

FIONA8: Adding GPUs to FIONAs Supports Data Science Machine Learning

Multi-Tenant Containerized GPU JupyterHub Running Kubernetes / CoreOS Eight Nvidia GTX-1080 Ti GPUs

~$13K 32GB RAM, 3TB SSD, 40G & Dual 10G ports Source: John Graham, Calit2

slide-25
SLIDE 25

FIONA8 FIONA8 100G Epyc NVMe

Nautilus - A Multi-Tenant Containerized PRP HyperCluster for Big Data Applications Running Kubernetes with Rook/Ceph Cloud Native Storage and GPUs for Machine Learning

40G SSD 3T 100G NVMe 6.4T SDSU 100G Gold NVMe March 2018 John Graham, Calit2/QI 100G NVMe 6.4T Caltech 40G SSD UCAR FIONA8 UCI FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8

sdx-controller controller-0

Calit2 100G Gold FIONA8 SDSC 40G SSD UCR 40G SSD USC 40G SSD UCLA 40G SSD Stanford 40G SSD UCSB 100G NVMe 6.4T 40G SSD UCSC 40G SSD Hawaii

Rook/Ceph - Block/Object/FS Swift API compatible with SDSC, AWS, and Rackspace

Kubernetes Centos7

slide-26
SLIDE 26

FIONA8 FIONA8 100G Epyc NVMe 40G 160TB 100G NVMe 6.4T SDSU 100G Gold NVMe

March 2018 John Graham, UCSD

100G NVMe 6.4T Caltech 40G 160TB UCAR FIONA8 UCI FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8 FIONA8

sdx-controller controller-0

Calit2 100G Gold FIONA8 SDSC 40G 160TB UCR 40G 160TB USC 40G 160TB UCLA 40G 160TB Stanford 40G 160TB UCSB 100G NVMe 6.4T 40G 160TB UCSC 40G 160TB Hawaii

Running Kubernetes/Rook/Ceph On PRP Allows Us to Deploy a Distributed PB+ of Storage for Posting Science Data

Rook/Ceph - Block/Object/FS Swift API compatible with SDSC, AWS, and Rackspace

Kubernetes Centos7

slide-27
SLIDE 27

Collaboration Opportunity with OSG & PRP

  • n Distributed Storage

1.8PB 1.2PB 1.6PB 210TB Total data volume pulled last year is dominated by 4 caches.

OSG Is Operating a Distributed Caching CI. At Present, 4 Caches Provide Significant Use PRP Kubernetes Infrastructure Could Either Grow Existing Caches by Adding Servers,

  • r by Adding Additional Locations

See talks by: Alex Feltus Derek Weitzel

StashCache Users include:

See talk by Marcelle Soares-Santos

LIGO DES

Source: Frank Würthwein, OSG, UCSD/SDSC, PRP

slide-28
SLIDE 28

New NSF CHASE-CI Grant Creates a Community Cyberinfrastructure: Adding a Machine Learning Layer Built on Top of the Pacific Research Platform

Caltech UCB UCI UCR UCSD UCSC Stanford MSU UCM SDSU NSF Grant for High Speed “Cloud” of 256 GPUs For 30 ML Faculty & Their Students at 10 Campuses for Training AI Algorithms on Big Data

NSF Program Officer: Mimi McClure

slide-29
SLIDE 29

48 GPUs for OSG Applications

UCSD Adding >350 Game GPUs to Data Sciences Cyberinfrastructure - Devoted to Data Analytics and Machine Learning

SunCAVE 70 GPUs WAVE + Vroom 48 GPUs

FIONA with 8-Game GPUs

88 GPUs for Students CHASE-CI Grant Provides 96 GPUs at UCSD for Training AI Algorithms on Big Data

slide-30
SLIDE 30

Next Step: Surrounding the PRP Machine Learning Platform With Clouds of GPUs and Non-Von Neumann Processors

Microsoft Installs Altera FPGAs into Bing Servers & 384 into TACC for Academic Access

CHASE-CI

64-TrueNorth Cluster 64-bit GPUs

4352x NVIDIA Tesla V100 GPUs

See talk by: Hurtado Anampa

slide-31
SLIDE 31

PRP is Partnering with NSF Grants Supporting Advanced Cyberinfrastructure Facilitators to Explore PRP Extension Toward NRP

PRP Connected

ACI-REF has also spawned the 35-member Campus Research Computing Consortium (CaRCC), Funded by the NSF as a Research Coordination Network (RCN)

CaRCC is Dedicated to Sharing Best Practices, Expertise, and Resources, Enabling the Advancement of Campus-Based Research Computing Activities Across the Nation

Jim Bottum, Principal Investigator Tom Cheatham, ACI REF Chair of Campus PIs

ACI-REF CaRCC

See talk by Tom Cheatham

slide-32
SLIDE 32

Expanding to the Global Research Platform Via CENIC/Pacific Wave, Internet2, and International Links PRP

PRP’s Current International Partners

Korea Shows Distance is Not the Barrier to Above 5Gb/s Disk-to-Disk Performance

Netherlands Guam Australia Korea Japan Singapore

slide-33
SLIDE 33

The Second National Research Platform Workshop Bozeman, MT August 6-7, 2018

A follow-up FIONA workshop will be held as a lead into the 2nd NRP workshop in Bozeman, starting August 2nd. While the workshop will be

  • pen to the community,

there is a specific focus

  • n EPSCoR-affiliated

and minority serving institutions. Co-Chairs: Larry Smarr, Calit2 Inder Monga, ESnet Ana Hunsinger, Internet2

Local Host: Jerry Sheehan, MSU

slide-34
SLIDE 34

Our Support:

  • US National Science Foundation (NSF) awards
  • CNS 0821155, CNS-1338192, CNS-1456638, CNS-1730158,

ACI-1540112, & ACI-1541349

  • University of California Office of the President CIO
  • UCSD Chancellor’s Integrated Digital Infrastructure Program
  • UCSD Next Generation Networking initiative
  • Calit2 and Calit2 Qualcomm Institute
  • CENIC, PacificWave and StarLight
  • DOE ESnet