Streaming Data in Cosmology Salman Habib Argonne National - PowerPoint PPT Presentation

Streaming Data in Cosmology Salman Habib Argonne National Laboratory Stream 2016, March 22, 2016 SPT LSST JVLA

Data Flows in Cosmology: The Big Picture Note: Will not repeat material from Stream 2015 Data/Computation in Cosmology: Data flows and associated analytics play an • essential role in cosmology (combination of streaming and offline analyses) Streaming Data: • Observations: CMB experiments (ACT, SPT, —), optical transients (Sn surveys, • GW follow-ups, —), radio surveys Simulations: Large datastreams (in situ and co-scheduled data transformation) • Analytics: Transient classification pipelines, imaging pipelines •

Data Flow Example • Transient Surveys: Optical searches for transients (e.g., DES, LSST, PTF) can have cadences in the range of fractions of minutes to minutes, current data rates are about 500 GB/ night — LSST can go up to 20TB/night, about 10K alerts/night • Machine Learning: Major opportunity for machine learning for filtering and classification of transient sources (potentially one in a million interesting events) demonstrated at NERSC Palomar Transient Factory (courtesy Peter Nugent) with PTF

In Situ Analysis and Co-Scheduling • Analysis Dataflows: Analysis data flows are complex and any future strategy must combine elements of in situ and offline approaches (Flops vs. IO/ storage imbalance) • CosmoTools Test: Test of coordinated offline analysis (“co-scheduling”) • Portability: Analysis routines implemented using PISTON (part of VTK-m, built on NVIDIA’s Thrust library) • Example Case (Titan): Large halo analysis (strong scaling bottleneck) offloaded to alternative resource using a listener script that looks for appropriate output files Sewell et al. 2015, SC15 Technical Paper

In Situ Analysis Example • Data Reduction: A trillion particle simulation with 100 k -d Tree analysis steps has a storage Halo requirement of ~4 PB -- in situ Simulation Finders Inputs analysis reduces it to ~200 TB HACC Simulation • I/O Chokepoints: Large data Voronoi Tesselation analyses difficult because I/O time > analysis time, plus scheduling overhead Merger Analysis Tools Trees • Fast Algorithms: Analysis Configuration time is only a fraction of a full Analysis Tools simulation timestep N-point Functions • Ease of Workflow: Large analyses difficult to manage in post-processing Caustics Parallel File System Voronoi Predictions go into Tessellations Cosmic Calibration Halo Profiles Framework to solve the Cosmic Inverse Problem

Offline Data Flow: Large-Scale Data Movement • Offline Data Flows: Cosmological simulation data flows already require ~PB/week alcf#dtn_mira capability, next-generation ALCF streaming data will require DTN similar bandwidth 10.0 Gbps 10.5 Gbps • ESnet Project: Aim to achieve a production capability of 1 PB/ 13.4 Gbps 7.3 Gbps 11.1 Gbps week (FS to FS) across major compute sites nersc#dtn olcf#dtn_atlas 6.7 Gbps 6.0 Gbps NERSC DTN DTN OLCF • Status: Very close but not there 13.3 Gbps yet (600+ TB/week); numbers 7.6 Gbps from a simulation dataset Data set: L380 “package” (4 TB) Files: 19260 8.2 Gbps Directories: 211 Other files: 0 Total bytes: 4442781786482 (4.4T bytes) • Future: Automate entire 6.9 Gbps Smallest file: 0 bytes (0 bytes) 6.8 Gbps Largest file: 11313896248 bytes (11G bytes) process within the data workflow Size distribution: 1 - 10 bytes: 7 files including retrieval from archival 10 - 100 bytes: 1 files 100 - 1K bytes: 59 files 1K - 10K bytes: 3170 files DTN storage (HPSS); add more 10K - 100K bytes: 1560 files 100K - 1M bytes: 2817 files ncsa#BlueWaters 1M - 10M bytes: 3901 files compute/data hubs 10M - 100M bytes: 3800 files NCSA 100M - 1G bytes: 2295 files 1G - 10G bytes: 1647 files 10G - 100G bytes: 3 files Petascale DTN project, courtesy Eli Dart

Extreme-Scale Analytics Systems (EASy) Project (ASCR/HEP) • New Approaches to Large-Scale Data Analytics: Combine aspects of High Performance Computing, Data-Intensive Computing, and High Throughput Computing to develop new pathways for large-scale scientific analyses enabled through Science Portals • EASy Elements (Initial focus on cosmological simulations and surveys): • Surveys: DESI, LSST, SPT, — • Software Stack: Run complex software stacks on demand (containers and virtual machines) • Resilience: Handle job stream failures and restarts • Resource Flexibility: Run complex workflows with dynamic resource requirements • Wide-Area Data Awareness: Seamlessly move computing to data and vice versa; access to remote databases and data consistency • Automated Workloads: Run automated production workflows • End-to-End Simulation-Based Analyses: Run analysis workflows on simulations and data using a combination of in situ and offline/co- scheduling approaches

EASy Project: Infrastructure Components Component Description Notes Data from Dark Energy Survey (DES), Sloan Digital Sky Make selected data subsets available given storage Observational Data Survey (SDSS), South Pole Telescope (SPT), and limits; make analysis software available to analyze upcoming surveys (DESI, LSST, WFIRST, —) the datasets Simulations for optical surveys (raw data, object catalogs, Very large amounts of simulation data need to be Simulation Data synthetic catalogs, predictions for observables); made available; hierarchical data views; data simulations for CMB observations compression methods Cosmological Multi-layered storage on NVRAM, spinning disk, and disk- Current storage availability for the project is ~PB on simulation Data Storage fronted tape (technologies include RAM disk, HPSS, predicting the spinning disk; larger resources available within distribution parallel file systems) HPSS; RAM disk testbeds of matter Data transfer synced with computational infrastructure Use of Globus transfer as an agreed mechanism; Data Transfer and resources; data transfer as integral component of current separate project with ESnet to have a data-intensive workflows production capability at 1PB/week Wide range of computational resources include high How to bring together a number of distinct resources Computational performance computing, high throughput computing, and to solve analysis tasks in a layered fashion? What is Infrastructure data-intensive computing platforms the optimal mix? Melding HPC and cluster resources; testbeds for Computational Resources at NERSC include Edison and Cori Phase 1; using HPC resources for data-intensive tasks and Resources at Argonne, Cooley, Jupiter/Magellan, Theta (future) elastic computing paradigms Running large-scale workflows with complex software Data management and analysis workflows, especially Containers and stacks; allowing for interactive as well as batch modes for workflows that combine simulation and observational Virtualization running jobs; use of web portals datastreams New data-intensive algorithms with improved scaling As data volumes increase rapidly, new algorithms are Algorithmic Advances properties, including approximate algorithms with error needed to produce results in finite time, especially for bounds; new statistical methods interactive appliications

Future Challenges • Data Filtering and Classification: The major challenges for machine learning approaches are high levels of throughput and lack of training datasets — these approaches are the only ones that are likely to succeed, however • Data Access: View of streaming as “one-shot” is actually a statement of a technology limitation; to overcome this will require cheap and fast storage with databases (or equivalent) overlays • Software Management: Current data pipelines can be very complex (although not very computationally intensive) with many software interdependencies — work using VMs and containers shows substantial promise • Resource Management: Cloud resources have attractive features, such as on-demand allocation — can enterprise-level science requirements for high-throughput data analytics be met by the cloud?

Streaming Data in Cosmology Salman Habib Argonne National - PowerPoint PPT Presentation

Streaming Data in Cosmology Salman Habib Argonne National Laboratory Stream 2016, March 22, 2016 SPT LSST JVLA Data Flows in Cosmology: The Big Picture Note: Will not repeat material from Stream 2015 Data/Computation in Cosmology: Data

A Brief History of Cosmology 1905 to 2005 1 A Brief History of Cosmology 1905 to 2005

Cosmology at the University of Cape Town http:/ /cosmology.uct.ac.za The coming of age of

Observational Cosmology (C. Porciani / K. Basu) Lecture 7 Cosmology with galaxy clusters

String cosmology and String cosmology and String cosmology and the index of the Dirac Dirac

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Radio Cosmology Tzu-Ching Chang (ASIAA) Wednesday, June 5, 2013 Cosmology in the Planck era ESA

Cosmology with Large-scale Structure of the Universe Eiichiro Komatsu (Texas Cosmology Center, UT

What every dynamicist should know about... Cosmology Eiichiro Komatsu (Texas Cosmology Center,

Cosmology with CMB and Large-scale Structure of the Universe Eiichiro Komatsu Texas Cosmology

Which probability Which probability Which probability Which probability theory for cosmology?

Cosmology 101 Modes of thinking in cosmology Old and New Swadesh Mitter Mahajan University of

Experimental astroparticle physics & cosmology Observational cosmology J.F. Mac as-P

Supergravity in Phenomenology and Cosmology Supergravity in Phenomenology and Cosmology CMSSM -

Observational Cosmology (C. Porciani / K. Basu) Lecture 7 Cosmology with galaxy clusters (Mass

Scale hierarchies and string cosmology I. Antoniadis Albert Einstein Center, University of Bern

Cosmology and Modified Gravity: Selected contributions by Prof. S. Nojiri S.D. Odintsov (ICREA

Weighing Neutrinos with Cosmology What exactly are they doing? arXiv:0911.5291 - PRL

Chapter 26: Cosmology Cosmology means the study of the structure and evolution of the entire

OpenJDK The Future of Open Source Java on GNU/Linux Dalibor Topi Java F/OSS Ambassador

10 Years of Water Vapor and Ozone Soundings at Costa Rica Holger Vmel (NCAR/EOL) H. Selkirk

Continuous time Markov Chains: construction and basic tools Conrado da Costa Department of

Applications of the DEEDS Database to Somerset Charters: Dating, Diplomatics, and Historical

Sambuz

Useful Links

Newsletter

Mail Us

Streaming Data in Cosmology Salman Habib Argonne National - PowerPoint PPT Presentation

Streaming Data in Cosmology Salman Habib Argonne National Laboratory Stream 2016, March 22, 2016 SPT LSST JVLA Data Flows in Cosmology: The Big Picture Note: Will not repeat material from Stream 2015 Data/Computation in Cosmology: Data

A Brief History of Cosmology 1905 to 2005 1 A Brief History of Cosmology 1905 to 2005

Cosmology at the University of Cape Town http:/ /cosmology.uct.ac.za The coming of age of

Observational Cosmology (C. Porciani / K. Basu) Lecture 7 Cosmology with galaxy clusters

String cosmology and String cosmology and String cosmology and the index of the Dirac Dirac

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Radio Cosmology Tzu-Ching Chang (ASIAA) Wednesday, June 5, 2013 Cosmology in the Planck era ESA

Cosmology with Large-scale Structure of the Universe Eiichiro Komatsu (Texas Cosmology Center, UT

What every dynamicist should know about... Cosmology Eiichiro Komatsu (Texas Cosmology Center,

Cosmology with CMB and Large-scale Structure of the Universe Eiichiro Komatsu Texas Cosmology

Which probability Which probability Which probability Which probability theory for cosmology?

Cosmology 101 Modes of thinking in cosmology Old and New Swadesh Mitter Mahajan University of

Experimental astroparticle physics &amp; cosmology Observational cosmology J.F. Mac as-P

Supergravity in Phenomenology and Cosmology Supergravity in Phenomenology and Cosmology CMSSM -

Observational Cosmology (C. Porciani / K. Basu) Lecture 7 Cosmology with galaxy clusters (Mass

Scale hierarchies and string cosmology I. Antoniadis Albert Einstein Center, University of Bern

Cosmology and Modified Gravity: Selected contributions by Prof. S. Nojiri S.D. Odintsov (ICREA

Weighing Neutrinos with Cosmology What exactly are they doing? arXiv:0911.5291 - PRL

Chapter 26: Cosmology Cosmology means the study of the structure and evolution of the entire

OpenJDK The Future of Open Source Java on GNU/Linux Dalibor Topi Java F/OSS Ambassador

10 Years of Water Vapor and Ozone Soundings at Costa Rica Holger Vmel (NCAR/EOL) H. Selkirk

Continuous time Markov Chains: construction and basic tools Conrado da Costa Department of

Applications of the DEEDS Database to Somerset Charters: Dating, Diplomatics, and Historical

Sambuz

Useful Links

Newsletter

Mail Us

Experimental astroparticle physics & cosmology Observational cosmology J.F. Mac as-P