Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big - PowerPoint PPT Presentation

Extreme Data-Intensive Scientific Computing Alex Szalay JHU

Big Data in Science • Data growing exponentially, in all science • All science is becoming data-driven • This is happening very rapidly • Data becoming increasingly open/public • Non-incremental! • Convergence of physical and life sciences through Big Data (statistics and computing) • The “long tail” is important • A scientific revolution in how discovery takes place => a rare and unique opportunity

Non-Incremental Changes • Need new randomized, incremental algorithms – Best result in 1 min, 1 hour, 1 day, 1 week • New computational tools and strategies … not just statistics, not just computer science, not just astronomy… • Need new data intensive scalable architectures • Science is moving from hypothesis-driven to data- driven discoveries Astronomy has always been data-driven…. now becoming more generally accepted

Sloan Digital Sky Survey • “ The Cosmic Genome Project ” • Two surveys in one – Photometric survey in 5 bands – Spectroscopic redshift survey • Data is public – 2.5 Terapixels of images => 5 Tpx – 10 TB of raw data => 120TB processed – 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Database and spectrograph built at JHU (SkyServer)

SkyServer • Prototype in 21st Century data access – 1B web hits in 10 years – 4,000,000 distinct users vs. 15,000 astronomers – The world’s most used astronomy facility today – The emergence of the “Internet scientist” – Collaborative server-side analysis done

The SDSS Genealogy SDSS SkyServer Hubble Onco Life Under Super CASJobs Turbulence Legacy SkyQuery GalaxyZoo Space Your Feet COSMOS MyDB DB Arch JHU 1K Pan- Palomar VO Open GALEX Millennium UKIDDS Genomes STARRS QUEST Services SkyQuery INDRA Milky Way VO VO Potsdam MHD DB Simulation Laboratory Footprint Spectrum

Data in HPC Simulations • HPC is an instrument in its own right • Largest simulations approach petabytes – from supernovae to turbulence, biology and brain modeling • Need public access to the best and latest through interactive numerical laboratories • Creates new challenges in – how to move the petabytes of data (high speed networking) – How to interface (virtual sensors, immersive analysis) – How to look at it (render on top of the data, drive remotely) – How to analyze (algorithms, scalable analytics)

Silver River Transfer • 150TB in less than 10 days from Oak Ridge to JHU using a dedicated 10G connection

Immersive Turbulence “… the last unsolved problem of classical physics…” Feynman • Understand the nature of turbulence – Consecutive snapshots of a large simulation of turbulence: now 30 Terabytes – Treat it as an experiment, play with the database! – Shoot test particles (sensors) from your laptop into the simulation, like in the movie Twister – Next: 70TB MHD simulation • New paradigm for analyzing simulations with C. Meneveau, S-Y. Chen, G. Eyink, R. Burns

Spatial queries, random samples • Spatial queries require multi-dimensional indexes. • (x,y,z) does not work: need discretisation – index on (ix,iy,iz) withix=floor(x/10) etc • More sophisticated: space fillilng curves – bit-interleaving/octtree/Z-Index – Peano-Hilbert curve – Need custom functions for range queries – Plug in modular space filling library (Budavari) • Random sampling using a RANDOM column – RANDOM from [0,1000000]

Cosmological Simulations In 2000 cosmological simulations had 10 10 particles and produced over 30TB of data (Millennium) • Build up dark matter halos • Track merging history of halos • Use it to assign star formation history • Combination with spectral synthesis • Realistic distribution of galaxy types Today: simulations with 10 12 particles and PB of output are • under way (MillenniumXXL, Silver River, Exascale Sky) • Hard to analyze the data afterwards • What is the best way to compare to real data?

The Milky Way Laboratory • Use cosmology simulations as an immersive laboratory for general users • Via Lactea-II (20TB) as prototype, then Silver River (50B particles) as production (15M CPU hours) • 800+ hi-rez snapshots (2.6PB) => 800TB in DB • Users can insert test particles (dwarf galaxies) into system and follow trajectories in pre-computed simulation • Users interact remotely with a PB in ‘real time’ Madau, Rockosi, Szalay, Wyse, Silk, Kuhlen, Lemson, Westermann, Blakeley

Visualizing Petabytes • Needs to be done where the data is… • It is easier to send a HD 3D video stream to the user than all the data – Interactive visualizations driven remotely • Visualizations are becoming IO limited: precompute octree and prefetch to SSDs • It is possible to build individual servers with extreme data rates (5GBps per server… see Data-Scope) • Prototype on turbulence simulation already works: data streaming directly from DB to GPU • N-body simulations next

Streaming Visualization of Turbulence Kai Buerger, Technische Universitat Munich, 24 million particles

Scalable Data-Intensive Analysis • Large data sets => data resides on hard disks • Analysis has to move to the data • Hard disks are becoming sequential devices – For a PB data set you cannot use a random access pattern • Both analysis and visualization become streaming problems • Same thing is true with searches – Massively parallel sequential crawlers (MR, Hadoop, etc) • Spatial indexing needs to be maximally sequential – Space filling curves (Peano-Hilbert, Morton,…)

Increased Diversification One shoe does not fit all! • Diversity grows naturally, no matter what • Evolutionary pressures help • Large floating point calculations move to GPUs • Individual groups want • Big data moves into the cloud specializations (private or public) • RandomIO moves to Solid State Disks • Stream processing emerging • noSQL vs databases vs column store vs SciDB …

Extending SQL Server • User Defined Functions in DB execute inside CUDA – 100x gains in floating point heavy computations • Dedicated service for direct access – Shared memory IPC w/ on-the-fly data transform Richard Wilton and Tamas Budavari (JHU)

Large Arrays in SQL Server • Recent effort by Laszlo Dobos (w. J. Blakeley and D. Tomic) • Written in C++ • Arrays packed into varbinary(8000) or varbinary(max) • Various subsets, aggregates, extractions and conversions in T- SQL (see regrid example:) SELECT s.ix, DoubleArray.Avg(s.a) INTO ##temptable FROM DoubleArray.Split(@a,Int16Array.Vector_3(4,4,4)) s SELECT @subsample = DoubleArray.Concat_N('##temptable') @a is an array of doubles with 3 indices The first command averages the array over 4×4×4 blocks, returns indices and the value of the average into a table Then we build a new (collapsed) array from its output

TileDB • Distributed DB that adapts to query patterns • No set physical schema – Represents data as tiles – Tiles replicate/migrate based on actual traffic • Can automatically load from existing DB – Inherits schema (for querying only!) • Fault tolerance – From one query, derive many – Each mini-query is a checkpoint – Can also estimate overall progress though ‘tiling’ • Execution order can be determined by sampling – Faster then sqrt(N) convergence

JHU Data-Scope • Funded by NSF MRI to build a new ‘instrument’ to look at data • Goal: 102 servers for $1M + about $200K switches+racks • Two-tier: performance (P) and storage (S) • Large (5PB) + cheap + fast (400+GBps), but … . ..a special purpose instrument Revised 1P 1S All P All S Full servers 1 1 90 6 102 rack units 4 34 360 204 564 capacity 24 720 2160 4320 6480 TB price 8.8 57 8.8 57 792 $K power 1.4 10 126 60 186 kW GPU* 1.35 0 121.5 0 122 TF seq IO 5.3 3.8 477 23 500 GBps IOPS 240 54 21600 324 21924 kIOPS netwk bw 10 20 900 240 1140 Gbps

Sociology • Broad sociological changes – Convergence of Physical and Life Sciences – Data collection in ever larger collaborations – Virtual Observatories: CERN, VAO, NCBI, NEON, OOI,… – Analysis decoupled, off archived data by smaller groups – Emergence of the citizen/internet scientist – Impact of demographic changes in science • Need to start training the next generations – П -shaped vs I-shaped people – Early involvement in “Computational thinking”

Summary • Science is increasingly driven by data (large and small) • Large data sets are here, COTS solutions are not • Changing sociology • From hypothesis-driven to data-driven science • We need new instruments: “microscopes” and “telescopes” for data • Same problems present in business and society • Data changes not only science, but society • A new, Fourth Paradigm of Science is emerging… A convergence of statistics, computer science, physical and life sciences…..

Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big - PowerPoint PPT Presentation

Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big Data in Science Data growing exponentially, in all science All science is becoming data-driven This is happening very rapidly Data becoming increasingly

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Synergistic Challenges in Data-Intensive Science and Extreme Scale Computing Vivek Sarkar

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

JST-CREST Extreme Big Data Project (2013-2018) Future Non-Silo Extreme Big Data Scientific

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Extreme Neural Network Computing Transforms Speech Quality Extreme Neural Network

for Convergence of Extreme Computing and Big Data Satoshi Matsuoka Professor Global Scientific

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Scientific Computing Albert-Jan Yzelman (May 10, 2010) Scientific Computing is... a two-years

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

What if trees had ? ratings in kN? John Morton Snohomish County SAR Everett Mountain Rescue

Second Quarter Report 2004 I am pleased to present BMO Financial Groups Second Quarter 2004

Cryptography I Exercises Luca Vigan` o Institut f ur Informatik

API RP 67 Review & Revision Title: Recommended Practices for Oilfield Explosives Safety Scope:

Structured Autoencoders for Operator-theoretic decomposition and Model reduction Karthik

Image Stitching Ali Farhadi CSE 576 Several slides from Rick Szeliski, Steve Seitz, Derek Hoiem,

& Stereo Tues Oct 20 Last time: How to stitch a panorama? Basic Procedure Take a

S cience Case 5: Planning a Deep NIRCam S urvey Building the APT file Martha Boyer S TS cI,

Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big - PowerPoint PPT Presentation

Extreme Data-Intensive Scientific Computing Alex Szalay JHU Big Data in Science Data growing exponentially, in all science All science is becoming data-driven This is happening very rapidly Data becoming increasingly

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Synergistic Challenges in Data-Intensive Science and Extreme Scale Computing Vivek Sarkar

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

JST-CREST Extreme Big Data Project (2013-2018) Future Non-Silo Extreme Big Data Scientific

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Extreme Neural Network Computing Transforms Speech Quality Extreme Neural Network

for Convergence of Extreme Computing and Big Data Satoshi Matsuoka Professor Global Scientific

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Scientific Computing Albert-Jan Yzelman (May 10, 2010) Scientific Computing is... a two-years

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

What if trees had ? ratings in kN? John Morton Snohomish County SAR Everett Mountain Rescue

Second Quarter Report 2004 I am pleased to present BMO Financial Groups Second Quarter 2004

Cryptography I Exercises Luca Vigan` o Institut f ur Informatik

API RP 67 Review &amp; Revision Title: Recommended Practices for Oilfield Explosives Safety Scope:

Structured Autoencoders for Operator-theoretic decomposition and Model reduction Karthik

Image Stitching Ali Farhadi CSE 576 Several slides from Rick Szeliski, Steve Seitz, Derek Hoiem,

&amp; Stereo Tues Oct 20 Last time: How to stitch a panorama? Basic Procedure Take a

S cience Case 5: Planning a Deep NIRCam S urvey Building the APT file Martha Boyer S TS cI,

API RP 67 Review & Revision Title: Recommended Practices for Oilfield Explosives Safety Scope:

& Stereo Tues Oct 20 Last time: How to stitch a panorama? Basic Procedure Take a