A Fast Database for Large Observational or Simulation Datasets - PowerPoint PPT Presentation

A Fast Database for Large Observational or Simulation Datasets Adrian M. Partl Leibniz-Institut für Astrophysik Potsdam (AIP) Monday, October 22, 12

Big Data in Astronomy and Astrophysics a tsunami of data... • Raw data usually with clear data formats, huge in Size of individual data set size Number of data sets Monday, October 22, 12

Big Data in Astronomy and Astrophysics a tsunami of data... • Raw data usually with clear data formats, huge in Size of individual data set size • Derived / processed data highly irregular in formats, large in size Number of data sets Monday, October 22, 12

Big Data in Astronomy and Astrophysics a tsunami of data... • Raw data usually with clear data formats, huge in Size of individual data set size • Derived / processed data highly irregular in formats, large in size • Analysed results with complex formats, small in Number of data sets size Monday, October 22, 12

www.multidark.org • Total row count: 2.34 10 11 • MS SQL Server • Apache Tomcat • similar setup to Millenium DB • Most queries 100 sec • significant amount of queries > 1000 sec Monday, October 22, 12

Problems while building the MultiDark DB • Data ingest time: Need to convert binary to ASCII CSV format (highly inefficient) • Data transformation: Computing values after ingest slow - best during ingest • Data indexing: Index on particle data (~10 10 particles) around one week • Data retrieval times slow on full table scans: cannot build index for every query • Spatial queries in 3D hard, impossible in 6D nearest neighbour search also inefficient Monday, October 22, 12

Why RDBMS? • SQL - it took long time for the community to adopt SQL (we think this is the main problem with NoSQL) • proven, widely available, large user base • good for structured data • Problems: – Built for different purposes (business, web, ...) result sets usually small - mostly in memory solutions – parallelisation of data / sharding – can be expensive Monday, October 22, 12

Our vision: • Open source DB solution for scientific purposes: A one size fits all solution built by the community for the community • Developments at AIP: – DB independent ingestion library and data transformation tool (DBIngestor and AsciiIngest) – MySQL sharding solution for scientific queries (PaQu) – MySQL plugins for data analysis, spatial queries and indexing, job queueing... – MySQL storage engine plugins for simulation raw data – Common web framework for data access (Daiquiri) Monday, October 22, 12

Our Developments with MySQL Spider engine: Federated engine: Great MySQL sharding engine Used for temporary results in developed by Kentoku Shiba joins and aggregates: MySQL http://spiderformysql.com Distributed with MySQL to Query queue access tables on other servers. Spider engine (would need rewrite to perform better though) Spider Federated MySQL MySQL MySQL MySQL MySQL MySQL Monday, October 22, 12

Our Developments with MySQL User Admin MySQL Daiquiri PaQu DBIngestor / AsciiIngest Query queue Spider engine Spider Federated MySQL MySQL MySQL MySQL MySQL MySQL Monday, October 22, 12

PaQu: Parallel Query Reformulation Implicit Joins: Aggregates: SELECT a.*, b.*, c.* SELECT a.bar, AVG(a.foo) FROM a, b, c FROM a WHERE b=2 AND GROUP BY a.bar; b.id=c.b_id AND a.id=b.a_id; SELECT a.*, tmp.* SELECT a.bar, FROM a, SUM(a.sum)/SUM(a.cnt) (SELECT b.*, c.* FROM FROM c, (SELECT a.bar as bar, (SELECT b.* SUM(a.foo) as sum, FROM b COUNT(a.foo) as cnt WHERE b=2) as b FROM a WHERE b.id=c.b_id) GROUP BY a.bar) as a WHERE a.id=tmp.b.a_ids GROUP BY a.bar; Monday, October 22, 12

Preliminary Performance Results indexed aggregate unindexed aggregate unindexed range speedup speedup speedup number of nodes number of nodes number of nodes • Strong correlation with hardware setup: – Cache sizes, size of data files (smaller is better / partitioning?), network and I/O performance Monday, October 22, 12

Conclusions • Scientific data increases rapidly in size ⇒ problems for off-the-shelf database systems • Slow acceptance by scientific community of anything new poses problems for NoSQL solutions and favours SQL • Open source solution for scientific use of open source databases needed • Promising results for using MySQL + Spider + PaQu for scientific data mining and fast full table scans Monday, October 22, 12

Interested? Help us in the development! DBIngestor/AsciiIngest: http://github.com/adrpar Thank you! Monday, October 22, 12

A Fast Database for Large Observational or Simulation Datasets - PowerPoint PPT Presentation

A Fast Database for Large Observational or Simulation Datasets Adrian M. Partl Leibniz-Institut fr Astrophysik Potsdam (AIP) Monday, October 22, 12 Big Data in Astronomy and Astrophysics a tsunami of data... Raw data usually with

Observational Methods and NATM NATM System for Observational approach to tunnel design Eurocode

Lecture 6/Chapters 5&6 Observational Studies & Review Advantages of Observational

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Observational Constraints of Observational Constraints of the Epoch of the Epoch of

Observational and Numerical Observational and Numerical Study of Ocean Dynamics over Study of

Chapter 2: Observational Studies In an observational study the subjects determine whether

Observational studies and experiments Introduction to Data Types of studies Observational

Lecture 6/Chapters 5&6 backward in time, about the past. Observational Studies & Review

Federated Observational and Simulation Data in the NASA Center for Climate Simulation Data

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Large-N observatio ional l data 1 Large-N observational data The basic idea Whenever cases

Presentation Q1 2010 1 Disclaimer This presentation has been prepared by Duni AB (the

Presentation GSPP More pictures Disinfection of hands Disinfection of hands Disinfection of

Verifying Array Manipulating Programs by Tiling Authors: Supratik Chakraborty, Ashutosh Gupta,

First year review WP2 overview Trento - September 24th, 2007 1 Goal To investigate

ORCHIDACEA RESORT Kata Beach, Phuket w w w . o r c h i d a c e a r e s o r t . c o m About Us

Sales for the First Quarter 2019/20 ending 30 June 2019 18.07.2019 Q1 2019/20 Sales : Key

COVID PRESENT STATUS DR. K. BHARATH MD, FIP CONSULTANT CHEST PHYSICIAN & INTERVENTIONAL

Cost-effectiveness modelling for TB interventions Gabriela B Gomez, Associate Professor in

A Fast Database for Large Observational or Simulation Datasets - PowerPoint PPT Presentation

A Fast Database for Large Observational or Simulation Datasets Adrian M. Partl Leibniz-Institut fr Astrophysik Potsdam (AIP) Monday, October 22, 12 Big Data in Astronomy and Astrophysics a tsunami of data... Raw data usually with

Observational Methods and NATM NATM System for Observational approach to tunnel design Eurocode

Lecture 6/Chapters 5&amp;6 Observational Studies &amp; Review Advantages of Observational

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Outline Narcisse Ngada DESY, MKK 1) What is simulation ? 14.05.2014 2) Why simulation ? 3)

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Grid simulation (AliEn) Outline GRID simulation Simulation tool Ptolemy (Berkeley)

Observational Constraints of Observational Constraints of the Epoch of the Epoch of

Observational and Numerical Observational and Numerical Study of Ocean Dynamics over Study of

Chapter 2: Observational Studies In an observational study the subjects determine whether

Observational studies and experiments Introduction to Data Types of studies Observational

Lecture 6/Chapters 5&amp;6 backward in time, about the past. Observational Studies &amp; Review

Federated Observational and Simulation Data in the NASA Center for Climate Simulation Data

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

T7 Cloud Simulation On-demand access simulation December 2016 T7 Cloud Simulation December 2016

Simulation Simulation CHAPTER 1 INTRODUCTION TO SIMULATION 2 MODELING CHAPTER 1 INTRODUCTION

Large-N observatio ional l data 1 Large-N observational data The basic idea Whenever cases

Presentation Q1 2010 1 Disclaimer This presentation has been prepared by Duni AB (the

Presentation GSPP More pictures Disinfection of hands Disinfection of hands Disinfection of

Verifying Array Manipulating Programs by Tiling Authors: Supratik Chakraborty, Ashutosh Gupta,

First year review WP2 overview Trento - September 24th, 2007 1 Goal To investigate

ORCHIDACEA RESORT Kata Beach, Phuket w w w . o r c h i d a c e a r e s o r t . c o m About Us

Sales for the First Quarter 2019/20 ending 30 June 2019 18.07.2019 Q1 2019/20 Sales : Key

COVID PRESENT STATUS DR. K. BHARATH MD, FIP CONSULTANT CHEST PHYSICIAN &amp; INTERVENTIONAL

Cost-effectiveness modelling for TB interventions Gabriela B Gomez, Associate Professor in

Lecture 6/Chapters 5&6 Observational Studies & Review Advantages of Observational

Lecture 6/Chapters 5&6 backward in time, about the past. Observational Studies & Review

COVID PRESENT STATUS DR. K. BHARATH MD, FIP CONSULTANT CHEST PHYSICIAN & INTERVENTIONAL