a fast database for large observational or simulation
play

A Fast Database for Large Observational or Simulation Datasets - PowerPoint PPT Presentation

A Fast Database for Large Observational or Simulation Datasets Adrian M. Partl Leibniz-Institut fr Astrophysik Potsdam (AIP) Monday, October 22, 12 Big Data in Astronomy and Astrophysics a tsunami of data... Raw data usually with


  1. A Fast Database for Large Observational or Simulation Datasets Adrian M. Partl Leibniz-Institut für Astrophysik Potsdam (AIP) Monday, October 22, 12

  2. Big Data in Astronomy and Astrophysics a tsunami of data... • Raw data usually with clear data formats, huge in Size of individual data set size Number of data sets Monday, October 22, 12

  3. Big Data in Astronomy and Astrophysics a tsunami of data... • Raw data usually with clear data formats, huge in Size of individual data set size • Derived / processed data highly irregular in formats, large in size Number of data sets Monday, October 22, 12

  4. Big Data in Astronomy and Astrophysics a tsunami of data... • Raw data usually with clear data formats, huge in Size of individual data set size • Derived / processed data highly irregular in formats, large in size • Analysed results with complex formats, small in Number of data sets size Monday, October 22, 12

  5. www.multidark.org • Total row count: 2.34 10 11 • MS SQL Server • Apache Tomcat • similar setup to Millenium DB • Most queries 100 sec • significant amount of queries > 1000 sec Monday, October 22, 12

  6. Problems while building the MultiDark DB • Data ingest time: Need to convert binary to ASCII CSV format (highly inefficient) • Data transformation: Computing values after ingest slow - best during ingest • Data indexing: Index on particle data (~10 10 particles) around one week • Data retrieval times slow on full table scans: cannot build index for every query • Spatial queries in 3D hard, impossible in 6D nearest neighbour search also inefficient Monday, October 22, 12

  7. Why RDBMS? • SQL - it took long time for the community to adopt SQL (we think this is the main problem with NoSQL) • proven, widely available, large user base • good for structured data • Problems: – Built for different purposes (business, web, ...) result sets usually small - mostly in memory solutions – parallelisation of data / sharding – can be expensive Monday, October 22, 12

  8. Our vision: • Open source DB solution for scientific purposes: A one size fits all solution built by the community for the community • Developments at AIP: – DB independent ingestion library and data transformation tool (DBIngestor and AsciiIngest) – MySQL sharding solution for scientific queries (PaQu) – MySQL plugins for data analysis, spatial queries and indexing, job queueing... – MySQL storage engine plugins for simulation raw data – Common web framework for data access (Daiquiri) Monday, October 22, 12

  9. Our Developments with MySQL Spider engine: Federated engine: Great MySQL sharding engine Used for temporary results in developed by Kentoku Shiba joins and aggregates: MySQL http://spiderformysql.com Distributed with MySQL to Query queue access tables on other servers. Spider engine (would need rewrite to perform better though) Spider Federated MySQL MySQL MySQL MySQL MySQL MySQL Monday, October 22, 12

  10. Our Developments with MySQL User Admin MySQL Daiquiri PaQu DBIngestor / AsciiIngest Query queue Spider engine Spider Federated MySQL MySQL MySQL MySQL MySQL MySQL Monday, October 22, 12

  11. PaQu: Parallel Query Reformulation Implicit Joins: Aggregates: SELECT a.*, b.*, c.* SELECT a.bar, AVG(a.foo) FROM a, b, c FROM a WHERE b=2 AND GROUP BY a.bar; b.id=c.b_id AND a.id=b.a_id; SELECT a.*, tmp.* SELECT a.bar, FROM a, SUM(a.sum)/SUM(a.cnt) (SELECT b.*, c.* FROM FROM c, (SELECT a.bar as bar, (SELECT b.* SUM(a.foo) as sum, FROM b COUNT(a.foo) as cnt WHERE b=2) as b FROM a WHERE b.id=c.b_id) GROUP BY a.bar) as a WHERE a.id=tmp.b.a_ids GROUP BY a.bar; Monday, October 22, 12

  12. Preliminary Performance Results indexed aggregate unindexed aggregate unindexed range speedup speedup speedup number of nodes number of nodes number of nodes • Strong correlation with hardware setup: – Cache sizes, size of data files (smaller is better / partitioning?), network and I/O performance Monday, October 22, 12

  13. Conclusions • Scientific data increases rapidly in size ⇒ problems for off-the-shelf database systems • Slow acceptance by scientific community of anything new poses problems for NoSQL solutions and favours SQL • Open source solution for scientific use of open source databases needed • Promising results for using MySQL + Spider + PaQu for scientific data mining and fast full table scans Monday, October 22, 12

  14. Interested? Help us in the development! DBIngestor/AsciiIngest: http://github.com/adrpar Thank you! Monday, October 22, 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend