Data at the Leibniz-Institute for Astrophysics Kristin Riebe AIP - - PowerPoint PPT Presentation
Data at the Leibniz-Institute for Astrophysics Kristin Riebe AIP - - PowerPoint PPT Presentation
Data at the Leibniz-Institute for Astrophysics Kristin Riebe AIP Leibniz-Institute for Astrophysics Potsdam Research areas: cosmic magnetic fields (solar/stellar physics, magnetohydrodynamics) extragalactic astrophysics
2
AIP – Leibniz-Institute for Astrophysics Potsdam
- Research areas:
– cosmic magnetic fields (solar/stellar physics, magnetohydrodynamics) – extragalactic astrophysics (galactic archeology, galaxies and quasars, cosmology)
- Development of Research
Technology and Infrastructure
– Robotic telescopes, (3D) spectroscopy – Supercomputing and E-Science
- Participation in many projects
– e.g. RAVE, ROSAT, XMM-Newton, LOFAR, MUSE, ...
3
Example data types at AIP
- Observations:
– RAVE
- Radial velocity measurements + spectra
– SDSS
- Mirror of DR7, catalog server
– „minor data sets“:
- Plate archive (historical plates)
- CALIFA (spectra of galaxies)
- Cepheids (collection of data for time series), ...
- Simulation data:
– Magnetohydrodynamics – Cosmological simulations: particle data, dark matter halo catalogues, halo merger history, ...
4
Behind the scenes
- Supercomputers: Leibniz, Babel, for in-house simulations,
data processing
- Almagest: Graywulf cluster for archiving, exchanging data,
hosting databases, publishing data, 700 TB disk space
- Virtual research environment:
– Erebos: ~ 250 TB disk space – Used by CLUES collaboration to exchange and process data
- Web servers for publishing smaller data sets
5
Data center task: Extract – Transform – Load
Extract Load Webserver Server Transform
Checking, Corrections, Additions; bring into (standard) format From different sources Publish the data
6
Example: MultiDark Database
- Collaboration with Spanish MultiDark project
- Publish data of cosmological simulations in a simulation
database
- Have similar success like MillenniumDB! :-)
- http://www.multidark.org
- 2 simulations uploaded (12+6 TB)
- > 1 million queries in 2 years,
~ 1500 per day, 4 TB downloaded
- ~ 140 registered users
7
Example workflow: MultiDark Database
- Extract:
– Cosmologists produce data, copy them to a server at AIP (VRE)
- Transform:
– We check data and reading routines, data curation (C/Fortran/Perl/Python)
- Load:
– Ingest data into database (SQL, bulk copy)
- Check and test:
– Check the data for completeness, consistency (SQL) – Create Peano-Hilbert keys, indexes (C#, Spatial 3D library (T. Budavari, G. Lemson))
- Publish:
– Using simpledb (Gerard Lemson, Millennium DB, jsp) – Write/update documentation; update admin tables of the database – Inform users
8
Transform: Data curation
- Check completeness of data sets
- Create homogeneous data sets, bring into useful
(standard) formats
- Add identifiers, grid indexes etc. for faster queries & for
representing relations in the database
- Cross-link data with other catalogues
=> usually we applied tailor-made solutions, tuned to each individual data set, custom reading routines required => now things are improving ...
9
DBIngestor and libhilbert
- DBIngestor library + AsciiIngest
– Adrian Partl, https://github.com/adrpar/DBIngestor, …/AsciiIngest – Apply converters (unit conversions, adding identifiers for db indexing, spatial grid indexes) – Apply asserters (nan, inf etc.) – => transform and load in one go – Easy to write own converters & add own reading routines for binary data
- C-library libhilbert
– For creating indexes of space-filling Peano-Hilbert curve in 20 dimensions
10
Data publication
- Many possibilities, very often individual solutions for each project
- Now: new webapp Daiquiri, http://escience.aip.de/daiquiri/
- Developed by Jochen Klar und Adrian Partl
- Web application for publishing data
- Modular, highly customizable
- Using PHP, Zend-framework
- Modern interface using bootstrap, jQuery
- Authentication, Query Interface
- Wordpress integration
- One code base to serve most needs,
- pen source, (easily) extendable
11
Daiquiri examples
- MultiDark2
- Califa
- 4MOST workshop
- Plate Archive
- Jubilee, Curie simulation
database in Madrid http://escience.aip.de/daiquiri/
Screenshot
Screenshot
Screenshot
15
VO compliance
- Currently working on including VO protocols with Daiquiri
– Download data as VOTables (MySQL-VOTable-Dump, see github) – TAP protocol for accessing data – UWS for job queues (MySQL query queue)
- Problems:
– No public PHP libraries for IVOA protocols available (only in java) – But community rather needs PHP or Python implementations
16
Concluding Remarks
- Comon tasks for each data publication: extracting,
transforming, uploading the data
- Different tool for each data set?
– Should rather use only a few, generalized tools, reusable, easier to maintain – Takes a lot of time to develop – => Collect tools from data centers? Combine efforts?
- Would like to have more implementations/libraries of VO