data ingestion in cta
play

Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo - PowerPoint PPT Presentation

Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo Antonelli 1,3 , Alessandro Costa 2 1 INAF, Astrophysical Observatory of Rome RIA-653549 2 INAF, Astronomical Observatory of Catania 3 ASDC, ASI-Science Data Center


  1. Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo Antonelli 1,3 , Alessandro Costa 2 1 INAF, Astrophysical Observatory of Rome RIA-653549 2 INAF, Astronomical Observatory of Catania 3 ASDC, ASI-Science Data Center stefano.gallozzi@oa-roma.inaf.it

  2. Cherenkov Telescope Array https://cta-observatory.org WHAT : CTA is the worldwide project for the future of Very High Energy gamma-ray astronomy. ~20 telescopes for the North-site (Canarie) ~100 telescopes for the South-site (Chile) WHO: the CTA Consortium consists of scientists and engineers from 32 countries from 5 continents and has become a truly global (ESFRI) project. WHY: One of the major technological challenge is related to the data-handling and archiving of the huge amount of data (from 20 to 100 PB/year) coming from the observatory facilities. 2 Data Ingestion in CTA

  3. Data Life Cycle 3 Data Ingestion in CTA

  4. CTA Data Model Data Short Name Description Level DL0 DAQ-RAW Acquired raw data. DL1 CALIBRATED Calibrated camera data. DL2 RECONSTRUCTE Reconstructed shower parameters (such as energy, D direction, particle ID). DL3 REDUCED Sets of selected events with associated instrumental response characterizations needed for science analysis. DL4 SCIENCE High Level binned data products (such as spectra, sky maps, or light curves). DL5 OBSERVATORY Legacy observatory data (such as survey sky maps or source catalog). 4 Data Ingestion in CTA

  5. Data Requirements Without data compression and assuming 165 operational nights/yr: ASTRI/Prot. → ~0.8 TB/night → ~0.3 PB/year Mini-Array → ~3 TB/night → ~6.1 TB/night A.R. → ~1.0 PB/year A.R. CTA → ~8.5 GB/s → ~40 TB/night → ~4 PB/year → ~20 PB/year A.R. (A.R. = After Reduction → input+processed data including calibs, intermediate reduction and MC simulation data) OPTIMISTIC SCENARIO The CTA Archive system must store, manage, preserve and The pessimistic one can provide easy access to such huge amount of data for a long time. take ~>100PB/year ! 5 Data Ingestion in CTA

  6. Archive Framework Open Archive Information System (OAIS) standard • INGEST unit involves a collection of software and/or middleware able to receive bulk data of difgerent types coming from the array and to prepare them for storage, performing basic operation like data indexing, dependencies and compression. • STORAGE guarantees the effjcient retrieval of ingested data, and providing simple archive hierarchy management and maintenance. Storage also supervises the status of the media used in the archive, providing a guarantee of error control and data security . • ADMINISTRATION unit deals with all the operations related to the CTA archive system and its management. It will assure archive performance and standards/requirements fulfjllment by means of dedicated monitoring functionality and recover of failures. • ACCESS unit consists of a collection of software and on-line services that provide effjcient access to the data to the other CTA components (e.g. the data processing pipelines). Furthermore, it will make CTA users able to access CTA data accordingly to their specifjc data access privileges. • DATA TRANSFER unit will guarantee the transfer of data and data products between the on-site and the ofg-site zone of the archive system. 6 Data Ingestion in CTA

  7. Architecture 7 Data Ingestion in CTA

  8. Running thePrototype • The test infrastructure has been setup using VirtualBox Virtual Machines and Docker containers . • Demo datasets coming from the ASTRI project are uploaded to the CTA OneZone within a space supported by the two providers. • The ingested data are enriched with Metadata thanks to the Cloud Data Management Interface (CDMI) or, alternatively, the REST API can be used. • Metadata queries are performed using REST-API and indexing functions (associated to the Space) on pre-defjned extended attributes (Metadata). • The CouchBase database (embedded in OneData) can be used alternatively to query and retrieve the metadata using Elastic Search engines (e.g. N1QL) or common MapReduce functions using the standard CouchBase console and the SDK from the client side. This will enable versatile access to the whole CTA dataset to higher level application frameworks and end-users analysis tools. 8 Data Ingestion in CTA

  9. Astronomical DATA (FITS format?) data descriptors == metadati ……. ……. ……. 9 Data Ingestion in CTA

  10. Metadata 10 Data Ingestion in CTA

  11. Metadata curl -k -H $TOKEN_HEADER -H $CDMI_VSN_HEADER -H 'Content-Type: Sample Ingestion application/cdmi-object' -d '{"metadata" : {“PROGRAM_ID" : “001"}}' -X PUT "$ENDPOINTDATA" function(meta) { if(meta['PROGRAM_ID']) { return meta['PROGRAM_ID']; Sample indexing function } return null; } curl -v -k --tlsv1.2 -Ss -H "X-Auth-Token: $TOKEN" \ Query using a REST-API call -X GET "https://$HOST:8443/api/v3/oneprovider/query- index/$INDEX_ID?key=\"0001\"&stale=false" 11 Data Ingestion in CTA

  12. Distributed Archive Advantages • lower costs respect to a single huge data center, • easy manageability & maintenance by single-site human resources • distributed database of meta-data within the architecture • easily scalability by adding new nodes on the system • disaster recovery free by difgerent sites redundancy 12 Data Ingestion in CTA

  13. Issues and Future Works • Improve Metadata query: possibility to performe more complex queries. • T est OneData roles and data permissions (through connection with an Authentication and Authorization Infrastructure)  Data is proprietary for a fjxed period, then it becomes public. • T est of the replication policies between providers. • Automatic Metadata ingestion (from fjle FITS headers) • Prototype deployment of the CTA Archive in 3 sites (INAF-Catania, INAF-Rome, ASDC) to enable CTA users to test it. • Prototype deployment with Data-Grid functionalities for CTA specifjc users (simulation & pipelines) • A look forward to Cloud-Services to be ready for CTA Workload Management System (DIRAC) migration from the DataGrid Environment to the Cloud Paradigm. 13 Data Ingestion in CTA

  14. References  CTA web page: http://www.cta-observatory.org/  ASTRI web page: http://www.brera.inaf.it/astri/  YouT ube demo: https://youtu.be/TbmJn1bIizE  OneData documentation: https://onedata.org/docs/index.html  OneData @ docker hub: https://hub.docker.com/u/onedata/ 14 Data Ingestion in CTA

  15. Ready to share experience! stefano.gallozzi@oa-roma.inaf.it eva.sciacca@oact.inaf.it alessandro.costa@oact.inaf.it angelo.antonelli@oa-roma.inaf.it https://www.indigo-datacloud.eu Better Software for Better Science. 15 Data Ingestion in CTA

  16. Questions? CTA-North → ← CTA-South 16 Data Ingestion in CTA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend