Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo - - PowerPoint PPT Presentation

data ingestion in cta
SMART_READER_LITE
LIVE PREVIEW

Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo - - PowerPoint PPT Presentation

Data Ingestion in CTA Stefano Gallozzi 1 , Eva Sciacca 2 , L.Angelo Antonelli 1,3 , Alessandro Costa 2 1 INAF, Astrophysical Observatory of Rome RIA-653549 2 INAF, Astronomical Observatory of Catania 3 ASDC, ASI-Science Data Center


slide-1
SLIDE 1

Data Ingestion in CTA

Stefano Gallozzi1, Eva Sciacca2, L.Angelo Antonelli1,3, Alessandro Costa2

1 INAF, Astrophysical Observatory of Rome 2 INAF, Astronomical Observatory of Catania 3 ASDC, ASI-Science Data Center

stefano.gallozzi@oa-roma.inaf.it

RIA-653549

slide-2
SLIDE 2

Cherenkov Telescope Array https://cta-observatory.org

Data Ingestion in CTA

WHAT: CTA is the worldwide project for the future of Very High Energy gamma-ray astronomy. WHO: the CTA Consortium consists of scientists and engineers from 32 countries from 5 continents and has become a truly global (ESFRI) project. WHY: One of the major technological challenge is related to the data-handling and archiving of the huge amount of data (from 20 to 100 PB/year) coming from the

  • bservatory facilities.

2

~20 telescopes for the North-site (Canarie) ~100 telescopes for the South-site (Chile)

slide-3
SLIDE 3

Data Life Cycle

Data Ingestion in CTA

3

slide-4
SLIDE 4

CTA Data Model

Data Ingestion in CTA

Data Level Short Name Description DL0 DAQ-RAW Acquired raw data. DL1 CALIBRATED Calibrated camera data. DL2 RECONSTRUCTE D Reconstructed shower parameters (such as energy, direction, particle ID). DL3 REDUCED Sets of selected events with associated instrumental response characterizations needed for science analysis. DL4 SCIENCE High Level binned data products (such as spectra, sky maps, or light curves). DL5 OBSERVATORY Legacy observatory data (such as survey sky maps or source catalog). 4

slide-5
SLIDE 5

Data Requirements

Data Ingestion in CTA

Without data compression and assuming 165 operational nights/yr:

ASTRI/Prot. → ~0.8 TB/night → ~0.3 PB/year Mini-Array → ~3 TB/night → ~6.1 TB/night A.R. → ~1.0 PB/year A.R. CTA → ~8.5 GB/s → ~40 TB/night → ~4 PB/year → ~20 PB/year A.R.

(A.R. = After Reduction → input+processed data including calibs, intermediate reduction and MC simulation data)

OPTIMISTIC SCENARIO The pessimistic one can take ~>100PB/year ! The CTA Archive system must store, manage, preserve and provide easy access to such huge amount of data for a long time.

5

slide-6
SLIDE 6

Archive Framework

Open Archive Information System (OAIS) standard

  • INGEST unit involves a collection of software and/or middleware able to

receive bulk data of difgerent types coming from the array and to prepare them for storage, performing basic operation like data indexing, dependencies and compression.

  • STORAGE guarantees the effjcient retrieval of ingested data, and

providing simple archive hierarchy management and maintenance. Storage also supervises the status of the media used in the archive, providing a guarantee of error control and data security .

  • ADMINISTRATION unit deals with all the operations related to the CTA

archive system and its management. It will assure archive performance and standards/requirements fulfjllment by means of dedicated monitoring functionality and recover of failures.

  • ACCESS unit consists of a collection of software and on-line services that

provide effjcient access to the data to the other CTA components (e.g. the data processing pipelines). Furthermore, it will make CTA users able to access CTA data accordingly to their specifjc data access privileges.

  • DATA TRANSFER unit will guarantee the transfer of data and data

products between the on-site and the ofg-site zone of the archive system.

Data Ingestion in CTA

6

slide-7
SLIDE 7

Architecture

Data Ingestion in CTA

7

slide-8
SLIDE 8

Running thePrototype

  • The test infrastructure has been setup using VirtualBox Virtual Machines

and Docker containers.

  • Demo datasets coming from the ASTRI project are uploaded to the CTA

OneZone within a space supported by the two providers.

  • The ingested data are enriched with Metadata thanks to the Cloud Data

Management Interface (CDMI) or, alternatively, the REST API can be used.

  • Metadata queries are performed using REST-API and indexing functions

(associated to the Space) on pre-defjned extended attributes (Metadata).

  • The CouchBase database (embedded in OneData) can be used alternatively

to query and retrieve the metadata using Elastic Search engines (e.g. N1QL) or common MapReduce functions using the standard CouchBase console and the SDK from the client side. This will enable versatile access to the whole CTA dataset to higher level application frameworks and end-users analysis tools.

Data Ingestion in CTA

8

slide-9
SLIDE 9

Astronomical DATA (FITS format?)

Data Ingestion in CTA 9

……. ……. ……. data descriptors == metadati

slide-10
SLIDE 10

Metadata

Data Ingestion in CTA 10

slide-11
SLIDE 11

Metadata

function(meta) { if(meta['PROGRAM_ID']) { return meta['PROGRAM_ID']; } return null; }

curl -k -H $TOKEN_HEADER -H $CDMI_VSN_HEADER -H 'Content-Type: application/cdmi-object'

  • d '{"metadata" : {“PROGRAM_ID" : “001"}}' -X PUT "$ENDPOINTDATA"

Sample Ingestion Sample indexing function

curl -v -k --tlsv1.2 -Ss -H "X-Auth-Token: $TOKEN" \

  • X GET "https://$HOST:8443/api/v3/oneprovider/query-

index/$INDEX_ID?key=\"0001\"&stale=false"

Query using a REST-API call

Data Ingestion in CTA

11

slide-12
SLIDE 12

Distributed Archive Advantages

  • lower costs respect to a single huge data center,
  • easy manageability & maintenance by single-site

human resources

  • distributed database of meta-data within the

architecture

  • easily scalability by adding new nodes on the system
  • disaster recovery free by difgerent sites redundancy

Data Ingestion in CTA

12

slide-13
SLIDE 13

Issues and Future Works

  • Improve Metadata query: possibility to performe more complex queries.
  • T

est OneData roles and data permissions (through connection with an Authentication and Authorization Infrastructure)

Data is proprietary for a fjxed period, then it becomes public.

  • T

est of the replication policies between providers.

  • Automatic Metadata ingestion (from fjle FITS headers)
  • Prototype deployment of the CTA Archive in 3 sites (INAF-Catania,

INAF-Rome, ASDC) to enable CTA users to test it.

  • Prototype deployment with Data-Grid functionalities for CTA

specifjc users (simulation & pipelines)

  • A look forward to Cloud-Services to be ready for CTA Workload

Management System (DIRAC) migration from the DataGrid Environment to the Cloud Paradigm.

Data Ingestion in CTA

13

slide-14
SLIDE 14

References

 CTA web page: http://www.cta-observatory.org/  ASTRI web page: http://www.brera.inaf.it/astri/  YouT ube demo: https://youtu.be/TbmJn1bIizE  OneData documentation: https://onedata.org/docs/index.html  OneData @ docker hub: https://hub.docker.com/u/onedata/

Data Ingestion in CTA

14

slide-15
SLIDE 15

Ready to share experience!

15

Data Ingestion in CTA

stefano.gallozzi@oa-roma.inaf.it eva.sciacca@oact.inaf.it alessandro.costa@oact.inaf.it angelo.antonelli@oa-roma.inaf.it https://www.indigo-datacloud.eu Better Software for Better Science.

slide-16
SLIDE 16

Questions?

16

Data Ingestion in CTA

CTA-North → ← CTA-South