active data
play

Active Data A Data-Centric Approach to Data Life-Cycle Management - PowerPoint PPT Presentation

Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet 1 Gilles Fedak 1 Matei Ripeanu 2 Samer Al-Kiswany 2 1 Inria, ENS Lyon, University of Lyon 2 University of British


  1. Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet 1 Gilles Fedak 1 Matei Ripeanu 2 Samer Al-Kiswany 2 1 Inria, ENS Lyon, University of Lyon 2 University of British Columbia November 18th, 2013 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20

  2. Introduction Active Data Discussion Conclusion Outline Introduction Data Life Cycle Management Use-case Requirements Active Data Active Data: principles & features Exemple: Globus Online and iRODS Discussion Advantages Limitations Conclusion Related works Conclusion A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20

  3. Introduction Active Data Discussion Conclusion Big Data ◮ Science and Industry have become data-intensive ◮ Volume of data produced by science and industry grows exponentially ◮ How to store this deluge of data? ◮ How to extract knowledge and sense? ◮ How to make data valuable? ◮ Some examples ◮ CERN’s Large Hadron Collider: 1.5PB/week ◮ Large Synoptic Survey Telescope, Chile: 30 TB/night ◮ Billion edge social network graphs ◮ Searching and mining the Web A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20

  4. Introduction Active Data Discussion Conclusion Data Life Cycle Data Life Cycle ◮ Creation/Acquisition ◮ Transfer ◮ Replication ◮ Disposal/Archiving Definition The life cycle is the course of operational stages through which data pass from the time when they enter a system to the time when they leave it. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20

  5. Introduction Active Data Discussion Conclusion Data Life Cycle Management Complicated scenarios ◮ Execution of workflow ◮ Complex interactions between software ◮ Need to quickly react to operational events Ad-hoc task-centric approaches ◮ Hard to program, maintain and debug ◮ No formal specification ◮ Complicates interactions between systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20

  6. Introduction Active Data Discussion Conclusion Data Life Cycle Use-case Example: the Advanced Photon Source at Argonne National Lab ◮ 100TB of raw data per day ◮ Raw data are preprocessed and registered in a Globus dataset catalog ◮ Data are analyzed by various applications ◮ Results are stored in the dataset catalog and shared More analysis Upload result Remote Academic Data Center Cluster r e f s n a Analysis T r Instrument Transfer Local (Beamline) Storage a t a d a t e m t u l Extract & s e r r e Register Metadata t s Metadata i g e R Catalog A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20

  7. Introduction Active Data Discussion Conclusion Use-case Task Centric Data Centric ◮ Independent scripts ◮ Express data-dependancies Vs ◮ Hard to program, maintain, verify ◮ Cross data-center coordination ◮ Coarse granularity ◮ User-level fault-tolerance ◮ Incremental processing A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20

  8. Introduction Active Data Discussion Conclusion Requirements Challenges: a perfect system should. . . ◮ Simply represent the life cycle of data distributed across different data centers and systems ◮ Simplify DLM modeling and reasoning ◮ Hide the complexity resulting from using different infrastructures and systems ◮ Be easy to integrate with existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20

  9. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created Written Read Terminated • t 1 t 2 t 4 t 3 Each token has a unique identifier, corresponding to the actual data item’s. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20

  10. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created Written Read Terminated • t 1 t 2 t 4 t 3 A transition is fired whenever a data state changes. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20

  11. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created Written Read Terminated • t 1 t 2 t 4 public void handler () { computeMD5 (); t 3 } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20

  12. Introduction Active Data Discussion Conclusion Active Data features The Active Data programming model and runtime environment: ◮ Allows to react to life cycle progression ◮ Exposes transparently distributed data sets ◮ Can be integrated with existing systems ◮ Has scalable performance and minimum overhead over existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20

  13. Introduction Active Data Discussion Conclusion Implementation ◮ Prototype implemented in Java ( ≃ 2,800 LOC) ◮ Client/Service communication is Publish/Subscribe ◮ 2 types of subscription: ◮ Every transitions for a given data item ◮ Every data items for a given transition Active Data Client Service subscribe e b i r c s b Client u s Client A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20

  14. Introduction Active Data Discussion Conclusion Implementation ◮ Several ways to publish transitions ◮ Instrument the code ◮ Read the logs ◮ Rely on an existing notification system ◮ The service orders transitions by time of arrival publish transition Active Data Client Service subscribe e b r i c s b Client u s n o t i s i n a r t h i s b l u Client p A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20

  15. Introduction Active Data Discussion Conclusion Implementation ◮ Clients run transition handler code locally ◮ Transition handlers are executed ◮ Serially ◮ In a blocking way ◮ In the order transitions were published publish transition Active Data Client notify Service subscribe e b r i c s b Client u s notify n o t i s i n a r t h i s b l u Client p A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20

  16. Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 35,000 Transitions per second 30,000 25,000 20,000 15,000 10,000 5,000 10 50 100 200 300 400 450 500 550 # clients Figure: Average number of transitions per second handled by the Active Data Service Clients publish 10,000 transitions in a row without pausing. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20

  17. Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 35,000 Transitions per second 30,000 25,000 20,000 15,000 10,000 5,000 10 50 100 200 300 400 450 500 550 # clients Figure: Average number of transitions per second handled by the Active Data Service The prototype scales up to 30,000 transitions per seconds. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20

  18. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. ◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20

  19. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. ◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems − → What about heterogeneous systems? A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20

  20. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. ◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems − → What about heterogeneous systems? Example with Globus Online and iRODS File transfer service Data store and metadata catalog A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20

  21. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS Data events coming from Globus Online and iRODS Terminated Id: { GO: 7b9e02c4-925d-11e2 } • Created t 5 t 1 t 2 t 9 Put Get t 6 Failed Succeeded t 7 t 8 t 10 t 3 t 4 Created Terminated iRODS Globus Online A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20

  22. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS Data events coming from Globus Online and iRODS public void handler () { Terminated iput (...); } Created t 5 t 1 t 2 t 9 Put • Get t 6 Failed Succeeded t 7 t 8 t 10 t 3 t 4 Created Terminated iRODS Globus Online A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend