Active Data A Data-Centric Approach to Data Life-Cycle Management - - PowerPoint PPT Presentation

active data
SMART_READER_LITE
LIVE PREVIEW

Active Data A Data-Centric Approach to Data Life-Cycle Management - - PowerPoint PPT Presentation

Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet 1 Gilles Fedak 1 Matei Ripeanu 2 Samer Al-Kiswany 2 1 Inria, ENS Lyon, University of Lyon 2 University of British


slide-1
SLIDE 1

Introduction Active Data Discussion Conclusion

Active Data

A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet1 Gilles Fedak1 Matei Ripeanu2 Samer Al-Kiswany2

1Inria, ENS Lyon, University of Lyon 2University of British Columbia

November 18th, 2013

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 1/20

slide-2
SLIDE 2

Introduction Active Data Discussion Conclusion

Outline

Introduction Data Life Cycle Management Use-case Requirements Active Data Active Data: principles & features Exemple: Globus Online and iRODS Discussion Advantages Limitations Conclusion Related works Conclusion

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 2/20

slide-3
SLIDE 3

Introduction Active Data Discussion Conclusion

Big Data

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 3/20

◮ Science and Industry have become data-intensive

◮ Volume of data produced by science and industry grows exponentially ◮ How to store this deluge of data? ◮ How to extract knowledge and sense? ◮ How to make data valuable?

◮ Some examples

◮ CERN’s Large Hadron Collider: 1.5PB/week ◮ Large Synoptic Survey Telescope, Chile: 30 TB/night ◮ Billion edge social network graphs ◮ Searching and mining the Web

slide-4
SLIDE 4

Introduction Active Data Discussion Conclusion

Data Life Cycle

Data Life Cycle

◮ Creation/Acquisition ◮ Transfer ◮ Replication ◮ Disposal/Archiving

Definition

The life cycle is the course of operational stages through which data pass from the time when they enter a system to the time when they leave it.

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 4/20

slide-5
SLIDE 5

Introduction Active Data Discussion Conclusion

Data Life Cycle Management

Complicated scenarios

◮ Execution of workflow ◮ Complex interactions between software ◮ Need to quickly react to operational events

Ad-hoc task-centric approaches

◮ Hard to program, maintain and debug ◮ No formal specification ◮ Complicates interactions between systems

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 5/20

slide-6
SLIDE 6

Introduction Active Data Discussion Conclusion

Data Life Cycle Use-case

Example: the Advanced Photon Source at Argonne National Lab

◮ 100TB of raw data per day ◮ Raw data are preprocessed and registered in a Globus dataset

catalog

◮ Data are analyzed by various applications ◮ Results are stored in the dataset catalog and shared

Instrument (Beamline) Local Storage

Transfer

Metadata Catalog

Extract & Register Metadata

Remote Data Center

T r a n s f e r

Academic Cluster

Analysis More analysis Upload result R e g i s t e r r e s u l t m e t a d a t a

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 6/20

slide-7
SLIDE 7

Introduction Active Data Discussion Conclusion

Use-case

Task Centric Vs Data Centric

◮ Independent scripts ◮ Express data-dependancies ◮ Hard to program, maintain, verify ◮ Cross data-center coordination ◮ Coarse granularity ◮ User-level fault-tolerance ◮ Incremental processing

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 7/20

slide-8
SLIDE 8

Introduction Active Data Discussion Conclusion

Requirements

Challenges: a perfect system should. . .

◮ Simply represent the life cycle of data distributed across

different data centers and systems

◮ Simplify DLM modeling and reasoning ◮ Hide the complexity resulting from using different

infrastructures and systems

◮ Be easy to integrate with existing systems

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 8/20

slide-9
SLIDE 9

Introduction Active Data Discussion Conclusion

Active Data principles

System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions

  • Created

t1 Written t2 Read t3 t4 Terminated

Each token has a unique identifier, corresponding to the actual data item’s.

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 9/20

slide-10
SLIDE 10

Introduction Active Data Discussion Conclusion

Active Data principles

System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions

Created t1

  • Written

t2 Read t3 t4 Terminated

A transition is fired whenever a data state changes.

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 9/20

slide-11
SLIDE 11

Introduction Active Data Discussion Conclusion

Active Data principles

System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions

Created t1

  • Written

t2 Read t3 t4 Terminated public void handler () { computeMD5 (); }

Code may be plugged by clients to transitions. It is executed whenever the transition is fired.

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 9/20

slide-12
SLIDE 12

Introduction Active Data Discussion Conclusion

Active Data features

The Active Data programming model and runtime environment:

◮ Allows to react to life cycle progression ◮ Exposes transparently distributed data sets ◮ Can be integrated with existing systems ◮ Has scalable performance and minimum overhead over

existing systems

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 10/20

slide-13
SLIDE 13

Introduction Active Data Discussion Conclusion

Implementation

◮ Prototype implemented in Java (≃ 2,800 LOC) ◮ Client/Service communication is Publish/Subscribe ◮ 2 types of subscription:

◮ Every transitions for a given data item ◮ Every data items for a given transition

Active Data Service

Client Client

subscribe

Client

s u b s c r i b e

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 11/20

slide-14
SLIDE 14

Introduction Active Data Discussion Conclusion

Implementation

◮ Several ways to publish transitions

◮ Instrument the code ◮ Read the logs ◮ Rely on an existing notification system

◮ The service orders transitions by time of arrival

Active Data Service

Client

publish transition

Client

subscribe

Client

s u b s c r i b e p u b l i s h t r a n s i t i

  • n
  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 11/20

slide-15
SLIDE 15

Introduction Active Data Discussion Conclusion

Implementation

◮ Clients run transition handler code locally ◮ Transition handlers are executed

◮ Serially ◮ In a blocking way ◮ In the order transitions were published

Active Data Service

Client

publish transition

Client

subscribe notify

Client

s u b s c r i b e notify p u b l i s h t r a n s i t i

  • n
  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 11/20

slide-16
SLIDE 16

Introduction Active Data Discussion Conclusion

Performance evaluation: Throughput

10 50 100 200 300 400 450 500 550 # clients 5,000 10,000 15,000 20,000 25,000 30,000 35,000 Transitions per second

Figure: Average number of transitions per second handled by the Active Data Service

Clients publish 10,000 transitions in a row without pausing.

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 12/20

slide-17
SLIDE 17

Introduction Active Data Discussion Conclusion

Performance evaluation: Throughput

10 50 100 200 300 400 450 500 550 # clients 5,000 10,000 15,000 20,000 25,000 30,000 35,000 Transitions per second

Figure: Average number of transitions per second handled by the Active Data Service

The prototype scales up to 30,000 transitions per seconds.

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 12/20

slide-18
SLIDE 18

Introduction Active Data Discussion Conclusion

Exemple: Data Provenance

Definition

The complete history of data life cycle derivations and operations.

◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 13/20

slide-19
SLIDE 19

Introduction Active Data Discussion Conclusion

Exemple: Data Provenance

Definition

The complete history of data life cycle derivations and operations.

◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems

− → What about heterogeneous systems?

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 13/20

slide-20
SLIDE 20

Introduction Active Data Discussion Conclusion

Exemple: Data Provenance

Definition

The complete history of data life cycle derivations and operations.

◮ Assess the quality of data ◮ Keep track of the origin of data over time ◮ Specialized Provenance Aware Storage Systems

− → What about heterogeneous systems? Example with Globus Online and iRODS File transfer service Data store and metadata catalog

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 13/20

slide-21
SLIDE 21

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created Get Put Terminated t5 t9 t6 t7 t8 t10

iRODS

  • Created

t1 t2 Succeeded Failed t3 t4 Terminated

Globus Online

Id: {GO: 7b9e02c4-925d-11e2}

slide-22
SLIDE 22

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created Get Put Terminated t5 t9 t6 t7 t8 t10

iRODS

Created t1 t2

  • Succeeded

Failed t3 t4 Terminated

Globus Online

public void handler () { iput (...); }

slide-23
SLIDE 23

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

  • Created

Get Put Terminated t5 t9 t6 t7 t8 t10

iRODS

Id: {GO: 7b9e02c4-925d-11e2, iRODS: 10032}

Created t1 t2 Succeeded Failed t3 t4 Terminated

Globus Online

slide-24
SLIDE 24

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created Get

  • Put

Terminated t5 t9 t6 t7 t8 t10

iRODS

public void handler () { annotate (); } Created t1 t2 Succeeded Failed t3 t4 Terminated

Globus Online

slide-25
SLIDE 25

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

$ imeta ls -d test/ out_test_4628 AVUs defined for dataObj test/ out_test_4628 : attribute: GO_FAULTS value: 0

  • attribute: GO_COMPLETION_TIME

value: 2013 -03 -21 19:28:41Z

  • attribute: GO_REQUEST_TIME

value: 2013 -03 -21 19:28:17Z

  • attribute: GO_TASK_ID

value: 7b9e02c4 -925d -11e2 -97ce -123139404 f2e

  • attribute: GO_SOURCE

value: go#ep1 /~/ test

  • attribute: GO_DESTINATION

value: asimonet#fraise /~/ out_test_4628

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 15/20

slide-26
SLIDE 26

Introduction Active Data Discussion Conclusion

Advantages

◮ Simple and graphical way to program DLM operations ◮ Allows to formally verify some properties of data life cycles ◮ Easy coordination between systems ◮ Easy to scale ◮ Easy to debug ◮ Easy fault tolerance ◮ Fine-grain interaction with data life cycle

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 16/20

slide-27
SLIDE 27

Introduction Active Data Discussion Conclusion

Limitations

◮ Complexity to reason in terms of life cycle events ◮ Lack of standard

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 17/20

slide-28
SLIDE 28

Introduction Active Data Discussion Conclusion

Related works

Data-centric parallel processing

◮ Programing models:

◮ MapReduce and higher level abstractions: PigLatin, Twister ◮ Incremental systems: MapReduce-Online, Percolator, Chimera,

Nephele

◮ Other models with implicit parallelism: Swift, Dryad, Allpairs

◮ Storage systems

◮ BitDew ◮ MosaStore ◮ Provenance Aware Storage Systems ◮ Active Storage

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 18/20

slide-29
SLIDE 29

Introduction Active Data Discussion Conclusion

Conclusion

Active Data is. . .

◮ Data-centric & Event-driven ◮ System-level data integration

What’s next?

◮ Advanced representation of operations that consume and

produce data: represent data derivation

◮ Data collection abilities ◮ Distributed implementation of the Publish/Subscribe layer

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 19/20

slide-30
SLIDE 30

Introduction Active Data Discussion Conclusion

Thank you!

Questions?

Inria booth #2116

  • A. Simonet(Inria)

Active Data (PDSW’13) November 18th, 2013 20/20