Fault-Tolerance for PastryGrid Middleware erin 1 , Heithem Abbes 1 , - - PowerPoint PPT Presentation

fault tolerance for pastrygrid middleware
SMART_READER_LITE
LIVE PREVIEW

Fault-Tolerance for PastryGrid Middleware erin 1 , Heithem Abbes 1 , - - PowerPoint PPT Presentation

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion Fault-Tolerance for PastryGrid Middleware erin 1 , Heithem Abbes 1 , 2 , Mohamed Jemni 2 , Yazid Christophe C Missaoui 2 1 LIPN, Universit e de Paris XIII, CNRS UMR 7030,


slide-1
SLIDE 1

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault-Tolerance for PastryGrid Middleware

Christophe C´ erin1, Heithem Abbes1,2, Mohamed Jemni2, Yazid Missaoui2

1LIPN, Universit´

e de Paris XIII, CNRS UMR 7030, France

2UTIC, ESSTT, Universit´

e de Tunis, Tunisia

HPGC’10 - IPDPS

slide-2
SLIDE 2

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Outlines

1

Introduction

2

PastryGrid

3

Fault Tolerance in PastryGrid

4

Conclusion

slide-3
SLIDE 3

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Desktop Grid Architectures Desktop Grid

! "#$%&!'()*+,-!-)(./

!"#$%&'()&*#+,"%(+%-#(

!" !"

!#$#%&'&$(

")*&+',#--)*.#'*/+, !#$#%(0,1$&(2)'(0

3&(2)'(

"//$4*+#'/$1 3&(/2$.&,5*(.0

!"#$%&'()"*+&%,-($",$.%" /0#0'1$-(2."+&%,-($",$.%"

3

45"%+3+6*7(#+(#$"%8&,"

9(%":&'';<6= 6>>'(,&$(0# ?,-"*.'"%

=&5@+3+A&$&+3+<"$+ B?+3+?&#*C0D E%0$0,0'5

Key Points Federation of thousand of nodes; Internet as the communication layer: no trust! Volatility; local IP; Firewall

slide-4
SLIDE 4

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Desktop Grid Architectures Desktop Grid

! "#$%&!'()*+,-!-)(./ &

!"#$%&'("%')*#+,-"#-.*"

!"#$%&'()"*+&%,-($",$.%" /01'($+$&0203*&$&+45#$6 7#$"%+#8*"+,8409 :8#8';$-(<."+&%,-($",$.%"

=

>0"%+=+?*4(#+(#$"%@&,"

?11'(,&$(8# A,-"*.'"%

B&02+=+C&$&+=+D"$+ EA+=+A&#*F8G H%8$8,8'0 !"

!#$#%&'&$(

")*&+', #--)*.#'*/+, !#$#%(0,1 $&(2)'(0

3&(2)'(

"//$4*+#'/$1 5.6&42)&$,78#(9(:

I(%"J&''3D?B

;#'#,<#+#=&$ 5.6&42)&$,78#(9(:

Future Generation (in 2006) Distributed Architecture Architecture with modularity: every component is “configurable”: scheduler, storage, transport protocole Direct communications between peers; Security; Applications coming from any sciences (e-Science applications)

slide-5
SLIDE 5

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

In search of distributed architecture PastryGrid An approach based on structured overlay network to discover (on the fly) the next node executing the next task

slide-6
SLIDE 6

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

In search of distributed architecture PastryGrid An approach based on structured overlay network to discover (on the fly) the next node executing the next task Decentralizes the execution of a distributed application with precedences between tasks

slide-7
SLIDE 7

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s overview Main objectives Fully distributed execution of task graph;

slide-8
SLIDE 8

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management;

slide-9
SLIDE 9

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination;

slide-10
SLIDE 10

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination; Dynamically creation of an execution environment;

slide-11
SLIDE 11

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination; Dynamically creation of an execution environment; No central element;

slide-12
SLIDE 12

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s overview Main objectives Fully distributed execution of task graph; Distributed resource management; Distributed coordination; Dynamically creation of an execution environment; No central element;

slide-13
SLIDE 13

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s Terminology Task terminology Friend tasks: T2, T3 share the same successor (T6)

slide-14
SLIDE 14

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s Terminology Task terminology Friend tasks: T2, T3 share the same successor (T6) Shared tasks T6: has n > 1 ancestors (T2, T3)

slide-15
SLIDE 15

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s Terminology Task terminology Friend tasks: T2, T3 share the same successor (T6) Shared tasks T6: has n > 1 ancestors (T2, T3) Isolated tasks T4, T5: have a single ancestor

slide-16
SLIDE 16

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid’s Terminology Task terminology Friend tasks: T2, T3 share the same successor (T6) Shared tasks T6: has n > 1 ancestors (T2, T3) Isolated tasks T4, T5: have a single ancestor Example

slide-17
SLIDE 17

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid components Addressing scheme to identify applications and users (based

  • n haching application name + submission date + user name

— DHT (Pastry))

slide-18
SLIDE 18

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid components Addressing scheme to identify applications and users (based

  • n haching application name + submission date + user name

— DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search.

slide-19
SLIDE 19

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid components Addressing scheme to identify applications and users (based

  • n haching application name + submission date + user name

— DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search. Rendez-vous concept (RDV); Objectives: localisation of a node without IP; task coordination; data recovery;

slide-20
SLIDE 20

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid components Addressing scheme to identify applications and users (based

  • n haching application name + submission date + user name

— DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search. Rendez-vous concept (RDV); Objectives: localisation of a node without IP; task coordination; data recovery; coordination protocol between machines participating in the application.

slide-21
SLIDE 21

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid components Addressing scheme to identify applications and users (based

  • n haching application name + submission date + user name

— DHT (Pastry)) Protocol of resource discovering; No dedicated nodes for the search of the next node to use → on the fly! Optimization: the machine that terminates the last starts the search. Rendez-vous concept (RDV); Objectives: localisation of a node without IP; task coordination; data recovery; coordination protocol between machines participating in the application.

slide-22
SLIDE 22

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning;

slide-23
SLIDE 23

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place;

slide-24
SLIDE 24

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes;

slide-25
SLIDE 25

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management;

slide-26
SLIDE 26

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload)

slide-27
SLIDE 27

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload) RDV Unknown;

slide-28
SLIDE 28

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload) RDV Unknown; Variable;

slide-29
SLIDE 29

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload) RDV Unknown; Variable; Failure: may still run;

slide-30
SLIDE 30

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload) RDV Unknown; Variable; Failure: may still run; Distributed data management;

slide-31
SLIDE 31

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload) RDV Unknown; Variable; Failure: may still run; Distributed data management; RDV for each application (limited overload)

slide-32
SLIDE 32

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

RDV Concept Coordinator Known at the beginning; Central element on a decicated place; Failure: the system crashes; Centralized resource management; Management of all applications (overload) RDV Unknown; Variable; Failure: may still run; Distributed data management; RDV for each application (limited overload)

slide-33
SLIDE 33

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works

slide-34
SLIDE 34

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works Hash (Application Name + User Name + Submission Date): Unique identifier ApplicationId

slide-35
SLIDE 35

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works Hash (Application Name + User Name + Submission Date): Unique identifier ApplicationId Initialization of RDV: The machine which is closest numerically to ApplicationId

slide-36
SLIDE 36

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works Hash (Application Name + User Name + Submission Date): Unique identifier ApplicationId Initialization of RDV: The machine which is closest numerically to ApplicationId Search for free machine and assignment of tasks T1, T2 and T3

slide-37
SLIDE 37

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works Hash (Application Name + User Name + Submission Date): Unique identifier ApplicationId Initialization of RDV: The machine which is closest numerically to ApplicationId Search for free machine and assignment of tasks T1, T2 and T3

slide-38
SLIDE 38

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works Request and Data Recovery by M1, M2 and M3: DataRequest and YourData

slide-39
SLIDE 39

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works

slide-40
SLIDE 40

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found

slide-41
SLIDE 41

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6

slide-42
SLIDE 42

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6

slide-43
SLIDE 43

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6

slide-44
SLIDE 44

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6

slide-45
SLIDE 45

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6

slide-46
SLIDE 46

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6 M2 seeks M5 and M6 and assigns T5 and T6

slide-47
SLIDE 47

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

How PastryGrid works M1 assigns T4 to M4 that she had found M3 ends T3 but does not seek a machine for T6 M2 seeks M5 and M6 and assigns T5 and T6

slide-48
SLIDE 48

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid Passive replication based on Past (maintaining of k copies of the node states) ; update copies when a modification occurs

  • n a source node; automatically creation of a copy (to

maintain k)

slide-49
SLIDE 49

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid Passive replication based on Past (maintaining of k copies of the node states) ; update copies when a modification occurs

  • n a source node; automatically creation of a copy (to

maintain k) If we adopt such approach ⇒ node explosion;

slide-50
SLIDE 50

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid Passive replication based on Past (maintaining of k copies of the node states) ; update copies when a modification occurs

  • n a source node; automatically creation of a copy (to

maintain k) If we adopt such approach ⇒ node explosion; A new component has been added: FTC (Fault Tolerant Component) node

Supervises tasks that are running;

slide-51
SLIDE 51

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid Passive replication based on Past (maintaining of k copies of the node states) ; update copies when a modification occurs

  • n a source node; automatically creation of a copy (to

maintain k) If we adopt such approach ⇒ node explosion; A new component has been added: FTC (Fault Tolerant Component) node

Supervises tasks that are running; A FTC component for each application; It contacts the RDV to decide the tasks to supervise;

slide-52
SLIDE 52

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid Passive replication based on Past (maintaining of k copies of the node states) ; update copies when a modification occurs

  • n a source node; automatically creation of a copy (to

maintain k) If we adopt such approach ⇒ node explosion; A new component has been added: FTC (Fault Tolerant Component) node

Supervises tasks that are running; A FTC component for each application; It contacts the RDV to decide the tasks to supervise; k copies of the FTC and k copies of the RDV per application. In fact you have 3 types of nodes: computing nodes, FTC nodes and RDV nodes to manage;

slide-53
SLIDE 53

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid Passive replication based on Past (maintaining of k copies of the node states) ; update copies when a modification occurs

  • n a source node; automatically creation of a copy (to

maintain k) If we adopt such approach ⇒ node explosion; A new component has been added: FTC (Fault Tolerant Component) node

Supervises tasks that are running; A FTC component for each application; It contacts the RDV to decide the tasks to supervise; k copies of the FTC and k copies of the RDV per application. In fact you have 3 types of nodes: computing nodes, FTC nodes and RDV nodes to manage;

slide-54
SLIDE 54

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid

slide-55
SLIDE 55

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M initializes the RDV and the FTC of the application

slide-56
SLIDE 56

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M initializes the RDV and the FTC of the application M assigns tasks T1, T2 to M1 and M2

slide-57
SLIDE 57

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M initializes the RDV and the FTC of the application M assigns tasks T1, T2 to M1 and M2 PAST creates k (k = 2) replicas RDV1, RDV2 for RDV and FTC1, FTC2 for FTC

slide-58
SLIDE 58

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid

slide-59
SLIDE 59

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M1 and M2 recover from RDV, the data for T1 and T2

slide-60
SLIDE 60

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M1 and M2 recover from RDV, the data for T1 and T2 The RDV informed the FTC of running tasks (T1 and T2)

slide-61
SLIDE 61

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M1 and M2 recover from RDV, the data for T1 and T2 The RDV informed the FTC of running tasks (T1 and T2) The FTC supervises the execution of tasks T1 and T2

  • n M1 and M2
slide-62
SLIDE 62

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault Tolerance in PastryGrid M1 and M2 recover from RDV, the data for T1 and T2 The RDV informed the FTC of running tasks (T1 and T2) The FTC supervises the execution of tasks T1 and T2

  • n M1 and M2
slide-63
SLIDE 63

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid Validation The FT part Intensive experiments have been conducted (each machine has a probability P to fail for X seconds): P = 20%, 40%, 80% ; 100 applications (2 to 128 // tasks) ; on 200 nodes

slide-64
SLIDE 64

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

PastryGrid Validation The FT part Intensive experiments have been conducted (each machine has a probability P to fail for X seconds): P = 20%, 40%, 80% ; 100 applications (2 to 128 // tasks) ; on 200 nodes Main observations:

In all cases, PastryGrid terminates; The recovery time depends on the node type; The delay varies from 4:53s to 7:16:41s. . . but it works! The number of delayed applications varies from 44 to 98.

slide-65
SLIDE 65

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Conclusion and Perspectives Conclusion PastryGrid: Fault-tolerant decentralized system for running distributed applications with precedence between tasks

slide-66
SLIDE 66

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Conclusion and Perspectives Conclusion PastryGrid: Fault-tolerant decentralized system for running distributed applications with precedence between tasks Creation of a dynamic execution environment for each application

slide-67
SLIDE 67

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Conclusion and Perspectives Conclusion PastryGrid: Fault-tolerant decentralized system for running distributed applications with precedence between tasks Creation of a dynamic execution environment for each application Decentralized collaboration between machines for application tasks management

slide-68
SLIDE 68

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Conclusion and Perspectives Perspectives DG has proved to be relevant for resource sharing ⇒ transpose this success story to the Cloud and PaaS universes ⇒ offer a technical alternate to Google, Salesforce, Amazon big farm of servers

slide-69
SLIDE 69

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Conclusion and Perspectives Perspectives DG has proved to be relevant for resource sharing ⇒ transpose this success story to the Cloud and PaaS universes ⇒ offer a technical alternate to Google, Salesforce, Amazon big farm of servers PastryGrid is based on emerging open source Cloud solution. From an economic point of view: if it is less expensive to host services locally and if it support a wide range of applications → more potential partners, then small/medium size companies will adopt PastryGrid;

slide-70
SLIDE 70

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Conclusion and Perspectives Perspectives DG has proved to be relevant for resource sharing ⇒ transpose this success story to the Cloud and PaaS universes ⇒ offer a technical alternate to Google, Salesforce, Amazon big farm of servers PastryGrid is based on emerging open source Cloud solution. From an economic point of view: if it is less expensive to host services locally and if it support a wide range of applications → more potential partners, then small/medium size companies will adopt PastryGrid;

slide-71
SLIDE 71

Introduction PastryGrid Fault Tolerance in PastryGrid Conclusion

Fault-Tolerance for PastryGrid Middleware

Christophe C´ erin1, Heithem Abbes1,2, Mohamed Jemni2, Yazid Missaoui2

1LIPN, Universit´

e de Paris XIII, CNRS UMR 7030, France

2UTIC, ESSTT, Universit´

e de Tunis, Tunisia

HPGC’10 - IPDPS