Fault tolerance based on the Publish- subscribe Paradigm for the - PowerPoint PPT Presentation

University of Paris XIII Université of Tunis INSTITUT GALILEE École Supérieure des Sciences Laboratoire d’Informatique et Tehniques de Tunis de Paris Nord (LIPN) Unité de Recherche UTIC Fault tolerance based on the Publish- subscribe Paradigm for the BonjourGrid Middleware Heithem ABBES, Christophe CERIN, Mohamed JEMNI and Walid SAAD Grid 2010 - 27 October 2010

Outline • Introduction • Objectives • Design of BonjourGrid • Integration of Boinc and Condor • Fault tolerance approach • Experimentation and validation • Conclusion and future works 2

Introduction (1/3) • P2P systems have allowed large improvements in the field of file sharing over Internet. 3

Introduction (1/3) • P2P systems have allowed large improvements in the field of file sharing over Internet. • Gnutella, Kazaa and Freenet 3

Introduction (1/3) • P2P systems have allowed large improvements in the field of file sharing over Internet. • Gnutella, Kazaa and Freenet ➡ Decentralized architecture ➡ No coordination between machines 3

Introduction (2/3) • Grid computing : obtaining an infrastructure offering computing power for users applications. • Coordination between machines during application execution. • Centralized or hierarchical architectures (Globus, Glite, Condor). 4

Introduction (2/3) • Grid computing : obtaining an infrastructure offering computing power for users applications. • Coordination between machines during application execution. • Centralized or hierarchical architectures (Globus, Glite, Condor). ➡ No scalability ➡ Complicated procedure of installation ➡ Complicated configuration phase for an ordinary user 4

Introduction (3/3) • Desktop Grid led the community to build computing systems based on voluntary machines. • Current systems use Master/Worker model 5

Introduction (3/3) • Desktop Grid led the community to build computing systems based on voluntary machines. • Current systems use Master/Worker model • United Devices, BOINC, PLANETLAB, XtremWeb 5

Introduction (3/3) • Desktop Grid led the community to build computing systems based on voluntary machines. • Current systems use Master/Worker model • United Devices, BOINC, PLANETLAB, XtremWeb • Application domains • Global climate prediction (BOINC) • Search for extraterrestrial intelligence (SETI@Home) • Cosmic rays study (XtremWeb). 5

Introduction (3/3) • Desktop Grid led the community to build computing systems based on voluntary machines. • Current systems use Master/Worker model • United Devices, BOINC, PLANETLAB, XtremWeb • Application domains • Global climate prediction (BOINC) • Search for extraterrestrial intelligence (SETI@Home) • Cosmic rays study (XtremWeb). ✓ Demonstrate the potential of Desktop Grid 5

Introduction (3/3) • Desktop Grid led the community to build computing systems based on voluntary machines. • Current systems use Master/Worker model • United Devices, BOINC, PLANETLAB, XtremWeb • Application domains • Global climate prediction (BOINC) • Search for extraterrestrial intelligence (SETI@Home) • Cosmic rays study (XtremWeb). ✓ Demonstrate the potential of Desktop Grid ✴ Suffer from being hardly scalable due to centralized control ✴ Rely on permanent administrative staff who guarantees the master operation 5

Objectives of BonjourGrid • Design a multi-platform coordinators and fault tolerant system using existing desktop grid middleware • Reduce the centralization factor: no static coordinator • Benefit from the existing decentralized service discovery tools (Publish / Subscribe) • Create coordinators on demand, automatically and without administrator intervention. • Each coordinator selects machines to participate in the execution of a given application. 6

Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 coordinator + N workers 7

Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 instance: 1 CE managed by a middleware 1 coordinator + N workers 7

Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 instance: 1 CE managed by a middleware 1 coordinator + N workers Controls and orchestrate multiple instances 7

Design of BonjourGrid Coordinateur 1 Computing Element (CE) = 1 instance: 1 CE managed by a middleware 1 coordinator + N workers Controls and orchestrate multiple instances Introduction of the concept of meta-grids 7

Design of BonjourGrid 8 A A

Design of BonjourGrid A 8 A A

Design of BonjourGrid A B 8 A A

Design of BonjourGrid A C B 8 A A

Design of BonjourGrid A D C B 8 A A

Design of BonjourGrid A D C B A computing element for each user 8 A A

Design of BonjourGrid A D C B A computing element for each user No static coordinator 8 A A

Design of BonjourGrid A D C B A computing element for each user No static coordinator Each user can specify a middleware for his computing element 8 A A

Components of BonjourGrid • BonjourGrid is based on : • A resource discovery protocol • Fully decentralized • A computing element • Executes and handles the various tasks of an application (Condor, Boinc, XtremWeb) • A global coordination protocol • Manages and controls all resources, services and computing elements • Does not depend on any specific machine or centralized element 9

Discovery protocol • Based on Bonjour protocol • Multicast IP network • An implementation by Apple of ZeroConf protocol. • Structured around three functionalities : • Dynamic allocation of IP addresses without DHCP • Resolution of names and IP addresses without DNS • Services discovery without directory server • Motivations • Industrial protocol approved by Apple • Different versions for the 3 OS (Windows, Linux, MaxOS) • Linux and MacOS distributions integrate Bonjour • Evolution of networks (10 Gb/s 10 * x Gb/s) => low risk of network congestion for multicast protocols 10

Computing element (CE) • Each coordinator creates dynamically its CE • CE = Coordinator + set of workers • CE functionalities • Allocates workers • Submits and run tasks on workers • Schedules and get results • Computing systems • XtremWeb, Condor or Boinc 11

Computing element (CE) • Each coordinator creates dynamically its CE • CE = Coordinator + set of workers • CE functionalities • Allocates workers • Submits and run tasks on workers • Schedules and get results • Computing systems • XtremWeb, Condor or Boinc 1 specific CE for each user 11

Coordination protocol • Each machine can have one of the three states (Idle, Worker or Coordinator). • A machine announces its state by publishing the specific service to this state : IdleService for idle state • WorkerService for worker state • CoordinatorService for coordinator state • • When machine state changes: • it publishes the appropriate service to advertise this new state, • after having deactivated the old one. • Every machine can discover machines that are in a given state: • A machine launches a discovery on a particular service instead of permanently receiving all new events. • Restrict communication between machines. 12

Layered architecture 13

Layered architecture Publish/Subscribe 13

Layered architecture Connection to BonjourGrid Publish/Subscribe 13

Layered architecture Resources discovery Connection to BonjourGrid Publish/Subscribe 13

Layered architecture Resources Resources discovery characteristics Connection to BonjourGrid Publish/Subscribe 13

Layered architecture Establishment of CE network Resources Resources discovery characteristics Connection to BonjourGrid Publish/Subscribe 13

Layered architecture Establishment of CE network XtremWeb Resources Resources discovery characteristics Connection to BonjourGrid Publish/Subscribe 13

Fault tolerance based on the Publish- subscribe Paradigm for the - PowerPoint PPT Presentation

University of Paris XIII Universit of Tunis INSTITUT GALILEE cole Suprieure des Sciences Laboratoire dInformatique et Tehniques de Tunis de Paris Nord (LIPN) Unit de Recherche UTIC Fault tolerance based on the Publish- subscribe

Publish/Subscribe Publish/Subscribe Model Producers publish information Consumers

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Striving for Versatility in Striving for Versatility in Publish/Subscribe Publish/Subscribe

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

The presentation to Externals of the NM23.0 Release via video-conference took place on the 18 th of

Efficiently Bypassing SNI-based HTTPS Filtering Wazen M. Shbair,Thibault Cholez, Antoine Goichot,

Forum on DNS Abuse June 25, 2012 Moderator: Ondrej Filip, CEO CZ.NIC

Investor Presentation July 2018 Legal Disclosure This presentation contains forward-looking

Joint Webinar #5 & Barcelona Data Science and Machine Learning Meetup Budapest Deep

ICANN Tutorial Tutorial Agenda Who am I ? History 5-7 years ago Current day

INVESTOR PRESENTATION DROPSUITE LIMITED AGM 26 MAY 2017 1 WE SAFEGUARD BUSINESS INFORMATION

Dec ecem ember ber 5 & 5 & 8, 8, 20 2016 16 Th The Single le Plan n for Studen

Fault tolerance based on the Publish- subscribe Paradigm for the - PowerPoint PPT Presentation

University of Paris XIII Universit of Tunis INSTITUT GALILEE cole Suprieure des Sciences Laboratoire dInformatique et Tehniques de Tunis de Paris Nord (LIPN) Unit de Recherche UTIC Fault tolerance based on the Publish- subscribe

Publish/Subscribe Publish/Subscribe Model Producers publish information Consumers

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Striving for Versatility in Striving for Versatility in Publish/Subscribe Publish/Subscribe

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

The presentation to Externals of the NM23.0 Release via video-conference took place on the 18 th of

Efficiently Bypassing SNI-based HTTPS Filtering Wazen M. Shbair,Thibault Cholez, Antoine Goichot,

Forum on DNS Abuse June 25, 2012 Moderator: Ondrej Filip, CEO CZ.NIC

Investor Presentation July 2018 Legal Disclosure This presentation contains forward-looking

Joint Webinar #5 &amp; Barcelona Data Science and Machine Learning Meetup Budapest Deep

ICANN Tutorial Tutorial Agenda Who am I ? History 5-7 years ago Current day

INVESTOR PRESENTATION DROPSUITE LIMITED AGM 26 MAY 2017 1 WE SAFEGUARD BUSINESS INFORMATION

Dec ecem ember ber 5 &amp; 5 &amp; 8, 8, 20 2016 16 Th The Single le Plan n for Studen

Joint Webinar #5 & Barcelona Data Science and Machine Learning Meetup Budapest Deep

Dec ecem ember ber 5 & 5 & 8, 8, 20 2016 16 Th The Single le Plan n for Studen