Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. - - PowerPoint PPT Presentation

▶

May 29, 2023 388 likes •594 views

Grid Aware HA-OSCAR Kshitij Limaye 1 , Box Leangsuksun 1 , Venkata K. Munganuru 1 , Zeno Greenwood 1 , Stephen L. Scott 2 , Richard Libby 3 , and Kasidit Chanchio 4 1.Louisiana Tech University, 2.Oak Ridge National Laboratory, 3.Intel, 4.Thammasat

SLIDE 1

Oscar 05 Symposium May 2005

Grid Aware HA-OSCAR

Kshitij Limaye1, Box Leangsuksun1, Venkata K. Munganuru1, Zeno Greenwood1, Stephen L. Scott2, Richard Libby3, and Kasidit Chanchio4

1.Louisiana Tech University, 2.Oak Ridge National Laboratory, 3.Intel, 4.Thammasat University, Thailand

SLIDE 2

Oscar 05 Symposium May 2005

Outline

Introduction Traditional & Dual head Architectures. Proposed Framework Smart Failover framework Experiment Planned & unplanned downtime Conclusion Future work

SLIDE 3

Oscar 05 Symposium May 2005

Introduction

Scientists across the world have employed Grid

Computing to overcome various resource level hurdles.

Clusters are favored job sites in grids. Rendering High availability becomes increasingly

important as critical applications shift to grid systems.

Though Grid is distributed , inevitable errors can make a

site unusable leading to reduced overall resources and slowing down the speed of computation.

SLIDE 4

Oscar 05 Symposium May 2005

Introduction – continued…

Efforts need to concentrate on making critical systems

highly available and eliminate single point of failures in grids and clusters.

HA-OSCAR removes single point of failure of cluster

based job site (Beowulf) by component redundancy and self-healing capabilities.

Smart Failover feature tries to make failover mechanism

graceful in terms of job management.

SLIDE 5

Oscar 05 Symposium May 2005

Traditional Intra site cluster configuration

Site-Manager is (cluster head

node having Globus Services) the node acting as the gateway between the cluster and the grid.

Site-manager is critical from

point of site being used to its full potential.

Failure of Site-Manager

causes whole site to go unused till it becomes healthy.

Outages are non-periodical

and unpredictable and hence measures should be taken to guarantee high availability of

services. Hence the proposed

architecture.

SLIDE 6

Oscar 05 Symposium May 2005

Critical Service Monitoring & Failover- Failback capability for site-manager

Client

Client submits MPI job Site-Manager

HAOSCAR failover if critical services (Gatekeeper, gridFTP, PBS) die

Compute nodes

Stand-By

SLIDE 7

Oscar 05 Symposium May 2005

Proposed Framework

Most of the current efforts have focused on task-level fault tolerance as in retrying the job on an alternate site. There is dearth of solutions for fault detection and recovery at the site level. We monitor Gatekeeper & gridFTP services in the Service monitoring sublayer and failover & failback in irreparable situations.

Operating System Applications Cluster Software Grid Layer HA-OSCAR Service Monitoring HA-OSCAR policy-based recovery mechanism

SLIDE 8

Oscar 05 Symposium May 2005

Grid Enabled HA service

The HA-OSCAR monitors the gatekeeper

and gridFTP services every 3 seconds.

When a service fails, to start after 3 attempts,

failover happens.

Standby also monitors Primary every 3

seconds to check whether it is alive.

SLIDE 9

Oscar 05 Symposium May 2005

Smart Failover Framework

Event monitor triggers Job Queue monitor on events such as

JOB_ADD, JOB_COMPLETE and system events

On sensing change in job queue, job queue monitor

triggers backup updater to update backup.

SLIDE 10

Oscar 05 Symposium May 2005

HA-OSCAR in a cluster based Grid environment

Production-quality Open source Linux-cluster project HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi- head Beowulf system HA-enabled HPC Services: Active/Hot Standby Self-healing with 3-5 sec automatic failover time The first known field- grade open source HA Beowulf cluster release

SLIDE 11

Oscar 05 Symposium May 2005

Experiment

Globus Toolkit 3.2 Oscar 3.0 HA-OSCAR beta 1.0

SLIDE 12

Oscar 05 Symposium May 2005

Observations

Average Failover time

was 19 seconds and average failback time was 20 seconds.

Services were restarted

in between 1-3 seconds depending on when last monitoring was done.

Group Service Type Time Alert 1

Service_ mon Gate keeper Alert

Sun Nov 21 09:10:30 2004

Xinetd. alert

Service- mon Gate keeper Up alert

Sun Nov 21 09:10:33 2004

Mail. alert

Group Service Type Time Alert 1

Primary _server Ping Alert

Sun Nov 21 09:30:20 2004

Server- down Alert

Primary _server Ping Up alert

Sun Nov 21 09:35:39 2004

Server- up .alert

SLIDE 13

Oscar 05 Symposium May 2005

Time needed for jobs to complete with/without “Smart Failover”

Assuming jobs start running after reboot on clusters. TLR = Time to complete last running jobs. MTTR (seconds) Total Time needed without Smart Failover feature Total time needed with smart Failover feature 120 (2 min) 120 + run time of predecessors – TLR (running jobs lost) 20 + run time of predecessors + TLR 600 (10 min) 600 + run time of predecessors – TLR (running jobs lost) 20 + run time of predecessors + TLR 3600 (60 min) 3600 + run time of predecessors – TLR (running jobs lost) 20 + run time of predecessors + TLR 7200 (2 hours) 7200 + run time of predecessors – TLR (running jobs lost) 20 + run time of predecessors + TLR

SLIDE 14

Oscar 05 Symposium May 2005

Planned Downtime

Time to taken to setup and configure software

adds to the planned downtime.

We have developed a easy Globus Toolkit

configuration helper package.

Also helps installation of side packages, such as

schedulers, MPI(s), etc.

This will help reducing planned downtime by

automating the process.

SLIDE 15

Oscar 05 Symposium May 2005

Unplanned Downtime

Assumptions: Package used: SPNP

Availability for grid having traditional cluster as intra site solution : 0.968 i.e. 11.68 days

downtime per year.

Availability for grid having HA- OSCAR enabled cluster as intra site solution:0.99992 i.e. 2

minutes downtime per year Hence the obvious availability gain.

HAOSCAR enabled Grid Vs Traditional Grid 70.00% 72.00% 74.00% 76.00% 78.00% 80.00% 82.00% 84.00% 86.00% 88.00% 90.00% 92.00% 94.00% 96.00% 98.00% 100.00% 1000 2000 3000 5000 6000 Mean Time To Failure(MTTF) in Hours Availability/year Single Head 4 cluster Grid HAOSCAR enabled 4 cluster grid Single Head 10 Cluster Grid HAOSCAR enabled 10 Cluster Grid

HA-OSCAR enabled Grid Vs Traditional Grid

SLIDE 16

Oscar 05 Symposium May 2005

Polling Overhead Measurement

20 sec failover time 0.9% CPU usage at each monitoring

interval

50 100 150 200 250 300 1 2 5 10 15 20 30 60 HA-OSCAR Mon polling interval (s) HA-OSCAR Network load in Packets/ Min m easured by TCPtrace

Comparison of network usages for HA-OSCAR different polling sizes

SLIDE 17

Oscar 05 Symposium May 2005

Summary

Institutions have significant investment in resources and

that needs to be guaranteed.

“Smart Failover” HA-OSCAR makes failover graceful in

terms of job management.

“Smart Failover” HA-OSCAR with Failover Aware

solution for site-manager provides better availability, self healing and fault tolerance.

HA-OSCAR ensures service and job level resilience for

clusters and grids.

SLIDE 18

Oscar 05 Symposium May 2005

Current status

Smart failover feature tested with Oscar 3.0,

OpenPBS as the scheduler.

Failover Aware client written to achieve

resilience for jobs submitted through grid.

Lab grade automated Globus installation

package ready.

SLIDE 19

Oscar 05 Symposium May 2005

Future Work

Develop the wrapper around scheduler for per

job add/complete events.

Testing of Smart failover feature with the event

monitoring system.

Integration of “Smart Failover” in next release of

HA-OSCAR

Research into lazy failback mechanism.

SLIDE 20

Oscar 05 Symposium May 2005

Grid Aware HA-OSCAR

Outline

Introduction

Introduction – continued…

Traditional Intra site cluster configuration

Critical Service Monitoring & Failover- Failback capability for site-manager

Proposed Framework

Grid Enabled HA service

and gridFTP services every 3 seconds.

failover happens.

seconds to check whether it is alive.

Smart Failover Framework

HA-OSCAR in a cluster based Grid environment

Experiment

Observations

Time needed for jobs to complete with/without “Smart Failover”

Planned Downtime

adds to the planned downtime.

configuration helper package.

schedulers, MPI(s), etc.

automating the process.

Unplanned Downtime

Polling Overhead Measurement

Summary

Current status

OpenPBS as the scheduler.

resilience for jobs submitted through grid.

package ready.

Future Work

job add/complete events.

monitoring system.

HA-OSCAR

Thank You