Abdulrahman Azab abdulrahman.azab@uis.no 1 What is Grid? Grid - - PowerPoint PPT Presentation

abdulrahman azab abdulrahman azab uis no
SMART_READER_LITE
LIVE PREVIEW

Abdulrahman Azab abdulrahman.azab@uis.no 1 What is Grid? Grid - - PowerPoint PPT Presentation

Abdulrahman Azab abdulrahman.azab@uis.no 1 What is Grid? Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. Ian Foster & Karl Kesselman , 2001. VO2


slide-1
SLIDE 1

1

Abdulrahman Azab abdulrahman.azab@uis.no

slide-2
SLIDE 2

2

What is Grid?

“Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations.”

Ian Foster & Karl Kesselman , 2001.

VO1 VO2 VO3 B1 B2 B3

slide-3
SLIDE 3

3

What is Cloud?

slide-4
SLIDE 4

4

The Kiss Rule

Keep it simple, stupid!

slide-5
SLIDE 5

5

Grid vs Cloud

  • Grid

Manager(s) Resource:Hero

I need a Scientific Linux with 2GB RAM!

I have scientific linux With 3 GB Ram

Take Hero User: Ali

slide-6
SLIDE 6

6

Grid vs Cloud

  • Cloud

I need 3 high-CPU windows machines for 2 weeks Available for 1000$

slide-7
SLIDE 7

7

Computational Grid vs. Computational Cloud

Computational Grid Computational Cloud

Provided service Computational power Amount of concurrent requests Limited Massive Transparency Not required Required Scalability Limited High

VO1 VO2 VO3

I don’t care. Both are Distributed computing

slide-8
SLIDE 8

8

Challenges

  • Many, but we consider:
  • 1. Stability with scalability
  • 2. System transparency
slide-9
SLIDE 9

9

Stability with Scalability

  • Stability
  • Scalability
  • Stability with scalability

Maintaining throughput under failures Ability to add more nodes Maintaining throughput under failure with bigger Environment

  • Achieve load balancing
  • Avoid job starvation
slide-10
SLIDE 10

10

How?

  • Optimized machine organization
  • Efficient job scheduling
  • Efficient fault tolerance
slide-11
SLIDE 11

11

Machine organization

Manager(s)

  • Flat

(gLite, Condor, Globus,…)

slide-12
SLIDE 12

12

Machine organization

  • Flat

(gLite, Condor, Globus,…)

slide-13
SLIDE 13

13

Machine organization

  • Flat

(NorduGrid, HIMAN, XtreemOS)

slide-14
SLIDE 14

14

Machine organization

  • Flat
slide-15
SLIDE 15

15

Machine organization

  • Hierarchical

(UNICORE, GridWay, BOINC,…)

slide-16
SLIDE 16

16

Machine organization

  • Hierarchical

(UNICORE, GridWay, BOINC,…)

slide-17
SLIDE 17

17

Machine organization

  • Interconnected

(Condor (flocking), DEISA, EGEE, NorduGrid)

slide-18
SLIDE 18

18

Machine organization

  • Interconnected

(Condor (flocking), DEISA, EGEE, NorduGrid)

slide-19
SLIDE 19

19

Proposal

slide-20
SLIDE 20

20

Machine Organization: Cell

slide-21
SLIDE 21

21

Scheduling: Cooperative

VO1 VO2 VO3 VO4 VO5 Goto vo2 or vo3

  • Minimize scheduling overhead using Fuzzy logic
slide-22
SLIDE 22

22

Worker Failures

W1 W2 W3 W4 W5

Checkpoints 1 2 3 4 5 6 7 8 Last Update Failure 1 Failure 2 W1 W2 W3 W4 W5

slide-23
SLIDE 23

23

Broker Failures

slide-24
SLIDE 24

24

Broker Failures

slide-25
SLIDE 25

25

Broker Failures

slide-26
SLIDE 26

26

Simulation Model: PeerSim

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Allocation Protocol

Idle Protocol

Regular Node

Grid CD Protocol

Broker Protocol

Broker

Service Allocator

Broker Protocol

Broker

Broker Overlay

Grid CD Protocol Grid CD Protocol

slide-27
SLIDE 27

27

Performance Evaluation

  • Validity of the stored resource information.
  • Efficiency of service allocation.
  • Impact of broker failure on resource information

updating. N  Total Grid size, M  Number of VOs

slide-28
SLIDE 28

28

Performance Evaluation

  • Broker Overlay Topologies

1 2 K 1 2 K 1 2 K 1 2 K

Ring Hyper-Cube Fully connected Wire-k-out

slide-29
SLIDE 29

29

Validity of the stored resource information

  • The deviation of the reading time values of RIDBs

stored in the resource information data set, from the current cycle in a broker, with the simulation cycles.

  • The deviation value for cycle (c):
slide-30
SLIDE 30

30

Validity of the stored resource information

N = 100, M = 20 N = 500, M = 100 (log scale)

slide-31
SLIDE 31

31

Efficiency of Job Allocation

  • One broker periodical allocation.
slide-32
SLIDE 32

32

Impact of Broker Failures on Resource Information Updating (N = 500, M = 100)

Ring Topology

slide-33
SLIDE 33

33

System Transparency

System

slide-34
SLIDE 34

34

Challenge

  • To submit jobs to a Grid system you need to

learn how to:

  • 1. Prepare your input files
  • 2. Write a detailed submission script.
  • 3. Submit your jobs through the front end.
  • 4. Monitor the execution.
  • 5. Collect the results.

Example for 2: condor_submit Do scientists have time for this ?

slide-35
SLIDE 35

35

Current solutions

  • Grid portals (Web-based gateways)

WebSphere, WebLogic, GridSphere, GridPortlets,..

  • Useful for manual submission. In many cases, it

is required to perform job submission automatically from a user code.

Px

slide-36
SLIDE 36

36

Current solutions

  • Web services

Birdbath (condor), GRAM (Globus), GridSAM, ..

  • APIs

DRMAA, SAGA, HiLA, CondorAPI, GridR, ..

  • The programming language has to support the

technology and the user must have the proper

  • experience. This is not the case for many low

level special purpose languages and most of the scientists

slide-37
SLIDE 37

37

Our Solution: GAFSI

  • Grid Access File System Interface

submission and management of grid jobs is carried out by executing simple read() and write() file system commands.

  • This technique allows all categories of users to

submit and manage grid jobs both manually and from their codes which may be written in any language.

Demo

slide-38
SLIDE 38

38

GAFSI-File sharing

Condor pool UCC Condor_schedd UNICORE

GAFSI

3 5 5 6

\<GAFSI‐S Watch‐path>

Condor UNICORE

4 4 2

File Sharing

1 1 7 7

File Sharing

Broker Users

File name: Job$Cluster$R$memory1024$Condor$start

slide-39
SLIDE 39

39

GAFSI-SSH

Condor pool

UCC

Condor_schedd UNICORE

GAFSI

3 5 5 6

\<GAFSI-S Watch-path>

Condor UNICORE

4 4 2

SFTP

GAFSI‐C GAFSI‐C GAFSI‐C GAFSI‐C GAFSI‐C

1 1 7 7

SFTP

Broker

Users

File name: Job$Cluster$R$memory1024$Condor$start

slide-40
SLIDE 40

40

Simple Example: R code

  • 1. Create the input files:

for (j in 1:Grid.workers){

... save(param,dataList,iterationList,file=paste(j,".RData", sep="")) }

  • 2. Copy them to the GAFSI watch path:

for (j in 1:Grid.workers){

file.copy(paste(j,".RData", sep=""),paste(Grid.workers.addresses[j], "\\input.RData", sep="")) }

slide-41
SLIDE 41

41

Simple Example: R code

  • 3. Copy the code file to the same path:

file.copy("worker.apl.kf.R", paste(Grid.mainpath,"\ \","code.R", sep=""))

  • 4. Create the start file to trigger the submission:

file.create(paste(Grid.mainpath,"\\ mytask$cluster$R mytask$cluster$R $memory300$start $memory300$start", sep=""))

slide-42
SLIDE 42

42

Simple Example: R code

  • 5. Wait for the completion, then collect the result

files:

while(TRUE){ Sys.sleep(1) if(file.exists(Grid.mainpath+ “mytask$cluster$R$exports=result.RData$memory300$start

mytask$cluster$R$exports=result.RData$memory300$start”))

next } //Result collection for(j in 1:results){ load(Grid.mainpath+”\\result”+j+”.RData”) }

slide-43
SLIDE 43

43

Initial Performance Evaluation

  • CPU utilization of R process during the execution
  • f a parallel version PSM.estimate() statistical

modeling function on Condor

slide-44
SLIDE 44

44

Conclusions and Future work

  • Maintaining stability with scalability together with

achieving system transparancy is a considerable challenge.

  • We’ve proposed a broker overlay based model as an

infrastructure to maintain stability with scalability.

  • A grid access file system interface is proposed to solve

the concurrency problem. It is currently being implemented on Condor and UNICORE frameworks.

  • The proposed architecture is to be implemented on

existing Grid frameworks.

  • GAFSI is to be implemented on Linux based on FUSE.
slide-45
SLIDE 45

45

Thank You

slide-46
SLIDE 46

46

Additional Slides

slide-47
SLIDE 47

47

Machine organization: Flat

  • gLite Workload Management System (WMS)
slide-48
SLIDE 48

48

Machine organization: Flat

  • Condor Central Manager (CM)
slide-49
SLIDE 49

49

Machine organization: Flat

  • Globus

Grid Resource Allocation & Management (GRAM)