French / UK workshop on GRID Computing
1
Grand Large
γλ
ACI Grid CGP2P
Investigating the impact of the Large Scale on distributed systems
- F. Cappello
INRIA
Grand-Large Project, INRIA/PCRI
LRI, Université Paris Sud
fci@lri.fr, www.lri.fr/~fci
Investigating the impact of the Large Scale on distributed systems - - PowerPoint PPT Presentation
ACI Grid CGP2P Grand Large Investigating the impact of the Large Scale on distributed systems F. Cappello INRIA Grand-Large Project, INRIA/PCRI LRI, Universit Paris Sud fci@lri.fr, www.lri.fr/~fci 1 French / UK workshop on GRID
French / UK workshop on GRID Computing
1
Grand Large
ACI Grid CGP2P
INRIA
LRI, Université Paris Sud
fci@lri.fr, www.lri.fr/~fci
French / UK workshop on GRID Computing
2
Grand Large
ACI Grid CGP2P
Node Features:
« Desktop GRID »
(Seti@home, Decrypthon, Climate-Prediction) Peer-to-Peer systems (Napster, Kazaa, etc.) Large scale distributed systems Large sites Computing centers, Clusters PC Windows, Linux
credential
authentication
confidence Computing « GRID » 2 kinds of Grids
French / UK workshop on GRID Computing
3
Grand Large
ACI Grid CGP2P
infrastructure)
Request may be related to Computations or data
Coordination system Client (PC)
request result accept provide
PC PC PC PC PC PC PC PC PC
accept provide request result
Client (PC) Service provider (PC)
Potential communications for parallel applications
Server provider (PC)
Accept concerns computation or data
French / UK workshop on GRID Computing
4
Grand Large
ACI Grid CGP2P
A very simple problem statement but leading to a lot of research issues (classical OS): Scheduling, Load Balancing, Security, Fairness, Coordination, Message passing, Data storage, Programming, Deployment, etc. BUT « Large Scale » feature has severe implications:
Conventional techniques/approaches may not fit Ex: fault tolerance
New approaches (intrinsically scalable/FT) are needed
French / UK workshop on GRID Computing
5
Grand Large
ACI Grid CGP2P
Research topics and sub-projects: Global architecture (F. C. and O. R.) User Interface, control language (SPI, S. Petiton) Security, sandboxing (SPII, O. Richard) Large scale Storage (SPIII, Gil Utard) Inter-node communications : MPICH-V (SPIV, F. Cappello) Scheduling -large scale, multi users- (SPIV, C. G. and F.C.) Theoretical proof of the protocols (SPV, J. Beauquier) GRID/P2P interoperability (SPV, A. Cordier) Validation on real applications (G. Alléon, etc.)
French / UK workshop on GRID Computing
6
Grand Large
ACI Grid CGP2P
According to the current knowledge, we need: 1) New tools (model, simulators, emulators, experi. Platforms) 2) Strong interaction between research tools
log(cost)
XtremWeb MPICH-V SMLSM US ADSL-Stats Grid’5000
log(realism)
SimLargeGrid Model for LSDS Protocol proof
Grid eXplorer
emulation math simulation live systems
French / UK workshop on GRID Computing
7
Grand Large
ACI Grid CGP2P
CGP2P results log(cost)
XtremWeb MPICH-V SMLSM US ADSL-Stats Grid’5000
log(realism)
SimLargeGrid Model for LSDS Protocol proof
Grid eXplorer
emulation math simulation live systems
French / UK workshop on GRID Computing
8
Grand Large
ACI Grid CGP2P
According to the current knowledge, we need: 1) New tools (model, simulators, emulators, experi. Platforms) 2) Strong interaction between research tools
log(cost) log(realism)
SimLargeGrid Model for LSDS Protocol proof
Grid eXplorer INRIA Grand-Large
XtremWeb MPICH-V SMLSM US ADSL-Stats Grid’5000 emulation math simulation live systems
French / UK workshop on GRID Computing
9
Grand Large
ACI Grid CGP2P
Network:
Nodes:
connection with select)
Either the target is down OR it cannot accept new connexions because all slots are full OR it does not see the incoming SYN message due to high network traffic
Etc.
French / UK workshop on GRID Computing
10
Grand Large
ACI Grid CGP2P
systems! (consensus impossibility)
without requiring consensus? An interesting strategy would be to consider for each node an “horizon”. Concensus would be guaranteed only inside this horizon.
Workshop: Hugues Fauconnier, Carole Delporte (Paris 7), Joffroy Beauquier, Franck Cappello, Colette Johnen, Sébastien Tixeuil, Thomas Herault (Paris 11)
French / UK workshop on GRID Computing
11
Grand Large
ACI Grid CGP2P
According to the current knowledge, we need: 1) New tools (model, simulators, emulators, experi. Platforms) 2) Strong interaction between research tools
log(cost)
XtremWeb MPICH-V SMLSM US ADSL-Stats Grid’5000
log(realism)
Model for LSDS Protocol proof
Grid eXplorer
SimLargeGrid emulation math simulation live systems
French / UK workshop on GRID Computing
12
Grand Large
ACI Grid CGP2P
Global coordination seems very difficult at large scale (Hierarchical solutions exist and may fit). More speculative approaches based on autonomous decisions, self organization are also good candidates. Investigate this last idea with a concrete mechanism: Scheduler/Load balancer (SimGrid, Bricks, GriSim don’t scale) Current status: a simulation tool: topology, volatility, asynchrony, latency/BW, heterogeneity + nearest neighbor scheduling algorithms + use the tool to compare them.
French / UK workshop on GRID Computing
13
Grand Large
ACI Grid CGP2P
Based on Swarm Multi-agent simulator
French / UK workshop on GRID Computing
14
Grand Large
ACI Grid CGP2P
According to the current knowledge, we need: 1) New tools (model, simulators, emulators, experi. Platforms) 2) Strong interaction between research tools
log(cost)
XtremWeb MPICH-V SMLSM US ADSL-Stats Grid’5000
log(realism)
Model for LSDS Protocol proof
Grid eXplorer
SimLargeGrid emulation math simulation live systems
French / UK workshop on GRID Computing
15
Grand Large
ACI Grid CGP2P
A “GRIDinLAB” instrument for CS researchers Founded by the French ministry of research through the ACI “Data Mass” incentive + INRIA For
Addressing specific issues of each domain Enabling research studies combining the 2 domains Ease and develop collaborations between the two communities. Statistics: 13 Laboratories 80 researchers 24 Research Experiments >1M€ (not counting salaries) Installed at IDRIS (Orsay)
French / UK workshop on GRID Computing
16
Grand Large
ACI Grid CGP2P
Close to
Emulab and WaniLab
Hardware + Soft
French / UK workshop on GRID Computing
17
Grand Large
ACI Grid CGP2P
– First GdX meeting was on September 16, 2003. – Hardware design meeting planned for October 15. – Hardware selection meeting on November 8
– Choosing the nodes (single or dual?) – Choosing the CPU (Intel IA 32, IA64, Athlon 64, etc.) – Choosing the experimental Network (Myrinet, Ethernet, Infiniband, etc.) – Choosing the general experiment production architecture (parallel OS architecture, user access, batch scheduler, result repositoty) – Choosing the experimental database harware – Etc.
French / UK workshop on GRID Computing
18
Grand Large
ACI Grid CGP2P
According to the current knowledge, we need: 1) New tools (model, simulators, emulators, experi. Platforms) 2) Strong interaction between research tools
log(cost)
XtremWeb MPICH-V SMLSM US ADSL-Stats Grid’5000
log(realism)
Model for LSDS Protocol proof
Grid eXplorer
SimLargeGrid emulation math simulation live systems
French / UK workshop on GRID Computing
19
Grand Large
ACI Grid CGP2P
Coordinator
Worker PC
result request
Coordinator Client PC Worker PC Client PC
job result job request
NAT/Proxy bypass)
result result
French / UK workshop on GRID Computing
20
Grand Large
ACI Grid CGP2P
PC worker air shower Coordinator Internet and LAN PC Worker PC Client Air shower parameter database (Lyon, France)
~ 5000
by the fall December
replication
Understanding the origin of very high cosmic rays:
– Sequential, Monte Carlo. Time for a run: 5 to 10 hours XtremWeb
PC worker
French / UK workshop on GRID Computing
21
Grand Large
ACI Grid CGP2P
Internet
Grenoble PC pool (PBS) Madison Wisconsin PC Pool (Condor) University Network LRI PC Pool Other Labs Lab Network lri.fr XW Client XW Coordinator
EU-USA Network
50 100 150 200 250 300 350 400 10 20 30 40 50 60 70 80 90
Processeurs utilisés Temps en minutes
WLG−309/Fautes WLG−270
Massive Fault (150 CPUs) Fault Free Situation
Scalability Resistance to massive fault
50 100 150 200 250 200 400 600 800 1000
Temps en minutes Nombre de travaux exécutés
WISC−97 WL−113 G−146 WLG−270 WL−451
About 1K CPUs
French / UK workshop on GRID Computing
22
Grand Large
ACI Grid CGP2P
– with theoretical studies, – experimental evaluations, – pragmatic implementations, aiming to provide a MPI implementation based on MPICH, featuring multiple fault tolerant protocols (3 currently), for Desktop Grids, Large Clusters and Grids
French / UK workshop on GRID Computing
23
Grand Large
ACI Grid CGP2P
Execution time with faults (Fault injection) BT.A.9
base comp base comm w/o checkpoint comp w/o checkpoint comm p4 comp p4 comm 9 16 25 Number of nodes 50 100 150 200 250 300 350 Time in seconds whole comp whole comm
MPICH-V (CM but no logs) MPICH-V (CM with logs) MPICH-V (CM+CS+ckpt) MPICH-P4
MPICH-V vs. MPICH-P4 BT.A
Number of faults during execution Total execution time (sec.)
Base exec. without ckpt. and fault
1 2 3 4 5 6 7 8 9 10 610 650 700 750 800 850 900 950 1000 1050 1100 ~1 fault/110 sec.
French / UK workshop on GRID Computing
24
Grand Large
ACI Grid CGP2P
Many of the CGP2P Participants are also involved in:
Summary: We are involved in different projects related to large scale distributed systems: – From Theoretical Studies to Actual Grid Deployments – About Fault Tolerance and Performance – Middleware Design and Implementation: XtremWeb, MPICH-V – Large Scale Experimental Platforms: Grid eXplorer, Grid’5000
French / UK workshop on GRID Computing
25
Grand Large
ACI Grid CGP2P
French / UK workshop on GRID Computing
26
Grand Large
ACI Grid CGP2P
The 4 research topics and their leaders:
Olivier Richard (ID-IMAG)
Pierre Sens (LIP6)
Pascale Primet (LIP, Inria RESO)
Christophe Cérin (Laria)
French / UK workshop on GRID Computing
27
Grand Large
ACI Grid CGP2P
X II.16 Grid coupled sys. X X X II.15 NG. Internet X X X II.14 Security X X X II.13 Reliability X X II.12 P2P storage X II.11 Bioinformatique X X II.10 Cellul. automaton X X II.9 Uni and multicast X X II.8 Data sharing X II.7 Comm. Optimizat. X X II.6 Scheduling X X II.5 Data base X II.4 DHT. X X II.3 Fault tolerance X X II.2 Mobile objects X X X II.1 Engineering tech. X X X I.8 Internet Emul. X I.7 Communication X I.6 Heterogeneity emul X X X I.5 Network. X I.4 Emul driven Simul X X I.3 Virt. Techniques X X I.2 Virtual Grid X X X X I.1 Platform Application Network Emulation Infrastructure Experiences
French / UK workshop on GRID Computing
28
Grand Large
ACI Grid CGP2P
Volunteer PC Downloads and executes the application Volunteer PC Parameters Client application
Internet Volunteer PC Coordinator
– SETI@Home, distributed.net, – Décrypthon (France)
– Folding@home, Genome@home, – Xpulsar@home,Folderol, – Exodus, Peer review,
– Javelin, Bayanihan, JET, – Charlotte (based on Java),
– Entropia, Parabon, – United Devices, Platform (AC)
A central coordinator schedules tasks
Master worker paradigm, Cycle stealing
French / UK workshop on GRID Computing
29
Grand Large
ACI Grid CGP2P
– Instant Messaging – Managing and Sharing Information – Collaboration – Distributed storage
– Napster, Gnutella, Freenet, – KaZaA, Music-city, – Jabber, Groove,
– Globe (Tann.), Cx (Javalin), Farsite, – OceanStore (USA), – Pastry, Tapestry/Plaxton, CAN, Chord,
– Cosm, Wos, peer2peer.org, – JXTA (sun), PtPTL (intel), Volunteer Service Provider Volunteer Volunteer PC participating to the resource discovery/coordination Internet Client
req.
All system resources
and server,
Distributed and self-organizing infrastructure
French / UK workshop on GRID Computing
30
Grand Large
ACI Grid CGP2P
—Symmetry for the negotiation phase —Asymmetry for Distribution and Execution phases. —Waves Several hours to get 1 movie parallel simulation is required!
French / UK workshop on GRID Computing
31
Grand Large
ACI Grid CGP2P
Goal: execute existing or new MPI Apps
PC client MPI_send() PC client MPI_recv()
Programmer’s view unchanged: Problems: 1) volatile nodes (any number at any time) 2) non named receptions ( should be replayed in the
same order as the one of the previous failed exec.)
Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Scalable Infrastructure/protocols 5) Avoid global synchronizations (ckpt/restart) 6) Theoretical verification of protocols