Reliable network booting of cluster computers Matthew Steggink July - - PowerPoint PPT Presentation

reliable network booting of cluster computers
SMART_READER_LITE
LIVE PREVIEW

Reliable network booting of cluster computers Matthew Steggink July - - PowerPoint PPT Presentation

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions Reliable network booting of cluster computers Matthew Steggink July 2nd, 2008 Matthew Steggink Reliable network booting


slide-1
SLIDE 1

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Reliable network booting of cluster computers

Matthew Steggink July 2nd, 2008

Matthew Steggink Reliable network booting of cluster computers

slide-2
SLIDE 2

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Matthew Steggink Reliable network booting of cluster computers

slide-3
SLIDE 3

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Network booting

◮ Booting off the network instead of local disk

Matthew Steggink Reliable network booting of cluster computers

slide-4
SLIDE 4

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Network booting

◮ Booting off the network instead of local disk ◮ Easily deploy new computers;

Matthew Steggink Reliable network booting of cluster computers

slide-5
SLIDE 5

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Network booting

◮ Booting off the network instead of local disk ◮ Easily deploy new computers; ◮ Centralized image management;

Matthew Steggink Reliable network booting of cluster computers

slide-6
SLIDE 6

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Network booting

◮ Booting off the network instead of local disk ◮ Easily deploy new computers; ◮ Centralized image management; ◮ Possibility of diskless computers;

Matthew Steggink Reliable network booting of cluster computers

slide-7
SLIDE 7

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Network booting

◮ Booting off the network instead of local disk ◮ Easily deploy new computers; ◮ Centralized image management; ◮ Possibility of diskless computers; ◮ Involves DHCP, ARP and TFTP ◮ Currently used for network booting: PXELinux

Matthew Steggink Reliable network booting of cluster computers

slide-8
SLIDE 8

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

The setup

Matthew Steggink Reliable network booting of cluster computers

slide-9
SLIDE 9

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Research question

When booting a large number of clients, some will not complete the boot process

◮ An analysis of the failing points; ◮ Determine the cause of the failing clients; ◮ Search for a solution;

Matthew Steggink Reliable network booting of cluster computers

slide-10
SLIDE 10

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Testing

Matthew Steggink Reliable network booting of cluster computers

slide-11
SLIDE 11

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Shape the traffic

◮ Limit the traffic to simulate network characteristics ◮ Two options to shape the traffic

Matthew Steggink Reliable network booting of cluster computers

slide-12
SLIDE 12

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Shape the traffic

◮ Limit the traffic to simulate network characteristics ◮ Two options to shape the traffic

  • 1. VMWare Teams
  • 2. Traffic Control in Linux: Token Bucket Filter

Matthew Steggink Reliable network booting of cluster computers

slide-13
SLIDE 13

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Shape the traffic

◮ Limit the traffic to simulate network characteristics ◮ Two options to shape the traffic

  • 1. VMWare Teams
  • 2. Traffic Control in Linux: Token Bucket Filter

◮ Limit traffic and set the rates lower to find a failing point

Matthew Steggink Reliable network booting of cluster computers

slide-14
SLIDE 14

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Observations - Traffic control

◮ VMware teaming does not shape accurately ◮ TC shapes more reliable

Figure: VMWare versus tc traffic control

Matthew Steggink Reliable network booting of cluster computers

slide-15
SLIDE 15

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Observations - Fail point

◮ Too much packet loss and not enough bandwidth

Matthew Steggink Reliable network booting of cluster computers

slide-16
SLIDE 16

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Identified problems

◮ DHCP

◮ No DHCP Offers, No boot file Matthew Steggink Reliable network booting of cluster computers

slide-17
SLIDE 17

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Identified problems

◮ DHCP

◮ No DHCP Offers, No boot file

◮ ARP

◮ ARP Timeout Matthew Steggink Reliable network booting of cluster computers

slide-18
SLIDE 18

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Identified problems

◮ DHCP

◮ No DHCP Offers, No boot file

◮ ARP

◮ ARP Timeout

◮ TFTP

◮ TFTP Timeout, Read timeout, illegal operation, server does

not support tsize

Matthew Steggink Reliable network booting of cluster computers

slide-19
SLIDE 19

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Identified problems

◮ DHCP

◮ No DHCP Offers, No boot file

◮ ARP

◮ ARP Timeout

◮ TFTP

◮ TFTP Timeout, Read timeout, illegal operation, server does

not support tsize

◮ During downloading (TFTP)

◮ Loading vmlinuz...

boot failed

Matthew Steggink Reliable network booting of cluster computers

slide-20
SLIDE 20

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Booting by TCP / HTTP using gPXE

◮ gPXE is an open source project ◮ TCP has delivery reliablity because of re-transmissions with

acknowledgments

◮ Two deployment methods

Matthew Steggink Reliable network booting of cluster computers

slide-21
SLIDE 21

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Booting by TCP / HTTP using gPXE

◮ gPXE is an open source project ◮ TCP has delivery reliablity because of re-transmissions with

acknowledgments

◮ Two deployment methods

  • 1. gPXE flashed into the boot ROM
  • 2. gPXE used as a second stage loader

Matthew Steggink Reliable network booting of cluster computers

slide-22
SLIDE 22

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

gPXE results

◮ gPXE is easy to use, only a few extra lines of code ◮ No alterations to the clients are needed ◮ It was compatible with mainstream boot ROM’s (Tested:

Intel, Broadcom, Nvidia)

◮ Connections are more reliable; no connections have been

aborted during testing

◮ Disadvantage at this point:

◮ Introduces a second DHCP transaction Matthew Steggink Reliable network booting of cluster computers

slide-23
SLIDE 23

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Situations compared

Matthew Steggink Reliable network booting of cluster computers

slide-24
SLIDE 24

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Conclusion

◮ gPXE is ready to deploy with only minor alterations; ◮ The current setup should not use TFTP; ◮ Connections are more reliable with gPXE and TCP/HTTP; ◮ Results:

◮ DHCP is still the bottleneck ◮ TFTP bottlenecks have been solved Matthew Steggink Reliable network booting of cluster computers

slide-25
SLIDE 25

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Future work

◮ Take out the second DHCP session ◮ There might be a better performing DHCP server

Matthew Steggink Reliable network booting of cluster computers

slide-26
SLIDE 26

Outline Theory Research question Test methods Observations Alternative booting Conclusion and future work Questions

Questions

◮ Matthew Steggink

matthew.steggink@os3.nl

Matthew Steggink Reliable network booting of cluster computers