|
Your problem is very likely the PXE server implementation, not the servers or nodes. A typical way of configuring a PXE network boot server is combining the ISC DHCP server with a stand-alone TFTP server. This combination is the one provided by default with most Linux distributions. While it will work with small homogeneous installations, it will not provide reliable PXE service with large or diverse installations. Most large installations have recognized that there is a problem with many simultaneously booting machines. But rather than directly addressing the actual problem (unreliable booting), they have focused on work-arounds such as staggering powering up machines, or detecting machines that failed to boot and power cycling them.
A second issue with using the typical DHCP+TFTP combination is that it is not really a PXE server. Much like the doorbell-triggered tape playback in the movie "Ferris Bueller's Day Off," the DHCP server is providing pre-set responses to packets. This means it can't adapt to variations, such as known bugs in specific PXE client versions. This is something that can be worked around when the clients are homogeneous, but few installation remain homogeneous for long.
To solve this problem for our customers, we wrote our own integrated PXE server to reliably boot compute nodes. It encapsulates much of what we have learned about booting machines. It also
- interprets the initial request to work around different generation of PXE client bugs. The BIOS code is unlikely to be fixed, and there are some pretty ugly bugs. (What does a file name of '' mean? Use the last file requested...)
- works around the TFTP capture effect, where clients that drop a packet are squeezed out and quickly give up, leaving the machine powered on but useless
- defers answering new requests when especially busy, but always respond before the client times out.
Just as importantly, Penguin's PXE server uses and updates the single cluster configuration file. Before writing the server we went through several rounds of writing configuration files from other configuration files, and each time we ended up with a fragile implementation that was difficult to debug. So do not do this writing a purpose-built PXE server that meets the above criteria should increase the reliability and consistency of your system's performance (any other business benefits?)
Even without using our server, there are several things you can do to make your network boot server more reliable:
- Verify that the network isn't a source of problems. "Smart" Ethernet switches are more likely to cause problems than non-configurable switches. Likely problems are:
- Spanning Tree Protocol, which blocks broadcast traffic for 60 seconds after a link is enable to check for network loops. Unfortunately PXE clients only try to contact a boot server for about 40 seconds.
- Broadcast packet rate limits, where the switch prevents broadcast packets "storms" from impacting other traffic. Some switches default to rates as low as only 16 packets per second. While PXE only needs to exchange a few broadcast packets before switching to directly addressed packets, cascading switches quickly exceeds per-port limits.
- Duplex mismatches. Set network switches to autonegotiate, or leave at CSMA/CD "half duplex".
- Minimize the size of the image you serve over TFTP. This will reduce both the TFTP network traffic and the window of vulnerability.
- Avoid using multicast, except for service discovery. Few PXE servers implement multicast: that's a good thing since multicast opens up a whole catalog of possible problems.
|