[ale] Linux HA
Jeremy T. Bouse
jeremy.bouse at undergrid.net
Wed Oct 31 15:31:19 EDT 2007
Christopher Fowler wrote:
> I've been testing some stuff in regards to Linux HA today. Normally we
> sell 2 servers. One is a "master" and the other is a "slave". I've
> been testing today the capability to use a floating IP address and allow
> the slave to take over for the master. I have a few issues that do need
> to be resolved before I can roll this out. In my lab and colo I
> experienced 2 issues that HA could not have saved me from.
>
> #1. Kernel not responding.
>
> In this case I can ping the server. All connect()'s from clients
> seem to hang until they timeout. In this scenario my slave will take
> the IP address but the master will still have it and still answer pings.
> Also he will still answer arp requests. HA can't save me here.
>
You can use a watchdog that will reboot the system if it fails to
read/write from the watchdog device. This should keep the system from
kernel hangs holding the HA up.
> #2. Kernel and programs still respond but disks are off
>
> In this case I/O to drives was hosed. Apache would serve up pages that
> were in memory but any request in a page on disk would result in that
> connection hanging forever. No I/O possible. In this scenario the
> heartbeat agent will probably still see a server that is working but the
> reality would be a DoS condition. Also upon seeing this issue I'm still
> left with a server who will not relinquish his IP address.
>
What kinda I/O are you using? I believe if you're using fiber
channel you can use the hbaping option to include your device as a ping
node to check.
> In both cases it seems my only recourse is to allow my slave to also
> control the power of the master. If #1 and #2 exist the slave can
> simply take the floating IP and make a determination if he needs to kill
> power. If so he can kill power and then the master can be repaired.
>
> Ideas?
>
Are you using a cross-connect cable and serial cable to otherwise
monitor your cluster nodes? What about a PDU that can be handled through
the STONITH module?
> Chris
More information about the Ale
mailing list