Friday 8 March 2013

A RHEL 6/Centos 6 HA Cluster for LAN Services switching to EXT4 (Part 3)

The Startup and Shutdown Problem
When I was having issues on my cluster, I found it frustrating that the cluster would come up during a node reboot. It was overly time consuming to have to boot single user and chkconfig services out (I know there is the "nocluster" kernel boot flag but it doesn't include DRBD). I'd prefer to have a completely "up" full multiuser system before it even thinks about cluster services, this makes repairing it easier.

Another issue I have is my service dependency of "clvmd" on "drbd" (I described and setup in my original GFS2 cluster setup blog) keeps being overwritten by updates. Which is annoying and I often forgot to reset it when I update.

Also I found I wanted a greater level of paranoia in my startup scripts before starting cluster services. For example, I had issues in the past with GFS2 starting on a not fully sync'd up DRBD (generated oops's). This was likely due to just the overhead of the rebuild but I decided I'd like a fully up to date DRBD on a node before starting services (even though I'm now using ext4). DRBD doesn't generally take long to sync up as it can determine changed blocks quickly (a few minutes) ( I guess, providing a node hasn't been down too long).

So I decided to cut through all this and create my own startup and stop script to meet my needs.

The second major issue I have with my new demand mounted cluster setup is that filesystems shared out via NFS have a nasty habit of not umounting when the service is stopped. Sometimes they successfully umount two odd minutes after the service has stopped. Restarting the "nfs" service usually frees them. I have found this is pretty much true, but they can stick.

A filesystem not umounting wasn't such a big deal with GFS2 as we are allowed to have the filesystem mounted on both nodes. But with our ext4 filesystem this is a big no no. So this can stop us flipping the service to a new node.

A service when stopped will get marked as failed if it fails to umount a file system. There isn't really a lot of choice in this situation. We can either restart this service on it's original node (though it will need to be disabled again and prudence would suggest the "-m" flag in "clusvcadm" when re-enabling it on it's original home node to force it to the correct node). Starting the service on the node that doesn't have the hung mount will cause the filesystem to be mounted twice on different nodes and will likely result in a file system corruption (a double disable using clusvcadm removes the cluster checking that it isn't mounted anywhere). So high levels of care and checking are needed in this situation (check where your mounts are on which nodes).

The only other choice is to hard halt the node with the stuck mount. This is really the only option when shutting a system down with a hung mount. We have to ensure that the DRBD storage is in an unused state before the "drbd" service will cleanly stop. If we just continue shutting down with a hung filesytem, the drbd will fail to stop and at some point in the process the network between the nodes will be shutdown (as in both sides could be changing the storage without the other knowing) . This could result in a split brain (depending on what the other node is doing). A hard halt is nasty but your storage is really more important than anything else.

Of course the other option would be a setup where the DRBD is in Primary/Secondary, but I resent having a high performance node sitting there doing nothing when it could be doing some load balancing for me.

So as it's likely I can't so easily move an NFS service once started, and as nodes will likely come up at different times (esp with my wait for an up-to-date DRBD before a node goes on) I have setup my cluster to not start NFS services automatically in cluster.conf (and to prevent them failing back). My startup script will handle placing them on the correct nodes.

So I wanted startup and stop scripts that should handle all this.

Firstly I chkconfig'd off cman, drbd, clvmd, rgmanager and ricci on both nodes.

I make no claims for the scripts below, they aren't overly clean and were put together quickly to get things going. They assume I have only my two nodes and that the only NFS services the systems have are cluster ones. Also somewhat nastily the node names are in the body of the scripts.

Firstly my /etc/rc.d/init.d/lclcluster startup script that basically calls another script to do the actual work. This need chkconfig --add lclcluster and then chkconfig lclcluster on


#!/bin/sh
#
# Startup script for starting the Cluster with paranoia
#
# chkconfig: - 99 00
# description: This script starts all the cluster services
# processname: lclcluster
# pidfile:

# Source function library.
. /etc/rc.d/init.d/functions

# See how we were called.
case "$1" in
  start)
    if [ -f /tmp/shutmedown ] ; then
        rm -f /tmp/shutmedown
        /sbin/chkconfig apcupsd on
        echo 'shutdown -h now' | at now+1minutes
        echo Going Straight to shutdown
        failure
                echo
                RETVAL=1
        exit 1
    fi
        echo -n "Starting lclcluster: "
    if ! [ -f /var/lock/subsys/lclcluster ] ; then
        if [ -f /usr/local/sbin/clusterstart ] ; then
            /usr/local/sbin/clusterstart initd >/usr/local/logs/cl
usterstart 2>&1 &
        fi
            touch /var/lock/subsys/lclcluster
        RETVAL=0
        success
        echo
    else
        echo -n "Already Running "
        failure
        echo
        RETVAL=1
    fi
        echo
        ;;
  stop)
        echo -n "Stopping lclcluster: "
    if [ -f /var/lock/subsys/lclcluster ] ; then
        if [ -f /usr/local/sbin/clusterstop ] ; then
            /usr/local/sbin/clusterstop initd >/usr/local/logs/clu
sterstop 2>&1
        fi
            rm -f /var/lock/subsys/lclcluster
        success
        echo
        RETVAL=0
    else
        echo -n "Not running "
        failure
        echo
        RETVAL=1
    fi
        ;;
  restart)
    $0 stop
    $0 start
    RETVAL=$?
    ;;
  *)
        echo "Usage: $0 {start|stop|restart}"
        exit 1
esac

exit $RETVAL

Then my /usr/local/sbin/clusterstart script.
  
#!/bin/bash

# Script to start all cluster services in the correct order

# If passed as a startup script wait 2 minutes for the startup to happen
# this will allow us to prevent cluster startup by touching a file
if [ "$1" = "initd" ] ; then
        echo "Startup Script called me waiting 2 minutes"
        sleep 120
fi

# If this file exists, do not start the cluster
if [ -f /tmp/nocluster ] ; then
    echo "Told not to start cluster. Exiting..."
    exit 1
fi

# Start cman
/etc/init.d/cman start

sleep 2

# Start DRBD
/etc/init.d/drbd start

# Loop until consistent
ret=1
ret2=1
count=0
while ( [ $ret -ne 0 ] || [ $ret2 -ne 0 ] ) &&  [ $count -lt 140 ] ; do
    grep "ro:Primary/" /proc/drbd
    ret=$?
    grep "ds:UpToDate/" /proc/drbd
    ret2=$?
    echo "DRBD Count: $count Primary: $ret UpToDate: $ret2"
    count=`expr $count + 1`
    sleep 10
done

if [ $count -eq 140 ] ; then
    echo "DRBD didnt sync in a timely manner. Not starting cluster services"
    exit 1
fi

if [ -f /tmp/splitbrain ] ; then
    echo "DRBD Split Brain detected. Not starting cluster services"
    exit 1
fi
  
echo "DRBD Consistent"

# Start clvmd
/etc/init.d/clvmd start

sleep 2

# Start rgmanager
/etc/init.d/rgmanager start

sleep 2

# Start ricci
/etc/init.d/ricci start

# If this file exists, do not start the cluster nfsds
if [ -f /tmp/nonfsds ] ; then
    echo "Told not to start NFS Services. Exiting..."
    exit 1
fi

# Waiting for things to settle
echo "Waiting for things to settle"
sleep 20

# Try to start my NFSD services
echo "Starting my NFSD services"
declare -A NFSD
NFSD[bldg1ux01n1i]="nfsdprojects"
NFSD[bldg1ux01n2i]="nfsdhome"


for f in ${NFSD[`hostname`]} ; do
    clustat | grep $f | grep -q disabled
    res=$?
    if [ $res -eq 0 ] ; then
        echo Starting $f
        clusvcadm -e $f -m `hostname`
    fi

    # Was any of these mine before and failed bring up again
    clustat | grep $f | grep `hostname` | grep -q failed
    res=$?
        if [ $res -eq 0 ] ; then
                echo Starting $f
        clusvcadm -d $f
        sleep 1
                clusvcadm -e $f -m `hostname`
        fi

done

echo Check other nodes services after storage is up

# Wait to see if other node starts its services or I will
echo Waiting to see if other nodes starts its services or not
echo First wait to see if its DRBD is consistent

# Loop until consistent
ret=0
count=0
while [ $ret -eq 0 ]  &&  [ $count -lt 140 ] ; do
    grep "sync" /proc/drbd
    ret=$?
    echo "DRBD Count: $count syncing: $ret"
    count=`expr $count + 1`
    sleep 10
done

if [ $count -eq 140 ] ; then
    echo "DRBD didnt sync on my partner continuing"
else  
    echo Give it a chance to start some services. Waiting....
    sleep 180
fi

if  [ "`hostname`" = "bldg1ux01n1i" ] ; then
    othermach="bldg1ux01n2i"
else
    othermach="
bldg1ux01n1i"
fi

# Start anything the other node hasnt claimed
echo "What didnt the other node take"

for f in ${NFSD[$othermach]} ; do
    clustat | grep $f | grep -q disabled
        res=$?
        if [ $res -eq 0 ] ; then
                echo Starting $f
                clusvcadm -e $f -m `hostname`
        fi
done

# Check with all our apcupsd shenanigans check that it's enabled
/sbin/chkconfig --list apcupsd | grep :on; ret=$?
if [ $ret -ne 0 ] ; then
    echo "Fixing apcupsd service"
    /sbin/chkconfig apcupsd on
    /etc/init.d/apcupsd start
fi

echo Cluster Start Complete


Some commentary on this script, basically on startup it will wait 2 minutes, this gives you time to touch /tmp/nocluster if you don't want the cluster to start after the rest of the machine is booted. We then start cman followed by DRBD, it loops looking to see if it locally is UpToDate and if not waits for it to be. If this doesn't happen in a timely manner it stops.

We also checks to see we haven't split brained, this works by having in my /etc/drbd.d/global_common.conf a line to direct to a local version of the split brain handler :

 handlers {
        split-brain "/usr/local/sbin/notify-split-brain.sh root";
        out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
    }





Which is a copy of the shipped DRBD split brain handler. But touches a file, here is the fragment that was changed:

 case "$0" in
    *split-brain.sh)
        touch /tmp/splitbrain
        SUBJECT="DRBD split brain on resource $DRBD_RESOURCE"
        BODY="



You will have to fix the split brain issue yourself

The script then starts clvmd, rgmanager and ricci. Now we start the NFS services that we should run on this node (hard coded in this file), the ones that are marked as disabled. Any marked "failed" that were previously this nodes, are now restarted (this will likely be services where the filesystem failed to umount. 

We now wait for the other node's DRBD to be in sync. If it fails to sync in a timely manner, we start any NFS services the other node should have but (for whatever reason) has failed to do. The reason for the apcupsd service changes will become clear later.

So in this script I'm valuing load balancing and safety over fast start up time of services.  

Now the /usr/local/sbin/clusterstop script:

#!/bin/bash

# Script to stop all cluster services in the correct order

# Stop rgmanager
/etc/init.d/rgmanager stop

# Stop ricci
/etc/init.d/ricci stop

# Clear mount by restarting nfsd
/etc/init.d/nfs condrestart

sleep 20

# Unmount any filesystems still around
# Keep trying until all mounts are gone or we timeout
echo "Unmount all cluster filesystems"
retc=0
count=0
while   [ $retc -eq 0 ] && [ $count -lt 60 ] ; do

    umop=""
    for mnts in `mount | grep "on /data/" | cut -d' ' -f3`; do
        umount $mnts
        ret=$?
        umop=$umop$ret
    done
   
    echo $umop | grep -q '1'
    retc=$?
    echo "Count: $count     umop: $umop retc: $retc"
    count=`expr $count + 1`
    sleep 2
done

# Drop out unless shutting down
if [ $count -eq 60 ] && [ "$1" != "initd" ] ; then
    echo
    echo "Failed to unmount all cluster file systems"
    echo "Still mounted:"
    mount | grep "on /data/"
    exit 1
   
fi

if [ $count -eq 60 ] ; then
    echo "Failed to umount all cluster file systems but having to continue a
nyway"
    # If other node is offline we should be OK to unclean shutdown
    clustat | grep -q " Offline"
    ret=$?
    grep -q "cs:WFConnection" /proc/drbd
    ret2=$?

    echo "Is other node offline, check clustat and drbd status"
    echo "Clustat status want 0: $ret , drbd unconnected want 0: $ret2"
    if [ $ret -eq 0 ] && [ $ret2 -eq 0 ] ; then
        echo "Looks like I am the last node standing, go for unclean shu
tdown"
    else
        echo "Halt this node"
        # Only flag if truly shutting down
        if [ $RUNLEVEL -eq 0 ] ; then
            echo "Flag to shutdown straight away on quick boot"
            # Disable apcupsd so I can control the shutdown no doubl
e shutdown
            /sbin/chkconfig apcupsd off
            touch /tmp/shutmedown
        fi
        # Try and be a little bit nice to it
        sync; sync
        /sbin/halt -f
    fi
fi

sleep 2

# Stop Clustered LVM
/etc/init.d/clvmd stop

sleep 2

# Stop drbd service
/etc/init.d/drbd stop

sleep 2

# If DRBD is still loaded and we got here, we need to failout
/sbin/lsmod | grep -q drbd
res=$?

# Drop out unless shutting down
if [ $res -eq 0 ] && [ "$1" != "initd" ] ; then
    echo
    echo "The drbd module is still loaded"
    echo "Something must be using it"
    exit 1
   
fi

if [ $res -eq 0 ] ; then
    echo "The drbd module is still loaded"
    # If other node is offline we should be OK to unclean shutdown
    clustat | grep -q " Offline"
    ret=$?
    grep -q "cs:WFConnection" /proc/drbd
    ret2=$?
   
    echo "Is other node offline, check clustat and drbd status"
    echo "Clustat status want 0: $ret , drbd unconnected want 0: $ret2"
    if [ $ret -eq 0 ] && [ $ret2 -eq 0 ] ; then
        echo "Looks like I am the last node standing, go for unclean shu
tdown"
    else
        echo "Halt this node"
        # Only flag if truly shutting down
        if [ $RUNLEVEL -eq 0 ] ; then
            echo "Flag to shutdown straight away on quick boot"
            # Disable apcupsd so I can control the shutdown no doubl
e shutdown
            /sbin/chkconfig apcupsd off
            touch /tmp/shutmedown
        fi
        # Try and be a little be nice to it
        sync; sync
        /sbin/halt -f
    fi
fi

sleep 2

# Stop cman
/etc/init.d/cman stop

echo "Cluster Stop Complete"


Now some commentary on the clusterstop script. We stop rgmanager, this should stop all services on this node and umount all the clustered filesystems (we also stop ricci). We then restart the "nfs" service, and this should allow us to umount any that didn't automatically. I've sometimes seen this take a couple of minutes, so we retry umounting very 2 seconds for 2 minutes. 

If we fail to umount, we drop out but this isn't an option if shutting down. In that case, we check the status of drbd, if we are the only node left we can shutdown without umounting these filesystems (as we have the most up-to-date copy of the storage so there will be no inconsistency). 

If we aren't the last node standing, we need to halt this node. This isn't very pleasant but the storage consistency is more important than this node. We try to be a little bit nice about this by syncing anything unsaved back to disk. 

Sadly when you halt a node, the surviving node will fence and reboot it. But we want to fully shutdown. The only way I found around this was to set a flag file to cause the init script (earlier in this article) to shut the machine down again (so in sequence a hard halt to fenced initiated reboot (no cluster) to clean shutdown).

The reason I chkconfig apcupsd off , is that we found that if the surviving node was by that time shutting down, this special reboot cycle would be interrupted. The apcupsd would notice the other node (if the other node has the UPS attached, assuming my nodes share a UPS) was shutting down it would shut this node down straight away. This would stop my shutdown trick working so on the next boot (power restore) this node would reboot again (due to the presence of the /tmp/shutmedown file), not very desirable. Obviously this isn't relevant if you don't own an APC UPS but similar issues may apply on other UPS's. So in my case during this quick reboot cycle we ensure apcupsd isn't running and correct that on the next boot.

On UPS shutdowns, we generally program one node (the one that DOESN'T have the UPS cable attached to it) to shutdown several minutes earlier than the one with the UPS directly attached. This is so we can be pretty sure we know which node has the latest version of the DRBD storage, should any incident occur.

Phew!!

STOP PRESS


Red Hat seem to have noticed the problem with cluster NFS services failing to umount their filesystems every time. They have added an option to the clusterfs and fs resources (nfsrestart) in RHEL6.4 that forces an NFS service restart when a service is stopped, this should allow a filesystem to be cleanly umounted by the cluster everytime. I haven't tested this yet, but obviously all of the above will still help with the other issues (and still check a umount has actually occurred). I'm also not sure what effect this will have on any other NFS services on this node.

This will be used like:

<fs device="/dev/cluvg00/lv00home" force_fsck="1" force_unmount="1" mountpoint="/data/home" name="homefs" nfslock="1" options="acl" quick_status="0" self_fence="0" nfsrestart="1" />



No comments:

Post a Comment