Saturday 9 March 2013

Getting Sendmail Relaying Everything He Doesn't Know About

Sendmail is a really powerful email tool but if like me it isn't always the final delivery point in your organisation it can be quite tricky to setup, this is especially true if you'd still like sendmail to be the delivery point for some of your emails (those for mailing lists, scripts, bit buckets etc where sendmail is so handy).

I basically wanted all email domains that I didn't host on this sendmail server  (i.e. not @my.domain) to go to an external relay host. And any email addressed @my.domain and unqualified addresses should be treated the same and go through the aliasing process then delivered to another server.

So I wanted:

blah@other.domain -> relay host

blah -> aliased or if nothing to alias or result of aliasing -> relay host as blah@my.domain (name could have been changed by aliasing)

blah@my.domain -> aliased or if nothing to alias or result of aliasing -> relay host as blah@my.domain (name could have been changed by aliasing)

This is surprisingly tricky. I searched for a while about how to do this and found a few approaches.

This guys approach of using a modified nullclient got me close:
http://brandonhutchinson.com/wiki/Nullclient_with_alias_processing

As I remember though this failed for me on the blah@my.domain I could never get to alias.

The correct approach seemed to be to set,

define(`SMART_HOST', `exchange.my.domain')dnl
define(`LOCAL_RELAY', `exchange.my.domain')dnl

provided masquerading is setup properly, this worked for all cases for me with one fatal flaw, if an address was unqualified the LOCAL_RELAY option sends the email onto the relay as blah@exchange.my.domain not blah@my.domain. In our case Exchange and then that just gets rejected.

I struggled with this for a while. Other's have modified LUSER_RELAY so that it takes in the domain name you'd like to append rather than adding the LOCAL_RELAY name. Such as here:

http://www.jmaimon.com/sendmail/anfi.homeunix.net/sendmail/relaycd.html

I quite liked this but didn't want to maintain modified base distributed m4 files that might get overwritten by OS upgrades.

So I took the approach of creating a copy of the relay delivery agent (with a new name (modrelay), but I then added a single rule to MasqSMTP (and created a new rule name for this (MasqSMTPTwist) and a copy of all it's lines below my new one) to remove the relay name from the email address and replace it with the preferred domain name. So I added this to the bottom of my sendmail.mc file:


LOCAL_RULE_0
Mmodrelay,              P=[IPC], F=mDFMuXa8, S=EnvFromSMTP/HdrFromSMTP, R=MasqSM
TPTwist, E=\r\n, L=2040,
                T=DNS/RFC822/SMTP,
                A=TCP $h


LOCAL_RULESETS
SMasqSMTPTwist
R$* < @ exchange.my.domain > $*              $@ $1 < @ my.domain > $2
                already fully qualified
R$* < @ $* > $*         $@ $1 < @ $2 > $3               already fully qualified
R$+                     $@ $1 < @ *LOCAL* >             add local qualification


Then I called this from LOCAL_RELAY by having "modrelay:" in front of the relay's name e.g.


define(`SMART_HOST', `exchange.my.domain')dnl
define(`LOCAL_RELAY', `modrelay:exchange.my.domain')dnl


It's a little bit messy but works for me.


A couple more clustered services (Squid and Sendmail)

I wanted to cluster a couple of more services that I hadn't needed to originally. One was Squid so I could make sure my Internet proxy service was highly available. And the other was sendmail.

Clustered Squid
Clustering Squid is relatively trivial. I created a clustered logical volume for Squid with an ext4 filesystem (exactly like I had for my previous services) but suitable in size for your cache directory requirements. Then I had in my /etc/sysconfig/squid file (both nodes):

SQUID_CONF="/data/squid/etc/squid/squid.conf"

I copied a /etc/squid to /data/squid/etc/squid preserving permissions.

In this /data/squid/etc/squid/squid.conf I had the following lines (as well as anything else I needed for squid):


http_port corp01clusquid:3128
tcp_outgoing_address 10.1.2.2


# Cache Dir                                                                     
cache_dir ufs /data/squid/var/cache/squid 76800 16 256
                                                                                
# Log directories
cache_access_log /data/squid/var/log/squid/access.log
cache_log /data/squid/var/log/squid/cache.log
cache_store_log /data/squid/var/log/squid/store.log



visible_hostname corp01clusquid

These direct all the logs and cache to the clustered filesystem, so all these subdirectories must exist under /data/squid i.e /data/squid/var/log , /data/squid/var/log/squid/ , /data/squid/var/cache/squid etc with the permissions taken as per the original filesystem ones.

Also the corp01clusquid is what your clients will connect to and so needs to be in DNS (and/or a CNAME to this) and have the IP address of the cluster service (in this case 10.1.2.2). This is also the IP (by the above parameters) that outgoing requests will come from. 

Here are the fragments from the cluster.conf required for this:


<ip address="10.1.2.2" monitor_link="0"/>
                        <fs device="/dev/cluvg00/lv00squid" force_fsck="1" force_unmount="1" mountpoint="/data/squid" name="squidfs" nfslock="1" options="acl" quick_status="0" self_fence="0"/>
               
 <service autostart="1" domain="corp01clusB" exclusive="0" name="squid" recovery="relocate">
                        <script file="/etc/init.d/squid" name="squid"/>
                        <ip ref="10.1.2.2"/>
                        <fs ref="squidfs"/>
</service>

Clustered Sendmail
Sendmail is also relatively easy to cluster, with one wrinkle. I still want both my nodes to locally to be able to send email even if they don't hold the sendmail service.

With this service I again create a clustered logical volume for sendmail with an ext4 filesystem (exactly like I had for my previous services) but suitable in size for your mail spooling needs.

The fragments from my cluster.conf required for this:

 <ip address="10.1.2.3" monitor_link="0"/>          
 <fs device="/dev/cluvg00/lv00sendmail" force_fsck="1" force_unmount="1" mountpoint="/data/sendmail" name="mailfs" nfslock="1" options="acl" quick_status="0" self_fence="0"/>

<service autostart="1" domain="corp01clusB" exclusive="0" name="mail" recovery="relocate">
                        <script file="/etc/init.d/sendmail" name="mail"/>
                        <ip ref="10.1.2.3"/>
                        <fs ref="mailfs"/>
</service>

Then in my sendmail.mc I need lines to redirect my queue and my alias file to the clustered directory. I also have a line to allow sendmail to listen on my clustered IP:

define(`ALIAS_FILE', `/data/sendmail/etc/aliases')dnl
define(`QUEUE_DIR', `/data/sendmail/var/spool/mqueue')dnl

DAEMON_OPTIONS(`Port=smtp,Addr=10.1.2.3, Name=MTA')dnl
CLIENT_OPTIONS(`Family=inet, Address=10.1.2.3')dnl

I don't do local delivery on this box but if you do you may well want to set local delivery to the clustered directory, you'll need to find a suitable sendmail m4 option to do this.

I put aliases in the clustered directory so I'd only have to edit a single copy of this (i.e. not on both nodes). You'll then have to ensure you create all the directories referred to in the options above and ensure their permissions and ownerships match the originals in the filesystem.

Now to ensure both nodes can still locally send email (useful for alerts etc), you need to have in submit.mc:

FEATURE(`msp', `mailhub.my.domain')dnl
define(`SMART_HOST',`mailhub.my.domain')dnl

Where mailhub.my.domain resolves to the clustered service IP (in this case 10.1.2.3). This will cause the local submit to send mail via the service IP on which ever node it's running at the time. 

Ensure sendmail.mc and submit.mc (and anything else you need in /etc/mail that is locally configured for you) are copied to both nodes in /etc/mail. Then you should be done.


Friday 8 March 2013

Cluster for LAN Services switching to EXT4 (Part 4)

The rest of the services are pretty much configured as they were in the GFS2 version of this cluster, except using the ext4 filesystem. Here is my final cluster.conf file:


<?xml version="1.0"?>
<cluster config_version="84" name="bldg1ux01clu">
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="bldg1ux01n1i" nodeid="1" votes="1">
<fence>
<method name="apc7920-dual">
<device action="off" name="apc7920" port="1"/>
<device action="off" name="apc7920" port="2"/>
<device action="on" name="apc7920" port="1"/>
<device action="on" name="apc7920" port="2"/>
</method>
<method name="bldg1ux01n1drac">
<device name="bldg1ux01n1drac"/>
</method>
</fence>
</clusternode>
<clusternode name="bldg1ux01n2i" nodeid="2" votes="1">
<fence>
<method name="apc7920-dual">
<device action="off" name="apc7920" port="3"/>
<device action="off" name="apc7920" port="4"/>
<device action="on" name="apc7920" port="3"/>
<device action="on" name="apc7920" port="4"/>
</method>
<method name="bldg1ux01n2drac">
<device name="bldg1ux01n2drac"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<failoverdomains>
<failoverdomain name="bldg1ux01A" nofailback="0" ordered="1" restricted="1">
<failoverdomainnode name="bldg1ux01n1i" priority="1"/>
<failoverdomainnode name="bldg1ux01n2i" priority="2"/>
</failoverdomain>
<failoverdomain name="bldg1ux01B" nofailback="0" ordered="1" restricted="1">
<failoverdomainnode name="bldg1ux01n1i" priority="2"/>
<failoverdomainnode name="bldg1ux01n2i" priority="1"/>
</failoverdomain>
<failoverdomain name="bldg1ux01Anfb" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="bldg1ux01n1i" priority="1"/>
<failoverdomainnode name="bldg1ux01n2i" priority="2"/>
</failoverdomain>
<failoverdomain name="bldg1ux01Bnfb" nofailback="1" ordered="1" restricted="1">
<failoverdomainnode name="bldg1ux01n1i" priority="2"/>
<failoverdomainnode name="bldg1ux01n2i" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.1.10.25" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00dhcpd" force_fsck="1" force_unmount="1" mountpoint="/data/dhcpd" name="dhcpdfs" nfslock="0" options="acl" quick_status="0" self_fence="0"/>
<ip address="10.1.10.26" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00named" force_fsck="1" force_unmount="1" mountpoint="/data/named" name="namedfs" nfslock="0" options="acl" quick_status="0" self_fence="0"/>
<ip address="10.1.10.27" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00cups" force_fsck="1" force_unmount="1" mountpoint="/data/cups" name="cupsfs" nfslock="0" options="acl" quick_status="0" self_fence="0"/>
<ip address="10.1.10.28" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00httpd" force_fsck="1" force_unmount="1" mountpoint="/data/httpd" name="httpdfs" nfslock="0" options="acl" quick_status="0" self_fence="0"/>
<ip address="10.1.10.29" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00projects" force_fsck="1" force_unmount="1" mountpoint="/data/projects" name="projectsfs" nfslock="1" options="acl" quick_status="0" self_fence="0"/>
<nfsexport name="exportbldg1clunfsprojects"/>
<nfsclient name="nfsdprojects" options="rw" target="10.0.0.0/8"/>
<ip address="10.1.10.30" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00home" force_fsck="1" force_unmount="1" mountpoint="/data/home" name="homefs" nfslock="1" options="acl" quick_status="0" self_fence="0"/>
<nfsexport name="exportbldg1clunfshome"/>
<nfsclient name="nfsdhome" options="rw" target="10.0.0.0/8"/>
<ip address="10.1.10.32" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00smbprj" force_fsck="1" force_unmount="1" mountpoint="/data/smbprj" name="smbdprjfs" nfslock="0" options="acl" quick_status="0" self_fence="0"/>
<ip address="10.1.10.33" monitor_link="0"/>
<fs device="/dev/cluvg00/lv00smbhome" force_fsck="1" force_unmount="1" mountpoint="/data/smbhome" name="smbdhomefs" nfslock="0" options="acl" quick_status="0" self_fence="0"/>
</resources>
</resources>
<service autostart="1" domain="bldg1ux01B" exclusive="0" name="cups" recovery="relocate">
<script file="/etc/init.d/cups" name="cups"/>
<ip ref="10.1.10.27"/>
<fs ref="cupsfs"/>
</service>
<service autostart="0" domain="bldg1ux01Anfb" exclusive="0" name="nfsdprojects" nfslock="1" recovery="relocate">
<ip ref="10.1.10.29"/>
<fs ref="projectsfs">
<nfsexport ref="exportbldg1clunfsprojects">
<nfsclient ref="nfsdprojects"/>
</nfsexport>
</fs>
<ip ref="10.1.10.32">
<fs ref="smbdprjfs"/>
<samba config_file="/etc/samba/smb.conf.prj" name="bldg1clusmbprj" smbd_options="-p 445 -l /data/smbprj/var/log/samba"/>
</ip>
</service>
<service autostart="1" domain="bldg1ux01A" exclusive="0" name="httpd" nfslock="0" recovery="relocate">
<ip ref="10.1.10.28">
<fs ref="httpdfs"/>
<apache config_file="conf/httpd.conf" name="httpd" server_root="/data/httpd/etc/httpd" shutdown_wait="10"/>
</ip>
</service>
<service autostart="0" domain="bldg1ux01Bnfb" exclusive="0" name="nfsdhome" nfslock="1" recovery="relocate">
<ip ref="10.1.10.30"/>
<fs ref="homefs">
<nfsexport ref="exportbldg1clunfshome">
<nfsclient ref="nfsdhome"/>
</nfsexport>
</fs>
<ip ref="10.1.10.33">
<fs ref="smbdhomefs"/>
<samba config_file="/etc/samba/smb.conf.home" name="bldg1clusmbhome" smbd_options="-p 445 -l /data/smbhome/var/log/samba"/>
</ip>
</service>
<service autostart="1" domain="bldg1ux01A" exclusive="0" name="dhcpd" recovery="relocate">
<script file="/etc/init.d/dhcpd" name="dhcpd"/>
<ip ref="10.1.10.25"/>
<fs ref="dhcpdfs"/>
</service>
<service autostart="1" domain="bldg1ux01A" exclusive="0" name="named" recovery="relocate">
<script file="/etc/init.d/named" name="named"/>
<ip ref="10.1.10.26"/>
<fs ref="namedfs"/>
</service>
</rm>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.2.3" login="apc" name="apc7920" passwd="securepassword"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.1.10.22" login="fence" name="bldg1ux01n1drac" passwd="securepassword"/>
<fencedevice agent="fence_ipmilan" ipaddr="10.1.10.23" login="fence" name="bldg1ux01n2drac" passwd="securepassword"/>
</fencedevices>
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
</cluster>

One change I made was I now set monitor_link="0", as I don't really want my services moving on switch stack reboots (my resilience is by bonding). 

I also was somewhat paranoid that my filesystems may (accidentally by admins) end up mounted on two nodes (a big no no on ext4), so I wrote the following script, to check that a node doesn't have anything mounted that the cluster says it shouldn't.


#!/usr/bin/perl

# Script to parse /etc/cluster/cluster.conf and check any FS's associated with services I don't hold
# aren't mounted here

$clusterconfig="/etc/cluster/cluster.conf";


open (CLUSTAT, "/usr/sbin/clustat |");
$hostname=`hostname`;

chomp($hostname);

# Find services started but that I don't own
while (<CLUSTAT>) 
{
# Find services started but that I don't own
if (/service:(.+?)\s+(.+?)\sstarted/)
{
$service=$1;
if ( $2 !~ /$hostname/ )
{
push(@nomyservices,$service);
}
}
}

close CLUSTAT;

open (MOUNTS, "/bin/mount|");

while (<MOUNTS>)
{
# What do I have mounted
if ( /\/dev\/mapper.+?on\s+?(.+?)\s+?type/)

push(@mymounts,$1);
}
}
close MOUNTS;

$retval=0;

open (CONFFILE, "<$clusterconfig") or die "Can't open cluster config file $clusterconfig";

$checkthis=0;

while (<CONFFILE>) 
{
# Create a lookup table of fs resources names to paths
if (/\<fs.+mountpoint="(.+?)".+name="(.+?)".+?$/)
{
$fslookup{$2}=$1;
next;
}

if (/\<service.+?name="(.+?)".+?$/)
{
$service=$1;
if ( $service ~~ @nomyservices)
{
$checkthis=1;
}
else
{
$checkthis=0;
}
}

if  ((/\<fs ref="(.+?)"/) && ($checkthis) )
{
# So service I don't own and do I have it's FS's mounted
$fs=$fslookup{$1};
if ( $fs ~~ @mymounts )
{
print "Double mounted Filesystem: $fs on $hostname not running that service\n";
$retval=1;
}
}

if (/\<\/service\>/)
{
$checkthis=0;
}
}

close CONFFILE;
exit ($retval);



This parses the cluster.conf for filesystems associated with services and checks if we should have them mounted (based on which services we hold). I wrapped a script around this (that checks the return code) and emails if it suspects a double mount. I then cron'd this every 15 minutes, hopefully minimising the length of time any double mount has been allowed to occur for. This is purely paranoia and if you do things properly, you should never end up here! Just a bit of belt and suspenders. 

After switching to ext4 for my cluster I have found this setup to be very stable. It has been happily running on several clusters for well over a year now.



A RHEL 6/Centos 6 HA Cluster for LAN Services switching to EXT4 (Part 3)

The Startup and Shutdown Problem
When I was having issues on my cluster, I found it frustrating that the cluster would come up during a node reboot. It was overly time consuming to have to boot single user and chkconfig services out (I know there is the "nocluster" kernel boot flag but it doesn't include DRBD). I'd prefer to have a completely "up" full multiuser system before it even thinks about cluster services, this makes repairing it easier.

Another issue I have is my service dependency of "clvmd" on "drbd" (I described and setup in my original GFS2 cluster setup blog) keeps being overwritten by updates. Which is annoying and I often forgot to reset it when I update.

Also I found I wanted a greater level of paranoia in my startup scripts before starting cluster services. For example, I had issues in the past with GFS2 starting on a not fully sync'd up DRBD (generated oops's). This was likely due to just the overhead of the rebuild but I decided I'd like a fully up to date DRBD on a node before starting services (even though I'm now using ext4). DRBD doesn't generally take long to sync up as it can determine changed blocks quickly (a few minutes) ( I guess, providing a node hasn't been down too long).

So I decided to cut through all this and create my own startup and stop script to meet my needs.

The second major issue I have with my new demand mounted cluster setup is that filesystems shared out via NFS have a nasty habit of not umounting when the service is stopped. Sometimes they successfully umount two odd minutes after the service has stopped. Restarting the "nfs" service usually frees them. I have found this is pretty much true, but they can stick.

A filesystem not umounting wasn't such a big deal with GFS2 as we are allowed to have the filesystem mounted on both nodes. But with our ext4 filesystem this is a big no no. So this can stop us flipping the service to a new node.

A service when stopped will get marked as failed if it fails to umount a file system. There isn't really a lot of choice in this situation. We can either restart this service on it's original node (though it will need to be disabled again and prudence would suggest the "-m" flag in "clusvcadm" when re-enabling it on it's original home node to force it to the correct node). Starting the service on the node that doesn't have the hung mount will cause the filesystem to be mounted twice on different nodes and will likely result in a file system corruption (a double disable using clusvcadm removes the cluster checking that it isn't mounted anywhere). So high levels of care and checking are needed in this situation (check where your mounts are on which nodes).

The only other choice is to hard halt the node with the stuck mount. This is really the only option when shutting a system down with a hung mount. We have to ensure that the DRBD storage is in an unused state before the "drbd" service will cleanly stop. If we just continue shutting down with a hung filesytem, the drbd will fail to stop and at some point in the process the network between the nodes will be shutdown (as in both sides could be changing the storage without the other knowing) . This could result in a split brain (depending on what the other node is doing). A hard halt is nasty but your storage is really more important than anything else.

Of course the other option would be a setup where the DRBD is in Primary/Secondary, but I resent having a high performance node sitting there doing nothing when it could be doing some load balancing for me.

So as it's likely I can't so easily move an NFS service once started, and as nodes will likely come up at different times (esp with my wait for an up-to-date DRBD before a node goes on) I have setup my cluster to not start NFS services automatically in cluster.conf (and to prevent them failing back). My startup script will handle placing them on the correct nodes.

So I wanted startup and stop scripts that should handle all this.

Firstly I chkconfig'd off cman, drbd, clvmd, rgmanager and ricci on both nodes.

I make no claims for the scripts below, they aren't overly clean and were put together quickly to get things going. They assume I have only my two nodes and that the only NFS services the systems have are cluster ones. Also somewhat nastily the node names are in the body of the scripts.

Firstly my /etc/rc.d/init.d/lclcluster startup script that basically calls another script to do the actual work. This need chkconfig --add lclcluster and then chkconfig lclcluster on


#!/bin/sh
#
# Startup script for starting the Cluster with paranoia
#
# chkconfig: - 99 00
# description: This script starts all the cluster services
# processname: lclcluster
# pidfile:

# Source function library.
. /etc/rc.d/init.d/functions

# See how we were called.
case "$1" in
  start)
    if [ -f /tmp/shutmedown ] ; then
        rm -f /tmp/shutmedown
        /sbin/chkconfig apcupsd on
        echo 'shutdown -h now' | at now+1minutes
        echo Going Straight to shutdown
        failure
                echo
                RETVAL=1
        exit 1
    fi
        echo -n "Starting lclcluster: "
    if ! [ -f /var/lock/subsys/lclcluster ] ; then
        if [ -f /usr/local/sbin/clusterstart ] ; then
            /usr/local/sbin/clusterstart initd >/usr/local/logs/cl
usterstart 2>&1 &
        fi
            touch /var/lock/subsys/lclcluster
        RETVAL=0
        success
        echo
    else
        echo -n "Already Running "
        failure
        echo
        RETVAL=1
    fi
        echo
        ;;
  stop)
        echo -n "Stopping lclcluster: "
    if [ -f /var/lock/subsys/lclcluster ] ; then
        if [ -f /usr/local/sbin/clusterstop ] ; then
            /usr/local/sbin/clusterstop initd >/usr/local/logs/clu
sterstop 2>&1
        fi
            rm -f /var/lock/subsys/lclcluster
        success
        echo
        RETVAL=0
    else
        echo -n "Not running "
        failure
        echo
        RETVAL=1
    fi
        ;;
  restart)
    $0 stop
    $0 start
    RETVAL=$?
    ;;
  *)
        echo "Usage: $0 {start|stop|restart}"
        exit 1
esac

exit $RETVAL

Then my /usr/local/sbin/clusterstart script.
  
#!/bin/bash

# Script to start all cluster services in the correct order

# If passed as a startup script wait 2 minutes for the startup to happen
# this will allow us to prevent cluster startup by touching a file
if [ "$1" = "initd" ] ; then
        echo "Startup Script called me waiting 2 minutes"
        sleep 120
fi

# If this file exists, do not start the cluster
if [ -f /tmp/nocluster ] ; then
    echo "Told not to start cluster. Exiting..."
    exit 1
fi

# Start cman
/etc/init.d/cman start

sleep 2

# Start DRBD
/etc/init.d/drbd start

# Loop until consistent
ret=1
ret2=1
count=0
while ( [ $ret -ne 0 ] || [ $ret2 -ne 0 ] ) &&  [ $count -lt 140 ] ; do
    grep "ro:Primary/" /proc/drbd
    ret=$?
    grep "ds:UpToDate/" /proc/drbd
    ret2=$?
    echo "DRBD Count: $count Primary: $ret UpToDate: $ret2"
    count=`expr $count + 1`
    sleep 10
done

if [ $count -eq 140 ] ; then
    echo "DRBD didnt sync in a timely manner. Not starting cluster services"
    exit 1
fi

if [ -f /tmp/splitbrain ] ; then
    echo "DRBD Split Brain detected. Not starting cluster services"
    exit 1
fi
  
echo "DRBD Consistent"

# Start clvmd
/etc/init.d/clvmd start

sleep 2

# Start rgmanager
/etc/init.d/rgmanager start

sleep 2

# Start ricci
/etc/init.d/ricci start

# If this file exists, do not start the cluster nfsds
if [ -f /tmp/nonfsds ] ; then
    echo "Told not to start NFS Services. Exiting..."
    exit 1
fi

# Waiting for things to settle
echo "Waiting for things to settle"
sleep 20

# Try to start my NFSD services
echo "Starting my NFSD services"
declare -A NFSD
NFSD[bldg1ux01n1i]="nfsdprojects"
NFSD[bldg1ux01n2i]="nfsdhome"


for f in ${NFSD[`hostname`]} ; do
    clustat | grep $f | grep -q disabled
    res=$?
    if [ $res -eq 0 ] ; then
        echo Starting $f
        clusvcadm -e $f -m `hostname`
    fi

    # Was any of these mine before and failed bring up again
    clustat | grep $f | grep `hostname` | grep -q failed
    res=$?
        if [ $res -eq 0 ] ; then
                echo Starting $f
        clusvcadm -d $f
        sleep 1
                clusvcadm -e $f -m `hostname`
        fi

done

echo Check other nodes services after storage is up

# Wait to see if other node starts its services or I will
echo Waiting to see if other nodes starts its services or not
echo First wait to see if its DRBD is consistent

# Loop until consistent
ret=0
count=0
while [ $ret -eq 0 ]  &&  [ $count -lt 140 ] ; do
    grep "sync" /proc/drbd
    ret=$?
    echo "DRBD Count: $count syncing: $ret"
    count=`expr $count + 1`
    sleep 10
done

if [ $count -eq 140 ] ; then
    echo "DRBD didnt sync on my partner continuing"
else  
    echo Give it a chance to start some services. Waiting....
    sleep 180
fi

if  [ "`hostname`" = "bldg1ux01n1i" ] ; then
    othermach="bldg1ux01n2i"
else
    othermach="
bldg1ux01n1i"
fi

# Start anything the other node hasnt claimed
echo "What didnt the other node take"

for f in ${NFSD[$othermach]} ; do
    clustat | grep $f | grep -q disabled
        res=$?
        if [ $res -eq 0 ] ; then
                echo Starting $f
                clusvcadm -e $f -m `hostname`
        fi
done

# Check with all our apcupsd shenanigans check that it's enabled
/sbin/chkconfig --list apcupsd | grep :on; ret=$?
if [ $ret -ne 0 ] ; then
    echo "Fixing apcupsd service"
    /sbin/chkconfig apcupsd on
    /etc/init.d/apcupsd start
fi

echo Cluster Start Complete


Some commentary on this script, basically on startup it will wait 2 minutes, this gives you time to touch /tmp/nocluster if you don't want the cluster to start after the rest of the machine is booted. We then start cman followed by DRBD, it loops looking to see if it locally is UpToDate and if not waits for it to be. If this doesn't happen in a timely manner it stops.

We also checks to see we haven't split brained, this works by having in my /etc/drbd.d/global_common.conf a line to direct to a local version of the split brain handler :

 handlers {
        split-brain "/usr/local/sbin/notify-split-brain.sh root";
        out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
    }





Which is a copy of the shipped DRBD split brain handler. But touches a file, here is the fragment that was changed:

 case "$0" in
    *split-brain.sh)
        touch /tmp/splitbrain
        SUBJECT="DRBD split brain on resource $DRBD_RESOURCE"
        BODY="



You will have to fix the split brain issue yourself

The script then starts clvmd, rgmanager and ricci. Now we start the NFS services that we should run on this node (hard coded in this file), the ones that are marked as disabled. Any marked "failed" that were previously this nodes, are now restarted (this will likely be services where the filesystem failed to umount. 

We now wait for the other node's DRBD to be in sync. If it fails to sync in a timely manner, we start any NFS services the other node should have but (for whatever reason) has failed to do. The reason for the apcupsd service changes will become clear later.

So in this script I'm valuing load balancing and safety over fast start up time of services.  

Now the /usr/local/sbin/clusterstop script:

#!/bin/bash

# Script to stop all cluster services in the correct order

# Stop rgmanager
/etc/init.d/rgmanager stop

# Stop ricci
/etc/init.d/ricci stop

# Clear mount by restarting nfsd
/etc/init.d/nfs condrestart

sleep 20

# Unmount any filesystems still around
# Keep trying until all mounts are gone or we timeout
echo "Unmount all cluster filesystems"
retc=0
count=0
while   [ $retc -eq 0 ] && [ $count -lt 60 ] ; do

    umop=""
    for mnts in `mount | grep "on /data/" | cut -d' ' -f3`; do
        umount $mnts
        ret=$?
        umop=$umop$ret
    done
   
    echo $umop | grep -q '1'
    retc=$?
    echo "Count: $count     umop: $umop retc: $retc"
    count=`expr $count + 1`
    sleep 2
done

# Drop out unless shutting down
if [ $count -eq 60 ] && [ "$1" != "initd" ] ; then
    echo
    echo "Failed to unmount all cluster file systems"
    echo "Still mounted:"
    mount | grep "on /data/"
    exit 1
   
fi

if [ $count -eq 60 ] ; then
    echo "Failed to umount all cluster file systems but having to continue a
nyway"
    # If other node is offline we should be OK to unclean shutdown
    clustat | grep -q " Offline"
    ret=$?
    grep -q "cs:WFConnection" /proc/drbd
    ret2=$?

    echo "Is other node offline, check clustat and drbd status"
    echo "Clustat status want 0: $ret , drbd unconnected want 0: $ret2"
    if [ $ret -eq 0 ] && [ $ret2 -eq 0 ] ; then
        echo "Looks like I am the last node standing, go for unclean shu
tdown"
    else
        echo "Halt this node"
        # Only flag if truly shutting down
        if [ $RUNLEVEL -eq 0 ] ; then
            echo "Flag to shutdown straight away on quick boot"
            # Disable apcupsd so I can control the shutdown no doubl
e shutdown
            /sbin/chkconfig apcupsd off
            touch /tmp/shutmedown
        fi
        # Try and be a little bit nice to it
        sync; sync
        /sbin/halt -f
    fi
fi

sleep 2

# Stop Clustered LVM
/etc/init.d/clvmd stop

sleep 2

# Stop drbd service
/etc/init.d/drbd stop

sleep 2

# If DRBD is still loaded and we got here, we need to failout
/sbin/lsmod | grep -q drbd
res=$?

# Drop out unless shutting down
if [ $res -eq 0 ] && [ "$1" != "initd" ] ; then
    echo
    echo "The drbd module is still loaded"
    echo "Something must be using it"
    exit 1
   
fi

if [ $res -eq 0 ] ; then
    echo "The drbd module is still loaded"
    # If other node is offline we should be OK to unclean shutdown
    clustat | grep -q " Offline"
    ret=$?
    grep -q "cs:WFConnection" /proc/drbd
    ret2=$?
   
    echo "Is other node offline, check clustat and drbd status"
    echo "Clustat status want 0: $ret , drbd unconnected want 0: $ret2"
    if [ $ret -eq 0 ] && [ $ret2 -eq 0 ] ; then
        echo "Looks like I am the last node standing, go for unclean shu
tdown"
    else
        echo "Halt this node"
        # Only flag if truly shutting down
        if [ $RUNLEVEL -eq 0 ] ; then
            echo "Flag to shutdown straight away on quick boot"
            # Disable apcupsd so I can control the shutdown no doubl
e shutdown
            /sbin/chkconfig apcupsd off
            touch /tmp/shutmedown
        fi
        # Try and be a little be nice to it
        sync; sync
        /sbin/halt -f
    fi
fi

sleep 2

# Stop cman
/etc/init.d/cman stop

echo "Cluster Stop Complete"


Now some commentary on the clusterstop script. We stop rgmanager, this should stop all services on this node and umount all the clustered filesystems (we also stop ricci). We then restart the "nfs" service, and this should allow us to umount any that didn't automatically. I've sometimes seen this take a couple of minutes, so we retry umounting very 2 seconds for 2 minutes. 

If we fail to umount, we drop out but this isn't an option if shutting down. In that case, we check the status of drbd, if we are the only node left we can shutdown without umounting these filesystems (as we have the most up-to-date copy of the storage so there will be no inconsistency). 

If we aren't the last node standing, we need to halt this node. This isn't very pleasant but the storage consistency is more important than this node. We try to be a little bit nice about this by syncing anything unsaved back to disk. 

Sadly when you halt a node, the surviving node will fence and reboot it. But we want to fully shutdown. The only way I found around this was to set a flag file to cause the init script (earlier in this article) to shut the machine down again (so in sequence a hard halt to fenced initiated reboot (no cluster) to clean shutdown).

The reason I chkconfig apcupsd off , is that we found that if the surviving node was by that time shutting down, this special reboot cycle would be interrupted. The apcupsd would notice the other node (if the other node has the UPS attached, assuming my nodes share a UPS) was shutting down it would shut this node down straight away. This would stop my shutdown trick working so on the next boot (power restore) this node would reboot again (due to the presence of the /tmp/shutmedown file), not very desirable. Obviously this isn't relevant if you don't own an APC UPS but similar issues may apply on other UPS's. So in my case during this quick reboot cycle we ensure apcupsd isn't running and correct that on the next boot.

On UPS shutdowns, we generally program one node (the one that DOESN'T have the UPS cable attached to it) to shutdown several minutes earlier than the one with the UPS directly attached. This is so we can be pretty sure we know which node has the latest version of the DRBD storage, should any incident occur.

Phew!!

STOP PRESS


Red Hat seem to have noticed the problem with cluster NFS services failing to umount their filesystems every time. They have added an option to the clusterfs and fs resources (nfsrestart) in RHEL6.4 that forces an NFS service restart when a service is stopped, this should allow a filesystem to be cleanly umounted by the cluster everytime. I haven't tested this yet, but obviously all of the above will still help with the other issues (and still check a umount has actually occurred). I'm also not sure what effect this will have on any other NFS services on this node.

This will be used like:

<fs device="/dev/cluvg00/lv00home" force_fsck="1" force_unmount="1" mountpoint="/data/home" name="homefs" nfslock="1" options="acl" quick_status="0" self_fence="0" nfsrestart="1" />