Monday 25 April 2011

Building a RHEL 6/Centos 6 HA Cluster for LAN Services (part 3)

Now we have working storage it's time to use it. I plan to use the Clustered Logical Volume manager on top of DRBD. Which if you don't know is simply a cluster aware version of the standard logical volume manager (clvmd).

Clustered Logical Volume Manager
All these steps should be performed on both nodes. First step is to edit /etc/lvm/lvm.conf and set locking type to 3

locking_type = 3

And comment the existing filter line and add a new one in:


# By default we accept every block device:
    #filter = [ "a/.*/" ]
    filter = [ "a|/dev/drbd*|", "r/.*/" ]


All we doing here is telling the logical volume system to be cluster aware and telling the LVM system to check the drbd devices for Physical Volumes.

As our clvmd (Clustered  Logical Volume Manager Daemon) requires DRBD to be up, and as GFS2 (which is the clustered file system we will be using) needs the LVM system we must make these dependant on each other in startup and shutdown orders. To do this:

Edit /etc/init.d/clvmd , add to drbd to the end of Required-Start and Required-Stop

# Required-Start: $local_fs cman drbd
# Required-Stop: $local_fs cman drbd

Edit /etc/init.d/gfs2 add clvmd to the end of Required-Start and Required-Stop

# Required-Start: $network cman clvmd
# Required-Stop: $network cman clvmd

Now chkconfig these services on and reset the order based on these dependencies:

/sbin/chkconfig drbd on
/sbin/chkconfig clvmd on
/sbin/chkconfig gfs2 on
/sbin/chkconfig drbd resetpriorities
/sbin/chkconfig clvmd resetpriorities
/sbin/chkconfig gfs2 resetpriorities

To check that the order set properly ls /etc/rc3.d. In there you should now have S70drbd, S71clvmd and S72gfs2 (your numbers may vary I guess, but they should be in this order of startup). You probably also want to check the stop order /etc/rc0.d  goes in the following order K06gfs2, K07clvmd, K08drbd. 

You will need to check these startup orders whenever updates get applied to these. And re-add the changes to the Required-Start and Required-Stops, if they have disappeared. Then use the "resetpriorities" above to apply the changes again to the order. I really must script this for myself. LINBIT have said they'll look into this too.

Just to remind you all of the above should have been applied to both nodes.

If drbd is running on both nodes and in primary/primary (check cat /proc/drbd), which it should have been after the previous section,  we are ready to start clvmd.

Start clvmd both nodes, /etc/init.d/clvmd start

We now setup the drbd to be a physical volume and add it to a volume group.   On ONE node only:

# pvcreate /dev/drbd0 
  Physical volume "/dev/drbd0" successfully created

# vgcreate cluvg00 /dev/drbd0
  Volume group "cluvg00" successfully created

Then check it happened on both nodes with:

# /sbin/vgscan 
  Reading all physical volumes.  This may take a while...
  Found volume group "cluvg00" using metadata type lvm2

Creating a GFS2 Filesystem on top of clvmd
I plan to use one Logical Volume (LV) for every service I plan to have (for most flexibility). As I'll be starting with DHCP I will create a LV for it. I'm going to mount all my clustered filesystems under /data (most people would possibly use /mnt, but it doesn't really matter). 

Perform on one node (1GB is the smallest file system you can have for GFS2 and I won't need any more space than this for DHCP):

#  lvcreate --size 1G --name lv00dhcpd cluvg00
  Logical volume "lv00dhcpd" created

Now make the filesystem:

mkfs -t gfs2 -p lock_dlm -j 2 -t bldg1ux01clu:dhcpd /dev/cluvg00/lv00dhcpd

This command line refers "-j 2" which is the number of journals we require, basically 1 per node. I'm also naming my volumes after the service that will use these. 

I also notice I needed to put in the cluster name to this command line.  Now I can't remember if I had to have the cluster.conf in place at this time for this to work (as it names the cluster), if so just take the cluster.conf from DHCP section below in this blog to keep things happy for the moment.

Now on both nodes:

mkdir /data
mkdir /data/dhcpd
mount /dev/mapper/cluvg00-lv00dhcpd /data/dhcpd

Check that it mounted and works. Put a few test files in /data/dhcpd and see that they appear on both nodes. Check the output of the mount command and the df command to see that things are as you expect  (on both nodes).


Add to /etc/fstab (on both nodes):

/dev/cluvg00/lv00dhcpd /data/dhcpd gfs2 acl 0 0

A note on this. I was quite happy to allow cluster suite to demand mount all these GFS2 file systems and leave them out of fstab. However ultimately I decided that I quite liked seeing all filesystems on both nodes (for easy admin and backup) but more importantly (unless you use the force unmount options in the cluster.conf) I quite like how this would cause the gfs2 mounts to be unmounted cleanly if the node was shutdown. Plus later we will be using clustered Samba and this is outside the cluster.conf, so demand mounting isn't an option.

Once happy clear out any test files and directories you have in /data/dhcpd.


Setup the APC Fence Device (only if using this device)

I'll include this in case anyone else wishes to use an APC device for fencing. A quick mention of fencing though. A lot of people think warnings about requiring fencing are overblown. But ultimately the remaining nodes (in this case node) needs an assurance that the other node is definitely dead when they can't talk to it (in heartbeats) before they will continue any access to GFS2 filesystems. If the fencing were to lie, file system corruption would occur in the GFS2. Now fencing seems pretty harsh on the node (in terms of lacking in clean shutdown of the node), but ultimately local filesystem data should be much less important than clustered data. 

To set the IP address of these APC devices works by a directed arp. So using the MAC address on the side:

arp -s 192.168.2.3 00:c0:b7:52:23:4F 
ping 192.168.2.3 -s 113

Then telnet to the  device from either node, personally I turned off SNMP, set a password, gave it a hostname. I set a domain name (not that it matters hugely). Set Subnet mask. Set the secure password.

Setup iDRAC 6 backup fencing (if on a Dell Server)
As my fence device is dependant on a single network card on each node, this becomes a single point of failure. To mitigate this I plan to use a secondary fence device, via the Dell DRAC (which is on my main network). This is only applicable on the new generation Dell servers. Similar methods exist on the old Dell servers and other vendors server. I'm including here as I struggled to find straight off how to configure IPMI on the new DRAC.

On each node.

I added a new custom user in "Openmanage" under "Main System Chassis" -> "Remote Access", then the users tab.  Called the user "fence" and set a secure password. The only option I set was under "IPMI User Privileges" section:  "Maximum LAN User Privilege granted" to Administrator. That is the only access this user needs.

On the DRAC card iteself, On the left hand menu select "Remote Access", then "Network/Security" at the top. Scroll down until you see "IPMI Settings" and tick "Enable IPMI Over LAN", "Channel Privilege Level Limit" to "Administrator" and type in 40 randon hex chars as an encryption key.

I added the drac cards to my hosts file (I called them just the hostnames with drac on the end)

10.1.10.22    bldg1ux01n1drac
10.1.10.23    bldg1ux01n2drac

Can be tested by:

fence_ipmilan  -a bldg1ux01n1drac -l fence -p securepassword -o status

,replace securepassword with the secure password you used. This won't turn anything off but will show you that the cluster fence stuff will correctly talk to the DRAC's.

Finally for fencing to work properly with DRAC we need to disable ACPI soft power off:

/etc/init.d/acpid stop
/sbin/chkconfig acpi off

So we get a hard power off from this fencing method.

Setting up clustering
First lets put in a basic cluster config file on both nodes (/etc/cluster/cluster.conf). Here is a basic one useful at this stage:

<?xml version="1.0"?>
<cluster config_version="1" name="bldg1ux01clu">
<cman expected_votes="1" two_node="1"/>
<clusternodes>
<clusternode name="bldg1ux01n1i" nodeid="1" votes="1">
<fence>
<method name="apc7920-dual">
<device action="off" name="apc7920" port="1"/>
<device action="off" name="apc7920" port="2"/>
<device action="on" name="apc7920" port="1"/>
<device action="on" name="apc7920" port="2"/>
</method>
<method name="bldg1ux01n1drac">
<device name="bldg1ux01n1drac"/>
</method>
</fence>
</clusternode>
<clusternode name="bldg1ux01n2i" nodeid="2" votes="1">
<fence>
<method name="apc7920-dual">
<device action="off" name="apc7920" port="3"/>
<device action="off" name="apc7920" port="4"/>
<device action="on" name="apc7920" port="3"/>
<device action="on" name="apc7920" port="4"/>
</method>
<method name="bldg1ux01n2drac">
<device name="bldg1ux01n2drac"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<failoverdomains>
<failoverdomain name="bldg1ux01A" ordered="1" restricted="1">
<failoverdomainnode name="bldg1ux01n1i" priority="1"/>
<failoverdomainnode name="bldg1ux01n2i" priority="2"/>
</failoverdomain>
<failoverdomain name="bldg1ux01B" ordered="1" restricted="1">
<failoverdomainnode name="bldg1ux01n1i" priority="2"/>
<failoverdomainnode name="bldg1ux01n2i" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
</resources>
</rm>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.2.3" login="apc" name="apc7920" passwd="securepassword"/>

<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
</cluster>

Now some commentary on this. At the beginning I've given the cluster name "bldg1ux01clu". I have set expected_votes="1" two_node="1" to allow us to have a single node and for it to be quorum, this is so that one node is enough  to continue working (kind of important in a two node cluster, by default this isn't true).

Next I set the node names, I'm using the internal names  bldg1ux01n1i and bldg1ux01n2i so that cluster comms will go via my internal 10Gb private interface. The first fence method for each node is my APC7920 power switch. As I have dual supplies I need to have four actions, turn both off , then turn both on. Then for each node I have the secondary DRAC fence method. 

After this I'm defining some failover domains bldg1ux01A and bldg1ux01B. This I'm basically using to set priorities on which node a service should try to run on. So in bldg1ux01A, node 1 has higher priority than 2 and vice versa in bldg1ux01B. 

I've currently provided no services or resources. But at the bottom are the details of my fence devices, referenced by each node. You'll need to put in the APC and DRAC "securepasswords".

Now we have a basic cluster.conf lets start the cluster service's services. So on both nodes run (except Lucci):

/sbin/chkconfig cman on
/sbin/chkconfig ricci on
/sbin/chkconfig rgmanager on

/sbin/chkconfig luci on # On node 1 only

/etc/init.d/cman start
/etc/init.d/ricci start
/etc/init.d/rgmanager start

/etc/init.d/lucci start # On node 1 only


The lucci service is basically the web interface to the cluster, so only needs to be run on one node. 

Once these are up you should see if everything is happy run the clustat command on either node:

# clustat
Cluster Status for bldg1ux01clu @ Mon Apr 25 16:03:12 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 bldg1ux01n1i                                1 Online, Local, rgmanager
 bldg1ux01n2i                                2 Online, rgmanager

This indicated all is up and running on both nodes (at this stage). 

We are now ready to bring up a service. Phew!!

Clustered DHCP
The standard shipped ISC dhcpd has built in failover clustering and it works very well. It does however suffers from a fatal flaw in my book if used with Dynamic DNS (DDNS). The issue is that if the primary node is down, then a machine registers it's DHCP name/address on the secondary. Then the primary comes back, the secondary doesn't share it's DDNS registration with the primary. If the secondary is now down when the lease expires the name will never get removed from DNS (messy, I don't like stale DNS entries). It was discussed briefly on the mailing list and this functionality is basically missing at this time. See:


So I decided to avoid the whole issue and use Cluster Suite to maintain a single running DHCPD that can swap nodes as required. 

My philosophy with services is that each one should have it's own IP address (so a separate floating IP per service) and separate  LV and mount per service to hold data and config files. This gives me total flexibility as to which node to run things on. I like the idea of putting the config files in a clustered file system as it means I can change them and lazily not have to manually copy them to the other node.

Each service is controlled by a resource agent (RA). It is a good idea to use the resource agents as they do quite a lot of work for you, monitoring the service and editing the config file to use the clustered services IP address and configs.

In dhcpd's case there isn't an RA for this one. So you can use the "script" RA, this basically allows you to use standard /etc/init.d files in the cluster config but without some of the neatness of a full RA.

So ensure that /data/dhcpd is mounted on both nodes. If not mount it, and check it is mounted (with df and/or mount). I'm now going to create a mini-root  of files required for this service. 

# cd /
# tar cpvf - etc/dhcp  var/lib/dhcpd | (cd /data/dhcpd; tar xpvf -)

I like old style tar copying. Edit /etc/cluster/cluster.conf (on only one node)  and add in an IP resource and clusterfs between the resource lines and the service after (before the "</rm>") i.e

<resources>
<ip address="10.1.10.25" monitor_link="1"/>
<clusterfs device="/dev/cluvg00/lv00dhcpd" fstype="gfs2" mountpoint="/data/dhcpd" name="dhcpdfs" options="acl"/>
</resources>
<service autostart="1" domain="bldg1ux01A" exclusive="0" name="dhcpd" recovery="relocate">
<script file="/etc/init.d/dhcpd" name="dhcpd"/>
<ip ref="10.1.10.25"/>
<clusterfs ref="dhcpdfs"/>
</rm>


I'm directing this service to by default run on node 1 with the domain directive.

Now edit the new config file at /data/dhcpd/etc/dhcp/dhcpd.conf , to direct the leases file to our shared file system (so both nodes can get at it to run the service) and allow it to listen on the service IP:

# Cluster Configs 
server-identifier 10.1.10.25;
lease-file-name "/data/dhcpd/var/lib/dhcpd/dhcpd.leases";
dhcpv6-lease-file-name "/data/dhcpd/var/lib/dhcpd/dhcpd6.leases";

So my complete /data/dhcpd/etc/dhcp/dhcpd.conf file will look like :

max-lease-time 432000;
default-lease-time 432000;
allow unknown-clients;
allow bootp;
deny duplicates;
deny declines;
deny client-updates;
ddns-updates on;
ddns-update-style interim;
ddns-domainname "lan";
ddns-rev-domainname "in-addr.arpa";

# Cluster Configs 
server-identifier 10.1.10.25;
lease-file-name "/data/dhcpd/var/lib/dhcpd/dhcpd.leases";
dhcpv6-lease-file-name "/data/dhcpd/var/lib/dhcpd/dhcpd6.leases";

authoritative;
subnet 10.1.10.0 netmask 255.255.255.0 {
        one-lease-per-client true;
        option routers 10.1.10.1;
        option broadcast-address 10.1.10.255;
        option subnet-mask 255.255.255.0;
#        option domain-name-servers 10.2.100.20, 10.2.100.21;
# The above option will change to the below when we get clustered DNS going
# option domain-name-servers 10.1.10.26, 10.2.100.20, 10.2.100.21;
pool {
         range 10.1.10.50 10.1.10.254;
}
}

Then change /etc/sysconfig/dhcpd (both nodes) to point to the clustered copy of the config file and us the bond network interface:

# Command line options here
DHCPDARGS="-cf /data/dhcpd/etc/dhcp/dhcpd.conf bond0"

Ensure that the service won't start on the node (so run on both nodes), cluster suite now manages starting and stopping:

/sbin/chkconfig dhcpd off

I'd add an entry for this new service IP to the hosts files (and your site DNS):

10.1.10.25 bldg1cludhcp bldg1cludhcp.lan

Now we need to propagate the new cluster.conf file. Bump the config_version number up by one in the cluster.conf file on the node we edited the file on above. Then validate it's config with 


# ccs_config_validate
Configuration validates

Then if it validates propagate it using:

cman_tool version -r

Check if it starts at this point with clustat:

# clustat
Cluster Status for bldg1ux01clu @ Mon Apr 25 16:03:12 2011
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 bldg1ux01n1i                                1 Online, Local, rgmanager
 bldg1ux01n2i                                2 Online, rgmanager

Service Name                                                     Owner (Last)                                                     State         
 ------- ----                                                     ----- ------                                                     -----               
 service:dhcpd                                               bldg1ux01n1i                                        started       

If it fails to start check /var/log/messages for information in each node.

Try restarting with "clusvcadm -e dhcpd" if it didn't autostart.

Phew x 2, one service down....Don't worry they get easier...


No comments:

Post a Comment