Sunday, 24 April 2011

Building a RHEL 6/Centos 6 HA Cluster for LAN Services (part 2)

Initial OS Install


Now we have the hardware setup we can move to installing the software.

At this point install a base RHEL/Centos 6 on both systems. The main customisations at this point are the partition table and hostnames. I also picked out the sensible packages for the services I required (including the cluster services).


On the partition table. For myself I left a Dell Utility partition and added sda2 as a root filesystem (at 100GB), had swap on sda3 (double my RAM at 64 GB) and used the remainder of the disk for my DRBD (sda4). I assigned types (just left my proposed DRBD as type Linux). Giving me:

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1          15      120456   de  Dell Utility
/dev/sda2   *          16       12764   102400000   83  Linux
/dev/sda3           12764       21118    67108864   82  Linux swap / Solaris
/dev/sda4           21118      145782  1001364219   83  Linux

As hostname I'd use something that you can easily identify the two hosts with say bldg1ux01n1 and bldg1ux01n2 (so this is something like "building 1 Linux  01 node 1"). 

I'd disable SELinux. I'm not usually for this (esp on Internet facing hosts) as it just looks like laziness from an admin perspective (not bothering to learn it, it's not that hard). But in this case this is an Intranet server and SELinux on clustering has only come in on RHEL 6, so I'll personally leave it time to stabilise for my comfort factor before trying.

I chkconfig'd  NetworkManager off, not supported in a cluster (no surprise there really). I chkconfig'd network on. 

I configured eth2 and eth3 to be a load balanced bond0 network (this is for talking to the main network). I'm not going to go into too much detail on this. I have switch supported link aggregation so I'm using mode 4. My "bond 0" network config now looks like (on node 1):

more /etc/sysconfig/network-scripts/ifcfg-bond0                            
DEVICE=bond0
NM_CONTROLLED=no
ONBOOT=yes
BOOTPROTO=static
IPADDR="10.1.10.20"
NETMASK="255.255.255.0"
BROADCAST="10.1.10.255"
IPV6INIT=no
USERCTL=no
NAME="System bond 0"
BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=layer3+4"

Node 2 is identical except the IPADDR is 10.1.10.21. 

BTW there is a bug (which might have been fixed when you try this) that in the interface config files for the bond (in this case ifcfg-eth3 and ifcfg-eth2) the MASTER=bond0 MUST NOT be quoted or it fails, see bz#669110.

I'm configuring bond1 for DRBD and cluster comms as a failover bond, as it's not a good idea to load balance the 10 Gb with a 1Gb backup. My "bond 1" network config looks like (on node 1):

more /etc/sysconfig/network-scripts/ifcfg-bond1
DEVICE=bond1
NM_CONTROLLED=no
ONBOOT=yes
BOOTPROTO=static
IPADDR="192.168.1.1"
NETMASK="255.255.255.0"
BROADCAST="192.168.1.255"
IPV6INIT=no
USERCTL=no
NAME="DRBD/Cluster bond 1"
BONDING_OPTS="miimon=100 mode=1 primary=eth4"
MTU=9000

Node 2 will be identical except for IPADDR being 192.168.1.2. Also note that I have set MTU to 9000 to allow Jumbo frames to optimise the performance of DRBD.

The other interface (eth1) is used for fencing, it isn't a bond, it is simple statically setup in ifcfg-eth1 with IP 192.168.2.1 on node 1 and 192.168.2.2 on node 2.

A few other quick setup things. I removed rhgb and quiet from kernel lines in grub.conf and set the startup level to 3 in /etc/inittab. I personally removed the local user I was forced to add at install time.

Also if you value your sanity you should setup passwordless public key SSH for root between your two nodes. It makes life so easy when you are copying files around.

Then I just configured the two nodes as normal server hosts on my network (minus any service work). So basically any Directory Service/Authentication you use (LDAP, NIS or AD etc), automounter, sendmail (I want alerts). I'm going to point these systems to another system for DNS in my resolv.conf until we get that going in the cluster. And anything else you usually do e.g restricting SSH. I'd also "chkconfig iptables off" until we get everything going.

You should have a working yum.conf setup (whether Internet or local repos) for the OS, all the Cluster stuff (High Availability and Resilient Storage (GFS)) and DRBD repos.

I will now add all these IP's to the local hosts files on the nodes. I don't want to rely on DNS for any cluster communications. 

more /etc/hosts                                                            
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.1.10.20    bldg1ux01n1.lan bldg1ux01n1
10.1.10.21    bldg1ux01n2.lan bldg1ux01n2
192.168.1.1   bldg1ux01n1i
192.168.1.2   bldg1ux01n2i
192.168.2.1   bldg1ux01n1f
192.168.2.2   bldg1ux01n2f
192.168.2.3   bldg1ux01fd

The "i" are just the back to back network connections (standing for Internal), the "f"'s are the fence network IP's and bldg1ux01fd is going to be the IP of the APC power switch fence device.

So at this point you might as well configure the fence device you are using. As I said I'm using an APC 7920, so I set it's IP to 192.168.2.3 and gave it a secure password.

Configuring the NTP Service
I guess I cheated a bit by including it as a cluster services in my list in part 1. NTP is easy. You just need to give the client machines both nodes IP's on the main network and they will pick one and swap over on failure of one node so no need for a sophisticated cluster setup. I'd just recommend pointing the two nodes at the same source time server and set each as a peer of each other in the ntp.conf file. A fragment of ntp.conf is below:

So on node 1:

restrict 10.2.100.20 mask 255.255.255.255 nomodify notrap noquery
server 10.2.100.20 key 30
restrict 10.2.100.21 mask 255.255.255.255 nomodify notrap noquery
server 10.2.100.21 key 30
restrict 192.168.1.2 mask 255.255.255.255 nomodify notrap noquery
peer 192.168.1.2 key 30

Node 2 is the same with 192.168.1.1 instead as the peer.

Just chkconfig ntpdate and ntpd on 


Configuring DRBD
First install the drbd packages (assuming you have configured the drbd repos and not sure if these are the same package names in the free repo)

yum install kmod-drbd drbd-utils drbd-bash-completion

DRBD has now broken out the config into global_common.conf and files for each DRBD resources.
I have :

# more /etc/drbd.d/global_common.conf                                        
global {
        usage-count no;
}

common {
        protocol C;

        handlers {
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
                out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
        }

        startup {
                wfc-timeout  600;       # Wait 600 for initial connection
                degr-wfc-timeout 600;  # Wait 600 seconds if this node was a degraded
                #become-primary-on both;
        }

        disk {
                on-io-error detach;
        }

        net {
                sndbuf-size 1024k;
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
        }

        syncer {
                rate 100M;
                al-extents 3389;
                verify-alg md5;
        }
}

# more /etc/drbd.d/r0.res                                                    
resource r0 {

        on bldg1ux01n1{
                        device    /dev/drbd0;
                        disk      /dev/sda4;
                        address   192.168.1.1:7789;
                        meta-disk internal;
                 }

        on bldg1ux01n2 {
                        device    /dev/drbd0;
                        disk      /dev/sda4;
                        address   192.168.1.2:7789;
                        meta-disk internal;
                }

Now I have got very specific tuning parameters in here that are specific to my HW setup (including the MTU value set earlier on the network) "al-extents" and "sndbuf-size". You should probably take these out during setup and follow the DRBD manual for performance optimisation and testing after you have the DRBD setup, but before you put any file systems or cluster setup on here (e.g CLVMD).

To setup DRBD on both nodes:

/etc/init.d/drbd start 
drbdadm create-md r0
drbdadm up r0
                                                                                                   
Both nodes should show (cat /etc/drbd) :
                                                                                                   
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
                                                                                                   
On node 1 (or node 2, please yourself) : 
drbdadm -- --overwrite-data-of-peer primary r0
                                                                                                   
Devices should now sync up. cat /etc/drbd for status.
                                                                                                   
I'd let the sync is finish.
                                                                                                   
Comment in "become-primary-on both" in the r0.conf file, and run 

/sbin/drbdadm adjust all 

to reload config.
                                                                                                   
Then on the node that wasn't primary. Switch it to primary too. 
                                                                                                   
# /sbin/drbdadm primary r0
# more /proc/drbd 
                                                                                                   
will show                                                                                          
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
                                                                                                   
So now in dual primary mode!

As I said earlier, you should now follow the performance optimisation section of the DRBD manual for tuning parameters. The method they use is destructive and involves dd'ing to the raw device as well as the DRBD device. Personally once I was happy with my performance I removed the drbd device again and readded it to ensure my setup was clean (no inconsistent blocks). This will involve some reading of the DRBD manual to understand what you are doing. I'm glossing this over here. 

Also you should avoid using /etc/init.d/drbd restart at any point, at the time of writing it brings the device up in "Secondary" even in a dual Primary setup, it's a minor bug in the startup script. I have reported this to Linbit. Using stop then start works fine.

No comments:

Post a Comment