If you're reading this third part of the Creating a Windows Cluster series, welcome back!
In the first two installments of the series we covered Using iSCSI to Connect to Shared Storage in Part 1 and Configuring Shared Disk in the OS in Part 2. Now we are at the point where things are ready to actually create the cluster.
This installment will be a little longer. It covers two main tasks. First, the Failover Clustering role will be installed on the prospective cluster nodes. Next, the cluster itself will be created and we'll check some basic configuration items.
At this point the word "prospective" will be dropped and the machines we have been working with will have graduated to the position of full-blown "Cluster Nodes".
Repeat the steps above on the other prospective cluster nodes.
Now we have completed installing the Failover Clustering feature on each of the servers that will be nodes in our cluster. To review, we have connected to shared storage as part of the first blog in this series. We used iSCSI, but there are other ways to achieve the connection to shared storage, such as Fibre Channel. Next, we configured the disks in the OS of the prospective nodes. At this point, one node has disks online, one has them offline. The next step is to actually create the cluster.
Now the cluster is created and we have checked that the storage is available and the network connections are all configured as they should be. The next thing we should do is to test failover. After all, what good is a cluster if it doesn't fail over? Might as well just have a standalone machine.
In this test we will monitor the disk resources and watch them as they fail over from one node to the next.
Examine the Failover Cluster Manager. Check the Disks node and look at the storage in the right pane. Make note of which machine owns the storage as indicated in the Owner Node column.
If you aren't already there you'll need to log on to the machine that is not the storage Owner Node. You will need to monitor the failover from Failover Cluster Manager on that machine. The test will involve doing something that will simulate the unexpected failure of the node that owns the storage resource and we won't be able to monitor the cluster from a "failed" node.
As fun as it might sound, I doubt any of us will test our node failover by wailing on it with a sledgehammer, or dousing it with a bucket of water, or making a big noise with a stick of dynamite strategically wedged into one of the drive bays.
Makes me think of a guy I worked with back in the early 90's. When he would see someone with an open server on the bench something compelled him to make a point of walking by to flip a quarter into the open chassis. Oh the fun of watching George bounce around on the motherboard while scrambling for the power button before something bad happened. But I digress…
No, we are going to pick something a little more benign to simulate a failure. Whatever you do, make sure it is unexpected to the OS of the node you are causing to fail. If you stop the cluster service the resources will fail over, but that failover is handled by the service as it stops. If you do a proper shutdown the service will also handle the failover coordination. We want something that will simulate something like when the motherboard decides to do an impression of a genie leaving the bottle (OK. Old reference I haven't heard in a long time. Let's go back 28 years. I was doing component level electronics, repairing military equipment that had been returned from the field, and this was a phrase we used when you powered up a unit and it let out some smoke. "Hey, you just let the genie out of the bottle!" Man, this post turned into a trip down memory lane, didn't it?).
Your failure needs to be something more along the lines of disconnecting the network cables. Another good option (if you are using virtual machines) is to go into the machine configuration and disconnect the NICs. This way the machine is no longer communicating on the network and the other node will detect that as a failure. Another option is to disconnect the power, not giving the machine a chance to initiate a failover.
Personally I prefer the network connection interruption method. I haven't seen a machine go belly up due to an improper shutdown in some time, but I guess old habits die hard and I don't like the idea of not performing a graceful shutdown. Probably never will.
Once you do something to cause an unexpected failure on the node that owns the disk resources you need to watch the other cluster node. It will take about 10 or 15 seconds and you will see the disk resources flash "Offline" and then come back online. The name of the server in the Owner Node column in the Failover Cluster Manager will change from the node that you caused to "fail" to the node that you are monitoring from.
And when you see that, you know your basic cluster is working. Next time we'll look at the creation of a SQL Server cluster on top of the failover cluster that we just built.
Until then, wishing you all the best!