Today, I had issue with one of the node cluster. After installing DPM agent on one of the nodes, the Server didn’t want to join to cluster. The error like this is not something to be happy about.
Cluster node ‘ServerName’ was removed from the active failover cluster membership.!
From cluster logs, I could see lot of event ID:1135
Cluster node ‘ServerName’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Tracking down systems logs from server itself i could see few other errors.
EventID 1146; 1070;
All in all, no usable information from these logs.
To get correct information and logs, I used Get-Clusterlog Powershell command to generate log file for each member of cluster.
The one I used on healthy node is:
Get-Clusterlog -Timespan 5 -Destination \\Node1\C$\ClusterLogs\
This way, you have initiate creating cluster logs for each cluster member in time period of 5 minutes and to collect data to same destination. In this period, you should try to start Cluster service from node that is causing the issue.
After command is completed, and you tried to start cluster service on node that is having the issue, you will end up with cluster log files.
I analyzed the the log file from node that is having issue. (as you can see, that log file is the largest in above picture.)
From logs provided by ClusterLogs, found out that one of NICs (iSCSI B2 in this case) is cousin this issue.
ERR mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'(Node5 – iSCSI B2)
WARN [DM] Aborting group transaction 80:80:5843+1
ERR [CORE] Node 5: exception caught (183)’ because of ‘Gum handler completed as failed’
ERR Exception in the PostForm is fatal (status = 183)
WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
After I identified what could be the issue, I tried simple fix, renaming affected NIC adapter. (in this case iSCSI B2), and trying to start cluster service again.
After few seconds, the service was running and Node was again joined to cluster.
After you confirm that cluster node is operating normally, you can stop the cluster service from failover managment console and rename NIC to be same as on rest of nodes.
Hope it will help someone.