Failover Cluster – Node won’t join to cluster. EventID: 1135

Today, I had issue with one of the node cluster. After installing DPM agent on one of the nodes, the Server didn’t want to join to cluster. The error like this is not something to be happy about.

Cluster node ‘ServerName’ was removed from the active failover cluster membership.!

Event IT1135
EventID 1135- Node was removed from active Failover cluster membership.

From cluster logs, I could see lot of event ID:1135

Cluster node ‘ServerName’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Tracking down systems logs from server itself i could see few other errors.

EventID 1146; 1070;

EventID1146EventHyperV

All in all, no usable information from these logs.

 

Troubleshooting

To get correct information and logs, I used Get-Clusterlog  Powershell command to generate log file for each member of cluster.

The one I used on healthy node is:

Get-Clusterlog -Timespan 5 -Destination \\Node1\C$\ClusterLogs\

This way, you have initiate creating cluster logs for each cluster member in time period of 5 minutes and to collect data to same destination. In this period, you should try to start Cluster service from node that is causing the issue.

After command is completed, and you tried to start cluster service on node that is having the issue,  you will end up with cluster log files.

ClusterLogs
This is cluster log files for 6 node cluster.

I analyzed the the log file from node that is having issue. (as you can see, that log file is the largest in above picture.)

From logs provided by ClusterLogs, found out that one of NICs (iSCSI B2 in this case) is cousin this issue.

 ERR mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'(Node5 – iSCSI B2)
 WARN [DM] Aborting group transaction 80:80:5843+1
 ERR [CORE] Node 5: exception caught (183)’ because of ‘Gum handler completed as failed’
ERR Exception in the PostForm is fatal (status = 183)
 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.

ClusterLogsAnalyzed
Log file generated by cluster. (Viewed by Configuration manager trace log tool)

Resolution.

After I identified what could be the issue, I tried simple fix, renaming affected NIC adapter. (in this case iSCSI B2), and trying to start cluster service again.

After few seconds, the service was running and Node was again joined to cluster.

ClusterService

After you confirm that cluster node is operating normally, you can stop the cluster service from failover managment console and rename NIC to be same as on rest of nodes.

 

Hope it will help someone.

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s