NetScaler HA Failover due to hardware failure

NetScaler HA Failover due to hardware failure

book

Article ID: CTX233299

calendar_today

Updated On:

Description

Primary does not restart but HA failover happens.
From the old primary node, there are below logs. So the failover request is initiated by the other node--the old secondary.

34895   207 PPE-0 self node 172.16.28.10: INIT due to REQUEST from HA peer node Thu Mar  1 08:40:16 2018
From the old secondary, there are below logs. We can see the old secondary request failover for it  missed 15 heartbeats.
21655     0 PPE-1 interface(LA/1): No HA heartbeats (Last received: Thu Mar  1 08:40:13 2018 ; Missed 15 heartbeats) Thu Mar  1 08:40:16 2018
21656     0 PPE-1 remote node 172.16.28.10: DOWN         Thu Mar  1 08:40:16 2018
21657     0 PPE-1 self node 172.16.28.11: Claiming       Thu Mar  1 08:40:16 2018
21658     0 PPE-1 self node 172.16.28.11: Primary        Thu Mar  1 08:40:16 2018
21659     0 PPE-1 interface(LA/1): HA heartbeats received Thu Mar  1 08:40:16 2018
From here ,we have two assumptions: Network Issue or Hardware Issue.

But the newnslog,we can see  some clue about the specific reason.

nsconmsg -K newnslog -d current -g ha_tot_pkt_tx -s time=01Mar2018:08:39 -s disptime=1 |more
reltime:mili second between two records Thu Mar  1 08:39:09 2018
  Index   rtime totalcount-val      delta rate/sec symbol-name&device-no&time
      7    7000       16066592         35        5 ha_tot_pkt_tx  Thu Mar  1 08:39:58 2018 
      8    7000       16066627         35        5 ha_tot_pkt_tx  Thu Mar  1 08:40:05 2018 
      9    8113       16066666         39        4 ha_tot_pkt_tx  Thu Mar  1 08:40:13 2018 
     10   10190       16066701         35        3 ha_tot_pkt_tx  Thu Mar  1 08:40:23 2018 
     11    7000       16066736         35        5 ha_tot_pkt_tx  Thu Mar  1 08:40:30 2018 

NetScaler generated two logs at 08:40:13 and 08:40:23.
The interval is 10s and NetScaler should generate 50 heartbeat (NetScaler generates 5 heartbeat per second by default). But from the log, only 35 heartbeats were generated and sent.
This is more likely to be a hardware failure.

At the same time , ns_hw_err.bash showed HDD errors . So we can locate the reason as hardware failure and then request RMA .
 

Resolution

Request for RMA HDD.

Problem Cause

Hardware failure caused that primary didn't generate heartbeat for some time . Then Secondary didn't receive heartbeat packets for 3 seconds and requested for HA failover