HA Active/Active NetScaler failure

HA Active/Active NetScaler failure

book

Article ID: CTX228286

calendar_today

Updated On:

Description






HA Active/Active NetScaler failure occurred and below are the ha counters.
nsconmsg105 -K newnslog.74 -g ha_tot_state -g ha_tot_master -g ha_cur_master  -s disptime=1 -s time=10aug2017:07:43 -d current | more

Display start time set to Thu Aug 10 07:43:00 2017
Displaying performance information
NetScaler V20 Performance Data
NetScaler NS10.5: Build 60.7.nc, Date: Nov 14 2015, 04:53:49
 
 
reltime:mili second between two records Thu Aug 10 07:43:32 2017
  Index   rtime totalcount-val      delta rate/sec symbol-name&device-no&time
      0   84000             34          1        0 ha_tot_master_claim  Thu Aug 10 07:43:32 2017 >>>>>>HA Master  claim
      1    7000              2          2        0 ha_cur_master_state  Thu Aug 10 07:43:39 2017
      2       0             62          1        0 ha_tot_master_change  Thu Aug 10 07:43:39 2017
      3  175000             51          1        0 ha_tot_state_fail  Thu Aug 10 07:46:34 2017


72216281     0 PPE-2 interface(LA/2): HA heartbeats received Thu Aug 10 07:43:23 2017

72216282     7 PPE-2 interface(LA/2): No HA heartbeats (Last received: Thu Aug 10 07:43:23 2017
72216283     0 PPE-2 interface(LA/2): HA heartbeats received Thu Aug 10 07:43:26 2017
72216284     0 PPE-2 interface(LA/2): No HA heartbeats (Last received: Thu Aug 10 07:43:26 2017
72216285     0 PPE-2 interface(LA/1): No HA heartbeats (Last received: Thu Aug 10 07:43:28 2017

Event changes (HA primary state flipping between the two nodes):

72216286     0 PPE-2 remote node x.x.x.x: DOWN        Thu Aug 10 07:43:32 2017
72216287     0 PPE-2 self node x.x.x.x: Claiming      Thu Aug 10 07:43:32 2017
72216288     0 PPE-2 self node x.x.x.x: Primary       Thu Aug 10 07:43:32 2017
 
72216465     0 PPE-2 self node x.x.x.x: ROUTEMONITOR_FAIL Thu Aug 10 07:46:33 2017
72216492     0 PPE-2 self node x.x.x.x: Secondary     Thu Aug 10 07:46:33 2017
72216495     0 PPE-2 self node x.x.x.x: UP            Thu Aug 10 07:46:33 2017
72216496     7 PPE-2 self node x.x.x.x: Claiming      Thu Aug 10 07:46:35 2017
72216497     0 PPE-2 self node x.x.x.x: Primary       Thu Aug 10 07:46:36 2017
 
72216627     0 PPE-2 self node x.x.x.x.: ROUTEMONITOR_FAIL Thu Aug 10 07:49:37 2017
72216693     0 PPE-2 self node x.x.x.x.x: Secondary     Thu Aug 10 07:49:37 2017
72216696     0 PPE-2 self node x.x.x.x.x.: UP            Thu Aug 10 07:49:37 2017
72216697     0 PPE-2 self node x.x.x.x.x: Claiming      Thu Aug 10 07:49:39 2017
72216698     0 PPE-2 self node x.x.x.x.x.: Primary       Thu Aug 10 07:49:40 2017

============

We see the node stop receiving the HA traffic:

nsconmsg105 -K newnslog.74 -g ha_tot_pkt_ -s disptime=1 -s time=10aug2017:07:43 -d current | more
Display start time set to Thu Aug 10 07:43:00 2017
Displaying performance information
NetScaler V20 Performance Data
NetScaler NS10.5: Build 60.7.nc, Date: Nov 14 2015, 04:53:49
 
 
reltime:mili second between two records Thu Aug 10 07:43:11 2017
  Index   rtime totalcount-val      delta rate/sec symbol-name&device-no&time
      0   63000      298273455         37        5 ha_tot_pkt_rx  Thu Aug 10 07:43:11 2017
      1       0      503682722         70       10 ha_tot_pkt_tx  Thu Aug 10 07:43:11 2017
      2    7000      298273492         37        5 ha_tot_pkt_rx  Thu Aug 10 07:43:18 2017
      3       0      503682792         70       10 ha_tot_pkt_tx  Thu Aug 10 07:43:18 2017
      4    7000      298273529         37        5 ha_tot_pkt_rx  Thu Aug 10 07:43:25 2017
      5       0      503682862         70       10 ha_tot_pkt_tx  Thu Aug 10 07:43:25 2017
      6    7000      298273549         20        2 ha_tot_pkt_rx  Thu Aug 10 07:43:32 2017        <<< last HA RX packets
      7       0      503682936         74       10 ha_tot_pkt_tx  Thu Aug 10 07:43:32 2017
      8    7000      503683008         72       10 ha_tot_pkt_tx  Thu Aug 10 07:43:39 2017
      9    7000      503683078         70       10 ha_tot_pkt_tx  Thu Aug 10 07:43:46 2017
     10    7000      503683148         70       10 ha_tot_pkt_tx  Thu Aug 10 07:43:53 2017


 

Resolution



By default, the HA related traffic flows through the VLANs to which the NSIP address is bound. To accommodate a potential surge in this traffic, you can separate the HA related traffic from the management traffic and restrict its flow to a separate VLAN. This VLAN is called the HA SYNC VLAN.

Problem Cause

There was a sync vlan mismatch between the primary and Secondary nodes causing the nodes to keep failing over. 

var/log]$ grep syncvlan ns.log.*
ns.log.0:Aug 10 21:58:18 <local0.info> 220.76.194.2 08/10/2017:12:58:18 GMT ns 0-PPE-0 : CLI CMD_EXECUTED 547484968 0 :    - Command "set HA node -haStatus ENABLED -haSync ENABLED -haProp ENABLED -helloInterval 200 -deadInterval 3 -failSafe OFF -maxFlips 0 -maxFlipTime 0 -syncvlan 3094" - Status "Success"
ns 0-PPE-0 : CLI CMD_EXECUTED 546965186 0 :  Command "set HA node -haStatus ENABLED -haSync ENABLED -haProp ENABLED -helloInterval 200 -deadInterval 3 -failSafe OFF -maxFlips 0 -maxFlipTime 0 -syncvlan 1" - Status "Success"
 
 

Issue/Introduction

HA node has stopped receiving HA heartbeat. This has triggered the node to initiate Primary claim process which made the set up Active/Active