High Availability Failovers Due to Missed HA HeartBeats of NetScaler VPX on VMware ESX Hypervisor

High Availability Failovers Due to Missed HA HeartBeats of NetScaler VPX on VMware ESX Hypervisor

book

Article ID: CTX217788

calendar_today

Updated On:

Description

NetScaler VPX on VMWare hypervisor High Availability failsover due to missed HA heartbeats.
Note: This article only pertains to NetScaler VPX on VMWare hypervisor.

Background

Root cause of the HA failovers is missing heartbeats due to VPX scheduling issues on the VMware host. The NetScaler’s Packet Engine CPU stops running for seconds; meaning the Hypervisor halts the VPX Virtual Machine while scheduling other Virtual Machines the same core/CPU. 

Citrix recommends to reserve and dedicate resources on the NetScaler VPX Virtual Machine. Refer to Citrix Product Documentation for detailed recommendations.

Note: For production use of NetScaler virtual appliance, the full memory allocation must be reserved. CPU cycles (in MHz) equal to at least the speed of one CPU core of the ESX should also be reserved.

Resolution

Within the VMWare hypervisor, reserve CPU resources on the NetScaler VPX Virtual Machine - both the Primary and Secondary Virtual Machines.

The following is an example for your reference:
VMWare host has a CPU capacity: 36 x 2.294 GHz

User-added image

Here is what we recommend, per the above example if we allocate five (5) vCPUs:

  • 1 CPU = 2.294 GHz (5 X 2.294 = 11.47 Ghz -converted to MHz -> 11470 MHz RESERVED)
  • Reserve 11470 MHz

Issue/Introduction

NetScaler VPX on Vmware Hypervisor High Availability failovers due to missed HA heartbeats

Additional Information

Using newslog event to confirm that VPX has scheduling issues

Check the failover event in the /var/nslog/newnslog*.
nsconmsg -K newnslog -d event | grep -E "node|heartbeat" | more

Here is an example of what is seen for an HA failover due to missed HA heartbeats.

Primary Device:

(The Primary device is now Secondary due to the Secondary device not receiving HA heartbeats)

 2077  7537 PPE-0 self node 192.168.1.10: INIT due to REQUEST from HA peer node Tue Jul 26 10:20:25 2016
 2062     0 PPE-1 self node 192.168.1.10: INIT due to REQUEST from HA peer node Tue Jul 26 10:20:25 2016
 2064     0 PPE-2 self node 192.168.1.10: INIT due to REQUEST from HA peer node Tue Jul 26 10:20:25 2016
 2085     0 PPE-2 self node 192.168.1.10: Secondary      Tue Jul 26 10:20:25 2016

Secondary Device:

(This Secondary Device did not miss the required HA heartbeats causing an HA failover and now it's Primary)

 2630  7529 PPE-0 interface(0/1): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016
 2631     0 PPE-0 interface(1/1): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016
 2632     0 PPE-0 interface(1/2): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016
 2633     0 PPE-0 interface(1/3): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016
 2634     0 PPE-0 remote node 192.168.1.10: DOWN         Tue Jul 26 10:20:27 2016          
 2635     0 PPE-0 self node 192.168.1.20: Claiming       Tue Jul 26 10:20:27 2016
 2636     0 PPE-0 self node 192.168.1.20: Primary        Tue Jul 26 10:20:27 2016

Examining the netio_tot_called counter to confirm that VPX has scheduling issues

In the following logs we see that counter logging is stopped for few seconds on both VPXs during the HA failover, which means that the VPX Virtual Machine was scheduled out.

netio_tot_called - This is the number of times the function netio is called. This function is called every time NetScaler needs to start packet processing; ideally the gap should be seven (7) seconds.

Collector bundle for 192.168.1.10 - /var/nslog/

nsconmsg -g netio_tot_called -d current -K newnslog  -s time=26Jul2016:10:20 -s disptime=1 |more

   Index   rtime totalcount-val      delta rate/sec symbol-name&device-no&time
      0 3585223      287355050      56748     8105 netio_tot_called  Tue Jul 26 10:20:08 2016
      1    7002      287381927      26877     3838 netio_tot_called  Tue Jul 26 10:20:15 2016
      2    7002      287408841      26914     3843 netio_tot_called  Tue Jul 26 10:20:22 2016
      3    7002      287554531      85636    12230 netio_tot_called  Tue Jul 26 10:20:34 2016   à Here we have a 12 second gap; ideally it should have been just 7 seconds
      4    7002      287593240      38709     5528 netio_tot_called  Tue Jul 26 10:20:41 2016
      5    7003      287621530      28290     4039 netio_tot_called  Tue Jul 26 10:20:48 2016
      6    7003      287648373      26843     3833 netio_tot_called  Tue Jul 26 10:20:55 2016
      7    7001      287676102      27729     3960 netio_tot_called  Tue Jul 26 10:21:02 2016
      8    7004      287703248      27146     3875 netio_tot_called  Tue Jul 26 10:21:09 2016
      9    7004      287730415      27167     3878 netio_tot_called  Tue Jul 26 10:21:16 2016

Collector bundle for 192.168.1.20 - /var/nslog/

nsconmsg -g netio_tot_called -d current -K newnslog  -s time=26Jul2016:10:20 -s disptime=1 |more

  Index   rtime totalcount-val      delta rate/sec symbol-name&device-no&time
      0  343090      246967167      26729     3817 netio_tot_called  Tue Jul 26 10:20:07 2016
      1    7001      246994115      26948     3849 netio_tot_called  Tue Jul 26 10:20:14 2016
      2    7003      247019658      25543     3647 netio_tot_called  Tue Jul 26 10:20:21 2016
      3   12698      247055240      35582     2802 netio_tot_called  Tue Jul 26 10:20:33 2016   à Here is the 12 seconds gap
      4    7012      247125542      70302    10025 netio_tot_called  Tue Jul 26 10:20:40 2016
      5    7001      247200102      25784     3682 netio_tot_called  Tue Jul 26 10:20:55 2016

Examining the sys_cur_duration_since_start counter to confirm that VPX has scheduling issues

You can also verify this issue using sys_cur_duration_since_start counter which should also be updated every second and thus have a delta of seven (7) seconds in ideal case. If there is gaps in uptime reporting counter then it clearly indicates issue with lost CPU time.

      9    7001   163.21:23:31          7        0 sys_cur_duration_sincestart  Mon Aug 14 13:32:12 2017
     10   12201   163.21:23:43         12        0 sys_cur_duration_sincestart  Mon Aug 14 13:32:25 2017------Delta value more than 7
     11    7002   163.21:23:50          7        0 sys_cur_duration_sincestart  Mon Aug 14 13:32:32 2017

Citrix Documentation - Managing High Availability Heartbeat Messages on a NetScaler Appliance