Extended ICA connection interruption during NetScaler HA failover on Azure

Extended ICA connection interruption during NetScaler HA failover on Azure

book

Article ID: CTX484564

calendar_today

Updated On:

Description

Users are encountering extended ICA connection interruptions during NetScaler High Availability (HA) failover events within the Azure environment.

  • The HA node pairs have been configured on Azure, with VPX serving as the Citrix Gateway for a Citrix Virtual Apps and Desktops (CVAD) environment.
  • Upon executing a force failover on the VPX, users are intermittently experiencing connection disruptions lasting approximately 22 seconds.
  • The issue occurs randomly during HA failover. In some case of smoothing switchover, ICA connection interrupt is in few seconds or so

Environment

Citrix is not responsible for and does not endorse or accept any responsibility for the contents or your use of these third party Web sites. Citrix is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement by Citrix of the linked Web site. It is your responsibility to take precautions to ensure that whatever Web site you use is free of viruses or other harmful items.

Resolution

  • Because the minimum interval of Azure SLB health probe is 5s, the disconnection time can only be reduced from TCP retransmission RTO to avoid long time intervals such as 12s retransmission.
  • Changing TCP retransmission RTO has been tested to alleviate this problem. However, due to the limited probe time interval, there is still an interruption of about 10 seconds in the worst case.

Problem Cause

There are two source of extended interruption:

(1) caused by Azure SLB health probe interval;

(2) caused by exponential interval of TCP SYN retransmission.

If the Azure SLB health probe interval is 5s, according to information from of Microsoft, in the worst case it will take 10 seconds for ALB to judge active/passive state of VPX.

By default, when the client receives a reset from the ADC he will immediately send a reconnection, followed by a retransmission in 3s, 6s, and 12s. In the previous situation of long ICA disconnection, the retransmission of 3s and 6s did not receive any reply, and retransmission in 12s will get a reply every time. This phenomenon is consistent with the above prediction that ALB might take 10s for HA failover in the worst case.

In the case of smooth switching, the client got correct reply when retransmitting within 3s or 6s.

Nstrace analysis:
(1) According to package on client side, HA failover happened on 16:18:14 and client received TCP reset from ADC(Frame No. 4619). Then at 16:18:15, client send TCP syn(Frame No. 4718) to recover the connection. Client sent TCP retransmission at 16:18:18/16:18:24 and got no response(Frame No. 4865 and 4946). Until 12s later at 16:18:36, TCP connection established successfully.
image.png
The behavior of the client is reasonable because it complies with the SYN retransmission mechanism in TCP protocol defined in RFC 1122.
Reference: https://www.rfc-editor.org/rfc/rfc1122#page-96
 
(2) In nstrace on VPX, the first TCP syn is received at 16:18:36,whicth means previous two TCP retransmission is lost in the way:
image.png

Additional Information

Application signal, detection of the signal, and Load Balancer reaction:
https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview