Configured link aggregation (LA) channels on the NetScaler SDX platform may intermittently flap under rare conditions when LACP is configured.
Under these conditions, the SDX appliance disables the interface and renegotiates LACP with the partner device. This could occasionally result in the LA channel being disabled, and as a result, if the LA channel is configured as a critical interface and the node is primary then this will cause an HA failover on the VPX instances. This has only been seen when the NetScaler is connected with Cisco Nexus switches. However, it is likely to also occur with other switches. This is an intermittent condition, and the interfaces recover on their own after a few seconds.
The following are some prerequisite conditions for this issue to occur:
LACP configured on one or more interfaces
SVM version 10.5
XenServer version 6.1
Any NetScaler VPX version
Any NetScaler SDX platform
There are several signs that this issue is occurring.
When this issue occurs, /var/log/daemon.log on the XenServer will show the interfaces and the channel being disabled. It will also show that they recover on their own.
/var/log/daemon.log
Jun 1 14:53:15 netscaler-sdx ovs-vswitchd: 12450|bond|INFO|interface eth35: link state down
Jun 1 14:53:15 netscaler-sdx ovs-vswitchd: 12451|bond|WARN|interface eth35: disabled
Jun 1 14:54:09 netscaler-sdx ovs-vswitchd: 12452|bond|INFO|interface eth34: link state down
Jun 1 14:54:09 netscaler-sdx ovs-vswitchd: 12453|bond|WARN|interface eth34: disabled
Jun 1 14:54:09 netscaler-sdx ovs-vswitchd: 12454|bond|WARN|bond bond0: all interfaces disabled
Jun 1 14:54:12 netscaler-sdx ovs-vswitchd: 12455|bond|INFO|interface eth35: link state up
Jun 1 14:54:12 netscaler-sdx ovs-vswitchd: 12456|bond|INFO|interface eth35: will be enabled if it stays up for 31000 ms
Jun 1 14:54:12 netscaler-sdx ovs-vswitchd: 12457|bond|INFO|interface eth34: link state up
Jun 1 14:54:12 netscaler-sdx ovs-vswitchd: 12458|bond|INFO|interface eth34: will be enabled if it stays up for 31000 ms
Jun 1 14:54:12 netscaler-sdx ovs-vswitchd: 12459|bond|INFO|bond bond0: active interface is now eth35, skipping remaining 31000 ms updelay (since no interface was enabled)
Jun 1 14:54:12 netscaler-sdx ovs-vswitchd: 12460|bond|WARN|interface eth35: enabled
On the VPX instances, the channel interfaces will go through re-negotiation.
# nsconmsg -K newnslog -d event
67205 0 PPE-0 self node 10.52.248.34: Primary (peer: Secondary, COMPLETE_FAIL) Wed Jun 1 20:50:44 2016
67286 0 PPE-0 remote node 10.52.248.35: UP Wed Jun 1 20:52:19 2016
67289 406 PPE-0 interface(10/2): DISTRIBUTING Wed Jun 1 20:59:03 2016
67290 602 PPE-0 interface(10/2): COLLECTING Wed Jun 1 21:09:07 2016
67291 91 PPE-0 interface(10/2): DISTRIBUTING Wed Jun 1 21:10:41 2016
67292 2016 PPE-0 interface(10/2): COLLECTING Wed Jun 1 21:44:15 2016
67293 91 PPE-0 interface(10/1): COLLECTING Wed Jun 1 21:45:42 2016
67294 0 PPE-0 'interface(LA/1)' DOWN Wed Jun 1 21:45:42 2016
67295 0 PPE-0 self node 10.52.248.34: COMPLETE_FAIL Wed Jun 1 21:45:42 2016
67333 0 PPE-0 self node 10.52.248.34: Secondary (peer: Secondary, UP) Wed Jun 1 21:45:42 2016
67339 0 PPE-0 remote node 10.52.248.35: Primary Wed Jun 1 21:49:18 2016
Simultaneous network traces taken from XenServer and a VPX instance will show that XenServer seems to not be receiving LACPDU’s. However, the VPX instance will actually see the LACPDU’s.
This issue can be worked around or avoided entirely with one of the following resolutions:
Upgrade the NetScaler SDX SVM firmware to a version of 11.x with XenServer version 6.5.
Configure the LA channels to be manual instead of LACP.