Netscaler SDX - Bad LACP packets from SDX device

Description

LA Interface in the SDX went down several times or shows error with the LACP status.
From the Switch customer gets BAD PDU errors and PDU_MISSED_TIME_OUT.
Builds: 11.1 build 53.10 or above.
Error messages is about the LACP negotiation. (Nexus and standard use 30 sec of timeout for LONG value and 1 sec for SHORT Value).
From the customer switch,
LACP packets error counters is constantly increasing in the port channel connected to SDX.

show lacp counters (from Cisco device)
User-added image

Checking the debug, we have bounch of BAD PDU Packets from Netscaler:

SDSC-N77-PROD# sh lacp internal event-history errors

1) Event:E_DEBUG, length:87, at 173109 usecs after Tue Oct 3 12:55:55 2017
[102] lacp_net_rx_data(283): Rcvd BAD PDU: Sanity failed: if_idx 0x1a090000: pkt_len 64

2) Event:E_DEBUG, length:87, at 112596 usecs after Tue Oct 3 12:55:55 2017
[102] lacp_net_rx_data(283): Rcvd BAD PDU: Sanity failed: if_idx 0x1a010000: pkt_len 64

3) Event:E_DEBUG, length:87, at 26104 usecs after Tue Oct 3 12:55:55 2017
[102] lacp_net_rx_data(283): Rcvd BAD PDU: Sanity failed: if_idx 0x1a08f000: pkt_len 64

4) Event:E_DEBUG, length:87, at 984448 usecs after Tue Oct 3 12:55:54 2017
[102] lacp_net_rx_data(283): Rcvd BAD PDU: Sanity failed: if_idx 0x1a00f000: pkt_len 64

5) Event:E_DEBUG, length:87, at 530838 usecs after Tue Oct 3 12:55:54 2017
[102] lacp_net_rx_data(283): Rcvd BAD PDU: Sanity failed: if_idx 0x1a08e000: pkt_len 64
...

Same in the ouput from: Show lacp internal pdu interface #

Switch constantly reports the following errors in it’s log:

(11) Recv BAD LACP PDU: len:64 at 748114 usecs after Mon Oct 9 13:23:43 2017

0180c200 000200e0 ed75a0ee 88090000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

In daemon.log from XenServer found the vswitchd link state for interfaces facing the switch and used for LACP Channel were up down several times:

Capture shows SDX is sending LACP Slow protocol,and incorrect data withing the packet:

Resolution

Packet highlighted in the Capture is not a LACP packet. It is generated by a daemon in SDX that is used for detecting TX stalls in the system. It sets the type as “Slow Protocols”, but the subtype is 0 and not 1 (LACP). This packet should be ignored by the switch and not treated as a LACP pdu. Siwtched use tjose packets as LACP.

Disabling the daemon stops the flaps, and the error messages in the switch logs. That will eliminate or confirm if the switch is misinterpreting the subtype 0 packets.

On the XenServer shell:
[root@netscaler-sdx ~]# ps -ax | grep nictx
8763 ? S 0:00 /usr/bin/python /etc/rc3.d/S20sdx-nictx start
9439 pts/4 S+ 0:00 grep nictx

[root@netscaler-sdx ~]# /etc/rc3.d/S20sdx-nictx stop
Sending SIGTERM to 8763

[root@netscaler-sdx ~]# ps -ax | grep nictx
10929 pts/4 S+ 0:00 grep nictx

Problem Cause

Packet highlighted in the Capture is not a LACP packet. It is generated by a daemon in SDX that is used for detecting TX stalls in the system. It sets the type as “Slow Protocols”, but the subtype is 0 and not 1 (LACP). This packet should be ignored by the switch and not treated as a LACP pdu. Siwtched use tjose packets as LACP.

Disabling the daemon stops the flaps, and the error messages in the switch logs. That will eliminate or confirm if the switch is misinterpreting the subtype 0 packets.

Issue/Introduction

SDX is sending BAD LACP packets (wrong PDU's)

Welcome to "KB Articles"