Persistence Sessions not getting cleared on NetScaler due to SSF Connection Failure

Persistence Sessions not getting cleared on NetScaler due to SSF Connection Failure

book

Article ID: CTX235248

calendar_today

Updated On:

Description

1. Persistent session entries were not getting cleared

2. Output of "show persistentSessions" was seen always increasing until it hits the system limit (default limit is 2,50,000 persistent sessions per PE) then persistence failure would be observed.

3. Following counters were incrementing, these indicate the Primary node is not able to create / Delete persistent sessions on the Secondary nnode

   21       0          65377466 dht_err_unable_to_put_replica 
   23       0          65377466 dht_err_unable_to_send_replicate_message 
   29       0        5500585108 dht_err_unable_to_put_replica_del_msg 
   31       0        5565963946 dht_err_pcb_link_err 


4. Looking at the Secure Socket Funneling (SSF) connection counters we saw SSF connections were not forming between the Nodes. 

    1       0 100 ( NSSSF_HDXINSIGHT_CONNACTIVE ) tcp_cur_ssf_flags 
    3       0                 0 tcp_cur_ssf_srvr_conn 
    5       0                 0 tcp_cur_ssf_clnt_conn 
    7       0                 0 tcp_cur_ssf_cm_srvr_conn 
    9       0                 0 tcp_cur_ssf_cm_clnt_conn 


5. RPC password issues were ruled out

6. We took a trace on primary and secondary and saw primary was trying to communicate with Secondary on TCP port 3009 to for the SSF Connections but Secondary was not responding to the SYN.

7. Looking at the internal service on port 3009, we saw it was down on secondary and primary both as cert was not bound, X.X.X.X would be the NSIP of the node

show service -internal
...
5)      nskrpcs-127.0.0.1-3009 (X.X.X.X:3009) - RPCSVRS
        State: DOWN
        Last state change was at Wed Mar  7 12:08:57 2018
        Time since last state change: 53 days, 14:26:36.20
[Certkey not bound]     Server Name: #ns-internal-127.0.0.1#
.....


8. Usually default ns-server-cert is bound to the internal services, but in this case it was removed and a custom cert was used, but bound only to internal-services port 443 for https access and internal services port 3009 was left out.

Resolution

Cert was bound to internal service :3009 after which we saw the SSF connections come up. That resolved the issue

Problem Cause

Per design, persistent session information is communicated over to Secondary Node and once cleared on Secondary the Primary clears its own.

Internal Services on port 3009 was down as no cert was bound on it, so this caused SSF Connections between the HA Nodes Primary and Secondary to fail, so the primary could not clear the persistent session info on the secondary causing it to not clear its own as well,ultimately leading to the issue.

Issue/Introduction

SSF Connection failure as internal service:3009 was down