1. Persistent session entries were not getting cleared
2. Output of "show persistentSessions" was seen always increasing until it hits the system limit (default limit is 2,50,000 persistent sessions per PE) then persistence failure would be observed.
3. Following counters were incrementing, these indicate the Primary node is not able to create / Delete persistent sessions on the Secondary nnode
21 0 65377466 dht_err_unable_to_put_replica
23 0 65377466 dht_err_unable_to_send_replicate_message
29 0 5500585108 dht_err_unable_to_put_replica_del_msg
31 0 5565963946 dht_err_pcb_link_err
4. Looking at the Secure Socket Funneling (SSF) connection counters we saw SSF connections were not forming between the Nodes.
1 0 100 ( NSSSF_HDXINSIGHT_CONNACTIVE ) tcp_cur_ssf_flags
3 0 0 tcp_cur_ssf_srvr_conn
5 0 0 tcp_cur_ssf_clnt_conn
7 0 0 tcp_cur_ssf_cm_srvr_conn
9 0 0 tcp_cur_ssf_cm_clnt_conn
5. RPC password issues were ruled out
6. We took a trace on primary and secondary and saw primary was trying to communicate with Secondary on TCP port 3009 to for the SSF Connections but Secondary was not responding to the SYN.
7. Looking at the internal service on port 3009, we saw it was down on secondary and primary both as cert was not bound, X.X.X.X would be the NSIP of the node
show service -internal
...
5) nskrpcs-127.0.0.1-3009 (X.X.X.X:3009) - RPCSVRS
State: DOWN
Last state change was at Wed Mar 7 12:08:57 2018
Time since last state change: 53 days, 14:26:36.20
[Certkey not bound] Server Name: #ns-internal-127.0.0.1#
.....
8. Usually default ns-server-cert is bound to the internal services, but in this case it was removed and a custom cert was used, but bound only to internal-services port 443 for https access and internal services port 3009 was left out.