When we bind the StoreFront monitor to our StoreFront 3.5 servers, every hour there is an entry on the dashboard and system log that there is a failure - probe failed.
Changing the monitor parameters for successive probe and response time-out to 10 and 5 seconds has fixed the issue.
This tells us that the script is fine and the NetScaler is able to reach the StoreFront servers but it is not getting a response in time to consider the server as UP.To understand how to monitor Citrix StoreFront, you can refer to: http://docs.citrix.com/en-us/netscaler/11/traffic-management/load-balancing/load-balancing-builtin-monitors/monitor-citrix-sf-services.html
Looking through the newnslog messages we see the following entries:
2155 14856 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 17:41:53 2016 2165 21 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 17:42:19 2016 2195 2032 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:19:15 2016 2202 7 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:23:20 2016 2203 7 PPE-0 DBSMonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:23:25 2016 2210 231 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:27:18 2016 2222 0 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:46:43 2016 2223 14 PPE-0 DBSMonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:46:53 2016 2275 28 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 18:49:28 2016 2333 168 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 19:04:02 2016 2335 196 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): UP; Last response: Success - Probe succeeded. Tue Mar 1 19:07:20 2016 2337 70 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 19:08:30 2016 2349 7 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 19:08:47 2016 2350 28 PPE-0 MonServiceBinding_SRV-BE-DI-0135.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0135_SSL): UP; Last response: Success - Probe succeeded. Tue Mar 1 19:09:12 2016 2352 7 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 19:09:22 2016 2375 14 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): DOWN; Last response: Failure - Probe failed. Tue Mar 1 19:11:43 2016 2377 315 PPE-0 MonServiceBinding_SRV-BE-DI-0134.XXXXXXX.COM:443_(mon_storefront)(svc_srv-be-di-0134_SSL): UP; Last response: Success - Probe succeeded. Tue Mar 1 19:16:54 2016
As you can see, the probes are failing more often than every hour. We do not get a reason for the probe failing though.
I also checked the nsumond.log and there are a lot of entries for the script nssf.pl failing for different reasons:
Wed Mar 2 20:54:52 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Wed Mar 2 20:54:52 2016: /netscaler/monitors/nssf.pl Exit Reason : (404 Not Found) (Partition ID: 0) Wed Mar 2 21:05:22 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Wed Mar 2 21:05:22 2016: /netscaler/monitors/nssf.pl Exit Reason : (Citrix Peer Resolution Service CitrixConfigurationReplication CitrixCredentialWallet CitrixDefaultDomainService CitrixSubscriptionsStore WAS W3SVC stopped running.Degraded Services.) (Partition ID: 0) Fri Mar 4 14:15:23 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Fri Mar 4 14:15:23 2016: /netscaler/monitors/nssf.pl Exit Reason : (200 OK) (Partition ID: 0)
However the main reason for the failure is this:
Wed Mar 2 19:34:38 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Wed Mar 2 19:34:38 2016: /netscaler/monitors/nssf.pl Exit Reason : (500 Can't connect to 192.168.200.135:443) (Partition ID: 0) Wed Mar 2 19:34:43 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Wed Mar 2 19:34:43 2016: /netscaler/monitors/nssf.pl Exit Reason : (500 Can't connect to 192.168.200.135:443) (Partition ID: 0) Wed Mar 2 19:34:58 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Wed Mar 2 19:34:58 2016: /netscaler/monitors/nssf.pl Exit Reason : (500 Can't connect to 192.168.200.134:443) (Partition ID: 0) Wed Mar 2 20:07:49 2016: /netscaler/monitors/nssf.pl Script failed. Exit code : 1 (Partition ID: 0) Wed Mar 2 20:07:49 2016: /netscaler/monitors/nssf.pl Exit Reason : (500 Can't connect to 192.168.200.135:443) (Partition ID: 0)