Note: This article applies to Citrix SD-WAN WANOP.
The following are high level troubleshooting steps for WCCP cluster deployment issues:
Verify the following:
Has deployment planning been performed? Refer to Citrix Documentation - Preparing for Your Deployment for more information.
Has the number and models of the CloudBridge appliance been determined? Refer to Citrix Documentation - Selecting Appliances for more information.
Have the two WCCP Service Group (SG) been selected? Refer to Citrix Documentation - Quick Start WCCP Clustering Guide for more information.
Has the router been configured? Refer to Citrix Citrix Documentation - Configuring the Router and Quick Start WCCP Clustering Guide for more information.
Have the CloudBridge WCCP configuration been determined, especially MaskValue, DC algorithm, and forwarding method?
Refer to Citrix Documentation - Load-Balancing in the WCCP Cluster, Assigning Buckets to Appliances and Quick Start WCCP Clustering Guide for more information.
Have the CloudBridge appliances and routers been configured? Refer to Citrix Documentation - Limitations and Quick Start WCCP Clustering Guide for more information.
Is there connectivity between these devices? Refer to Citrix Documentation - Testing and Troubleshooting for more information.
If there are multiple routers, focus on a single router and verify the following on that router then repeat for any additional routers one at a time.
Pick a CloudBridge appliance that is a member of the WCCP cluster.
Go to the WCCP Monitoring page called Cache Status under Monitoring > Appliance Performance > WCCP as shown in the following screen shot.
Examine the specific SG pair.
If the status shows as Has Assignment then go to Step 5.
Verify the appliance’s router status under Monitoring > Appliance Performance > WCCP Cache Status. Refer to the status in Table 1 - Router Status Field on WCCP Monitoring Page.
This status page does not automatically refresh, so a manual refresh is required to determine the latest status.
If it has a Blue text as described in Table 1 - Router Status Field on WCCP Monitoring Page, then it is an SG configuration error. Determine the cause of the error and correct it.
The following are two examples:
WCCP might be disabled.
The interface selected in the SG configuration is incorrect.
If the table entry has a notification, then verify the appliance’s notifications and look up that condition in Table 2 - WCCP Notifications. This table describes the problem and possible corrections. Correct the condition. Refer to CTX200319 and CTX200412 for more information.
If the CloudBridge appliance WCCP monitoring status page has the selected router and SG pair with a "No response from Router" status then the possible reasons and corrections for this failure are listed in this point. The following are the log messages:
If there is no connectivity to the router, then perform the following steps:
Ping the router from the CloudBridge appliance’s traffic interface used in the configuration, such as apA.
Verify if the interface is enabled and if the correct interface is selected.
If the router is on the same subnet as the interface, and ping fails, then examine wiring or any intermediate switch. Correct the condition.
If the router is on a different subnet, then the interface on any appliance other than a CloudBridge4000/5000 needs a default gateway on the interface, or on the NetScaler component is a CloudBridge4000/5000 appliance then a static route.
In either case ping the gateway and then on that gateway device ping the router. Might be the router needs a static route for the route back to the appliance.
The router has learned SG characteristics from another CloudBridge appliance, and this appliances does not match. Possible items are Password, Priority, Protocol, a reversal of WAN and LAN SGs and hence the direction is a mismatch. An unlikely possibility is that there are already 32 caches in the SG. Correct this appliance’s configuration.
If the status is "SG has socket send error", a reboot of the appliance should resolve this issue. If the issue persists, collect the support bundle and contact Citrix technical support. Refer to Citrix Documentation - Generating a Tar Archive for Technical Support for more information.
If the status is "Needs Assignment" or "Waiting for DC to Assign", this might be temporary, however, if this condition continues (beyond a minute), then the Designated Cache (DC) has decided not to provide an assignment to the router on behalf of this appliances. In this case, go to the WCCP router's monitoring page (Monitoring > Appliance Performance > WCCP (Cache x) > Routers), as shown in the following screen shot. From this page determine the DC. Then examine that DC appliance and view the logs to determine why the DC did not assign any mask elements to the original CloudBridge appliance that has the "Needs Assignment" router status. (It is possible that the original appliance is the DC and it did not give itself an assignment.)
Generally the reason is that the "Mask Value" has too few bits that are set, effectively it is too small so there are not enough mask elements to distribute to all SG members (see the following screen shots).
Another reason is that the appliance does not have the same release as the DC. If there is no DC after a few minutes, then:
- Take a PCAP trace of the cache with the lowest IP address, and also another cache of the SG pair.
- Create support bundle (or extraction of the logs) of these two appliances. Refer to Citrix Documentation - Generating a Tar Archive for Technical Support for more information.It might also be self-correcting if the cache with the lowest IP has WCCP disabled for 40 seconds, then re-enabled. If the status is "Needs Assignment" or "Waiting for DC to Assign", other cluster analysis can be verified, even when not resolving the missing assignment of this appliance. To continue the analysis go to Step 5.
DC Cache Status:
SC1 Cache Status:
SC2 Cache Status:
Verify the appliance’s Notifications. If there are any WCCP related notifications on either of the SGs in the pair on the selected router, then look up that condition in Table 2 - WCCP Notifications. This table describes the problem and possible corrections. Correct the condition. Refer to CTX200319 and CTX200412 for more information.
If the CloudBridge appliance has "Has Assignment" or "Needs Assignment" or "Waiting for DC to Assign" as its status on the WCCP monitoring page then look at all members of the cluster. Go to, Monitoring > Appliance Performance > WCCP (Cache X) > Cluster Summary. If this appliance is a CloudBridge4000/5000, then examining either of the cache pair would be fine (the status on either is 99% identical). An example of this page is shown in the following screen shot. The line with the "Handshake" represents other caches (CloudBridge appliances) in the cluster. Examine this list (for the selected router and SG pair) and determine if all expected caches are present. If there are any missing, then go to that appliance’s management IP and examine the WCCP status. Follow this procedure starting at Step 2 on that appliance to resolve why it is not part of this cluster.
In the list of original appliances if any of the members of this list has a router status as "Not Seen", then that cache (CloudBridge appliance) is no longer part of the cluster. This might be temporary, such as a reboot. If it is not, then follow this procedure starting at Step 2 on that appliance to resolve why it is not part of this cluster. If the router status is "Seen" but not "Assigned", then you have to determine the reason why it has no assignment. For all of the caches that have no assignment, follow the procedure described in Step 3.e by examining the DC logs.
At this step, all the SGs cache members are present and they all have assignments. The details of the assignment can be viewed on any of the status pages, however Monitoring > Appliance Performance > WCCP (Cache X) > Service Groups page provides a comprehensive view, as shown in the following screen shot. The example shows two active cache members in the 52/51 SG pair. The MaskValue is 0x1 and 110.0.10.3 has element 0; and 110.0.10.2 has element 1. The evaluation should be the load at each appliance. There is no comprehensive tool to provide this other than verifying each appliance and comparing connection, LAN, and WAN usage graphs. If any appliance is greatly loaded, and one or more are lightly loaded, then the MaskValue choice should be reevaluated. If a network component mask has been selected then a host component mask might be a better solution. If the issue cannot be resolved with a different MaskValue, possible additional appliances needs to be added to support the complete peak load and support a N+1 solution. Refer to Citrix Documentation - Planning Your Deployment and Quick Start WCCP Clustering Guide for more information.
Iterate on the other SG pairs and the other routers. When all cluster members are active on all routers, and the load is reasonably distributed, then the last step is to verify if there are two or more routers in the same SG pair. In this case the "HSRP Deterministic" algorithm should be selected to support migration of connection between routers. The "Least Disruptive" algorithm should only be used when there is a single router.
This table shows the Router Status as shown on each router in the following diagram. The status conditions that are shown as configuration errors are colored Blue, conditions representing some internal errors are in Purple and operational determined errors are shown in Orange.
# | Status Text | Notification |
---|---|---|
1 | Undefined interface | |
2 | Bad configuration | |
3 | Disabled interface | |
4 | Bad subnet for interface | |
5 | Internal problem | |
6 | Service Group is disabled | |
7 | WCCP is disabled | |
8 | Acceleration is disabled | |
9 | Contacting router | |
10 | Connecting to router | |
11 | Connected to router | |
12 | Disconnecting from router | |
13 | No response from router | WMAJ-1 or 13 |
14 | Router's Forward or Return or Assignment capability mismatch | WMAJ-4 , 5, 6 or 7 |
15 | Multicast Discovering | |
16 | Multicast Failed to Discover | |
17 | Multicast Shutdown | |
18 | Router's view has other cache | WMAJ-5 |
19 | Router's Assignment Capability mismatch | WMAJ-4 , 5, 6 or 7 |
20 | Router Is OffNet and AP's GW is Invalid | WMAJ-8 |
21 | SG had socket send error | |
22 | Needs Assignment | |
23 | Has Assignment | |
24 | Waiting for Partner SG | WMAJ-9 |
25 | Waiting for DC to Assign | WWARN-1, 2 or 3 |
26 | Direction (source/destination) mismatch with DC | |
27 | Number of mask bits mismatch with DC | WMAJ-10, 11 & 12 |
28 | Mask value mismatch with DC | WMAJ-10, 11 & 12 |
29 | SNH |
The following diagram indicates a typical flow of router status presented to the user on the WCCP status page for a particular router in a particular SG.
Note: Some of these status messages are brief, and a user might not see all the status messages from "Contacting Router" to "Has Assignment".
Ref # | Alert Type | Message | Meaning |
---|---|---|---|
WMAJ-1 | WCCP MAJOR | WCCP SG: <sg> on <apDisplayName> Router: <IP> Has not sent I See You in X Time Outs (Y s) | CloudBridge cannot connect to a SG router after 1 minute. You should verify configuration and connectivity. Ping the specified router from the CloudBridge. Verify the router’s WCCP SG configuration. In a cluster the SG’s priority, and protocol must be the same on all SG members. The password must match the routers password in the SG, and both SG use the same password. |
WMAJ-3 | WCCP MAJOR | WCCP SG: <sg> on <apDisplayName> Router: <ip> Has incompatible Packet Return Method: LEVEL2 and is Offnet, and does not support GRE for packet return. | The specified router has an incompatible Packet Return Method, LEVEL2, and does not support GRE for this method. Also the router is not on the CloudBridge’s subnet, that is, it is Offnet. Packet Acceleration is incompatible. One solution would be to move the CloudBridge and the router to be on the same subnet. |
WMAJ-4 | WCCP MAJOR | WCCP SG: <sg> on <apDisplayName> Router: <ip> Has incompatible Router Forwarding: < GRE | LEVEL2 | GRE&LEVEL2 | UNKNOWN > Configured as: < GRE | LEVEL2 | AUTO | UNKNOWN> [ NSLB Solution must have same forwarding as return method.] [Router-Not-On-Same-Subnet]. | The specified router has an incompatible Forwarding method. Packet Acceleration is incompatible. One solution would be to reconfigure the SG to use a compatible Forwarding Method. Another solution, if the router is "Offnet", and the Router Forwarding was configured as LEVEL2 on the CloudBridge would be to move the CloudBridge to the same subnet as the router, or change the configuration to GRE or AUTO. |
WMAJ-5 | WCCP MAJOR | WCCP Not connecting. Another WCCP cache is operating on SG: <sg> on <apDisplayName> Cache’s IP: <ip> " Received from Router: <ip>. | Another WCCP cache device such as a WS is actively connected to this router. One solution would be to reconfigure the router and WS for a new SG, or to determine which device is the other cache, and only allow one cache in the SG. |
WMAJ-6 | WCCP MAJOR | WCCP SG: <sg> on <apDisplayName> Router: <ip> Has incompatible Router Assignment Method: <HASH|MASK> and SG configured: <HASH|MASK> router not used. | The specified router has an incompatible Assignment method. This router in the SG will not be used. One solution would be to reconfigure the SG to use a compatible Router Assignment Method, or to change from "Auto" to "Hash" or "Mask" to specifically select a method compatible by all routers in the SG. |
WMAJ-7 | WCCP MAJOR | WCCP SG: <sg> on <apDisplayName> Router: <ip> Has incompatible Packet Return Method: <GRE | LEVEL2 | GRE&LEVEL2 > and doesn't support the SG's selected method. | The specified router has an incompatible Packet Return Method, either GRE or LEVEL2, and does not support the method selected by the SG. If the SG is configured as "Auto" for the Packet Return Method, another router’s method was selected before this router was discovered. Try selecting GRE or LEVEL2 specifically. All routers in a Multicast SG must have the same packet return method. A unicast SG can be composed with different packet return methods. |
WMAJ-8 | WCCP MAJOR | WCCP SG: <sg> Router: <ip> is offNet and GW: <ip> is invalid on <apDisplayName>. | The AP’s GW is invalid (either 0.0.0.0 or is not on the AP’s subnet), and the router is offnet and thus unreachable. |
WMAJ-9 | WCCP MAJOR | SG: <num> partner SG: <num> has been unable to connect to Router: <ip>. Please check the WCCP configuration. | With the router specified only one SG of a pair SG is able to connect to the router. Verify that router’s WCCP configuration as to why the CloudBridge cannot establish a connection. (WCCP_RTR::PartnerConnectFail) |
WMAJ-10 | WCCP MAJOR | WCCP <ap> SG:<sg> Router <ip> Alone I See You component ASSIGN MAP numValues <value> mismatches Configured <value> shutting down both SG. | This should never happen. It is here to verify the operation of a legacy implementation. Call the factory. The SG is shutdown. (WCCP_OPER_SG. NumMaskValuesMismatch) |
WMAJ-11 | WCCP MAJOR | WCCP <ap> detecting SG:<sg1> and partner <sg2> Router <ip> <Designated Cache | Sub-ordinate Cache> I See You component ASSIGN MAP <DcNumberBits> mismatches Configured <value> shutting down SG. | This could happen on a paired SG where the DC has a different configuration from this CloudBridge. The SG pair is shutdown. Match the configuration of the number of bits in the mask. (WCCP_OPER_SG. NumMaskValuesMismatch) |
WMAJ-12 | WCCP MAJOR | WCCP On <ap> detecting SG:<sg1> and partner SG: <sg2> Router <ip> [Designated Cache | Sub-ordinate Cache ] Learned Mask value: <hex-num> does not match Configured <hex-num2> shutting down both SGs. | This happens when a CloudBridge’s mask value does not match the learned value from the router’s I See You message which was set by the DC. (WCCP_OPER_SG:: MaskValueMismatch) |
WMAJ-13 | WCCP MAJOR | "SG: <sg> on Router: <ip> [<ip>] connectivity has fluctuated at least <num> times in the last <num2> minutes. Shutting down this SG to prevent network outage. | This might be for an overloaded unit where the WCCP/UDP protocol becomes dropped and the SG cycles up and down as the router starts sending load and the unit drops packets because of overload, and then the SG goes down. |
WWARN-1 | WCCP WARNING | WCCP SG: <num> Router: <ip> Cache: <ip> has not been seen in partner SG: <num> for more than <val> seconds. | This is detected on the DC. This could happen on a paired SG where it was not possible to establish communication on one SG of the pair with this router. |
WWARN-2 | WCCP WARNING | WCCP <ap> SG: <num> Router Ip: <ip> Mode: Subordinate Cache has not received an assignment for more that <val> seconds. OR WCCP <ap> SG: <num> Router Ip: <ip> Mode: Subordinate Cache has lost its assignment. | This could happen on a paired SG where it was not possible for the acting DC to perform an assignment on this platform. One reason is with mask assignment, the number of bits chosen is too small for the number of CloudBridges in the cluster. Another reason is that the DC went away and a replacement DC had not yet been able to perform an assignment. |
WWARN-3 | WCCP WARNING | WCCP SG: <num> & <num> Router Ip: <ip> on <ap> this unit operating as Mode: Designated Cache not able to give itself assignment because mask-size is too small. | This could happen on a paired SG where there are more caches than there are mask elements (MaskSize) and the DC cannot give itself an assignment. This would happen when using the deterministic assignment algorithm and the DC has an IP address larger than the other caches that are given an assignment. This condition would remain until enough caches leave the SG to allow the DC to give itself an assignment. The SG configuration could have its “MaskSize” changed to a larger number. |
OneShot | MAJOR | WCCP_TIME_EVENT::StartWccpEvent WCCP SG: <sgnum> PopProc: <PopProc> PopName: <PopName> Handle: 0x<num> Seconds: <sec> seconds 'new' mailed MAJOR error no MEMORY. System is broken. Call the factory. | The memory is heap is so short of available memory that a small size cannot be allocated. |