Problem Definition
Users traveling to regional offices were unable to access applications published on servers in their home regions. Intermittently users would receive the error message:
“An error occurred when connecting to the MetaFrame server to launch the application. Please make sure that the MetaFrame server is running and that the network is functioning.”
Environment
Client Access: Web Interface 4.0
Metaframe: Presentation Server 3.0 with Service Pack 2005.04
Server: Windows 2003, Service Pack 1
Data Store: SQL 2000 Service Pack 3 - MDAC 2.8
Zones: 3 geographically separated Zones, 2 Data Collectors per Zone - 1 PDC/1 BDC
Hotfixes: PSE400W2K3R01
This environment consists of three zones spanning across North America, Europe, and Asia. There are approximately 600 servers in the New York office, 200 servers in London and 100 in Tokyo. These are Presentation Server 3.0 with Service Pack 2005.4, R01 and Web Interface 4.0. Users launch the Web Interface site and based on their location they are routed through a VIP to a localized Web Interface server. Load sharing has not been enabled between the zones.
Troubleshooting Methodology
Launch.ica file: Attempting to examine the launch.ica file also failed. The “save file as” operation would hang or produce a similar error.
CDF Tracing: CDF tracing was performed on the Zone Data Collectors (which are also used as the XML brokers by Web Interface in this environment). Trace modules were selected for capturing Dynamic Store and XML data.
Details
Using a batch file with tracelog.exe, the following (four) pre-configured trace files were provided:
Runtrace.cmd containing the following:
tracelog -start tracesession -guid ZDC_trace.ctl -flag 0xffffff -level 16 -cir 50 -f c:\ZDC.etl
Stoptrace.cmd containing:
tracelog -stop tracesession
ZDC_trace.ctl containing:
7460B365-C1D2-495E-843E-5C88865CA6F1 MF_Driver_Wdica
5DF7C852-3BB0-49A2-A2DD-3D63D0B143DB MF_DLL_Wsxica
7EB18582-343E-4239-95DD-F34F6C1D60BC MF_Services_ServerFTA
04979F4A-470E-4569-97A9-A0D6FB785872 MF_Service_CtxXmlSS
28FB3FE7-B3E9-4E46-B462-D8AAB4AC3E0E MF_SDK_MfcomExe
0432987E-F918-4598-8AA0-50657FDFE334 IMA_Sals_MfServer
08265143-ADBE-4578-8EDB-987FD15F3104 IMA_Subsystems_Browser
5BD888D6-6540-462C-A011-9B0D2C205B3B IMA_Subsystems_MfServer
5AD82332-790A-4B1C-941C-AADF4AC4BB25 IMA_System_System
50381E5B-C32A-455B-B3C2-9735570677F2 IMA_Runtime_DynamicStore
4AEAF09B-6997-4CF3-96F4-F823A46510DC IMA_Runtime_ZoneManager
5D452398-2CE7-4A5B-955E-F907A86BC5F7 IMA_Runtime_HostResolver
3A02EF43-BC8F-4D76-BE63-2CE5EAFE7126 IMA_Runtime_PersistentStore
F2F8EC10-BDDF-4D92-9015-A07D3D2B97B8 IMA_Runtime_Runtime
QFarm: QFarm /Offline output determined whether or not the local Zone Data Collector acknowledges servers belonging to the remote zones.
Technical analysis
In a multi-zone environment, a Zone Data Collector must establish a “gateway” to each remote Zone Data Collector in order to share dynamic data between zones. This process of building gateways is done automatically. Once the gateway has been established, Dynamic Store information can be replicated between the remote and local zones.
To validate gateway communication, heartbeat pings are forwarded between Zone Data Collectors, and ping responses are received in return. Once the response is received, the ping (failed) counter is reset.
Example
Sending PING to host [Server02TS]. |
Ping Succeeded and pingFailedCount has been reset |
If a zone times out while waiting for a response, it will add 1 to its counter and send another ping. Once 5 timeouts occur, the gateway is torn down and all data from the remote zone is purged from the local Dynamic Store. The gateway is then recreated.
Example (from CDF trace on Tokyo Zone Data Collector / XML broker):
The ping communication failure occurs:
No Pings received for 5 ping times. |
Creating all gateways |
At this point Citrix Engineering was able to focus on a problem in the area of inter-zone communication. This would affect multi-zone environments where zones are separated by high-latency networks. The trace data shows the Zone Data Collector gateways being destroyed between Tokyo, New York City and Europe.
Example (from CDF trace on Tokyo Zone Data Collector):
[0]1518.1448::09/08/2006-16:52:12.194 [ds2]Destroying gateway for zone [NorthAmerica] |
[0]1518.1448::09/08/2006-16:52:12.194 [ds2]Destroying gateway for zone [Europe] |
End user experiences application resolution or launch attempt failures:
[0]1518.1448::09/08/2006-16:53:11.068 [mfserver]MFServer::ResolveAppInZones : do not have a zone preference list or resolution failed in those zones |
The local zone tries again to ping the remote zone. Once the gateway to the remote zone is rebuilt, all of the dynamic store data tables (applications, servers, users..) are imported and rewritten to the local Dynamic Store. When this process takes place, application and server information goes missing and the launch.ica file cannot be created.
Sporadically, you may also notice that QFARM /OFFLINE lists the servers belonging to remote zones.
Resolution / Recommendation
1. PSE400R01W2K3034 [CPR#128340]
This issue of Gateways being torn down has been addressed in hotfix PSE400R01W2K3034 (now replaced with PSE400R01W2K3076 and PSE400W2K3R02). CDF tracing confirmed that the behavior also occurred on the test servers. Hotfix 34 was applied to the test environment and the resolution was confirmed with further CDF tracing. Hotfix 34 was then deployed to the customer’s production environment. However, it took approximately 2 hours before the zone data was fully converged.
2. PSE400R02W2K3003 [CPR#143478]
The change in hotfix 34 helps to reduce the tearing down of Gateways, however, it still required a significant amount of time for all zones to converge the information, for example, when Zone Data Collectors are changed or restarted intentionally. With the current customer configuration, this took 2 hours to converge zone data.
Hotfix PSE400R02W2K3003, was applied to the Zone Data Collectors. This significantly reduced the convergence time from 2 hours to less than a ½ hour.
Note: Hotfix PSE400R02W2K3003 is not yet publicly available.
3. Recommendation to remove the Gateway Validation Interval registry key.
Caution! This fix requires you to edit the registry. Using Registry Editor incorrectly can cause serious problems that may require you to reinstall your operating system. Citrix cannot guarantee that problems resulting from the incorrect use of Registry Editor can be solved. Use Registry Editor at your own risk. Be sure to back up the registry before you edit it.
The registry below key was previously added to the Zone Data Collectors in this environment:
HKEY_LOCAL_MACHINE\SOFTWARE\Citrix\IMA\RUNTIME
GatewayValidationInterval (DWORD)
Value: 0x00007530 (hex)
Reducing the GatewayValidationInterval registry value can cause an adverse effect in high latency environments. This was previously documented in the Advanced Concepts Guide. We have since revised this recommendation for High latency environments and produced the article referenced below.
Additional Information
CTX107059 – Advanced Concepts Guide