Troubleshooting Retries on Provisioned Virtual Machines

Troubleshooting Retries on Provisioned Virtual Machines

book

Article ID: CTX222944

calendar_today

Updated On:

Description

What is a retry?
When a target streams a vDisk image, it does so by requesting blocks of data from the PVS server, which in turn replies with the requested data. This data is broken down into several packets called fragments. A retry occurs when a target device requests data from the PVS server and one or more data packets do not reach the target, and the target sends another request or a retry for the same data. A retry may also occur if a heartbeat sent from the target does not get a response from the PVS server.  Heartbeats occur every 30 seconds by default.
 
On a vDisk, retries are a direct result of either a packet loss or latency of more than one second on the network, causing a timeout while communicating with the storage device hosting the vDisk.
 
Poor target device performance, sluggish mouse responsiveness, application latency, and slow-moving screen changes can all be an indication of excessive retries, which can be the result of high bandwidth utilization, a congested pipe, collisions, network quality of service (QOS), application layer filtering or even physical network problems, such as a flapping network interface, malfunctioning NIC or bad cabling. To eliminate physical hardware problems, you must start looking closer at the network infrastructure. Check with your network team for bandwidth utilization, interface-related problems, or misconfigurations. These retries can also be traced to storage-based problems or application interference, such as an intrusive antivirus or third-party security software.
 
In PVS, we can see the retries in two places:
 
On the PVS console > vDisk Pool > Show vDisk Usage

User-added image

On the streaming target > Virtual Disk Status

User-added image

What is considered high retries in a provisioned target device?
It depends how long the machine has been up; when the machine is booting, and the OS has not loaded, there should be no retries. If the machine has been up for, say, about five minutes, showing 100 retries is a reason for concern. However, if the target has been up for over a week, 100 retries may not be much at all.
 
To be able to determine the cause of the retries, start narrowing down from the perspective of either storage or network. If the storage is located on a device running on the network, that device may be experiencing performance or network issues. Involve the storage team and perform a test by moving the vDisk to another LUN or to local storage on the PVS server; this would allow us to see if the problem is related to the vDisk storage.  
 
Note: When we say local storage, we mean the local storage on a physical PVS server or local storage on the hypervisor where the PVS server exists. This test is not valid if the hypervisor points to a remote storage.
 
For Storage-related problems, run PerfMon on the targets to track disk usage and correlate that with storage platform logs. https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/perfmon
 
If, after moving the vDisk to local storage, the retries are still present, then the issue is most likely related to the network between the server and the target. In this case, bi-directional network traces need to be captured. This can be done by following the instructions in article https://support.citrix.com/article/CTX139171. After capturing the traces, contact Citrix to analyze them.