This article contains information about best practices for VDI-in-a-Box High Availability.
When using local storage instead of shared storage, a few key factors should be considered when it comes to user desktop High Availability (HA) and failover. An administrator can apply several best practice techniques to provide the best user experience with VDI-in-a-Box.
Server-side and client-side HA consistent are important to achieve a successful VDI deployment. It is important to understand the VDI-in-a-Box architecture when discussing HA. As this article focuses on best practices, refer to the following article to know what happens when a VDI-in-a-Box server fails:
CTX135014 - Overview of How High Availability Works with VDI-in-a-Box
There are several ways a VDI-in-a-Box administrator can ensure users always have a server to connect to. A Single Point of Failure (SPOF) scenario ensues that an administrator provides a single vdiMgr IP address or hostname to all users. If this particular vdiMgr was to go offline for any reason, none of the users would be able to login to their desktops. Even though VDI-in-a-Box grids load balance desktops across all servers, users can log on and have connections brokered through any vdiMgr in the grid.
Grid Virtual IP
VDI-in-a-Box 5.1 introduces a new feature that allows an administrator to configure a single IP address for the entire grid. This applies to VDI-in-a-Box grids consisting of any number of servers, even if there is just one server in the grid. This feature allows the administrator to configure a single IP address and/or single DNS record, providing HA of the VDI-in-a-Box Web Interface and Connection Brokering roles.
The VDI-in-a-Box Grid Virtual IP provides a highly available single IP address for an entire VDI-in-a-Box. This ensures there is no downtime in an event of a server failure when the users try to authenticate and connect to VDI-in-a-Box desktops. After the administrator configures a Grid IP, a server in the grid starts listening to connections or requests for that IP address. This server listens for all user Web Interface and connection brokering requests going to the Grid IP. Heartbeats between the VDI-in-a-Box grid ensures a functional server is serving the Grid IP. If the vdiMgr serving the Grid IP goes offline, another functional server in the grid takes over the Grid IP role and listens for the requests. It typically takes a few seconds for another server in the grid to start listening to requests on the Virtual IP after a failure. This does not disrupt any existing user sessions.
Complete the following procedure to configure the Grid Virtual IP:
Login to the VDI-in-a-Box web console as an administrator.
Open Admin > Advanced Properties menu.
Type an IP address in the Grid IP Address field.
Click OK to save the settings.
Optionally, create a single Host (A) Record on DNS server to point a FQDN to the Grid IP.
Users connect to a single FQDN or IP address – the Grid Virtual IP.
Note: An administrator is no longer required to provide multiple IP addresses or use the obsolete DNS Round-Robin technique to ensure that the users always have a VDI-in-a-Box server to connect. The SPOF is eliminated without the requirement for an external load balancer.
Java Client
The VDI-in-a-Box Java client can be used to connect the users to a virtual desktop without the requirement for a web browser. After downloading the Java client file to a client device, the user authenticates and is given a choice to start their virtual desktop(s). This file contains a list of all the vdiMgr IP address in the grid and updates the list each time the user connects. Although the use of the new Grid IP feature negates the need for the Java client to have a list of all vdiMgr IP addresses, you might be running a VDI-in-a-Box 5.0.x or earlier grid which does not have this feature. Another reason to use the Java client is when using a certain thin client or kiosk-type deployment where users will not have access to a web browser.
The Java Client requires the client devices to have JRE (free download from http://www.java.com), and can be published to client devices on a domain. Users can also download the client file directly through any vdiMgr server in the grid, using the URL https://vdiMgrIP/dt/vdiclient.jnlp. The file can be open directly to start the client or it can be downloaded and saved on the client device for easy access. After downloading, the user can double-click the vdiclient.jnlp file on the client device to open it. The administrator can also create or publish a simple logon script or batch file that downloads the Java Client. The script or batch file can be initiated with the following command:
javaws https://vdiMgrIP/dt/vdiclient.jnlp
Users are required to accept or trust the server when the popups appear on the screen. These will only appear one time if Always Trust is selected. Installing a valid SSL Certificate onto the vdiMgrs in the grid prevents the trust window from appearing.
NetScaler ADC Load Balancing
It is possible to use a NetScaler ADC, virtual or physical, to provide HA of a VDI-in-a-Box Web Interface and Connection Brokering roles. In most cases this is not necessary, as the previous methods described in this article have simpler approaches. Remember, this will only be load balancing the VDI-in-a-Box Web Interface and Connection brokering roles; it will not load balance or affect the back-end load balancing of VDI-in-a-Box desktops. Consider NetScaler Load Balancing as HA or failover of these roles so the users always have a means to connect.
To use a NetScaler with VDI-in-a-Box, specifically to provide HA of the mentioned roles, configure the NetScaler appliance as usual. This includes setting up a management address, installing a license (including the free VPX Express), and ensuring there is at least one MIP or SNIP on NetScaler.
Refer to CTX120318 - Using Mapped and Subnet IP Addresses on a NetScaler Appliance for more information about MIPs and SNIPs.
You can configure a load balancing virtual server using either SSL or HTTP. If using SSL, which is recommended for secure communication between the servers and clients:
Ensure to first install an SSL Certificate on NetScaler.
Create a load balancing virtual server using the protocol of your choice.
Select an IP address.
Select COOKIEINSERT persistence, and the load balancing method of your choice (such as Round Robin, Least Connection).
Now you can create a new load balancing service for each vdiMgr in the grid. Each server can use SSL or HTTP, the IP address of a particular vdiMgr server, and the HTTP-ECV or HTTPS-ECV monitor (respectively). Repeat this for each vdiMgr in the grid and the end result is a single load balancing virtual server composed of services (each service representing one vdiMgr).
Note: NetScaler monitors the health of each server and prevents users from connecting to a vdiMgr if the health check fails.
Each server and service has a status that clearly shows the state to be UP or DOWN.
DNS Round Robin
This technique is obsolete and not recommended after upgrading to VDI-in-a-Box 5.1. The Grid Virtual IP eliminates the requirement to create multiple host records pointing to the IP addresses of multiple vdiMgrs. Previously, the administrators would use the DNS Round Robin technique because there was no internal mechanism to ensure HA of the VDI-in-a-Box Web Interface (with the exception of the Java client). Remember that DNS Round Robin is not true HA as users might still be pointed to failed vdiMgr IP. DNS is not aware of server health.
Note: This section has been kept for historical purposes in the event that all other techniques cannot be deployed. This method is not recommended as other techniques provide true HA for VDI-in-a-Box Web Interface and connection brokering.
Example
An administrator is still using 4 server VDI-in-a-Box 5.0.2 grids in production and wants a single FQDN for users to connect to. The administrator is currently testing VDI-in-a-Box 5.1 in a lab environment but does not plan to upgrade the production grid for a few months. The administrator does not want to invest time in setting up NetScaler for load balancing at this time. The simplest method to use until upgrading to 5.1 is DNS Round Robin. The administrator creates the following DNS Records for the company.com zone:
Type: A IP: 192.168.1.10 Name: vdi
Type: A IP: 192.168.1.11 Name: vdi
Type: A IP: 192.168.1.12 Name: vdi
Type: A IP: 192.168.1.13 Name: vdi
This section has information about the HA best practices for the VDI-in-a-Box servers.
Note: VDI-in-a-Box uses a linear architecture and the N+1 model.
The linear architecture makes it simple to scale out and manage, as there are no single points of failure; each vdiMgr in a grid is identical to the others (with the exception of some items such as IP addresses and IDs). With the N+1 model, each server participating in a VDI-in-a-Box grid is active and serving desktops to the users. Server sizing is essential, and when done properly, allows an administrator to have an extra server that makes the grid highly available in the event of a server failure. This extra server, along with all the other servers in the grid allows to have desktops spun up to make up for the loss of a failed server.
Load Balancing and Grid Capacity
Load balancing is done automatically in a VDI-in-a-Box grid. The algorithms used to determine load balancing primarily depend on the total/used/available physical memory and processors in each server. VDI-in-a-Box grids can be compromised using physical servers of different capacity and vendor. The only requirement for a VDI-in-a-Box grid is the hypervisor be the same (XenServer, ESX/ESXi, or Hyper-V) on each server. By default, a vdiMgr assumes the entire server, with the exception of a 2GB threshold, is available to use for virtual desktops. This capacity can be adjusted through the VDI-in-a-Box web console by completing the following procedure:
Open the Servers tab.
Select the Desktop link associated with a server.
Click Adjust Capacity.
Ensure to review the VDI-in-a-Box 5.1 Server Sizing Guide for details regarding the amount of RAM used by each hypervisor for specific number of desktops provisioned.
Note: This is useful if you have other Virtual Machines on the same host that are outside the control of VDI-in-a-Box, such as a NetScaler VPX or Windows Server Virtual Machine. This is typically not recommended as infrastructure servers or services such as Active Directory should not be hosted on the same servers hosting VDI. This is a general recommendation and is not specific to VDI-in-a-Box.
The grid capacity is a combination of all the servers participating in a grid. Although, there is some variance amongst the different hypervisors, VDI-in-a-Box uses 90% maximum capacity for the grid, with the ability to lower or raise this number (up to 200%). This grid-wide setting is used when load balancing and spinning up desktops on each server in the grid. After the server reaches this threshold, it will not spin up any more desktops. This adjustment can be used in combination with the individual server capacity adjustment to ensure servers are not over-committed. This number can also be increased to a higher threshold to ensure that in a case of a disaster (server failure) the functional servers in the grid can spin up extra desktops. This might create instability on the servers and should be tested before going into production to find the sweet spot. In many cases, it might be more important to have extra desktops so all the users can continue working, even with some degraded performance on the servers. These situations are usually temporary, lasting from several hours to days or weeks. For long-term outages, it is recommended to check the performance on the servers, and if possible, add more. servers to the grid to relieve the workload
Grid Settings
VDI-in-a-Box grid members communicate with each other on a regular basis using heart beats. These heart beats allow grid members to monitor the state of each server.
The default server failover threshold is 15 minutes. This means that a server will not go into a missing state until other servers in the grid do not receive any heart beats from the server for 15 minutes. This threshold is sufficient in most cases, primarily because of SPOF. In many cases a short-term failure, such as an unplugged power or network cable is resolved within few minutes. This allows the administrator to resolve the issue within 15 minutes without the grid marking the one server as missing and marking the pooled desktops as destroyed. This also places additional workload on functional servers to make up for the missing server. Adjusting this threshold to a low number, such as two minutes, might cause more issues than anticipated. If the same short-term failure occurs, such as an unplugged cable, the entire grid endures a larger workload to make up for the server marked as missing. This might affect more users than a few minutes of downtime.
There are some scenarios in which a low server failover threshold is viable, primarily when most SPOF are reduced. Mitigating risk with redundant systems allows an administrator to be more confident that unexpected outages will not occur. These outages usually include power supply failure, network cable or switch powered off or failed. Investing in redundant systems and components helps to prevent these SPOF from causing failures. There are some failures, such as natural disasters, which can cause more harm and require more intricate disaster recovery procedures. As a best practice, it is recommended to keep the server failover threshold at or near the default 15 minutes. Change this to a lower level if your environment has eliminated common SPOFs and has been tested.
Complete the following procedure to adjust server failover:
Log on to the VDI-in-a-Box web console as an administrator.
Open Admin > Advanced Properties menu.
Open the Grid section.
Adjust the time in the Seconds of lost heartbeat before failing server section.
Click OK to save the settings.
As a best practice, there are two things the administrator can do to ensure servers do not go missing from a grid. This is the case when updating the grid or network issues are causing heart beats to be missed. When an expected maintenance occurs, placing the grid in Maintenance mode is typically required. This is true when changing some hypervisor settings, updating the VDI-in-a-Box grid, or installing a new license. When a grid is placed in Maintenance mode, users are not able to login, existing user sessions are not disrupted, and server-to-server heart beats are disabled. However, if external forces are causing issues such as lots of packet loss on a switch, you might have to disable the server heart beats without preventing users from logging in. In such a case, you can go to the Grid section in the Advanced Properties menu and clear Enable server failover on lost heart beat. This allows the grid to function as expected, with the exception of these server-to-server heart beats.
Desktop Templates
VDI-in-a-Box does not limit the number of desktops an administrator can spin up, even if it surpasses the amount of licenses purchased. Apart from server memory/CPU/disk limitations, VDI-in-a-Box licensing is based on concurrent user sessions and not total desktops or users. In some cases you might have users on different work hour shifts; if using pooled desktops that are refreshed on a regular basis; you will only require enough licenses for the amount of users at any given time.
HA is configured with desktop templates because an administrator is required to find out the sweet spot for the deployment. Server Sizing is also very important with the N+1 architecture to ensure that remaining servers in a grid can spin up enough desktops to make up for a failed server. Let us assume server sizing is done correctly (there are a few options for an administrator).
Many administrators will spin up templates to equal the total number of licenses purchased. This works as expected when servers are full functional, but can impede user productivity in the event of a server failure. The simplest resolution is to set the maximum and pre-start numbers for the desktops to be more than the number of concurrent users expected at any given time, which means that if there are 100CCU license and expected an average of 90-100 users at any given time, you might want to spin up 120 desktops (depends on server quantity and capacities). If a server were to fail, this scenario would ensure there are already desktops in the New status on the functional servers—all waiting for the users on the failed server to log into. This not only requires less server resources with regards to IOPS spikes, but also ensures the users can get quicker access to desktops and continue working. The number of desktops chosen to spin up depends on the server capacities, but as a general best practice ensure there are a significant number of New desktops available across the grid to be allocated to the users in an event of a server failure.
Finding the Right HA Configuration
This section covers some ways to ensure HA meets your specific requirements. It is important to note that VDI-in-a-Box does not limit an administrator from spinning up more desktops than the grid is licensed for. VDI-in-a-Box licensing is based on concurrent user sessions and not the total amount of desktops. The ability to spin up more desktops than required is vital to HA because it allows for a much faster failover for users on a failed server. There are also some performance benefits as the servers will not be required to spin up any desktops (or just a few) on the functional servers, thus preventing IOPS spikes when new desktops are being created. Refer to the Knowledge Center article CTX135014 - Overview of How High Availability Works with VDI-in-a-Box for more information regarding what happens when a VDI-in-a-Box server fails.
Following is a real-world example about how spinning up more desktops can help:
VDI-in-a-Box Grid consists of 5 servers.
Each server is identical in regards to RAM, CPUs, and Disks.
Each server is capable of running 120 desktops.
The entire grid is capable of running a maximum of 600 desktops total.
The customer has purchased a 400 CCU VDI-in-a-Box license.
There are several Windows 7 32-bit Golden Images.
There are several Pooled desktop templates, each with the same RAM and vCPU allocation.
The VDI-in-a-Box administrator has invested in several infrastructure features such as redundant power supplies to limit the number of SPOF in the environment. The servers are locked in the racks and the administrator is confident that servers will not accidentally go offline because of human error. Most of the VDI-in-a-Box users are the Support and Sales teams so the need for users to install new applications on the desktops is not a requirement. All profiles are synced up using Citrix UPM, so even if a user loses a desktop, all their application settings, documents, and files are available on the new desktop.
The administrator decides to leave the Grid Capacity at 90%.
The calculations show that if one server fails and all desktops are in use, 80 desktops will need to be spun up on the remaining 4 servers in the grid. This means each server would need to spin up 20 desktops, totaling 100 desktops each. As each server has enough RAM/CPU/Disk to spin up 120 desktops, a single server failure would cause each server to go up to 83% usage. Under normal workload of 80 desktops the capacity sits around 67% for each server, showing an increase of 16% load. The performance impact is minimal as most of the users are task workers. The administrator did this when sizing the servers, because of each server would only be able to host a maximum of 100 desktops each; the grid capacity with a failed server would reach 100%. Depending on the hypervisor being used, the administrator could have purchased some smaller servers and adjust the capacity to be 100% (up to 200% on some hypervisors). This might work but is quite risky as running servers at the maximum or over capacity can lead to major performance issues and possibly the desktops or servers crashing.
Normally, an administrator would only set all the Templates to spin up 400 desktops because a 400CCU license was purchased. This would be acceptable but if a single server were to fail, 80 desktops would be required to be spin up across the remaining 4 servers in the grid. Some of the users might not be able to log back into a desktop for half an hour from the moment the one server failed. The administrator feels this is unacceptable and makes a few more changes.
The administrator decides to change the maximum and pre-start of Templates to equal 500 instead of 400. After saving the Templates, a message appears indicting there are more desktops than licenses, but this does not prevent desktops from spinning up. Since all the servers in the grid are the same, each server will have 100 desktops instead of just 80 desktops. Remember, this will still bring the capacity of each server up to 83% instead of 67%, but many of the desktops will be idle and not in use. They are still consuming limited memory, CPU cycles, and disk space. If all 400 users were logged into desktops and a single server fails, they would be able to reconnect once the server goes missing as there are already 100 desktops in the New status on the remaining servers. The calculation is rather simple:
100 desktops per server * 5 servers = 500 total desktops.
400 users working /5 servers = 80 users per server and 20 idle (New) desktops per server.
If a single server fails, 80 users on failed server / 4 functional servers = 20 New desktops required on each remaining server.
(There are already 20 New desktops (100 desktops per server – 80 desktops in use per server = 20 available desktops per server).
In the last step, the administrator must change the server failover threshold. The default 15 minutes is too long as the users on a failed server will not be able to log back in until the 15 minute threshold is met—even if there are desktops available on the remaining servers. The administrator invested heavily in infrastructure and changes this threshold to 90 seconds (1.5 minutes). This means that if a server goes missing from the grid, those users will be able to log back into new desktops on the other servers in the grid.
All 500 desktops are spun up and the administrator is ready to do some testing with LoginVSI and some real users. After a week of testing there are no server failures and everything is working as expected. The administrator decides to see what happens if a single server fails to ensure if the configuration produces expected results. While having 400 active sessions going, the administrator unplugs the redundant power supply cables on one of the servers in the grid and logs into the VDI-in-a-Box web console. The administrator notices some warnings in the Recent Tasks and Events showing the server has missed several heart beats, and after 90 seconds the one server goes into the missing state. The administrator notices the desktops sessions residing on the failed server ended immediately after unplugging the power cables, and attempts to log back in as one of those test users after 90 seconds (server went missing). The administrator is successfully able to receive a new desktop on one of the remaining servers and the documents are still there because Citrix UPM is being used. Several of the test users on the failed server to log back in as well after the 90 second threshold and all of them immediately received new desktops.
Following lessons derived from this example:
It is important to find the sweet spot for High Availability when it comes to VDI-in-a-Box.
The default settings are good for many use cases, but if desktops are required immediately in the event of a failure for BCP, it is important to keep few things in mind.
First, while being simple and cost-efficient it is a good idea to invest in redundant power supplies, teamed or bonded network interfaces, etc.
Second, always test the grid in a lab or staged environment, especially when tweaking some of the settings. Although this example provides an outline of expected behavior, results will vary based on server capacities.
Third, properly size servers using the N+1 model to ensure desktops can safely spin up in the event of a single server failure.
Fourth, adjust the capacity of the individual servers, the capacity of the entire grid, and the server failover time only if required. In some cases if there are many SPOF in the environment, more harm might be caused if a 90 second server failover is used as users will lose their desktop sessions unnecessarily if something like a network cable on a server was just unplugged for a few minutes.