This article is intended to provide an overview of how High Availability works with VDI-in-a-Box. It explains what happens when servers fail, what users will experience, and what administrators can do to improve the user experience.
Refer to the Knowledge Center article CTX135013 - VDI-in-a-Box High Availability Best Practices for Best Practices regarding VDI-in-a-Box High Availability.
It is important to understand exactly what happens when a server fails and how it will affect the users. A clear understanding allows VDI-in-a-Box administrators to be proactive in ensuring that the users have little to no downtime. Each of the following sections are dedicated to describing the different aspects of VDI-in-a-Box High Availability and failover. Remember that the information provided describes the events that occur, but the VDI-in-a-Box HA Best Practices guide should be followed to understand some of the grid settings such as Grid IP and Server Failover Timeout.The VDI-in-a-Box product uses health checks amongst grid members. This allows the load balancing and failover mechanisms to understand the current state of each server. Functional servers will typically be in an Activated or Deactivated state. Servers encountering problems regarding functionality, such as incorrect hypervisor credentials because of a password change, will be in a Broken state. If a server does not respond to heart beats for a certain period of time (default threshold is 15 minutes), it will go into a Missing state from the grid. This means that each remaining grid member will recognize that server as missing and will try to compensate for the missing desktops.
These health checks are sent out between servers on a regular basis. Servers can miss several heart beats without any interruption or removal from the grid, but the administrator will see warning messages in the console. If using the default server failover threshold of 15 minutes, the administrator might decide to restart one of the vdiMgr appliances without placing the grid into Maintenance Mode. As the appliance restart might take a few minutes, there will be warning messages in the console stating that server X.X.X.X has been missing a heart beat for XX seconds. As soon as that vdiMgr appliance comes back online the warnings stop and the server will not go into a missing state.Dynamic desktops in VDI-in-a-Box are refreshed on logout and/or on a schedule. VDI-in-a-Box 5.1 categorizes these types of desktops as Pooled Desktops. When a server fails, all dynamic desktops in the New and In-Use status on that server will be forgotten by the remaining servers in the grid. This means the remaining grid will need to compensate for these lost desktops and will start spinning up desktops to equalize to the pre-start and max numbers defined in the templates.
Persistent desktops in a VDI-in-a-Box grid are defined as those set to a Manual refresh. These are similar to Dedicated Desktops in a XenDesktop environment, as users are assigned to a single desktop and will always log into it. Unfortunately, VDI-in-a-Box does not offer High Availability for Persistent desktops. When a server fails, users assigned to Persistent desktops will not be able to log back into their desktops unless the server comes back online, or the administrator manually destroys and creates new desktops for the users.Personal desktops in a VDI-in-a-Box grid are a hybrid of Dynamic and Persistent desktops. The base image can be refreshed, providing granular control to the administrator, while the personal vDisk attached to the desktop allows users to install applications and have persistent profile data. When a server fails, Personal desktops in the New and In-Use status will be remembered by the remaining members of the grid. This means the remaining grid will not spin up more Personal desktops to compensate for the Pre-Start and Max values set in the Personal desktop templates.
If the administrator has spun up more desktops than licensed for, the grid will still attempt to compensate for the total number but most likely there will not be enough server capacity (limitation can be disk, CPU, or memory). However, users will be able to log into new desktops much faster as there will already be a sufficient number of desktops in the New status in the remaining grid.Dynamic desktops that were spun up on remaining servers in a grid will not be destroyed if they are in the In-Use status. Initially, the grid might exceed the total number of New desktops defined in the templates as the failed server comes back online. All Dynamic desktops in the New status across the grid will automatically spin down according to server workload using the automatic load balancing algorithms. This means, that the users originally on the failed server who are now on desktops on other servers will not be interrupted. These In-Use desktops will be refreshed accordingly to the refresh policy set in the templates.
Persistent desktops that were not manually destroyed by the administrator will come back online when the server returns to the grid. This means that the users assigned to those persistent desktops will be able to log back in as usual. It is important to note that if the server encountered a severe failure such as disk failure with possible data corruption, persistent desktops might need to be destroyed. Ensure that all users, even those with Persistent desktops, have important documents and files backed up on a regular basis.Personal desktops will have the same behavior as Persistent desktops if the administrator did not intervene. This means Personal desktop users will be able to log back into their desktops on the original server that has come back online. The personal vDisks can be backed up and restored to another server or central storage repository, allowing users to log into New Personal desktops on other servers. Refer to the following Knowledge Center articles for more information regarding Personal desktops and how to backup or restore such desktops:
CTX134792 - Backup and Restore VDI-in-a-Box Personal Desktops
CTX134793 - Best Practices for Personal Desktops with VDI-in-a-Box
There could be many reasons for a server to go into a missing state; power failure, network failure, disk failure, or any other situation that causes a server to no longer be operational. There are two scenarios which are possible for grid reformation. This happens when a server, or a group of servers, go into a missing state and then come back online.
A single server returning to an existing grid will always receive grid state from one of the existing members of the grid. This applies when there are at least three servers in the grid, including the returning server. If there are only two servers in the grid, a missing server will be seen on both servers assuming it is a network and not a physical server failure. Upon return, one of the servers will be chosen to send the grid state to the other server.When a grid with multiple servers is severed, there can be a situation where two separate grids remain. Each of these separate grids contains multiple servers. Once the issue is resolved and all servers can communicate with one another again, the grid with the most servers will be chosen to provide grid state to the other grid members. The easiest way to think about this is that the most number of servers will always win to provide grid state. However, the same scenario above applies if both members of the severed grid contain the same number of servers.
Server Sizing comes into play because servers in a grid will compensate for the loss of a failed server. This is especially true for Dynamic desktops, so if servers are only sized to accommodate for the number of desktops licensed for, you might run into issues when a server fails. Basic memory, CPU, and disk calculations need to be taken into consideration when sizing servers to ensure failover can occur. Using the N+1 model will ensure that your grid can safely accommodate for the loss of a single server. A simple way to look at this concept is to take the maximum number of desktops you plan to require at any given time, size each server to evenly distribute the total size (memory/CPU/disk), and then add another server. The additional server will be used actively in the grid, and is not what others might call a “hot spare”. This N+1 model allows you to spin up extra desktops on all the remaining servers in a grid.
VDI-in-a-Box takes CPU and memory into consideration for load balancing. Disk space is not used for these calculations but the servers are aware of total/used/free disk space. Running low on disk space, especially in a time of server failover, will cause warning and error messages to appear in the VDI-in-a-Box web console for the administrator. Remember that most server failover scenarios are temporary and there will be a slight performance impact on users as more desktops than usual will be running on the functional servers. The impact depends on many factors such as current workload, server specs, and types of users.Assuming the servers are not sized properly and a server fails, you might encounter a situation where some desktops cannot be spun up on remaining grid members because there is not enough memory/CPU/disk. The VDI-in-a-Box grid will spin up as many desktops until the max capacities are met, and at that point the administrator will see warning and error messages in the console.
MAC Address PoolsIf MAC address pools are used and a single server in a grid becomes unavailable, the missing allocated MAC addresses will be reclaimed. These can be used for new desktops spun up on the remaining functional grid. The scenario is a bit different when there are two servers that become mutually missing. In this case, each server will make use of the full range of MAC addresses in the pool as required.