Recovering Virtual Machines from Failed Pool Member

Recovering Virtual Machines from Failed Pool Member

book

Article ID: CTX132387

calendar_today

Updated On:

Description

In the event of a XenServer host power failure, any Virtual Machines (VMs) running on that host might not be displayed in XenCenter. This is the expected behavior without High Availability (HA) enabled.

The following is a XenCenter screen shot of the pool prior to the failure of the host named "xenserver2":

User-added image

This is a screen shot of the same pool after the failure of xenserver2:

User-added image

Note: VMs that were running on xenserver2 are not displayed in XenCenter.

Environment

The above mentioned sample code is provided to you as is with no representations, warranties or conditions of any kind. You may use, modify and distribute it at your own risk. CITRIX DISCLAIMS ALL WARRANTIES WHATSOEVER, EXPRESS, IMPLIED, WRITTEN, ORAL OR STATUTORY, INCLUDING WITHOUT LIMITATION WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NONINFRINGEMENT. Without limiting the generality of the foregoing, you acknowledge and agree that (a) the sample code may exhibit errors, design flaws or other problems, possibly resulting in loss of data or damage to property; (b) it may not be possible to make the sample code fully functional; and (c) Citrix may, without notice or liability to you, cease to make available the current version and/or any future versions of the sample code. In no event should the code be used to support ultra-hazardous activities, including but not limited to life support or blasting activities. NEITHER CITRIX NOR ITS AFFILIATES OR AGENTS WILL BE LIABLE, UNDER BREACH OF CONTRACT OR ANY OTHER THEORY OF LIABILITY, FOR ANY DAMAGES WHATSOEVER ARISING FROM USE OF THE SAMPLE CODE, INCLUDING WITHOUT LIMITATION DIRECT, SPECIAL, INCIDENTAL, PUNITIVE, CONSEQUENTIAL OR OTHER DAMAGES, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Although the copyright in the code belongs to Citrix, any distribution of the sample code should include only your own standard copyright attribution, and not that of Citrix. You agree to indemnify and defend Citrix against any and all claims arising from your use, modification or distribution of the sample code.

Resolution

To recover from this issue, complete the following procedure:

  1. Confirm if the host has actually failed and determine if it can be recovered.
    In the event of certain hardware failures there is no way to recover the host and it must be removed from the pool altogether.

  2. If the host is not recoverable, run the following command to obtain a list of VMs that are running on the failed host:
    xe vm-list resident-on=<UUID_of_failed_host> is-control-domain=false params=uuid

  3. Run the following command to reset the power-state on those VMs:
    xe vm-reset-powerstate uuid=<UUID_of_the_VM_to_recover> --force

    a) To reset the power-state for all the VMs which got locked on the failed slave server , run the following command:
    xe vm-reset-powerstate resident-on= <UUID_of_failed_host> --multiple --force 

    As the power states are reset for the VMs, they are displayed at the bottom of the pool list in XenCenter:

    User-added image

  4. After all the VMs are recovered, the failed host can be forgotten.

  5. Before you can start the VMs on another XenServer host, you must release the "locks" on the VM storage.
    Each disk in a Storage Repository can only be used by one host at a time. So it is essential to make the disk accessible to other XenServer hosts after a host has failed.

  6. To do so, run the following script on the pool master for each SR that contains disks of any affected VMs:
    /opt/xensource/sm/resetvdis.py all <UUID_of_failed_host> <UUID_of_SR> [--master]

    Note
    : Customers must only supply the third string ("--master") only in the following cases:
    . When the SR is not shared (i.e. local storage).
    . When the SR is a shared SR and the failed host is the pool master

    Warning! Incorrect use of this command can lead to data corruption. Before running the preceding command for an SR ensure the following conditions are true:
    . The failed XenServer host is unrecoverable
    . The SR is not attached
    . The VDIs on the SR are not in use

    If you attempt to start a VM on another XenServer host before running this command, you might receive the following error message:
    VDI <UUID> already attached RW.


Problem Cause

This is the expected behavior without High Availability (HA) enabled. When HA is not enabled XenServer is unable to confirm that there has been a host failure and to recover the VMs from that host. It is unsafe to restart the VMs on other hosts in the pool if the problem host is unresponsive rather than in a completely failed state. This might cause data corruption.

Issue/Introduction

Recovering Virtual Machines from Failed Pool Member.

Additional Information

CTX119717 – XenServer High Availability

CTX130821 – How to Clean the Xapi Database after Running the host-forget Command