How to use and troubleshoot NVIDIA vGPU XenMotion with XenServer

How to use and troubleshoot NVIDIA vGPU XenMotion with XenServer

book

Article ID: CTX232828

calendar_today

Updated On:

Description

This article provides the methods to use and troubleshoot NVIDIA vGPU XenMotion feature.
 


Instructions

NVIDIA vGPU XenMotion was previously a Tech Preview feature in XenServer 7.3, it is a fully supported feature in this release.
NVIDIA  vGPU XenMotion enables a VM that uses virtual GPU to perform XenMotion, Storage XenMotion, or VM Suspend. VMs with vGPU XenMotion capabilities can be migrated to avoid downtime and can be included in High Availability deployments.
 

Enabling and using NVIDIA vGPU XenMotion

Here are the high-level steps to prepare and use NVIDIA vGPU XenMotion feature. 
  1. Install XenServer 7.4 on a server with NVIDIA GRID cards attached and license it with Enterprise Edition or through a XenDesktop/XenApp entitlement.
  2. Install NVIDIA GRID vGPU Manager with XenMotion enabled for XenServer in Control Domain.
  3. Assign a virtual GPU to the VM.
  4. Install NVIDIA XenMotion-enabled vGPU Driver on Windows VMs.
  5. After preparation, you can perform the following operations on a VM that uses GPU Pass-through or virtual GPU:
  • XenMotion (Intra-Pool)

 
  • Storage XenMotion (Inter-Pool)
User-added image
The Add New Server Dialog
 
  • VM Suspend
 
 

The following requirements and restrictions apply when using NVIDIA vGPU XenMotion:

  • XenServer 7.4 and above (7.3 was Tech Preview)
  • XenServer Enterprise Edition (or access to XenServer through a XenDesktop/XenApp entitlement)
  • An NVIDIA GRID card, Maxwell family or later. Migration is only supported between same GPU card models.
  • An NVIDIA GRID Virtual GPU Manager for XenServer with XenMotion enabled. For more information, see the NVIDIA Documentation.
  • Windows VM with NVIDIA XenMotion-enabled vGPU drivers installed. VMs without the appropriate vGPU drivers installed are not supported with any vGPU XenMotion features.
  • NVIDIA subscription or a license for the supported NVIDIA cards. Refer to the NVIDIA product information for details.
  • XenMotion of VMs from the previous versions of XenServer is not supported. For example, VMs running on XenServer 7.3 can’t be migrated to a XenServer 7.4 host.
  • Reboot and shutdown operations on a VM while a migration is in progress are not supported and can cause the migration to fail.
  • Linux VMs are not supported with any NVIDIA vGPU XenMotion features.
  • Disk and memory snapshotting is not supported.


Troubleshooting

Here are some common checkpoints while troubleshooting NVIDIA vGPU XenMotion related issues.

Common checkpoint #1: Ensure the NVIDIA host driver is loaded successfully and vGPU migration is enabled.

  1. Verify that vGPU migration is enabled in NVIDIA host driver by running following command:

[root@xenserver ~]# dmesg |grep -E "NVRM|nvidia"
[   23.667910] nvidia: module license 'NVIDIA' taints kernel.
[   23.701203] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[   23.701952] nvidia 0000:07:00.0: enabling device (0000 -> 0003)
[   23.702288] nvidia 0000:08:00.0: enabling device (0000 -> 0003)
[   23.702580] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.28  Wed Jan 31 02:38:16 PST 2018 (using threaded interrupts)
[   23.702597] NVRM: PAT configuration unsupported.
[   25.478421] NVRM: Enabling vGPU live migration.
[   26.311678] NVRM: Enabling vGPU live migration.
[   34.076410] NVRM: Enabling vGPU live migration.
[   34.697255] NVRM: Enabling vGPU live migration.
[root@xenserver ~]#
 

You can also get this kind of information in kernel logs (/var/log/kern.log), prefixed with NVRM or nvidia.
If you can’t see “NVRM: Enabling vGPU live migration” from the output, refer to NIVIDA documentation to enable vGPU live migration.


Common checkpoint #2: Ensure the host and guest NVIDIA drivers are matched version.

The NVIDIA host and guest drivers come in pairs in the same NVIDIA GRID vGPU software package. Whenever the host driver is updated, ensure the guest driver is updated too. Otherwise, mismatched host and guest driver versions may cause a problem (the driver version doesn’t need to be exactly the same, but need to be from the same GRID major version, GRID 6 or later).
Perform the following steps to check NVIDIA driver versions:

  1. In Control Domain, run command nvidia-smi:
    [root@xenserver ~]# nvidia-smi
    Tue Feb 27 08:30:10 2018
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 390.28                 Driver Version: 390.28                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|

    [root@xenserver ~]#
 
  1. In Control Domain, run command nvidia-smi vgpu -q to get guest driver version. 
    You can also get guest driver version from inside the guest VM -> Control Panel > Programs > Programs and Features:

 

Note: The NVIDIA System Management Interface, nvidia-smi, is a command-line tool that reports management information for NVIDIA GPUs. See NVIDIA System Management Interface nvidia-smi for more information.



Common checkpoint #3: Ensure the destination host has a GPU available required by the VM.

Only a single vGPU type is permitted on each pGPU. If you attempt to migrate to a host already has different vGPU types running, you’ll not be able to perform the VM migration operation with following error:

  • Intra-Pool scenario:
  • Inter-Pool scenario:



Common checkpoint #4: Possible TDR after VM migration with a vGPU attached.
 

A TDR (Timeout Detection and Recovery, meaning that the graphics device didn't respond to a request within 2 seconds and the operating system resets the graphics card) may occur after migration of a Windows VM with a vGPU attached. This may happen because Windows system may treat the downtime during the migration as a timeout in GPU operations.

  1. To resolve this issue, modify how Windows tracks time during a migration by disabling two of the viridian enlightenments.

xe vm-param-set uuid=<vm-uuid> platform:viridian_reference_tsc=false
xe vm-param-set uuid=<vm-uuid> platform:viridian_time_ref_count=false

 

For a general troubleshooting purpose, you can:

  • Collect Server Status Report of both source and destination XenServer hosts. Look for “emu-manager*” lines in /var/log/daemon.log. ‘emu-manager’ is a coordination helper used in migration and suspend/resume, the output of this is useful for diagnosing migration and suspend/resume failures.

  • For NVIDIA guest driver installation issues, you can also collect driver installation logs in C:\windows\inf\setupapi.dev.log.

 

    Issue/Introduction

    This article provides the methods to use and troubleshoot NVIDIA vGPU XenMotion feature.

    Additional Information

    Configuring Citrix XenServer 7.4 for Graphics
    NVIDIA vGPU Troubleshooting Guide  
    NVIDIA System Management Interface nvidia-smi