This article provides the methods to use and troubleshoot NVIDIA vGPU XenMotion feature.
The following requirements and restrictions apply when using NVIDIA vGPU XenMotion:
Here are some common checkpoints while troubleshooting NVIDIA vGPU XenMotion related issues.
Common checkpoint #1: Ensure the NVIDIA host driver is loaded successfully and vGPU migration is enabled.
[root@xenserver ~]# dmesg |grep -E "NVRM|nvidia"
[ 23.667910] nvidia: module license 'NVIDIA' taints kernel.
[ 23.701203] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 23.701952] nvidia 0000:07:00.0: enabling device (0000 -> 0003)
[ 23.702288] nvidia 0000:08:00.0: enabling device (0000 -> 0003)
[ 23.702580] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 390.28 Wed Jan 31 02:38:16 PST 2018 (using threaded interrupts)
[ 23.702597] NVRM: PAT configuration unsupported.
[ 25.478421] NVRM: Enabling vGPU live migration.
[ 26.311678] NVRM: Enabling vGPU live migration.
[ 34.076410] NVRM: Enabling vGPU live migration.
[ 34.697255] NVRM: Enabling vGPU live migration.
[root@xenserver ~]#
You can also get this kind of information in kernel logs (/var/log/kern.log), prefixed with NVRM or nvidia.
If you can’t see “NVRM: Enabling vGPU live migration” from the output, refer to NIVIDA documentation to enable vGPU live migration.
Common checkpoint #2: Ensure the host and guest NVIDIA drivers are matched version.
The NVIDIA host and guest drivers come in pairs in the same NVIDIA GRID vGPU software package. Whenever the host driver is updated, ensure the guest driver is updated too. Otherwise, mismatched host and guest driver versions may cause a problem (the driver version doesn’t need to be exactly the same, but need to be from the same GRID major version, GRID 6 or later).
Perform the following steps to check NVIDIA driver versions:
Note: The NVIDIA System Management Interface, nvidia-smi, is a command-line tool that reports management information for NVIDIA GPUs. See NVIDIA System Management Interface nvidia-smi for more information.
Common checkpoint #3: Ensure the destination host has a GPU available required by the VM.
Only a single vGPU type is permitted on each pGPU. If you attempt to migrate to a host already has different vGPU types running, you’ll not be able to perform the VM migration operation with following error:
A TDR (Timeout Detection and Recovery, meaning that the graphics device didn't respond to a request within 2 seconds and the operating system resets the graphics card) may occur after migration of a Windows VM with a vGPU attached. This may happen because Windows system may treat the downtime during the migration as a timeout in GPU operations.
To resolve this issue, modify how Windows tracks time during a migration by disabling two of the viridian enlightenments.
xe vm-param-set uuid=<vm-uuid> platform:viridian_reference_tsc=false
xe vm-param-set uuid=<vm-uuid> platform:viridian_time_ref_count=false
For a general troubleshooting purpose, you can:
Collect Server Status Report of both source and destination XenServer hosts. Look for “emu-manager*” lines in /var/log/daemon.log. ‘emu-manager’ is a coordination helper used in migration and suspend/resume, the output of this is useful for diagnosing migration and suspend/resume failures.
For NVIDIA guest driver installation issues, you can also collect driver installation logs in C:\windows\inf\setupapi.dev.log.