Cannot log in to Nvidia vGPU attached VM

Cannot log in to Nvidia vGPU attached VM

book

Article ID: CTX463681

calendar_today

Updated On:

Description

In Nvidia vGPU environment, user is not able to log in to VM any more, and VM is showing as YELLOW status from XenCenter.

xenopsd-xc is killed by oom-killer due to Dom 0 memory usage is over 99%.

Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.962426] vgpu invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=2, oom_score_adj=0
...
Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.963735] DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15768kB
Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.963740] DMA32: 2813*4kB (UME) 2596*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (H) 0*4096kB = 34068kB
Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.963746] Normal: 3205*4kB (UE) 1172*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22196kB
...
Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.964792] Out of memory: Kill process 32578 (xenopsd-xc) score 12 or sacrifice child
Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.965018] Killed process 32578 (xenopsd-xc) total-vm:657120kB, anon-rss:108740kB, file-rss:4420kB, shmem-rss:0kB

 

Environment

Citrix is not responsible for and does not endorse or accept any responsibility for the contents or your use of these third party Web sites. Citrix is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement by Citrix of the linked Web site. It is your responsibility to take precautions to ensure that whatever Web site you use is free of viruses or other harmful items.

Resolution

Memory leak issue has been fixed by Nvidia vGPU Software Driver version 13.3. Please contact with your hardware vendor to upgrade Nvidia vGPU Software Driver to version 13.3 or later version.

Problem Cause

Memory leak issue occurred due to Known Issue of Nvidia vGPU Software Driver version 13.0-13.2, please check the following Nvidia documentation for further information - https://docs.nvidia.com/grid/13.0/grid-vgpu-release-notes-citrix-xenserver/index.html#bug-200724807-memory-leaks-in-vgpu-manager-plugin-cause-vm-to-hang

Applications running in a VM request memory to be allocated and freed by the vGPU manager plugin, which runs on the hypervisor host. When an application requests the vGPU manager plugin to free previously allocated memory, some of the memory is not freed. Some applications request memory more frequently than other applications. If such applications run for a long period of time, for example for two or more days, the failure to free all allocated memory might cause the hypervisor host to run out of memory. As a result, memory allocation for applications running in the VM might fail, causing the applications and, sometimes, the VM to hang.