In Nvidia vGPU environment, user is not able to log in to VM any more, and VM is showing as YELLOW status from XenCenter.
xenopsd-xc is killed by oom-killer due to Dom 0 memory usage is over 99%.
Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.962426] vgpu invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=2, oom_score_adj=0 ... Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.963735] DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15768kB Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.963740] DMA32: 2813*4kB (UME) 2596*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (H) 0*4096kB = 34068kB Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.963746] Normal: 3205*4kB (UE) 1172*8kB (UE) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22196kB ... Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.964792] Out of memory: Kill process 32578 (xenopsd-xc) score 12 or sacrifice child Jul 29 15:07:47 <HOSTNAME> kernel: [22724504.965018] Killed process 32578 (xenopsd-xc) total-vm:657120kB, anon-rss:108740kB, file-rss:4420kB, shmem-rss:0kB
Memory leak issue occurred due to Known Issue of Nvidia vGPU Software Driver version 13.0-13.2, please check the following Nvidia documentation for further information - https://docs.nvidia.com/grid/13.0/grid-vgpu-release-notes-citrix-xenserver/index.html#bug-200724807-memory-leaks-in-vgpu-manager-plugin-cause-vm-to-hang
Applications running in a VM request memory to be allocated and freed by the vGPU manager plugin, which runs on the hypervisor host. When an application requests the vGPU manager plugin to free previously allocated memory, some of the memory is not freed. Some applications request memory more frequently than other applications. If such applications run for a long period of time, for example for two or more days, the failure to free all allocated memory might cause the hypervisor host to run out of memory. As a result, memory allocation for applications running in the VM might fail, causing the applications and, sometimes, the VM to hang.