XenServer 7 Hosts crashes with below call traces while trying to start multiple vGPU attached Virtual Machines. The following trace can be found in xen.log in the crash folder /var/log/crash:
(XEN) [101632.198343] ----[ Xen-4.6.1-xs128153 x86_64 debug=n Not tainted ]---- (XEN) [101632.198344] CPU: 5 (XEN) [101632.198345] RIP: e008:[<ffff82d0801f34e7>] vmx_vmenter_helper+0x347/0x3b0 (XEN) [101632.198349] RFLAGS: 0000000000010003 CONTEXT: hypervisor (d0v12) (XEN) [101632.198351] rax: 000000008005003b rbx: ffff83007a0f1000 rcx: 0000000000000000 (XEN) [101632.198352] rdx: 0000000000006c00 rsi: 0000000000000007 rdi: ffff83007a0f1000 (XEN) [101632.198353] rbp: ffff83007a0f1000 rsp: ffff83201bd4fd60 r8: 0000000000000008 (XEN) [101632.198354] r9: 0000000000000001 r10: 0000000000000000 r11: ffff82e000000000 (XEN) [101632.198355] r12: ffff83007a18d000 r13: ffff82d08035ce20 r14: 0000000000000005 (XEN) [101632.198356] r15: ffff83201bdfb000 cr0: 000000008005003b cr4: 00000000003526e0 (XEN) [101632.198357] cr3: 00000040cc016000 cr2: 0000007075300000 (XEN) [101632.198357] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) [101632.198359] Xen code around <ffff82d0801f34e7> (vmx_vmenter_helper+0x347/0x3b0): (XEN) [101632.198360] 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b (XEN) [101632.198363] Xen stack trace from rsp=ffff83201bd4fd60: (XEN) [101632.198363] ffff8340cc053000 ffff82d0801680c7 0000000536d9dce0 ffff83201bd4ff18 (XEN) [101632.198365] ffff83007a18a000 ffff83201bda9f00 ffff83203ffe4f00 ffff83007a18d000 (XEN) [101632.198367] ffff830079efd000 ffff83201bdfb000 0000000000000005 ffff83201bd22000 (XEN) [101632.198368] ffff82d08035ce20 ffff82d08016bd29 0000000000000000 0000000000000000 (XEN) [101632.198369] 0000000000000000 0000000000000206 0000000000000086 0000000000000286 (XEN) [101632.198370] 0000000000000005 ffff830079efd058 ffff82d08013399b ffff830079efd000 (XEN) [101632.198372] 00005c6f17145fcb ffff83007a18d000 ffff83201bdfbf84 ffff83201bdfb000 (XEN) [101632.198374] ffff82d08035ce20 ffff82d08012d35e ffff83201bd48000 ffff83201bd52148 (XEN) [101632.198375] ffff83201bd22000 ffff83201bd52160 ffff83201bd55660 ffff83007a0f1000 (XEN) [101632.198377] 0000000000000005 ffff83007a18d000 0000000001c9c380 ffffffffffffff00 (XEN) [101632.198378] 000000fb000007b0 00000000ffffffff ffffffffffffffff ffff83201bd48000 (XEN) [101632.198379] ffff82d080340d00 ffff8340cc053000 ffff82d08035ce20 ffff82d0801308ec (XEN) [101632.198381] ffff83007a0f1000 ffff83201bd48000 ffff83007a0f1000 ffff83201bd22000 (XEN) [101632.198383] 00000000ffffffff ffff82d080167d35 ffff830079efd000 00000000ffffffff (XEN) [101632.198384] ffffe000544039c0 0000000000000000 0000000b54c81ef7 ffffd001b6afea20 (XEN) [101632.198385] ffffd001b6ad5180 0000000000000001 0000000000000002 0000000000000000 (XEN) [101632.198386] 00000000ffffffff 0000000000000020 0000000000000086 0000000000000000 (XEN) [101632.198387] ffffe000544038e0 0000000000000000 000000fc00000000 fffff8032ffb982f (XEN) [101632.198390] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 (XEN) [101632.198391] Xen call trace: (XEN) [101632.198392] [<ffff82d0801f34e7>] vmx_vmenter_helper+0x347/0x3b0 (XEN) [101632.198394] [<ffff82d0801680c7>] domain.c#__context_switch+0xc7/0x410 (XEN) [101632.198396] [<ffff82d08016bd29>] context_switch+0xc9/0xea0 (XEN) [101632.198398] [<ffff82d08013399b>] timer.c#add_entry+0x4b/0xb0 (XEN) [101632.198401] [<ffff82d08012d35e>] schedule.c#schedule+0x46e/0x7d0 (XEN) [101632.198402] [<ffff82d0801308ec>] softirq.c#__do_softirq+0x5c/0x90 (XEN) [101632.198403] [<ffff82d080167d35>] domain.c#idle_loop+0x25/0x60 (XEN) [101632.198404] (XEN) [101632.198404] (XEN) [101632.198405] **************************************** (XEN) [101632.198406] Panic on CPU 5: (XEN) [101632.198406] FATAL TRAP: vector = 6 (invalid opcode) (XEN) [101632.198407] **************************************** (XEN) [101632.198407] (XEN) [101632.198408] Reboot in five seconds... (XEN) [101632.198409] Executing kexec image on cpu5 (XEN) [101632.199415] Shot down all CPUs
Random reboot of XenServer hosts without the mentioned workaround and with Hosts with more memory (memory > 512 GB). More predominantly seen with Hosts running Virtual machines having vGPU attached but can happen otherwise too.
A feature (PML) which got introduced with Broadwell CPUs and which got adopted in Xen 4.6 and XenServer7 is causing this issue. You can read more about the feature at - http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf
Citrix is aware of this issue and working towards a permanent fix.
As a workaround, run the following command on Each XenServer Host in the Pool
# /opt/xensource/libexec/xen-cmdline --set-xen ept=no-pml
# /opt/xensource/libexec/xen-cmdline --set-xen iommu=dom0-passthrough
After running the above commands, reboot the hosts to take effect.
This issue was fixed in XenServer 7.2 and the following hotfixes:
XS71E006 for XenServer 7.1 - https://support.citrix.com/article/CTX222424
XS70E032 for XenServer 7.0 - https://support.citrix.com/article/CTX222423
A buggy implementation of PML in Xen is causing the issue.