XenServer 7 Host Crash while starting multiple Virtual Machine

XenServer 7 Host Crash while starting multiple Virtual Machine

book

Article ID: CTX220674

calendar_today

Updated On:

Description

XenServer 7 Hosts crashes with below call traces while trying to start multiple vGPU attached Virtual Machines. The following trace can be found in xen.log in the crash folder /var/log/crash:

(XEN) [101632.198343] ----[ Xen-4.6.1-xs128153  x86_64  debug=n  Not tainted ]----
(XEN) [101632.198344] CPU:    5
(XEN) [101632.198345] RIP:    e008:[<ffff82d0801f34e7>] vmx_vmenter_helper+0x347/0x3b0
(XEN) [101632.198349] RFLAGS: 0000000000010003   CONTEXT: hypervisor (d0v12)
(XEN) [101632.198351] rax: 000000008005003b   rbx: ffff83007a0f1000   rcx: 0000000000000000
(XEN) [101632.198352] rdx: 0000000000006c00   rsi: 0000000000000007   rdi: ffff83007a0f1000
(XEN) [101632.198353] rbp: ffff83007a0f1000   rsp: ffff83201bd4fd60   r8:  0000000000000008
(XEN) [101632.198354] r9:  0000000000000001   r10: 0000000000000000   r11: ffff82e000000000
(XEN) [101632.198355] r12: ffff83007a18d000   r13: ffff82d08035ce20   r14: 0000000000000005
(XEN) [101632.198356] r15: ffff83201bdfb000   cr0: 000000008005003b   cr4: 00000000003526e0
(XEN) [101632.198357] cr3: 00000040cc016000   cr2: 0000007075300000
(XEN) [101632.198357] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) [101632.198359] Xen code around <ffff82d0801f34e7> (vmx_vmenter_helper+0x347/0x3b0):
(XEN) [101632.198360]  0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b
(XEN) [101632.198363] Xen stack trace from rsp=ffff83201bd4fd60:
(XEN) [101632.198363]    ffff8340cc053000 ffff82d0801680c7 0000000536d9dce0 ffff83201bd4ff18
(XEN) [101632.198365]    ffff83007a18a000 ffff83201bda9f00 ffff83203ffe4f00 ffff83007a18d000
(XEN) [101632.198367]    ffff830079efd000 ffff83201bdfb000 0000000000000005 ffff83201bd22000
(XEN) [101632.198368]    ffff82d08035ce20 ffff82d08016bd29 0000000000000000 0000000000000000
(XEN) [101632.198369]    0000000000000000 0000000000000206 0000000000000086 0000000000000286
(XEN) [101632.198370]    0000000000000005 ffff830079efd058 ffff82d08013399b ffff830079efd000
(XEN) [101632.198372]    00005c6f17145fcb ffff83007a18d000 ffff83201bdfbf84 ffff83201bdfb000
(XEN) [101632.198374]    ffff82d08035ce20 ffff82d08012d35e ffff83201bd48000 ffff83201bd52148
(XEN) [101632.198375]    ffff83201bd22000 ffff83201bd52160 ffff83201bd55660 ffff83007a0f1000
(XEN) [101632.198377]    0000000000000005 ffff83007a18d000 0000000001c9c380 ffffffffffffff00
(XEN) [101632.198378]    000000fb000007b0 00000000ffffffff ffffffffffffffff ffff83201bd48000
(XEN) [101632.198379]    ffff82d080340d00 ffff8340cc053000 ffff82d08035ce20 ffff82d0801308ec
(XEN) [101632.198381]    ffff83007a0f1000 ffff83201bd48000 ffff83007a0f1000 ffff83201bd22000
(XEN) [101632.198383]    00000000ffffffff ffff82d080167d35 ffff830079efd000 00000000ffffffff
(XEN) [101632.198384]    ffffe000544039c0 0000000000000000 0000000b54c81ef7 ffffd001b6afea20
(XEN) [101632.198385]    ffffd001b6ad5180 0000000000000001 0000000000000002 0000000000000000
(XEN) [101632.198386]    00000000ffffffff 0000000000000020 0000000000000086 0000000000000000
(XEN) [101632.198387]    ffffe000544038e0 0000000000000000 000000fc00000000 fffff8032ffb982f
(XEN) [101632.198390]    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) [101632.198391] Xen call trace:
(XEN) [101632.198392]    [<ffff82d0801f34e7>] vmx_vmenter_helper+0x347/0x3b0
(XEN) [101632.198394]    [<ffff82d0801680c7>] domain.c#__context_switch+0xc7/0x410
(XEN) [101632.198396]    [<ffff82d08016bd29>] context_switch+0xc9/0xea0
(XEN) [101632.198398]    [<ffff82d08013399b>] timer.c#add_entry+0x4b/0xb0
(XEN) [101632.198401]    [<ffff82d08012d35e>] schedule.c#schedule+0x46e/0x7d0
(XEN) [101632.198402]    [<ffff82d0801308ec>] softirq.c#__do_softirq+0x5c/0x90
(XEN) [101632.198403]    [<ffff82d080167d35>] domain.c#idle_loop+0x25/0x60
(XEN) [101632.198404] 
(XEN) [101632.198404] 
(XEN) [101632.198405] ****************************************
(XEN) [101632.198406] Panic on CPU 5:
(XEN) [101632.198406] FATAL TRAP: vector = 6 (invalid opcode)
(XEN) [101632.198407] ****************************************
(XEN) [101632.198407] 
(XEN) [101632.198408] Reboot in five seconds...
(XEN) [101632.198409] Executing kexec image on cpu5
(XEN) [101632.199415] Shot down all CPUs

Random reboot of XenServer hosts without the mentioned workaround and with Hosts with more memory (memory > 512 GB). More predominantly seen with Hosts running Virtual machines having vGPU attached but can happen otherwise too. 

Resolution

A feature (PML) which got introduced with Broadwell CPUs and which got adopted in Xen 4.6 and XenServer7 is causing this issue. You can read more about the feature at - http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf

Citrix is aware of this issue and working towards a permanent fix. 

As a workaround, run the following command on Each XenServer Host in the Pool
     # /opt/xensource/libexec/xen-cmdline --set-xen ept=no-pml
    # /opt/xensource/libexec/xen-cmdline --set-xen iommu=dom0-passthrough

After running the above commands, reboot the hosts to take effect.

This issue was fixed in XenServer 7.2 and the following hotfixes:
XS71E006 for XenServer 7.1 - https://support.citrix.com/article/CTX222424
XS70E032 for XenServer 7.0 - https://support.citrix.com/article/CTX222423


Problem Cause

A buggy implementation of PML in Xen is causing the issue.

Issue/Introduction

XenServer 7 Hosts crashes with below call traces while trying to start multiple vGPU attached Virtual Machines.